Tag Archives: media fragment addressing

Dealing with multi-track video (and audio)

We are slowly approaching the stage where we want to make multi-track video of the following type available and accessible:

  • original video track
  • original audio track
  • dubbed audio tracks in n different languages
  • audio description track in n different langauges
  • sign language video tracks in n different sign langauges
  • caption tracks in n different langauges
  • multiple other time-aligned text tracks in different langauges
  • audio and video track from different camera angles
  • music and speech tracks can be separate
  • different quality tracks are available
  • accompanying images, e.g. slides for a presentation

One of the issues with such a sizeable number of tracks is how to display them. Some of them are alternatives, some of them additions. Sign language is typically presented in a PiP (picture-in-picture) approach. If we have a music and a speech (or singing) track, we may want to have control over removing certain tracks – e.g. to be able to do karaoke. Caption and subtitle tracks in the same language are probably alternatives, while in different languages they could be additions. It is not a trivial challenge to handle such complex files in an application.

At this point, I am only trying to solve a sub-challenge. As we talk about a particular track in a multi-track media file, we will want to identify it by name. Should there be a standard for naming the track, so that we can e.g. address them by a URL, e.g. with the intention of only delivering a subset of tracks from the larger file? We could introduce that for Ogg – but maybe there is an opportunity to do this across file formats?

To find some answers to these and related questions, I want to discuss two approaches.

The first approach is a simple numbering approach. In it, the audio, video, and annotation tracks are all ordered and then numbered through. This will result in the following sets of track names: video[0] … [n], audio[0] … [n], timed text[0] … [n], and possibly even timed images[0] … [n]. This approach is simple, easy to understand, and only requires ordering the tracks within their types. It allows addressing of a particular track – e.g. as required by the media fragment URI scheme for track addressing. However, it does not allow identification of alternatives, additions, or presentation styles.

Should alternatives, additions, and presentation styles be encoded in the name of track? Or should this information go into a meta description area of the multi-track video? Something like skeleton in Ogg? Or should it go a step further and be buried in an external information file such as an m3u file (or ROE for Ogg)?

I want to experiment here with the naming scheme and what we would need to specify to be able to decide which tracks to ignore and which to combine for a presentation. And I want to ask for your comments and advice.

This requires listing exactly what types of content tracks we may have to deal with.

In the video space, we have at minimum the following track types:

  • main video content – with alternative camera angles
  • subsidiary video content – with alternative camera angles
  • sign language videos – in alternative languages

Alternatives are defined by camera angle and language. Also, each track can be made available in a different quality. I’d also regard additional image content, such as slides in a presentation, into subsidiary video content. So, here we could use a scheme such as video_[main,side,sign]_language_angle.

In the audio space, we have at minimum the following track types:

  • main audio content – in alternative languages
  • background audio content – e.g.music, SFX, noise
  • foreground speech or singing content – in alternative languages
  • audio descriptions – in alternative languages

Alternatives are defined by language and content type. Again, each track can be made available in a different quality. Here we could use a scheme such as audio_type_language.

In the text space, we have at minimum the following track types:

  • subtitles – in different languages
  • captions – in different languages
  • textual audio descriptions – in different languages
  • other time-aligned text – in different languages

Alternatives are defined by language and content type – e.g. lyrics, captions and subtitles really compete for the same screen space. Here we could use a scheme such as text_type_language.

A generic track naming scheme
It seems, the generic naming scheme of

<content_type>_<track_type>_<language> [_<angle>]

can cover all cases.

Are there further track types, further alternatives I have missed? What do you think?