Dealing with multi-track video (and audio)

We are slowly approaching the stage where we want to make multi-track video of the following type available and accessible:

  • original video track
  • original audio track
  • dubbed audio tracks in n different languages
  • audio description track in n different langauges
  • sign language video tracks in n different sign langauges
  • caption tracks in n different langauges
  • multiple other time-aligned text tracks in different langauges
  • audio and video track from different camera angles
  • music and speech tracks can be separate
  • different quality tracks are available
  • accompanying images, e.g. slides for a presentation

One of the issues with such a sizeable number of tracks is how to display them. Some of them are alternatives, some of them additions. Sign language is typically presented in a PiP (picture-in-picture) approach. If we have a music and a speech (or singing) track, we may want to have control over removing certain tracks – e.g. to be able to do karaoke. Caption and subtitle tracks in the same language are probably alternatives, while in different languages they could be additions. It is not a trivial challenge to handle such complex files in an application.

At this point, I am only trying to solve a sub-challenge. As we talk about a particular track in a multi-track media file, we will want to identify it by name. Should there be a standard for naming the track, so that we can e.g. address them by a URL, e.g. with the intention of only delivering a subset of tracks from the larger file? We could introduce that for Ogg – but maybe there is an opportunity to do this across file formats?

To find some answers to these and related questions, I want to discuss two approaches.

The first approach is a simple numbering approach. In it, the audio, video, and annotation tracks are all ordered and then numbered through. This will result in the following sets of track names: video[0] … [n], audio[0] … [n], timed text[0] … [n], and possibly even timed images[0] … [n]. This approach is simple, easy to understand, and only requires ordering the tracks within their types. It allows addressing of a particular track – e.g. as required by the media fragment URI scheme for track addressing. However, it does not allow identification of alternatives, additions, or presentation styles.

Should alternatives, additions, and presentation styles be encoded in the name of track? Or should this information go into a meta description area of the multi-track video? Something like skeleton in Ogg? Or should it go a step further and be buried in an external information file such as an m3u file (or ROE for Ogg)?

I want to experiment here with the naming scheme and what we would need to specify to be able to decide which tracks to ignore and which to combine for a presentation. And I want to ask for your comments and advice.

This requires listing exactly what types of content tracks we may have to deal with.

In the video space, we have at minimum the following track types:

  • main video content – with alternative camera angles
  • subsidiary video content – with alternative camera angles
  • sign language videos – in alternative languages

Alternatives are defined by camera angle and language. Also, each track can be made available in a different quality. I’d also regard additional image content, such as slides in a presentation, into subsidiary video content. So, here we could use a scheme such as video_[main,side,sign]_language_angle.

In the audio space, we have at minimum the following track types:

  • main audio content – in alternative languages
  • background audio content –, SFX, noise
  • foreground speech or singing content – in alternative languages
  • audio descriptions – in alternative languages

Alternatives are defined by language and content type. Again, each track can be made available in a different quality. Here we could use a scheme such as audio_type_language.

In the text space, we have at minimum the following track types:

  • subtitles – in different languages
  • captions – in different languages
  • textual audio descriptions – in different languages
  • other time-aligned text – in different languages

Alternatives are defined by language and content type – e.g. lyrics, captions and subtitles really compete for the same screen space. Here we could use a scheme such as text_type_language.

A generic track naming scheme
It seems, the generic naming scheme of

<content_type>_<track_type>_<language> [_<angle>]

can cover all cases.

Are there further track types, further alternatives I have missed? What do you think?

10 thoughts on “Dealing with multi-track video (and audio)

  1. Our Internet quota is already small enough as it is. Do we really want to throw it away downloading multiple video streams even if you only want to watch one of them?

    Take sign language — I have perfectly good hearing, so I don’t want to waste bandwidth downloading it. On the other hand, somebody without hearing won’t want to download the audio track.

    1. @jeremy That is exactly the point why I am defining a naming scheme for such multi-track video: the browser can then instruct the server to only deliver those streams that are actually of interest to its particular user. Think of it as content negotiation on multi-track media.

      A deaf person may argue the same for any audio track of a video, and a blind person for any video track – they also don’t want to waste bandwidth on something that is not necessary to them. This naming scheme is a first step in the direction of enabling content negotiation on muti-track media files through the media fragment URI scheme which allows addressing of tracks.

  2. A few notes from our IRC discussion:

    – such a name, being a list of attributes, might be better off as actual attributes (eg, in Skeleton for Ogg), as it will both avoid having to parse a string into its constituent attributes, and make it easier to add new attributes should the need arise.
    – Sending attribute preferences in the original request from the client (either via HTTP headers or URI parameters) will allow the server to select tracks and stream them to the client, needing only one roundtrip, as opposed to two if the client first requests a list of tracks, then requests a subet of these tracks, after having parsed them and worked out which ones it’s interested about). This, if keeping names like proposed, also moves parsing complexity from client to server, a good thing I think.
    – names might still be interesting as unique identifiers, for those clients that know exactly what they want (eg, editors), but do not have to carry the semantics.
    – Track numbers are easy to break by remuxing a video (eg, simple edition, or even adding a subtitles track).

  3. I’d prefer to address the tracks through something like ROE. Naming conventions can get lengthy when overloaded with a lot of attributes. Seems like ROE allows for some pretty verbose descriptions of tracks and their attributes.

    As long as the spec allows for content negotiation with ROE descriptors, I think we’re in business.

  4. I have had many questions related to the challenges of multi-track video, so thought I should extend this a bit.

    So, let me explain the idea for how this information could be used with the HTML5 video element:

    Assuming we have a multi-track video linked in the src attribute of a video element, the browser would need to identify what is in the tracks (along the lines described in the blog entry) and then do something useful with it.

    By default, it would use the main audio and video track of the default language that the browser is set to and play them back.

    For blind people (which would need to be a setting in the browser) it would additionally activate the audio description track in the default language. It would also probably disable all video tracks.

    For deaf people (again a browser setting) it would additionally activate the caption or subtitle track in the default language and the sign language video track in the default sign language. It would also probably disable all audio tracks.

    Additionally, the browser would provide a right-click menu that lets you activate/deactivate all tracks individually. If the controls attribute is set in the video element, this menu is also added to the controls bar.

    Additionally, there would be a JavaScript API through which the Web developer can identify the available tracks and turn them on/off selectively.

    It is possible that to make this available in a container-format independent way, an external description format such as ROE is necessary.

  5. I was pointed to QuickTime and how QuickTime does grouping of alternative tracks and associates presentation information.

    Here is the link to the QuickTime “Track Header” description:

    It contains the following fields (amongst others):

    Alternate group – specifies a collection of movie tracks that contain alternate data for one another, choices may be based on e.g. playback quality, language, or the capabilities of the computer.

    Layer – describes the priority in which tracks overlay each other

    Track width/height – pixel size of this track

    Volume – loudness of this track

    Further, here is the QuickTime “Media Header” description – “Media Atoms” define a track

  6. How ambitious a request would it be to have a rudimentary depth-based multi-layering per video track; for example a separate layer for the background and foreground or for moving vs. static objects. Couldn’t seam carving and motion tracking enable such layers of metadata to be more accessible? If this type of innovation is too far ahead for this project, when do you forsee it coming into being?

    I raise this question because I truly believe there are creative, knowledge and financial motivations to see this sort of “object-oriented” approach to the elements within the frame be included in a video’s metadata. Video is needlessly and artificially flattened by modeling it on celluloid.

  7. @Gabriel,

    I think this is a bit too ambitious for right now. Describing such multi-layers in a simple manner has been tried in SMIL and MPEG-7 and I truly think we are not yet ready for it on the Web. We are moving only slightly in that direction with media fragments and the idea of addressing spatial objects.

  8. @silvia

    Which constraints of the web do see as being prohibitive specifically? I’ve been reading a lot lately on how the costs of storage, bandwidth and processing are predicted to exponentially decline in the near future. Do you think these factors affect the evolution of such standards? Also, what about a cross-media format which might have a rich life of use “offline” that could eventually be ported to the web in the future…

  9. @Gabriel I wasn’t referring to technical restrictions on the Web, just to what people are ready for – and with “people” I mean users, publishers, as well as tool providers. If there were good products out there that are able to produce your background/foreground separation automatically and they were important players in the market place, then there would be a need to look at it in the Web, too. Right now we are mostly trying to solve issues that YouTube and others have had to deal with and partially solved, but not in a manner that will work across the Web. Web standards generally don’t tend to be doing new technology, but rather standardise good existing practice.

Comments are closed.