Manifests for exposing the structure of a Composite Media Resource

In the previous post I explained that there is a need to expose the tracks of a time-linear media resource to the user agent (UA). Here, I want to look in more detail at different possibilities of how to do so, their advantages and disadvantages.

Note: A lot of this has come out of discussions I had at the recent W3C TPAC and is still in flux, so I am writing this to start discussions and brainstorm.

Declarative Syntax vs JavaScript API

We can expose a media resource’s tracks either through a JavaScript function that can loop through the tracks and provide access to the tracks and their features, or we can do this through declarative syntax.

Using declarative syntax has the advantage of being available even if JavaScript is disabled in a UA. The markup can be parsed easily and default displays can be prepared without having to actually decode the media file(s).

OTOH, it has the disadvantage that it may not necessarily represent what is actually in the binary resource, but instead what the Web developer assumed was in the resource (or what he forgot to update). This may lead to a situation where a “404” may need to be given on a media track.

A further disadvantage is that when somebody copies the media element onto another Web page, together with all the track descriptions, and then the original media resource is changed (e.g. a subtitle track is added), this has not the desired effect, since the change does not propagate to the other Web page.

For these reasons, I thought that a JavaScript interface was preferable over declarative syntax.

However, recent discussions, in particular with some accessibility experts, have convinced me that declarative syntax is preferable, because it allows the creation of a menu for turning tracks on/off without having to even load the media file. Further, declarative syntax allows to treat multiple files and “native tracks” of a virtual media resource in an identical manner.

Extending Existing Declarative Syntax

The HTML5 media elements already have declarative syntax to specify multiple source media files for media elements. The <source> element is typically used to list video in mpeg4 and ogg format for support in different browsers, but has also been envisaged for different screensize and bandwidth encodings.

The <source> elements are generally meant to list different resources that contribute towards the media element. In that respect, let’s try using it for declaring a manifest of tracks of the virtual media resource on an example:

  <video>
    <source id='av1' src='video.3gp' type='video/mp4' media='mobile' lang='en'
                     role='media' >
    <source id='av2' src='video.mp4' type='video/mp4' media='desktop' lang='en'
                     role='media' >
    <source id='av3' src='video.ogv' type='video/ogg' media='desktop' lang='en'
                     role='media' >
    <source id='dub1' src='video.ogv?track=audio[de]' type='audio/ogg' lang='de'
                     role='dub' >
    <source id='dub2' src='audio_ja.oga' type='audio/ogg' lang='ja'
                     role='dub' >
    <source id='ad1' src='video.ogv?track=auddesc[en]' type='audio/ogg' lang='en'
                     role='auddesc' >
    <source id='ad2' src='audiodesc_de.oga' type='audio/ogg' lang='de'
                     role='auddesc' >
    <source id='cc1' src='video.mp4?track=caption[en]' type='application/ttaf+xml'
                     lang='en' role='caption' >
    <source id='cc2' src='video.ogv?track=caption[de]' type='text/srt; charset="ISO-8859-1"'
                     lang='de' role='caption' >
    <source id='cc3' src='caption_ja.ttaf' type='application/ttaf+xml' lang='ja'
                     role='caption' >
    <source id='sign1' src='signvid_ase.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='ase' role='sign' >
    <source id='sign2' src='signvid_gsg.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='gsg' role='sign' >
    <source id='sign3' src='signvid_sfs.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='sfs' role='sign' >
    <source id='tad1' src='tad_en.srt' type='text/srt; charset="ISO-8859-1"'
                     lang='en' role='tad' >
    <source id='tad2' src='video.ogv?track=tad[de]' type='text/srt; charset="ISO-8859-1"'
                     lang='de' role='tad' >
    <source id='tad3' src='tad_ja.srt' type='text/srt; charset="EUC-JP"' lang='ja'
                     role='tad' >
  </video>

Note that this somewhat ignores my previously proposed special itext tag for handling text tracks. I am doing this here to experiment with a more integrative approach with the virtual media resource idea from the previous post. This may well be a better solution than a specific new text-related element. Most of the attributes of the itext element are, incidentally, covered.

You will also notice that some of the tracks are references to tracks inside binary media files using the Media Fragment URI specification while others link to full files. An example is video.ogv?track=auddesc[en]. So, this is a uniform means of exposing all the tracks that are part of a (virtual) media resource to the UA, no matter whether in-band or in external files. It actually relies on the UA or server being able to resolve these URLs.

“type” attribute

“media” and “type” are existing attributes of the <source> element in HTML5 and meant to help the UA determine what to do with the referenced resource. The current spec states:

The “type” attribute gives the type of the media resource, to help the user agent determine if it can play this media resource before fetching it.

The word “play” might need to be replaced with “decode” to cover several different MIME types.

The “type” attribute was also extended with the possibility to add the “charset” MIME parameter of a linked text resource – this is particularly important for SRT files, which don’t handle charsets very well. It avoids having to add an additional attribute and is analogous to the “codecs” MIME parameter used by audio and video resources.

“media” attribute

Further, the spec states:

The “media” attribute gives the intended media type of the media resource, to help the user agent determine if this media resource is useful to the user before fetching it. Its value must be a valid media query.

The “mobile” and “desktop” values are hints that I’ve used for simplicity reasons. They could be improved by giving appropriate bandwidth limits and width/height values, etc. Other values could be different camera angles such as topview, frontview, backview. The media query aspect has to be looked into in more depth.

“lang” attribute

The above example further uses “lang” and “role” attributes:

The “lang” attribute is an existing global attribute of HTML5, which typically indicates the language of the data inside the element. Here, it is used to indicate the language of the referenced resource. This is possibly not quite the best name choice and should maybe be called “hreflang”, which is already used in multiple other elements to signify the language of the referenced resource.

“role” attribute

The “role” attribute is also an existing attribute in HTML5, included from ARIA. It currently doesn’t cover media resources, but could be extended. The suggestion here is to specify the roles of the different media tracks – the ones I have used here are:

  • “media”: a main media resource – typically contains audio and video and possibly more
  • “dub”: a audio track that provides an alternative dubbed language track
  • “auddesc”: a audio track that provides an additional audio description track
  • “caption”: a text track that provides captions
  • “sign”: a video-only track that provides an additional sign language video track
  • “tad”: a text track that provides textual audio descriptions to be read by a screen reader or a braille device

Further roles could be “music”, “speech”, “sfx” for audio tracks, “subtitle”, “lyrics”, “annotation”, “chapters”, “overlay” for text tracks, and “alternate” for a alternate main media resource, e.g. a different camera angle.

Track activation

The given attributes help the UA decide what to display.

It will firstly find out from the “type” attribute if it is capable of decoding the track.

Then, the UA will find out from the “media” query, “role”, and “lang” attributes whether a track is relevant to its user. This will require checking the capabilities of the device, network, and the user preferences.

Further, it could be possible for Web authors to influence whether a track is displayed or not through CSS parameters on the <source> element: “display: none” or “visibility: hidden/visible”.

Examples for track activation that a UA would undertake using the example above:

Given a desktop computer with Firefox, German language preferences, captions and sign language activated, the UA will fetch the original video at video.ogv (for Firefox), the German caption track at video.ogv?track=caption[de], and the German sign language track at signvid_gsg.ogv (maybe also the German dubbed audio track at video.ogv?track=audio[de], which would then replace the original one).

Given a desktop computer with Safari, English language preferences and audio descriptions activated, the UA will fetch the original video at video.mp4 (for Safari) and the textual audio description at tad_en.srt to be displayed through the screen reader, since it cannot decode the Ogg audio description track at video.ogv?track=auddesc[en].

Also, all decodeable tracks could be exposed in a right-click menu and added on-demand.

Display styling

Default styling of these tracks could be:

  • video or alternate video in the video display area,
  • sign language probably as picture-in-picture (making it useless on a mobile and only of limited use on the desktop),
  • captions/subtitles/lyrics as overlays on the bottom of the video display area (or whatever the caption format prescribes),
  • textual audio descriptions as ARIA live regions hidden behind the video or off-screen.

Multiple audio tracks can always be played at the same time.

The Web author could also define the display area for a track through CSS styling and the UA would then render the data into that area at the rate that is required by the track.

How good is this approach?

The advantage of this new proposal is that it builds basically on existing HTML5 components with minimal additions to satisfy requirements for content selection and accessibility of media elements. It is a declarative approach to the multi-track media resource challenge.

However, it leaves most of the decision on what tracks are alternatives of/additions to each other and which tracks should be displayed to the UA. The UA makes an informed decision because it gets a lot of information through the attributes, but it still has to make decisions that may become rather complex. Maybe there needs to be a grouping level for alternative tracks and additional tracks – similar to what I did with the second itext proposal, or similar to the <switch> and <par> elements of SMIL.

A further issue is one that is currently being discussed within the Media Fragments WG: how can you discover the track composition and the track naming/uses of a particular media resource? How, e.g., can a Web author on another Web site know how to address the tracks inside your binary media resource? A HTML specification like the above can help. But what if that doesn’t exist? And what if the file is being used offline?

Alternative Manifest descriptions

The need to manifest the track composition of a media resource is not a new one. Many other formats and applications had to deal with these challenges before – some have defined and published their format.

I am going to list a few of these formats here with examples. They could inspire a next version of the above proposal with grouping elements.

Microsoft ISM files (SMIL subpart)

With the release of IIS7, Microsoft introduced “Smooth Streaming”, which uses chunking on files on the server to deliver adaptive streaming to Silverlight clients over HTTP. To inform a smooth streaming client of the tracks available for a media resource, Microsoft defined ism files: IIS Smooth Streaming Server Manifest files.

This is a short example – a longer one can be found here:

<?xml version=