All posts by silvia

3rd W3C Web and TV Workshop, Hollywood

Curious about any new requirements that the TV community may have for HTML5 video, I attended the W3C Web and TV Workshop in Hollywood last week. It’s already the third of its kind and was also the largest to date showing an increasing interest of the TV community to converge with the Web community.

The Workshop Aim

I went into the Workshop not quite knowing what to expect. My previous contact with members of this community was restricted to email exchanges on the W3C Web and TV Interest Group (IG) mailing list. I knew there was some interest in video accessibility (well: particularly captions) and little knowledge of existing HTML5 specifications around text tracks and why the browsers were going with WebVTT. So I had decided to attend the workshop to get a better understanding of the community, it’s background, needs, and issues, and to hopefully teach some of the ways of HTML5. For that reason I had also submitted a WebVTT presentation/demo.

As it turned out, the workshop had as its key target the facilitation of communication between the TV and the HTML5 community. The aim was to identify features that need to be added to the HTML5 video element to satisfy the needs of the TV community. I obviously came to the right workshop.

The process that is being used by the W3C in the Interest Group is to have TV community members express their needs, then have HTML5 experts express how these needs can be satisfied with existing HTML5 features, then make trial implementations and identify any shortcomings, then move forward to progress these through HTML5 or HTML.next. This workshop clearly focused on the first step: expressing needs.

Often times it was painful for me to watch presenters defending their requirements and trying to impress on the audience how important a certain feature is to them when that features actually already has a HTML5 specification, but just not yet a browser implementations. That there were so few HTML5 video experts present and that they were given very little space to directly reply to the expressed needs and actually explain what is already possible (or specified to be possible) was probably one of the biggest drawbacks of the workshop.

To be fair, detailed technical discussions were not possible in a room with 150 attendees with a panel sitting at the front discussing topics and taking questions. Solving a use case with existing HTML5 markup and identifying the gaps requires smaller break-out groups of a maximum of maybe 20 people and sufficient HTML5 knowledge in the room. Ultimately they require a single person to try to implement it using JavaScript alone, and, failing that, writing browser extensions. Only such code actually proves that a feature is missing.

Now, the video features of HTML5 are still continuing to change almost on a daily basis. Much development is, for example, happening around real-time communication features and around the track element as we speak. So, focusing on further requirements finding around HTML5 video for now is probably a good thing.

The TV Community Approach

Before I move on to some of the topics covered by the workshop, I have to express some concern about the behaviour that I observed with lots of the TV community folks. Many people tried pushing existing solutions from other spaces into the Web unchanged with a claim of not re-inventing the wheel and following paved cowpaths, which are some of the underlying design principles for HTML5. I can understand where such behaviour originates thinking that having solved the same problems elsewhere before, those solutions should apply here, too. But I would like to warn people of this approach.

If we blindly apply solutions that were not developed for HTML5 into HTML we will end up with suboptimal solutions that will hurt us further down the track. The principles of not re-inventing the wheel and following paved cowpaths were introduced for features that were already implemented by browsers or in de-facto standard use by JavaScript libraries. They were not created for new features in HTML. The video element is a completely new feature in HTML thus everything around it is new.

I would therefore like to see some more respect given to HTML5 and the complexities involved in finding the best possible technical solutions for the Web given that the video element does not stand alone in HTML5, but is part of a much larger picture of technical capabilities on the Web where many of the requested features for TV applications may already be solved by existing HTML markup that is not part of the video element.

Also, HTML5 is not just about the HTML markup, but also about CSS and JavaScript and HTTP. There are several layers of technology involved in creating a Web application: not only a separation of work between client and servers, but also between the Operating System, the media framework, the browser, browser plugins, and JavaScript has to be balanced. To get this balance right is a fine art that will take many discussion, many experiments and sometimes several design approaches. We need patience and calm to work through this, not a rushed adoption of existing solutions from other spaces.

New Requirements

Now let’s get to the take-aways I had from the workshop’s sessions:

Session 1 / Content Provider and Consumer Perspective:

The sessions participants postulate that we will see the creation of application stores for TV applications similar to how we have experienced this for mobile phones and tablets. People enjoy collecting apps like they collect badges. Right now, the app store domain is dominated by native apps and now Web apps. The reason is that we haven’t got a standard platform for setting up Web app stores with Web apps that work in all browsers on all operating systems. Thus, developers have to re-deploy their app for many environments.

While essentially an orthogonal need to HTML standardisation, this seems to be one of the key issues that keep Web apps back from making big market inroads and W3C may do well in setting up a new WG to define a standard Web app manifest format and JS APIs.

Session 2+3 / Multi-screen TV in the Home Network:

Several technologies of hybrid TV broadcast and set-top-box Web content delivery were being pointed out, including the European HbbTV and the Japanese Hybridcast, the latter of which gave an in-depth demo.

Web purists would probably say that it would be simpler to just deliver all content over the Web and not have to worry about any further technical challenges encountered by having to synchronize content received via two vastly different delivery mechanisms. I personally believe this development is one of business models: we don’t yet know exactly how to earn money from TV content delivered over the Internet, but we do know how to do so with TV content. So, hybrids allow the continuation of existing income streams while allowing the features to be augmented with those people enjoy from the Internet.

Should requirements that emerge from such a use case for HTML5 video be taken seriously? I think they absolutely should. What I see happening is that a new way of using the Web is starting to emerge. The new way is video-focused rather than text-focused. We receive our Web content by watching video programming online – video channels, not Web pages are the core content that we consume in the living room. Video channels are where we start our browsing experience from. Search may still be our first point of call, but it will be search for video content or a video-centric app rather than search for a Web site.

And it will be a matter of many interconnected devices in the house that contribute to the experience: the 5.1 stereos that are spread all over the house and should receive our video’s sound, the different screens in the different areas of our house between which we move around, and remote controls, laptops or tablets that function as remote controls and preview stations and are used to determine our viewing experience and provide a back-channel to the publishers.

We have barely begun to identify how such interconnected devices within a home fit within the server-client-based view of the Web world, and the new Web Sockets functionality. The Home Networking Task Force of the Web and TV IG is looking at the issues and analysing existing protocols and standards that solve this picture. But I have a gnawing feeling that the best solution will be something new that is more Web-specific and fits better with the technology layers of the Web.

Session 4 / Synchronized Metadata:

The TV environment offers many data services, some of which have been legally prescribed. This session analysed TV needs and how they can be satisfied with current HTML5.

Subtitles and closed captioning support are one of the key requirements that have been legally prescribed to allow for equal access of non-native speakers, and blind and vision-impaired users to TV content. After demonstration of some key features defined into the HTML5 track element and the WebVTT format, it was generally accepted that HTML5 is making big progress in this space, in particular that browsers are in the process of implementing support for the track element. A concern still exists for complete coverage of all the CEA-608/708 features in WebVTT.

Further concern was raised for support of audio descriptions and audio translations, in particular since no browser has as yet committed to implementing the HTML5’s media multitrack API with the @mediagroup attribute. In this context I am excited to see first JavaScript polyfills emerge (see captionator.js & mediagroup.js).

Another concern was that many captions are actually delivered as raster images (in particular DVD captions) and how that would work in the Web context. The proposal was to use WebVTT and encode the raster images as data-URIs included in timed cues, then render them by JavaScript as an overlay. This is something to explore further.

Demos were shown using WebVTT to synchronize ads with videos, to display related metadata from a user’s life log with videos, to display thumbnails along a video’s timeline, and to show the rendering of text descriptions through screen readers. General agreement by the panel was that WebVTT offers many opportunities and that this area will continue to need further development and that we will see new capabilities on the Web around metadata that were not previously possible on TV.

Session 5 / Content Format and Codecs: DASH and Codec standards

The introduction of HTTP adaptive streaming into HTML5 was one of the core issues that kept returning in the discussions. This panel focused on MPEG DASH, but also mentioned the need for programmatic implementation of adaptive streaming functionality.

The work around MPEG DASH would require specifications of how to use DASH with WebM and Ogg Theora, as well as a specification of a HTML5 profile for DASH, which would limit the functionality possible in DASH files to the ones needed in a HTML5 video element. One criticism of DASH was its verbosity. Another was its unclear patent position. Panel attendees with included Qualcomm, Apple and Microsoft made very clear that their position is pro a royalty-free use of DASH.

The work around a programmatic implementation for adaptive streaming would require at least a JavaScript API to measure the quality of service of a presented video element and a JavaScript API to feed the video element with chunks of (encrypted) video content on the fly. Interestingly enough, there are existing experiments both around Video metrics and MediaSource extensions, so we can expect some progress in this space, even if these are not yet a strong focus of the HTML WG.

I would personally support the creation of Community Group at the W3C around HTTP adaptive streaming and DASH. I think it would work towards alleviating the perceived patent issues around DASH and allow the right members of the community to participate in preparing a specification for HTML5 without requiring them to become W3C members.

Session 6 / Content Protection and DRM

A core concern of the TV community is around content protection. The requirements in this space seem, however, very confused.

The key assumption here is that Web browsers should support the decoding of DRM-protected content in the HTML5 video element because the video element provides a desirable JavaScript API, accessibility features (the track element), default controls, and the possibility to synchronize multiple media elements. However, at the same time, the video element is part of the core content of a Web page and thus allows direct access to the image content in a canvas etc, so some of its functionality is not desirable.

The picture is further confused by requests for authentication, authorization, encryption, obfuscation, same-origin, secure transmission, secure decryption key delivery, unique content identification and other “content protection” techniques without a clear understanding of what is already possible on the Web and what requirements to content publishers actually have for delivering their content on the Web. This is further complicated by the fact that there are many competing solutions for DRM systems in the market with no clear standard that all browsers could support.

A thorough analysis of the technologies and solutions available in this space as well as an analysis of the needs for HTML5 is required before it becomes clear what solution HTML5 browsers may need to support. There seemed to be agreement in the group, though, that browsers would not need to implement DRM solutions, but rather only hand through the functionality of the platform on which they are running (including the media frameworks and operating system functionalities). How this is supposed to work was, however, unclear.

Session 7 / Web & TV: Additional Device & User Requirements

This was a catch-all session for topics that had not been addressed in other sessions. Among the topics addressed in this group were:

  • Parental Guidance: how to deal with ratings in an internationally inconsistent ratings landscape, how to deliver the ratings with the content, and how to enforce the viewing restrictions
  • Emergency Notifications: how to replicate on the Web the emergency notification functionality of TV by providing text overlays to alert users
  • TV channels: how to detect what channels of programming are available to users

Overall, the workshop was a worthwhile experience. It seems there is a lot of work still ahead for making HTML5 video the best it can be on the Web.

The new FOMS: Open Media Developers at OVC

Since 2007 I have organised the annual Foundations of Open Media Software (FOMS) developers workshop. Last year it was held for the first time in the northern hemisphere, in fact on the two days straight after the Open Video Conference (OVC).

This year I’m really excited to announce that the workshop will be an integral part of the Open Video Conference on 10-12 September 2011.

FOMS 2011 will take place as the Open Media Developers track at OVC and I would like to see as many if not more open media software developers attend as we had in last year’s FOMS.

Why should you go?

Well, firstly of course the people. As in previous years, we will have some of the key developers in open media software attend – not as celebrities, but to work with other key developers on hard problems and to make progress.

Then, secondly we believe we have some awesome sessions in preparation:

How we run it

I’m actually not quite satisfied with just these sessions. I’d like to be more flexible on how we make the three days a success for everyone. And this implies that there will continue to be room to add more sessions, even while at the conference, and create breakout groups to address really hard issues all the way through the conference.

I insist on this flexibility because I have seen in past years that the most productive outcomes are created by two or three people breaking away from the group, going into a corner and hacking up some demos or solutions to hard problems and taking that momentum away after the workshop.

To allow this to happen, we will have a plenary on the first day during which we will identify who is actually present at the workshop, what they are working on, what sessions they are planning on a attending, and what other topics they are keen to learn about during the conference that may not yet be addressed by existing sessions.

We’ll repeat this exercise on the Monday after all the rest of the conference is finished and we get a quieter day to just focus on being productive.

But is it worth the effort?

As in the past years, whether the workshop is a success for you depends on you and you alone. You have the power to direct what sessions and breakout groups are being created, and you have the possibility to find others at the workshop that share an interest and drag them away for some productive brainstorming or coding.

I’m going to make sure we have an adequate number of rooms available to actually achieve such an environment. I am very happy to have the support of OVC for this and I am assured we have the best location with plenty of space.

Trip sponsorships

As in previous FOMSes, we have again made sure that travel and conference sponsorship is available to community software developers that would otherwise not be able to attend FOMS. We have several such sponsorships and I encourage you to email the FOMS committee or OVC about it. Mention what you’re working on and what you’re interested to take away from OVC and we can give you free entry, hotel and flight sponsorship.

Oh, and don’t forget to Register for OVC!

2009, Oct 9th, Web Directions South: Taking HTML5 a step further

Web Directions South 2009, Sydney Convention Centre, October 8 2.40pm.

This talk focuses on the efforts engaged by W3C to improve the new HTML 5 media elements with mechanisms to allow people to access multimedia content, including audio and video. Such developments are also useful beyond accessibility needs and will lead to a general improvement of the usability of media, making media discoverable and generally a prime citizen on the Web.

Silvia will discuss what is currently technically possible with the HTML5 media elements, and what is still missing. She will describe a general framework of accessibility for HTML5 media elements and present her work for the Mozilla Corporation that includes captions, subtitles, textual audio annotations, timed metadata, and other time-​​aligned text with the HTML5 media elements. Silvia will also discuss work of the W3C Media Fragments group to further enhance video usability and accessibility by making it possible to directly address temporal offsets in video, as well as spatial areas and tracks.

Link to conference Website

Video:

Slides:

2011, Feb 25th, SLUG: “The latest on HTML5 media”

SLUG Sydney Linux Users Group, 25th Feb 2011 “The latest on HTML5 media”

Our speaker this month is Silvia Pfeiffer, giving two short talks on “The latest on HTML5 media”. We’re promised lots of demos, and also some details about the code that makes the demos possible.

Silvia has worked extensively as an invited expert to the W3C in areas such as the new and tags in HTML5. Prior to this, Silvia has worked for almost two decades on projects such as Annodex (http://www.annodex.net/ – standards for annotating and indexing media files).

Silvia has recently published “The Definitive Guide to HTML5 Video”, a few copies of which just may be given away on the night.

Slides

2011, Jan 24th, LCA Multimedia Mini-conf “Audio and Video processing in HTML5”

Linux.conf.au LCA Multimedia Miniconf, 24th Jan 2011

Audio and video processing have traditionally been hard number-crunching tasks to do and nobody would have considered executing them in a Web browser on a remote file. However, with the capabilities of modern hardware and web browser software, and the integration of audio and video into HTML5, it’s now possible to do (almost) everything inside a web browser in real-time with a bit of JavaScript. In this talk we will look at the Firefox Audio API to visualize sound with an FFT and to manipulate sound. For video we look at the possibilities of the Canvas to manipulate video pixels in real-time to achieve things like motion detection.

Slides

2010, Oct 14th, Web Directions South, Sydney: “HTML5 Audio and Video”

On 14th October I gave a talk at Web Directions South on “HTML5 audio and video – using these exciting new elements in practice”.

I wanted to give people an introduction into how to use these elements while at the same time stirring their imagination as to the design possibilities now that these elements are available natively in browsers. I re-used some of the demos that I have put together for the book that I am currently writing, added some of the cool stuff that others have done and finished off with an outlook towards what new features will probably arrive next.

“Slides” are now available, which are really just a Web page with some demos that work in modern browsers.

Table of contents:

HTML5 Audio and Video

  1. Cross browser <video> element
  2. Cross browser <audio> element
  3. Encoding
  4. Fallback considerations
  5. CSS and <video> – samples
  6. <video> and the JavaScript API
  7. <video> and SVG
  8. <video> and Canvas
  9. <video> and Web Workers
  10. <video> and Accessibility
  11. audio plans

Recent developments around WebVTT

People have been asking me lots of questions about WebVTT (Web Video Text Tracks) recently. Questions about its technical nature such as: are the features included in WebVTT sufficient for broadcast captions including positioning and colors? Questions about its standardisation level: when is the spec officially finished and when will it move from the WHATWG to the W3C? Questions about implementation: are any browsers supporting it yet and how can I make use of it now?

I’m going to answer all of these questions in this post to make it more efficient than answering tweets, emails, and skype and other phone conference requests. It’s about time I do a proper post about it.

Implementations

I’m starting with the last area, because it is the simplest to answer.

No, no browser has as yet shipped support for the <track> element and therefore there is no support for WebVTT in browsers yet. However, implementations are in progress. For example, Webkit has recently received first patches for the track element, but there is still an open bug for a WebVTT parser. Similarly, Firefox can now parse the track element, but is still working on the element’s actual functionality.

However, you do not have to despair, because there are now a couple of JavaScript polyfill libraries for either just the track element or for video players with track support. You can start using these while you are waiting for the browsers to implement native support for the element and the file format.

Here are some of the libraries that I’ve come across that will support SRT and/or WebVTT (do leave a comment if you come across more):

  • Captionator – a polyfill for track and SRT parsing (WebVTT in the works)
  • js_videosub – a polyfill for track and SRT parsing
  • jscaptions – a polyfill for track and SRT parsing
  • LeanBack player – a video player with track and SRT, SUB, DFXP, and soon full WebVTT parsing support
  • playr – a video player that includes track and WebVTT parsing
  • MediaElementJS – a video player that includes track and SRT parsing
  • Kaltura’s video player – a video player that includes track and SRT parsing

I am actually most excited about the work of Ronny Mennerich from LeanbackPlayer on WebVTT, since he has been the first to really attack full support of cue settings and to discuss with Ian, me and the WHATWG about their meaning. His review notes with visual description of how settings are to be interpreted and his demo will be most useful to authors and other developers.

Standardisation

Before we dig into the technical progress that has been made recently, I want to answer the question of “maturity”.

The WebVTT specification is currently developed at the WHATWG. It is part of the HTML specification there. When development on it started (under its then name WebSRT), it was also part of the HTML5 specification of the W3C. However, there was a concern that HTML5 should be independent of the chosen captioning format and thus WebVTT currently only exists at the WHATWG.

In recent months – and particularly since browser vendors have indicated that they will indeed implement support for WebVTT as their implementation of the <track> element – the question of formal standardization of WebVTT at the W3C has arisen. I’m involved in this as a Google contractor and we’ve put together a proposed charter for a WebVTT Working Group at the W3C.

In the meantime, standardization progresses at the WHATWG productively. Much feedback has recently been brought together by Ian and changes have been applied or at least prepared for a second feature set to be added to WebVTT once the first lot is implemented. I’ve captured the potentially accepted and rejected new features in a wiki page.

Many of the new features are about making the WebVTT format more useful for authoring and data management. The introduction of comments, inline CSS settings and default cue settings will help authors reduce the amount of styling they have to provide. File-wide metadata will help with the exchange of management information in professional captioning scenarios and archives.

But even without these new features, WebVTT already has all the features necessary to support professional captioning requirements. I’ve prepared a draft mapping of CEA-608 captions to WebVTT to demonstrate these capabilities (CEA-608 is the TV captioning standard in the US).

So, overall, WebVTT is in a great state for you to start implementing support for it in caption creation applications and in video players. There’s no need to wait any longer – I don’t expect fundamental changes to be made, but only new features to be added.

New WebVTT Features

This takes us straight to looking at the recently introduced new features.

  • Simpler File Magic:
    Whereas previously the magic file identifier for a WebVTT file was a single line with “WEBVTT FILE”. This has now been changed to a single line with just “WEBVTT”.
  • Cue Bold Span:
    The <b> element has been introduced into WebVTT, thus aligning it somewhat more with SRT and with HTML.
  • CSS Selectors:
    The spec already allowed to use the names of tags, the classes of <c> tags, and the voice annotations of <v> tags as CSS selectors for ::cue. ID selector matching is now also available, where the cue identifier is used.
  • text-decoration support:
    The spec now also supports the CSS text-decoration property for WebVTT cues, allowing functionality such as blinking text and bold.

Further to this, the email identifies the means in which WebVTT is extensible:

  • Header area:
    The WebVTT header area is defined through the “WEBVTT” magic file identifier as a start and two empty lines as an end. It is possible to add into this area file-wide information header information.
  • Cues:
    Cues are defined to start with an optional identifier, and then a start/end time specification with “–>” separator. They end with two empty lines. Cues that contain a “–>” separator but don’t parse as valid start/end time are currently skipped. Such “cues” can be used to contain inline command blocks.
  • Inline in cues:
    Finally, within cues, everything that is within a “tag”, i.e. between “”, and does not parse as one of the defined start or end tags is ignored, so we can use these to hide text. Further, text between such start and end tags is visible even if the tags are ignored, so wen can introduce new markup tags in this way.

Given this background, the following V2 extensions have been discussed:

  • Metadata:
    Enter name-value pairs of metadata into the header area, e.g.

    WEBVTT
    Language=zh
    Kind=Caption
    Version=V1_ABC
    License=CC-BY-SA
    
    1
    00:00:15.000 --> 00:00:17.950
    first cue
  • Inline Cue Settings:
    Default cue settings can come in a “cue” of their own, e.g.

    WEBVTT
    
    DEFAULTS --> D:vertical A:end
    
    00:00.000 --> 00:02.000
    This is vertical and end-aligned.
    
    00:02.500 --> 00:05.000
    As is this.
    
    DEFAULTS --> A:start
    
    00:05.500 --> 00:07.000
    This is horizontal and start-aligned.
    
  • Inline CSS:
    Since CSS is used to format cue text, a means to do this directly in WebVTT without a need for a Web page and external style sheet is helpful and could be done in its own cue, e.g.

    WEBVTT
    
      STYLE -->
      ::cue(v[voice=Bob]) { color: green; }
      ::cue(c.narration) { font-style: italic; }
      ::cue(c.narration i) { font-style: normal; }
    
      00:00.000 --> 00:02.000
      <v Bob>Welcome.
    
      00:02.500 --> 00:05.000
      <c .narration>To <i>WebVTT</i>.
    
  • Comments:
    Both, comments within cues and complete cues commented out are possible, e.g.

    WEBVTT
    
     COMMENT -->
     00:02.000 --> 00:03.000
     two; this is entirely
     commented out
     
     00:06.000 --> 00:07.000
     this part of the cue is visible
     <! this part isn't >
     <and neither is this>
    

Finally, I believe we still need to add the following features:

  • Language tags:
    I’d like to add a language tag that allows to mark up a subpart of cue text as being in a different language. We need this feature for mixed-language cues (in particular where a different font may be necessary for the inline foreign-language text). But more importantly we will need this feature for cues that contain text descriptions rather than captions, such that a speech synthesizer can pick the correct language model to speak the foreign-language text. It was discussed that this could be done with a <lang jp>xxx</lang> type of markup.
  • Roll-up captions:
    When we use timestamp objects and the future text is hidden, then is un-hidden upon reaching its time, we should allow the cue text to scroll up a line when the un-hidden text requires adding a new line. This is the typical way in which TV live captions have been displayed and so users are acquainted with this display style.
  • Inline navigation:
    For chapter tracks the primary use of cues are for navigation. In other formats – in particular in DAISY-books for blind users – there are hierarchical navigation possibilities within media resources. We can use timestamp objects to provide further markers for navigation within cues, but in order to make these available in a hierarchical fashion, we will need a grouping tag. It would be possible to introduce a <nav> tag that can group several timestamp objects for navigation.
  • Default caption width:
    At the moment, the default display size of a caption cue is 100% of the video’s width (height for vertical directions), which can be overruled with the “S” cue setting. I think it should by default rather be the width (height) of the bounding box around all the text inside the cue.

Aside from these changes to WebVTT, there are also some things that can be improved on the <track> element. I personally support the introduction of the source element underneath the track element, because that allows us to provide different caption files for different devices through the @media media queries attribute and it allows support for more than just one default captioning format. This change needs to be made soon so we don’t run into trouble with the currently empty track element.

I further think a oncuelistchange event would be nice as well in cases where the number of tracks is somehow changed – in particular when coming from within a media file.

Other than this, I’m really very happy with the state that we have achieved this far.

HTML5 multi-track audio or video

In the last months, we’ve been working hard at the WHATWG and W3C to spec out new HTML markup and a JavaScript interface for dealing with audio or video content that has more than just one audio and video track.

This is particularly relevant when a Web page author wants to add a sign language track to a video or audio resource for deaf people, or an audio description track (i.e. a sound track in which a speaker explains the key things that can be seen on screen) for blind people. It is also relevant when a Web page author wants to publish a video with multiple audio tracks that are each a different language dub for the video and can be used for less common cases such as a director’s comment track, or making available different camera angles for an event.

Just to be clear: this is not a means to introduce video editing functionality into the Web browser. If you want to do edits, you’re better off with an application that will eventually render a new piece of content and includes fancy transitions etc. Similarly, this is not a means to introduce mixing functionality (as in what DJs do when they play with multiple audio recordings). You’re better off with an actual audio mixing or DJ application that will provide you all sorts of amazing effects and filters.

So, multi-track is squarely focused on synchronizing alternative or additional tracks to a single resource with a single timeline to which all tracks are slaved.

Two means of publishing such multi-track media content are possible:

  • In-band multi-track
  • Synchronized resources

1. In-band multi-track

In in-band multi-track, there is a single file that has all all the tracks inside it. For this single file, there is now an API in HTML5 that allows addressing and controlling these tracks.

Of the video file formats that Web browsers support, WebM is currently not defined to contain more than one audio or video track. However, since WebM is using the Matroska container format, which supports multi-track, it is possible to extend WebM for multi-track resources. I have seen multitrack Ogg, MP4 and Matroska files in the wild and most media players support their display.

The specification that has gone into HTML5 to support in-band multi-track looks as follows:

interface HTMLMediaElement : HTMLElement {
  [...]
  // tracks
  readonly attribute AudioTrackList audioTracks;
  readonly attribute VideoTrackList videoTracks;
};

interface AudioTrackList : EventTarget {
  readonly attribute unsigned long length;
  getter AudioTrack (unsigned long index);
  AudioTrack? getTrackById(DOMString id);

           attribute EventHandler onchange;
           attribute EventHandler onaddtrack;
           attribute EventHandler onremovetrack;
};

interface AudioTrack {
  readonly attribute DOMString id;
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;
           attribute boolean enabled;
};

interface VideoTrackList : EventTarget {
  readonly attribute unsigned long length;
  getter VideoTrack (unsigned long index);
  VideoTrack? getTrackById(DOMString id);
  readonly attribute long selectedIndex;

           attribute EventHandler onchange;
           attribute EventHandler onaddtrack;
           attribute EventHandler onremovetrack;
};

interface VideoTrack {
  readonly attribute DOMString id;
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;
           attribute boolean selected;
};

You will notice that every audio and video track gets an index to address them. You can enable and disable individual audio tracks (via the enabled attribute) and you can select a single video track for display (via the selectedIndex attribute). This means that one or more audio tracks can be active at the same time (e.g. main audio and audio description), but only one video track will be active at a time (e.g. main video or sign language).

Through the id, kind, label and language attributes you can find out more about what actual content is available in the individual tracks so as to activate/deactivate them correctly and display the right information about them.

kind identifies the type of content that the track exposes such as “description” (for audio description), “sign” (for sign language), “main” (for the default displayed track), “translation” (for a dubbed audio track), and “alternative” (for an alternative to the default track).

label provides a human readable string that describes the content of the track aiming to be used in a menu.

id provides a short machine-readable string that can be used to construct a media fragment URI for the track. The use case for this will be discussed later.

language provides a machine-readable language code to identify which language is spoken or signed in an audio or sign language video track.

Example 1:

The following uses a video file that has a main video track, a main audio track in English and French, and an audio description track in English and French. (It likely also has caption tracks, but we will ignore text tracks for now.) This code sample switches the French audio tracks on and all other audio tracks off.

<video id="v1" poster=“video.png” controls>
 <source src=“video.ogv” type=”video/ogg”>
 <source src=“video.mp4” type=”video/mp4”>
</video>

<script type="text/javascript">
video = document.getElementsByTagName("video")[0];

for (i=0; i < video.audioTracks.length; i++) {
  if (video.audioTracks[i].language.substring(0,2) === "fr") {
    video.audioTracks[i].enabled = true;
  } else {
    video.audioTracks[i].enabled = false;
  }
}
</script>

Example 2:

The following uses a audio file that has a main audio track in English, no main video track, but sign language video tracks in ASL (American Sign Language), BSL (British Sign Language), and ASF (Australian Sign Language). This code sample switches the Australian sign language track on and all other video tracks off.

<video id="a1" controls>
 <source src=“audio_sign.ogg” type=”video/ogg”>
 <source src=“audio_sign.mp4” type=”video/mp4”>
</video>

<script type="text/javascript">
video = document.getElementsByTagName("video")[0];

for (i=0; i< video.videoTracks.length; i++) {
  if (video.videoTracks[i].language === 'sgn-asf') {
    video.videoTracks[i].selected = true;
  } else {
    video.videoTracks[i].selected = false;
  }
}
</script>

If you have more tracks in both examples that conflict with your intentions, you may need to further filter your activation / deactivation code using the kind attribute.

2. Synchronized resources

Sometimes the production process of media creates not a single resource with multiple contained tracks, but multiple resources that all share the same timeline. This is particularly useful for the Web, because it means the user can download only the required resources, typically saving a substantial amount of bandwidth.

For this situation, an attribute called @mediagroup can be added in markup to slave multiple media elements together. This is administrated in the JavaScript API through a MediaController object, which provides events and attributes for the combined multi-track object.

The new IDL interfaces for HTMLMediaElement are as follows:

interface HTMLMediaElement : HTMLElement {
  [...]
  // media controller
           attribute DOMString mediaGroup;
           attribute MediaController? controller;
};

enum MediaControllerPlaybackState { "waiting", "playing", "ended" };
[Constructor]
interface MediaController : EventTarget {
  readonly attribute unsigned short readyState; // uses HTMLMediaElement.readyState's values

  readonly attribute TimeRanges buffered;
  readonly attribute TimeRanges seekable;
  readonly attribute unrestricted double duration;
           attribute double currentTime;

  readonly attribute boolean paused;
  readonly attribute MediaControllerPlaybackState playbackState;
  readonly attribute TimeRanges played;
  void pause();
  void unpause();
  void play(); // calls play() on all media elements as well

           attribute double defaultPlaybackRate;
           attribute double playbackRate;

           attribute double volume;
           attribute boolean muted;

           attribute EventHandler onemptied;
           attribute EventHandler onloadedmetadata;
           attribute EventHandler onloadeddata;
           attribute EventHandler oncanplay;
           attribute EventHandler oncanplaythrough;
           attribute EventHandler onplaying;
           attribute EventHandler onended;
           attribute EventHandler onwaiting;

           attribute EventHandler ondurationchange;
           attribute EventHandler ontimeupdate;
           attribute EventHandler onplay;
           attribute EventHandler onpause;
           attribute EventHandler onratechange;
           attribute EventHandler onvolumechange;
};

You will notice that the MediaController replicates some of the states and events of the slave media elements. In general the approach is that the attributes represent the summary state from all the elements and the writable attributes when set are handed through to all the slave elements.

Importantly, if the individual media elements have @controls activated, then the displayed controls interact with the MediaController thus allowing synchronized playback and interaction with the combined multi-track object.

Example 3:

The following uses a video file that has a main video track, a main audio track in English. There is another video file with the ASL sign language for the video, and an audio file with the audio description in English. This code sample creates controls on the first file, which then also control the audio description and the sign language video, neither of which have controls. Since the audio description doesn’t have controls, it doesn’t get visually displayed. The sign language video will just sit next to the main video without controls.

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid">
 <source src=“video.webm” type=”video/webm”>
 <source src=“video.mp4” type=”video/mp4”>
</video>

<video id="v2" poster=“sign.png” mediagroup="a11y_vid">
 <source src=“sign.webm” type=”video/webm”>
 <source src=“sign.mp4” type=”video/mp4”>
</video>

<audio id="a1" mediagroup="a11y_vid">
 <source src=“audio.ogg” type=”audio/ogg”>
 <source src=“audio.mp3” type=”audio/mp3”>
</audio>

Example 4:

We now accompany a main video with three sign language video tracks in ASL, BSL and ASF. We could just do this in JavaScript and replace the currentSrc of a second video element with the links to BSL and ASF as required, but then we need to run our own media controls to list the available tracks. So, instead, we create a video element for each one of the tracks and use CSS to remove the inactive ones from the page layout. The code sample activates the ASF track and deactivates the other sign language tracks.

<style>
  video.inactive { display: none; }
</style>

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid" class="inactive">
 <source src=“video.webm” type=”video/webm”>
 <source src=“video.mp4” type=”video/mp4”>
</video>

<video id="v2" poster=“sign_asl.png” mediagroup="a11y_vid" >
 <source src=“sign_asl.webm” type=”video/webm”>
 <source src=“sign_asl.mp4” type=”video/mp4”>
</video>

<video id="v3" poster=“sign_bsl.png” mediagroup="a11y_vid" class="inactive">
 <source src=“sign_bsl.webm” type=”video/webm”>
 <source src=“sign_bsl.mp4” type=”video/mp4”>
</video>

<video id="v4" poster=“sign_asf.png” mediagroup="a11y_vid" class="inactive">
 <source src=“sign_asf.webm” type=”video/webm”>
 <source src=“sign_asf.mp4” type=”video/mp4”>
</video>

<script type="text/javascript">
videos = document.getElementsByTagName("video");

for (i=0; i < videos.length; i++) {
  if (videos[i].currentSrc.match(/asf/g).length > 0) {
    videos[i].class = "";
  } else {
    videos[i].class = "active";
  }
}
</script>

Example 5:

In this final example we look at what to do when we have a in-band multi-track resource with multiple video tracks that should all be displayed on screen. This is not a simple problem to solve because a video element is only allowed to display a single video track at a time. Therefore for this problem you need to use both approaches: in-band and synchronized resources.

We take a in-band multitrack resource with a main video and audio track and three sign language tracks in ASL, BSL and ASF. The second resource will be made up from the URI of the first resource with a media fragment address of the sign language tracks. (If required, these can be discovered using the getID() function on the first resource.) The markup will look as follows:

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid">
 <source src=“video.ogv#track=v_main&track=a_main” type=”video/ogv”>
 <source src=“video.mp4#track=v_main&track=a_main” type=”video/mp4”>
</video>

<video id="v2" poster=“sign.png” controls mediagroup="a11y_vid">
 <source src=“video.ogv#track=asl&track=bsl&track=asf” type=”video/ogv”>
 <source src=“video.mp4#track=asl&track=bsl&track=asf” type=”video/mp4”>
</video>

Note that with multiple video elements you can always style them in the way that you want them displayed on screen. E.g. if you want a picture-in-picture display, you scale the second video down and absolutely position it on top of the first one in the appropriate location. You can even grab the second video into a canvas, chroma-key your sign language speaker on a green or blue screen and remove that background through some canvas processing before popping it on top of the video.

The world is all yours!

HOWEVER: There is one big caveat on all these specs – while they have all found entry into the HTML5 specification, it would be expecting a bit much to have browser support already. 🙂

UPDATE 23 July 2014: I’ve just changed this to use the latest spec, which should also at least partially be implemented already.