ginger's thoughts

Silvia's blog

Tag: audio

HTML5 multi-track audio or video

In the last months, we’ve been working hard at the WHATWG and W3C to spec out new HTML markup and a JavaScript interface for dealing with audio or video content that has more than just one audio and video track.

This is particularly relevant when a Web page author wants to add a sign language track to a video or audio resource for deaf people, or an audio description track (i.e. a sound track in which a speaker explains the key things that can be seen on screen) for blind people. It is also relevant when a Web page author wants to publish a video with multiple audio tracks that are each a different language dub for the video and can be used for less common cases such as a director’s comment track, or making available different camera angles for an event.

Just to be clear: this is not a means to introduce video editing functionality into the Web browser. If you want to do edits, you’re better off with an application that will eventually render a new piece of content and includes fancy transitions etc. Similarly, this is not a means to introduce mixing functionality (as in what DJs do when they play with multiple audio recordings). You’re better off with an actual audio mixing or DJ application that will provide you all sorts of amazing effects and filters.

So, multi-track is squarely focused on synchronizing alternative or additional tracks to a single resource with a single timeline to which all tracks are slaved.

Two means of publishing such multi-track media content are possible:

  • In-band multi-track
  • Synchronized resources

1. In-band multi-track

In in-band multi-track, there is a single file that has all all the tracks inside it. For this single file, there is now an API in HTML5 that allows addressing and controlling these tracks.

Of the video file formats that Web browsers support, WebM is currently not defined to contain more than one audio or video track. However, since WebM is using the Matroska container format, which supports multi-track, it is possible to extend WebM for multi-track resources. I have seen multitrack Ogg, MP4 and Matroska files in the wild and most media players support their display.

The specification that has gone into HTML5 to support in-band multi-track looks as follows:

``` interface HTMLMediaElement : HTMLElement { [...] // tracks readonly attribute AudioTrackList audioTracks; readonly attribute VideoTrackList videoTracks; }; interface AudioTrackList : EventTarget { readonly attribute unsigned long length; getter AudioTrack (unsigned long index); AudioTrack? getTrackById(DOMString id); attribute EventHandler onchange; attribute EventHandler onaddtrack; attribute EventHandler onremovetrack; }; interface AudioTrack { readonly attribute DOMString id; readonly attribute DOMString kind; readonly attribute DOMString label; readonly attribute DOMString language; attribute boolean enabled; }; interface VideoTrackList : EventTarget { readonly attribute unsigned long length; getter VideoTrack (unsigned long index); VideoTrack? getTrackById(DOMString id); readonly attribute long selectedIndex; attribute EventHandler onchange; attribute EventHandler onaddtrack; attribute EventHandler onremovetrack; }; interface VideoTrack { readonly attribute DOMString id; readonly attribute DOMString kind; readonly attribute DOMString label; readonly attribute DOMString language; attribute boolean selected; }; ```

You will notice that every audio and video track gets an index to address them. You can enable and disable individual audio tracks (via the enabled attribute) and you can select a single video track for display (via the selectedIndex attribute). This means that one or more audio tracks can be active at the same time (e.g. main audio and audio description), but only one video track will be active at a time (e.g. main video or sign language).

Through the id, kind, label and language attributes you can find out more about what actual content is available in the individual tracks so as to activate/deactivate them correctly and display the right information about them.

kind identifies the type of content that the track exposes such as “description” (for audio description), “sign” (for sign language), “main” (for the default displayed track), “translation” (for a dubbed audio track), and “alternative” (for an alternative to the default track).

label provides a human readable string that describes the content of the track aiming to be used in a menu.

id provides a short machine-readable string that can be used to construct a media fragment URI for the track. The use case for this will be discussed later.

language provides a machine-readable language code to identify which language is spoken or signed in an audio or sign language video track.

Example 1:

The following uses a video file that has a main video track, a main audio track in English and French, and an audio description track in English and French. (It likely also has caption tracks, but we will ignore text tracks for now.) This code sample switches the French audio tracks on and all other audio tracks off.

``` <video id="v1" poster=“video.png” controls> <source src=“video.ogv” type=”video/ogg”> <source src=“video.mp4” type=”video/mp4”> </video> <script type="text/javascript"> video = document.getElementsByTagName("video")[0]; for (i=0; i < video.audioTracks.length; i++) { if (video.audioTracks[i].language.substring(0,2) === "fr") { video.audioTracks[i].enabled = true; } else { video.audioTracks[i].enabled = false; } } </script> ```

Example 2:

The following uses a audio file that has a main audio track in English, no main video track, but sign language video tracks in ASL (American Sign Language), BSL (British Sign Language), and ASF (Australian Sign Language). This code sample switches the Australian sign language track on and all other video tracks off.

``` <video id="a1" controls> <source src=“audio_sign.ogg” type=”video/ogg”> <source src=“audio_sign.mp4” type=”video/mp4”> </video> <script type="text/javascript"> video = document.getElementsByTagName("video")[0]; for (i=0; i< video.videoTracks.length; i++) { if (video.videoTracks[i].language === 'sgn-asf') { video.videoTracks[i].selected = true; } else { video.videoTracks[i].selected = false; } } </script> ```

If you have more tracks in both examples that conflict with your intentions, you may need to further filter your activation / deactivation code using the kind attribute.

2. Synchronized resources

Sometimes the production process of media creates not a single resource with multiple contained tracks, but multiple resources that all share the same timeline. This is particularly useful for the Web, because it means the user can download only the required resources, typically saving a substantial amount of bandwidth.

For this situation, an attribute called @mediagroup can be added in markup to slave multiple media elements together. This is administrated in the JavaScript API through a MediaController object, which provides events and attributes for the combined multi-track object.

The new IDL interfaces for HTMLMediaElement are as follows:

``` interface HTMLMediaElement : HTMLElement { [...] // media controller attribute DOMString mediaGroup; attribute MediaController? controller; }; enum MediaControllerPlaybackState { "waiting", "playing", "ended" }; [Constructor] interface MediaController : EventTarget { readonly attribute unsigned short readyState; // uses HTMLMediaElement.readyState's values readonly attribute TimeRanges buffered; readonly attribute TimeRanges seekable; readonly attribute unrestricted double duration; attribute double currentTime; readonly attribute boolean paused; readonly attribute MediaControllerPlaybackState playbackState; readonly attribute TimeRanges played; void pause(); void unpause(); void play(); // calls play() on all media elements as well attribute double defaultPlaybackRate; attribute double playbackRate; attribute double volume; attribute boolean muted; attribute EventHandler onemptied; attribute EventHandler onloadedmetadata; attribute EventHandler onloadeddata; attribute EventHandler oncanplay; attribute EventHandler oncanplaythrough; attribute EventHandler onplaying; attribute EventHandler onended; attribute EventHandler onwaiting; attribute EventHandler ondurationchange; attribute EventHandler ontimeupdate; attribute EventHandler onplay; attribute EventHandler onpause; attribute EventHandler onratechange; attribute EventHandler onvolumechange; }; ```

You will notice that the MediaController replicates some of the states and events of the slave media elements. In general the approach is that the attributes represent the summary state from all the elements and the writable attributes when set are handed through to all the slave elements.

Importantly, if the individual media elements have @controls activated, then the displayed controls interact with the MediaController thus allowing synchronized playback and interaction with the combined multi-track object.

Example 3:

The following uses a video file that has a main video track, a main audio track in English. There is another video file with the ASL sign language for the video, and an audio file with the audio description in English. This code sample creates controls on the first file, which then also control the audio description and the sign language video, neither of which have controls. Since the audio description doesn’t have controls, it doesn’t get visually displayed. The sign language video will just sit next to the main video without controls.

``` <video id="v1" poster=“video.png” controls mediagroup="a11y_vid"> <source src=“video.webm” type=”video/webm”> <source src=“video.mp4” type=”video/mp4”> </video> <video id="v2" poster=“sign.png” mediagroup="a11y_vid"> <source src=“sign.webm” type=”video/webm”> <source src=“sign.mp4” type=”video/mp4”> </video> <audio id="a1" mediagroup="a11y_vid"> <source src=“audio.ogg” type=”audio/ogg”> <source src=“audio.mp3” type=”audio/mp3”> </audio> ```

Example 4:

We now accompany a main video with three sign language video tracks in ASL, BSL and ASF. We could just do this in JavaScript and replace the currentSrc of a second video element with the links to BSL and ASF as required, but then we need to run our own media controls to list the available tracks. So, instead, we create a video element for each one of the tracks and use CSS to remove the inactive ones from the page layout. The code sample activates the ASF track and deactivates the other sign language tracks.

``` <style> video.inactive { display: none; } </style> <video id="v1" poster=“video.png” controls mediagroup="a11y_vid" class="inactive"> <source src=“video.webm” type=”video/webm”> <source src=“video.mp4” type=”video/mp4”> </video> <video id="v2" poster=“sign_asl.png” mediagroup="a11y_vid" > <source src=“sign_asl.webm” type=”video/webm”> <source src=“sign_asl.mp4” type=”video/mp4”> </video> <video id="v3" poster=“sign_bsl.png” mediagroup="a11y_vid" class="inactive"> <source src=“sign_bsl.webm” type=”video/webm”> <source src=“sign_bsl.mp4” type=”video/mp4”> </video> <video id="v4" poster=“sign_asf.png” mediagroup="a11y_vid" class="inactive"> <source src=“sign_asf.webm” type=”video/webm”> <source src=“sign_asf.mp4” type=”video/mp4”> </video> <script type="text/javascript"> videos = document.getElementsByTagName("video"); for (i=0; i < videos.length; i++) { if (videos[i].currentSrc.match(/asf/g).length > 0) { videos[i].class = ""; } else { videos[i].class = "active"; } } </script> ```

Example 5:

In this final example we look at what to do when we have a in-band multi-track resource with multiple video tracks that should all be displayed on screen. This is not a simple problem to solve because a video element is only allowed to display a single video track at a time. Therefore for this problem you need to use both approaches: in-band and synchronized resources.

We take a in-band multitrack resource with a main video and audio track and three sign language tracks in ASL, BSL and ASF. The second resource will be made up from the URI of the first resource with a media fragment address of the sign language tracks. (If required, these can be discovered using the getID() function on the first resource.) The markup will look as follows:

``` <video id="v1" poster=“video.png” controls mediagroup="a11y_vid"> <source src=“video.ogv#track=v_main&track=a_main” type=”video/ogv”> <source src=“video.mp4#track=v_main&track=a_main” type=”video/mp4”> </video> <video id="v2" poster=“sign.png” controls mediagroup="a11y_vid"> <source src=“video.ogv#track=asl&track=bsl&track=asf” type=”video/ogv”> <source src=“video.mp4#track=asl&track=bsl&track=asf” type=”video/mp4”> </video> ```

Note that with multiple video elements you can always style them in the way that you want them displayed on screen. E.g. if you want a picture-in-picture display, you scale the second video down and absolutely position it on top of the first one in the appropriate location. You can even grab the second video into a canvas, chroma-key your sign language speaker on a green or blue screen and remove that background through some canvas processing before popping it on top of the video.

The world is all yours!

HOWEVER: There is one big caveat on all these specs - while they have all found entry into the HTML5 specification, it would be expecting a bit much to have browser support already. :-)

UPDATE 23 July 2014: I’ve just changed this to use the latest spec, which should also at least partially be implemented already.

Talk at Web Directions South, Sydney: HTML5 audio and video

On 14th October I gave a talk at Web Directions South on “HTML5 audio and video - using these exciting new elements in practice”.

I wanted to give people an introduction into how to use these elements while at the same time stirring their imagination as to the design possibilities now that these elements are available natively in browsers. I re-used some of the demos that I have put together for the book that I am currently writing, added some of the cool stuff that others have done and finished off with an outlook towards what new features will probably arrive next.

“Slides” are now available, which are really just a Web page with some demos that work in modern browsers.

Table of contents:

HTML5 Audio and Video

  1. Cross browser
  2. Cross browser
  3. Encoding
  4. Fallback considerations
  5. CSS and
  6. audio plans

Your metadata is not my metadata

Over the last two days we had the Open Subtitles Summit here in New York. It was very exciting to feel the energy in the room to make a change to media accessibility - I am sure we will see much development over the next 12 months. We spoke much about HTML5 video and standards and had many discussions about subtitles, captions, and other accessibility information.

On Wednesday we had a discussion about metadata and I quickly realized that “your metadata is not my metadata”: everyone used the word for something different. So, I suggested to have a metadata discussion on Thursday where we would put a structure onto all of this, identify what kinds of metadata we have and whether and how it should be supported in HTML5 standards.

Our basic findings are very simple and widely accepted. There are three fundamentally different types of metadata:

  • Technical metadata about video: information about the format of the resource - things that can be determined automatically and are non-controversial, such as the width, height, framerate, audio sample rate etc. This information can be used to, e.g. decide if a video is appropriate for a certain device.
  • Semantic metadata about video: semantic information about the video resource - e.g. license, author, publication date, version, attribution, title, description. This information is good for search and identification.
  • Timed semantic metadata: semantic information that is associated with time intervals of the video, not with the full video - e.g. active speaker, location, date-time, objects.

As we talked about this further, however, we identified subclasses of these generic types that are very important to identify because they will be handled differently.

We found that semantic metadata can be separated into universal metadata and domain-specific metadata. Universal metadata is semantic metadata that can basically be applied to any content. There is very little of that and the W3C Media Annotations WG has done a pretty good job in identifying it. Domain-specific metadata is such metadata that only applies to some content, e.g. all the videos about sports have metadata such as game scores, players, or type of sport.

As for adding such metadata into media resources, we discussed that it makes sense to have the universal metadata explicitly spelled out and to have a generic means to associate name-value pairs with resource. Of course it will all be stored in databases, but there was also a requirement to have it encoded into the media resource - and in our discussion case: into external captions or subtitle files.

As for timed metadata - it is possible to separate this into metadata that is only relevant as part of a subtitle or caption file, because the metadata relates to a certain word or a word sequence, and into independent timed metadata that can be stored in, e.g. JSON or some similar format.

Since we are particularly interested in subtitles and captions, the timed metadata that is associated with words or word sequences is particularly important. The most natural metadata that is useful as part of subtitles is of course speaker segmentation. We also identified that hyperlinks to related content are just as important, since it can enable applications such as popcorn.js.

Potentially there is a use for metadata association with any sequence of words in a caption or subtitle, which could be satisfied with the use of a generic markup element for a sequence of words, such that microdata or RDFa may get associated. A request for such a generic means of associating metadata was made. However, the need for it still has to be confirmed with good use cases - the breakout group was out of time as we came to this point. So, leave your ideas for use cases in the requirements - they will help shape standards.

Upcoming conferences / workshops

Lots is happening in open source multimedia land in the next few months.

Check out these cool upcoming conferences / workshops / miniconfs…

September 29th and 30th, New York Open Subtitles Design Summit October 1st and 2nd, New York Open Video Conference

October 3rd and 4th, New York Foundations of Open Media Software Developer Workshop

January 24/25th, Brisbane, Australia LCA Multimedia Miniconf

W3C Media Annotations API standard

Recently, I was asked to review the W3C Media Annotations specifications as they are about to go into Last Call (a state that comes before the request for implementations at the W3C).

The W3C Media Annotations group has defined a set of metadata that they believe is representative and common for media resources. The ontology consist of the following fields:

  • ma:identifier: a URI or string to identify a resource
  • ma:title: a string providing the title of the resource
  • ma:language: a language code describing the language used in the resource
  • ma:locator: the URI at which the resource can be accessed
  • ma:contributor: a URI or string identifying the contributor and the nature of the contribution
  • ma:creator: a URI or string identifying an author
  • ma:createDate: a date of creation or publication of the resource
  • ma:location: a string or geo code identifying where the resource has been shot/recorded
  • ma:description: a string describing the content of the resource
  • ma:keyword: a word or word combination providing a topic, keyword or tag representing the resource
  • ma:genre: a string providing the genre of the resource
  • ma:rating: rating value, including the rating scale
  • ma:relation: a URI and string identifying a related resource and the relationship
  • ma:collection: a URI or string providing the name of a collection to which the resource belongs
  • ma:copyright: a URI or string with the copyright statement.
  • ma:license: a string or URI with the usage license
  • ma:publisher: a string or URI with the publisher of the resource
  • ma:targetAudience: a URI and classification string providing the issuer of the classification and the classification value
  • ma:fragments: a list of string and URI values that identify media fragments and their type
  • ma:namedFragments: a list of string and URI values the provide names to media fragments
  • ma:frameSize: a width - height pair in pixels
  • ma:compression: a string providing the compression algorithm
  • ma:duration: a float to provide the resource duration in seconds
  • ma:format String: the mime type of the resource
  • ma:samplingrate: a float with the audio sampling rate
  • ma:framerate: a float with the video frame rate
  • ma:bitrate: a float providing the average bit rate in kbps
  • ma:numTracks: an int of the number of tracks

Note that some of these fields are not single values, but simple constructs of multiple values. Thus, they are actually more complex than name-value pairs that, e.g. are typically used in HTML meta headers or in Dublin Core. I regard this as an issue for implementations.

The fields were chosen as typical metadata being available about media resources. The media fragments fields are a bit dubious in this respect, but could be useful in future.

The metadata is determined either from within the resource itself or from a metadata collection about the resource. As such, the document maps several existing metadata and media resource formats to this interface, amongst them:

As they didn’t have a mapping table for Ogg content, I offered the following:

MAWGRelationOgg propertiesHow to do the mappingDatatype
Descriptive Properties (Core Set)
Identification
ma:identifierexactNameName field in skeleton header (new)String
ma:titleexactTitleTITLE field in vorbiscomment headerString
exactTitleTitle field in skeleton header (new)String
relatedAlbumALBUM title in vorbiscomment headerString
ma:languageexactLanguageLanguage field in skeleton header (new)language code
ma:locatorexactfile URI from systemURI
Creation
ma:contributorexactArtist, PerformerARTIST and PERFORMER vorbiscomment headersStrings
ma:creatorrelatedOrganizationORGANIZATION field in vorbiscomment header
ma:createDateexactDateDATE field in vorbiscomment headerISO date format
ma:locationexactLocationLOCATION field in vorbiscomment headerString
Content description
ma:descriptionexactDescriptionDESCRIPTION field in vorbiscomment headerString
ma:keywordN/A
ma:genreexactGenreGENRE field in vorbiscomment headerString
ma:ratingN/A
Relational
ma:relationrelatedVersion, TracknumberVERSION (version of a title), TRACKNUMBER (CD track) fields in vorbiscomment headerStrings
ma:collectionrelatedAlbumALBUM field of vorbiscomment headerString
Rights
ma:copyrightexactCopyrightCOPYRIGHT field of vorbiscomment headerString
ma:licenseexactLicenseLICENSE field of vorbiscomment headerString
Distribution
ma:publisherrelatedOrganizationORGNIZATION field of vorbiscomment headerString
ma:targetAudiencemore specificRoleRole field of Skeleton header (new)String
Fragments
ma:fragmentsN/A
ma:namedFragmentsN/A
Technical Properties
ma:frameSizeexactextract from binary header of video trackint, int (width x height)
ma:compressionexactContent-typeContent-type field of Skeleton headerMIME type
ma:durationexactcalculate as duration = last_sample_time - first_sample_time of OggIndex header of skeletonFloat (or rather: rational - rational)
ma:formatexactContent-typeContent-type field of Skeleton headerMIME type
ma:samplingrateexactcalculate as granulerate = granulerate_numerator / granulerate_denominator of Skeleton headerRational (or rather int / int)
ma:framerateexactcalculate as granulerate = granulerate_numerator / granulerate_denominator of Skeleton headerRational (or rather int / int)
ma:bitrateexactcalculate as bitrate = length_of_segment / duration from OggIndex headers of skeletonFloat
ma:numTracksexactTracknumberTRACKNUMBER field of vorbiscomment header (track number on album)Int

You will notice that the table mentions 4 fields in skeleton with a “new” marker - they are actually proposed fields in skeleton - a bit of coding will be necessary to introduce them into software. The space for these fields already exists in message header fields, so it won’t require a change of the skeleton format.

In the second specification of the Media Annotations WG, the group offers a standard API to access (i.e. read) the defined fields. They also intend to create an API to write the fields, but I doubt that will be easy because of the vast amount of file types they intend to support.

There is basically a single function that allows the extraction of metadata: MAObject[] getProperty(in DOMString propertyName, in optional DOMString sourceFormat, in optional DOMString subtype, in optional DOMString language, in optional DOMString fragment );

I proposed it may be possible to include this into HTML5 as follows: interface HTMLMediaElement : HTMLElement { ... getter MAObject getProperty(in DOMString propertyName, in optional unsigned long trackIndex); ... }

This would either extract the property for a particular track in a media resource or for the complete resource if no track index is given. The only problem I see is that the returned object is different depending on the requested property - the MAObject is only a parent class for the returned object types. I am not sure it is therefore possible to specify this easily in HTML5.

Overall I thought the specification was a nice piece of work. I am not sure I agree with all the chosen fields, but that is always an issue with metadata. The most important fields are there and that’s what matters.