Tag Archives: html5 media

AppRTC : Google’s WebRTC test app and its parameters

If you’ve been interested in WebRTC and haven’t lived under a rock, you will know about Google’s open source testing application for WebRTC: AppRTC.

When you go to the site, a new video conferencing room is automatically created for you and you can share the provided URL with somebody else and thus connect (make sure you’re using Google Chrome, Opera or Mozilla Firefox).

We’ve been using this application forever to check whether any issues with our own WebRTC applications are due to network connectivity issues, firewall issues, or browser bugs, in which case AppRTC breaks down, too. Otherwise we’re pretty sure to have to dig deeper into our own code.

Now, AppRTC creates a pretty poor quality video conference, because the browsers use a 640×480 resolution by default. However, there are many query parameters that can be added to the AppRTC URL through which the connection can be manipulated.

Here are my favourite parameters:

  • hd=true : turns on high definition, ie. minWidth=1280,minHeight=720
  • stereo=true : turns on stereo audio
  • debug=loopback : connect to yourself (great to check your own firewalls)
  • tt=60 : by default, the channel is closed after 30min – this gives you 60 (max 1440)

For example, here’s how a stereo, HD loopback test would look like: https://apprtc.appspot.com/?r=82313387&hd=true&stereo=true&debug=loopback .

This is not the limit of the available parameter, though. Here are some others that you may find interesting for some more in-depth geekery:

  • ss=[stunserver] : in case you want to test a different STUN server to the default Google ones
  • ts=[turnserver] : in case you want to test a different TURN server to the default Google ones
  • tp=[password] : password for the TURN server
  • audio=true&video=false : audio-only call
  • audio=false : video-only call
  • audio=googEchoCancellation=false,googAutoGainControl=true : disable echo cancellation and enable gain control
  • audio=googNoiseReduction=true : enable noise reduction (more Google-specific parameters)
  • asc=ISAC/16000 : preferred audio send codec is ISAC at 16kHz (use on Android)
  • arc=opus/48000 : preferred audio receive codec is opus at 48kHz
  • dtls=false : disable datagram transport layer security
  • dscp=true : enable DSCP
  • ipv6=true : enable IPv6

AppRTC’s source code is available here. And here is the file with the parameters (in case you want to check if they have changed).

Have fun playing with the main and always up-to-date WebRTC application: AppRTC.

UPDATE 12 May 2014

AppRTC now also supports the following bitrate controls:

  • arbr=[bitrate] : set audio receive bitrate
  • asbr=[bitrate] : set audio send bitrate
  • vsbr=[bitrate] : set video receive bitrate
  • vrbr=[bitrate] : set video send bitrate

Example usage: https://apprtc.appspot.com/?r=&asbr=128&vsbr=4096&hd=true

Video Conferencing in HTML5: WebRTC via Socket.io

Six months ago I experimented with Web sockets for WebRTC and the early implementations of PeerConnection in Chrome. Last week I gave a presentation about WebRTC at Linux.conf.au, so it was time to update that codebase.

I decided to use socket.io for the signalling following the idea of Luc, which made the server code even smaller and reduced it to a mere reflector:

 var app = require('http').createServer().listen(1337);
 var io = require('socket.io').listen(app);

 io.sockets.on('connection', function(socket) {
         socket.on('message', function(message) {
         socket.broadcast.emit('message', message);
     });
 });

Then I turned to the client code. I was surprised to see the massive changes that PeerConnection has gone through. Check out my slide deck to see the different components that are now necessary to create a PeerConnection.

I was particularly surprised to see the SDP object now fully exposed to JavaScript and thus the ability to manipulate it directly rather than through some API. This allows Web developers to manipulate the type of session that they are asking the browsers to set up. I can imaging e.g. if they have support for a video codec in JavaScript that the browser does not provide built-in, they can add that codec to the set of choices to be offered to the peer. While it is flexible, I am concerned if this might create more problems than it solves. I guess we’ll have to wait and see.

I was also surprised by the need to use ICE, even though in my experiment I got away with an empty list of ICE servers – the ICE messages just got exchanged through the socket.io server. I am not sure whether this is a bug, but I was very happy about it because it meant I could run the whole demo on a completely separate network from the Internet.

The most exciting news since my talk is that Mozilla and Google have managed to get a PeerConnection working between Firefox and Chrome – this is the first cross-browser video conference call without a plugin! The code differences are minor.

Since the specification of the WebRTC API and of the MediaStream API are now official Working Drafts at the W3C, I expect other browsers will follow. I am also looking forward to the possibilities of:

The best places to learn about the latest possibilities of WebRTC are webrtc.org and the W3C WebRTC WG. code.google.com has open source code that continues to be updated to the latest released and interoperable features in browsers.

The video of my talk is in the process of being published. There is a MP4 version on the Linux Australia mirror server, but I expect it will be published properly soon. I will update the blog post when that happens.

HTML5 Video Presentations at LCA 2011

Working in the WHAT WG and the W3C HTML WG, you sometimes forget that all the things that are being discussed so heatedly for standardization are actually leading to some really exciting new technologies that not many outside have really taken note of yet.

This week, during the Australian Linux Conference in Brisbane, I’ve been extremely lucky to be able to show off some awesome new features that browser vendors have implemented for the audio and video elements. The feedback that I got from people was uniformly plain surprise – nobody expected browser to have all these capabilities.

The examples that I showed off have mostly been the result of working on a book for almost 9 months of the past year and writing lots of examples of what can be achieved with existing implementations and specifications. They have been inspired by diverse demos that people made in the last years, so the book is linking to many more and many more amazing demos.

Incidentally, I promised to give a copy of the book away to the person with the best idea for a new Web application using HTML5 media. Since we ran out of time, please shoot me an email or a tweet (@silviapfeiffer) within the next 4 weeks and I will send another copy to the person with the best idea. The copy that I brought along was given to a student who wanted to use HTML5 video to display on surfaces of 3D moving objects.

So, let’s get to the talks.

On Monday, I gave a presentation on “Audio and Video processing in HTML5“, which had a strong focus on the Mozilla Audio API.

I further gave a brief lightning talk about “HTML5 Media Accessibility Update“. I am expecting lots to happen on this topic during this year.

Finally, I gave a presentation today on “The Latest and Coolest in HTML5 Media” with a strong focus on video, but also touching on audio and media accessibility.

The talks were streamed live – congrats to Ryan Verner for getting this working with support from Ben Hutchings from DebConf and the rest of the video team. The videos will apparently be available from http://linuxconfau.blip.tv/ in the near future.

UPDATE 4th Feb 2011: And here is my LCA talk …

with subtitles on YouTube:

WebSRT and HTML5 media accessibility

On 23rd July, Ian Hickson, the HTML5 editor, posted an update to the WHATWG mailing list introducing the first draft of a platform for accessibility for the HTML5 <video> element. The platform provides for captions, subtitles, audio descriptions, chapter markers and similar time-synchronized text both in-band (inside the video resource) and out-of-band (as external text files). Right now, the proposal only regards <video>, but I personally believe the same can be applied to the <audio> element, except we have to be a bit more flexible with the rendering approach. Anyway…

What I want to do here is to summarize what was introduced, together with the improvements that I and some others have proposed in follow-up emails, and list some of the media accessibility needs that we are not yet dealing with.

For those wanting to only selectively read some sections, here is a clickable table of contents of this rather long blog post:

THE WebSRT TIMED TEXT FORMAT

The first and to everyone probably most surprising part is the new file format that is being proposed to contain out-of-band time-synchronized text for video. A new format was necessary after the analysis of all relevant existing formats determined that they were either insufficient or hard to use in a Web environment.

The new format is called WebSRT and is an extension to the existing SRT SubRip format. It is actually also the part of the new specification that I am personally most uncomfortable with. Not that WebSRT is a bad format. It’s just not sufficient yet to provide all the functionality that a good time-synchronized text format for Web media should. Let’s look at some examples.

WebSRT is composed of a sequence of timed text cues (that’s what we’ve decided to call the pieces of text that are active during a certain time interval). Because of its ancestry of SRT, the text cues can optionally be numbered through. The content of the text cues is currently allowed to contain three different types of text: plain text, minimal markup, and anything at all (also called “metadata”).

In its most simple form, a WebSRT file is just an ordinary old SRT file with optional cue numbers and only plain text in cues:

  1
  00:00:15.00 --> 00:00:17.95
  At the left we can see...

  2
  00:00:18.16 --> 00:00:20.08
  At the right we can see the...

  3
  00:00:20.11 --> 00:00:21.96
  ...the head-snarlers

A bit of a more complex example results if we introduce minimal markup:

  00:00:15.00 --> 00:00:17.95 A:start
  Auf der <i>linken</i> Seite sehen wir...

  00:00:18.16 --> 00:00:20.08 A:end
  Auf der <b>rechten</b> Seite sehen wir die....

  00:00:20.11 --> 00:00:21.96 A:end
  <1>...die Enthaupter.

  00:00:21.99 --> 00:00:24.36 A:start
  <2>Alles ist sicher.
  Vollkommen <b>sicher</b>.

and add to this a CSS to provide for some colors and special formatting:

    ::cue { background: rgba(0,0,0,0.5); } 
    ::cue-part(1) { color: red; } 
    ::cue-part(2, b) { font-style: normal; text-decoration: underline; } 

Minimal markup accepts <i>, <b>, <ruby> and a timestamp in <>, providing for italics, bold, and ruby markup as well as karaoke timestamps. Any further styling can be done using the CSS pseudo-elements ::cue and ::cue-part, which accept the features ‘color’, ‘text-shadow’, ‘text-outline’, ‘background’, ‘outline’, and ‘font’.

Note that positioning requires some special notes at the end of the start/end timestamps which can provide for vertical text, line position, text position, size and alignment cue setting. Here is an example with vertically rendered Chinese text, right-aligned at 98% of the video frame:

  00:00:15.00 --> 00:00:17.95 A:start D:vertical L:98%
  在左边我们可以看到...

  00:00:18.16 --> 00:00:20.08 A:start D:vertical L:98%
  在右边我们可以看到...

  00:00:20.11 --> 00:00:21.96 A:start D:vertical L:98%
  ...捕蝇草械.

  00:00:21.99 --> 00:00:24.36 A:start D:vertical L:98%
  一切都安全.
  非常地安全.

Finally, WebSRT files can be authored with abstract metadata inside cues, which practically means anything at all. Here’s an example with HTML content:

  00:00:15.00 --> 00:00:17.95 A:start
  <img src="pic1.png"/>Auf der <i>linken</i> Seite sehen wir...

  00:00:18.16 --> 00:00:20.08 A:end
  <img src="pic2.png"/>Auf der <b>rechten</b> Seite sehen wir die....

  00:00:20.11 --> 00:00:21.96 A:end
  <img src="pic3.png"/>...die <a href="http://members.chello.nl/j.kassenaar/
elephantsdream/subtitles.html">Enthaupter</a>.

  00:00:21.99 --> 00:00:24.36 A:start
  <img src="pic4.png"/>Alles ist <mark>sicher</mark>.<br/>Vollkommen <b>sicher</b>.

Here is another example with JSON in the cues:

  00:00:00.00 --> 00:00:44.00
  {
    slide: intro.png,
    title: "Really Achieving Your Childhood Dreams" by Randy Pausch, 
             Carnegie Mellon University, Sept 18, 2007
  }

  00:00:44.00 --> 00:01:18.00
  {
    slide: elephant.png,
    title: The elephant in the room...
  }

  00:01:18.00 --> 00:02:05.00
  {
    slide: denial.png,
    title: I'm not in denial...
  }

What I like about WebSRT:

  1. it allows for all sorts of different content in the text cues – plain text is useful for texted audio descriptions, minimal markup is useful for subtitles, captions, karaoke and chapters, and “metadata” is useful for, well, any data.
  2. it can be easily encapsulated into media resources and thus turned into in-band tracks by regarding each cue as a data packet with time stamps.
  3. it is not verbose

Where I think WebSRT still needs improvements:

  1. break with the SRT history: since WebSRT and SRT files are so different, WebSRT should get its own MIME type, e.g. text/websrt, and file extensions, e.g. .wsrt; this will free WebSRT for changes that wouldn’t be possible by trying to keep conformant with SRT
  2. introduce some header fields into WebSRT: the format needs
    • file-wide name-value metadata, such as author, date, copyright, etc
    • language specification for the file as a hint for font selection and speech synthesis
    • a possibility for style sheet association in the file header
    • a means to identify which parser is required for the cues
    • a magic identifier and a version string of the format
  3. allow innerHTML as an additional format in the cues with the CSS pseudo-elements applying to all HTML elements
  4. allow full use of CSS instead of just the restricted features and also use it for positioning instead of the hard to understand positioning hints
  5. on the minimum markup, provide a neutral structuring element such as <span @id @class @lang> to associate specific styles or specific languages with a subpart of the cue

Note that I undertook some experiments with an alternative format that is XML-based and called WMML to gain most of these insights and determine the advantages/disadvantages of a xml-based format. The foremost advantage is that there is no automatism with newlines and displayed new lines, which can make the source text file more readable. The foremost disadvantages are verbosity and that there needs to be a simple encoding step to remove all encapsulating header-type content from around the timed text cues before encoding it into a binary media resource.

ASSOCIATING EXTERNAL TIMED TEXT RESOURCES WITH A VIDEO

Now that we have a timed text format, we need to be able to associate it with a media resource in HTML5. This is what the <track> element was introduced for. It associates the timestamps in the timed text cues with the timeline of the video resource. The browser is then expected to render these during the time interval in which the cues are expected to be active.

Here is an example for how to associate multiple subtitle tracks with a video:

  <video src="california.webm" controls>
    <track label="English" kind="subtitles" src="calif_eng.wsrt" srclang="en">
    <track label="German" kind="subtitles" src="calif_de.wsrt" srclang="de">
    <track label="Chinese" kind="subtitles" src="calif_zh.wsrt" srclang="zh">
  </video>

In this case, the UA is expected to provide a text menu with a subtitle entry with these three tracks and their label as part of the video controls. Thus, the user can interactively activate one of the tracks.

Here is an example for multiple tracks of different kinds:

  <video src="california.webm" controls>
    <track label="English" kind="subtitles" src="calif_eng.wsrt" srclang="en">
    <track label="German" kind="captions" src="calif_de.wsrt" srclang="de">
    <track label="French" kind="chapter" src="calif_fr.wsrt" srclang="fr">
    <track label="English" kind="metadata" src="calif_meta.wsrt" srclang="en">
    <track label="Chinese" kind="descriptions" src="calif_zh.wsrt" srclang="zh">
  </video>

In this case, the UA is expected to provide a text menu with a list of track kinds with one entry each for subtitles, captions and descriptions through the controls. The chapter tracks are expected to provide some sort of visual subdivision on the timeline and the metadata tracks are not exposed visually, but are only available through the JavaScript API.

Here are several ideas for improving the <track> specification:

  • <track> is currently only defined for WebSRT resources – it should be made generic and then browsers can compete on the formats for which they provide support. WebSRT could be the baseline format. A @type attribute could be added to hint at the MIME type of the provided resource.
  • <track> needs a means for authors to mark certain tracks as active, others as inactive. This can be overruled by browser settings e.g. on @srclang and by user interaction.
  • karaoke and lyrics are supported by WebSRT, but aren’t in the HTML5 spec as track kinds – they should be added and made visible like subtitles or captions.

EXPOSING A LIST OF TimedTracks TO JAVASCRIPT

This is where we take an extra step and move to a uniform handling of both in-band and out-of-band timed text tracks. Futher, a third type of timed text track has been introduced in the form of a MutableTimedTrack – i.e. one that can be authored and added through JavaScript alone.

The JavaScript API that is exposed for any of these track type is identical. A media element now has this additional IDL interface:

interface HTMLMediaElement : HTMLElement {
...
  readonly attribute TimedTrack[] tracks;
  MutableTimedTrack addTrack(in DOMString label, in DOMString kind, 
                                 in DOMString language);
};

A media element thus manages a list of TimedTracks and provides for adding TimedTracks through addTrack().

The timed tracks are associated with a media resource in the following order:

  1. The <track> element children of the media element, in tree order.
  2. Tracks created through the addTrack() method, in the order they were added, oldest first.
  3. In-band timed text tracks, in the order defined by the media resource’s format specification.

The IDL interface of a TimedTrack is as follows:

interface TimedTrack {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;
  readonly attribute unsigned short readyState;
           attribute unsigned short mode;
  readonly attribute TimedTrackCueList cues;
  readonly attribute TimedTrackCueList activeCues;
  readonly attribute Function onload;
  readonly attribute Function onerror;
  readonly attribute Function oncuechange;
};

The first three capture the value of the @kind, @label and @srclang attributes and are provided by the addTrack() function for MutableTimedTracks and exposed from metadata in the binary resource for in-band tracks.

The readyState captures whether the data is available and is one of “not loaded”, “loading”, “loaded”, “failed to load”. Data is only availalbe in “loaded” state.

The mode attribute captures whether the data is activate to be displayed and is one of “disabled”, “hidden” and “showing”. In the “disabled” mode, the UA doesn’t have to download the resource, allowing for some bandwidth management.

The cues and activeCues attributes provide the list of parsed cues for the given track and the subpart thereof that is currently active.

The onload, onerror, and oncuechange functions are event handlers for the load, error and cuechange events of the TimedTrack.

Individual cues expose the following IDL interface:

interface TimedTrackCue {
  readonly attribute TimedTrack track;
  readonly attribute DOMString id;
  readonly attribute float startTime;
  readonly attribute float endTime;
  DOMString getCueAsSource();
  DocumentFragment getCueAsHTML();
  readonly attribute boolean pauseOnExit;
  readonly attribute Function onenter;
  readonly attribute Function onexit;
  readonly attribute DOMString direction;
  readonly attribute boolean snapToLines;
  readonly attribute long linePosition;
  readonly attribute long textPosition;
  readonly attribute long size;
  readonly attribute DOMString alignment;
  readonly attribute DOMString voice;
};

The @track attribute links the cue to its TimedTrack.

The @id, @startTime, @endTime attributes expose a cue identifier and its associated time interval. The getCueAsSource() and getCueAsHTML() functions provide either an unparsed cue text content or a text content parsed into a HTML DOM subtree.

The @pauseOnExit attribute can be set to true/false and indicates whether at the end of the cue’s time interval the media playback should be paused and wait for user interaction to continue. This is particularly important as we are trying to support extended audio descriptions and extended captions.

The onenter and onexit functions are event handlers for the enter and exit events of the TimedTrackCue.

The @direction, @snapToLines, @linePosition, @textPosition, @size, @alignment and @voice attributes expose WebSRT positioning and semantic markup of the cue.

My only concerns with this part of the specification are:

  • The WebSRT-related attributes in the TimedTrackCue are in conflict with CSS attributes and really should not be introduced into HTML5, since they are WebSRT-specific. They will not exist in other types of in-band or out-of-band timed text tracks. As there is a mapping to do already, why not rely on already available CSS features.
  • There is no API to expose header-specific metadata from timed text tracks into JavaScript. This such as the copyright holder, the creation date and the usage rights of a timed text track would be useful to have available. I would propose to add a list of name-value metadata elements to the TimedTrack API.
  • In addition, I would propose to allow media fragment hyperlinks in a <video> @src attribute to point to the @id of a TimedTextCue, thus defining that the playback position should be moved to the time offset of that TimedTextCue. This is a useful feature and builds on bringing named media fragment URIs and TimedTracks together.

RENDERING TimedTracks

The third part of the timed track framework deals with how to render the timed text cues in a Web page. The rendering rules are explained in the HTML5 rendering section.

I’ve extracted the following rough steps from the rendering algorithm:

  1. All timed tracks of a media resource that are in “showing” mode are rendered together to avoid overlapping text from multiple tracks.
  2. The timed tracks cues that are to be rendered are collected from the active timed tracks and ordered by the timed track order first and by their start time second. Where there are identical start times, the cues are ordered by their end time, earliest first, or by their creation order if all else is identical.
  3. Each cue gets its own CSS box.
  4. The text in the CSS boxes is positioned and formated by interpreting the positioning and formatting instructions of WebSRT that are provided on the cues.
  5. An anonymous inline CSS box is created into which all the cue CSS boxes are wrapped.
  6. The wrapping CSS box gets the dimensions of the video viewport. The cue CSS boxes are positioned so they don’t overlap. The text inside the cue CSS boxes inside the wrapping CSS box is wrapped at the edges if necessary.

To overcome security concerns with this kind of direct rendering of a CSS box into the Web page where text comes potentially from a different and malicious Web site, it is required to have the cues come from the same origin as the Web page.

To allow application of a restricted set of CSS properties to the timed text cues, a set of pseudo-selectors was introduced. This is necessary since all the CSS boxes are anonymous and cannot be addressed from the Web page. The introduced pseudo-selectors are ::cue to address a complete cue CSS box, and ::cue-part to address a subpart of a cue CSS box based on a set of identifiers provided by WebSRT.

I have several issues with this approach:

  • I believe that it is not a good idea to only restrict rendering to same-origin files. This will disallow the use of external captioning services (or even just a separate caption server of the same company) to link to for providing the captions to a video. Henri Sivonen proposed a means to overcome this by parsing every cue basically as its own HTML document (well, the body of a document) and then rendering these in iFrame-manner into the Web page. This would overcome the same-origin restriction. It would also allow to do away with the new ::cue CSS selectors, thus simplifying the solution.
  • In general I am concerned about how tightly the rendering is tied to WebSRT. Step 4 should not be in the HTML5 specification, but only apply to WebSRT. Every external format should provide its own mapping to CSS. As it is specified right now, other formats, such as e.g. 3GPP in MPEG-4 or Kate in Ogg, are required to map their format and positioning information to WebSRT instructions. These are then converted again using the WebSRT to CSS mapping rules. That seems overkill.
  • I also find step 6 very limiting, since only the video viewport is regarded as a potential rendering area – this is also the reason why there is no rendering for audio elements. Instead, it would make a lot more sense if a CSS box was provided by the HTML page – the default being the video viewport, but it could be changed to any area on screen. This would allow to render music lyrics under or above an audio element, or render captions below a video element to avoid any overlap at all.

SUMMARY AND FURTHER NEEDS

We’ve made huge progress on accessibility features for HTML5 media elements with the specifications that Ian proposed. I think we can move it to a flexible and feature-rich framework as the improvements that Henri, myself and others have proposed are included.

This will meet most of the requirements that the W3C HTML Accessibility Task Force has collected for media elements where the requirements relate to accessibility functionality provided through alternative text resources.

However, we are not solving any of the accessibility needs that relate to alternative audio-visual tracks and resources. In particular there is no solution yet to deal with multi-track audio or video files that have e.g. sign language or audio description tracks in them – not to speak of the issues that can be introduced through dealing with separate media resources from several sites that need to be played back in sync. This latter may be a challenge for future versions of HTML5, since needs for such synchoronisation of multiple resources have to be explored further.

In a first instance, we will require an API to expose in-band tracks, a means to control their activation interactively in a UI, and a description of how they should be rendered. E.g. should a sign language track be rendered as pciture-in-picture? Clear audio and Sign translation are the two key accessibility needs that can be satisfied with such a multi-track solution.

Finally, another key requirement area for media accessibility is described in a section called “Content Navigation by Content Structure”. This describes the need for vision-impaired users to be able to navigate through a media resource based on semantic markup – think of it as similar to a navigation through a book by book chapters and paragraphs. The introduction of chapter markers goes some way towards satisfying this need, but chapter markers tend to address only big time intervals in a video and don’t let you navigate on a different level to subchapters and paragraphs. It is possible to provide that navigation through providing several chapter tracks at different resolution levels, but then they are not linked together and navigation cannot easily swap between resolution levels.

An alternative might be to include different resolution levels inside a single chapter track and somehow control the UI to manage them as different resolutions. This would only require an additional attribute on text cues and could be useful to other types of text tracks, too. For example, captions could be navigated based on scenes, shots, coversations, or individual captions. Some experimentation will be required here before we can introduce a sensible extension to the given media accessibility framework.

“HTML5 Audio And Video Accessibility, Internationalisation And Usability” talk at Mozilla Summit

For 2 months now, I have been quietly working along on a new Mozilla contract that I received to continue working on HTML5 media accessibility. Thanks Mozilla!

Lots has been happening – the W3C HTML5 accessibility task force published a requirements document, the Media Text Associations proposal made it into the HTML5 draft as a <track> element, and there are discussions about the advantages and disadvantages of the new WebSRT caption format that Ian Hickson created in the WHATWG HTML5 draft.

In attending the Mozilla Summit last week, I had a chance to present the current state of development of HTML5 media accessibility and some of the ongoing work. I focused on the following four current activities on the technical side of things, which are key to satisfying many of the collected media accessibility requirements:

  1. Multitrack Video Support
  2. External Text Tracks Markup in HTML5
  3. External Text Track File Format
  4. Direct Access to Media Fragments

The first three now already have first drafts in the HTML5 specification, though the details still need to be improved and an external text track file format agreed on. The last has had a major push ahead with the Media Fragments WG publishing a Last Call Working Draft. So, on the specification side of things, major progress has been made. On the implementation – even on the example implementation – side of things, we still fall down badly. This is where my focus will lie in the next few months.

Follow this link to read through my slides from the Mozilla 2010 summit.

HTML5 Media and Accessibility presentation

Today, I was invited to give a talk at my old workplace CSIRO about the HTML5 media elements and accessibility.

A lot of the things that have gone into Ogg and that are now being worked on in the W3C in different working groups – including the Media Fragments and HTML5 WGs – were also of concern in the Annodex project that I worked on while at CSIRO. So I was rather excited to be able to report back about the current status in HTML5 and where we’re at with accessibility features.

Check out the presentation here. It contains a good collection of links to exciting demos of what is possible with the new HTML5 media elements when combined with other HTML features.

I tried something now with this presentation: I wrote it in a tool called S5, which makes use only of HTML features for the presentation. It was quite a bit slower than I expected, e.g. reloading a page always included having to navigate to that page. Also, it’s not easily possible to do drawings, unless you are willing to code them all up in HTML. But otherwise I have found it very useful for, in particular, including all the used URLs and video element demos directly in the slides. I was inspired with using this tool by Chris Double’s slides from LCA about implementing HTML 5 video in Firefox.

Accessibility support in Ogg and liboggplay

At the recent FOMS/LCA in Wellington, New Zealand, we talked a lot about how Ogg could support accessibility. Technically, this means support for multiple text tracks (subtitles/captions), multiple audio tracks (audio descriptions parallel to main audio track), and multiple video tracks (sign language video parallel to main video track).

Creating multitrack Ogg files
The creation of multitrack Ogg files is already possible using one of the muxing applications, e.g. oggz-merge. For example, I have my own little collection of multitrack Ogg files at http://annodex.net/~silvia/itext/elephants_dream/multitrack/. But then you are stranded with files that no player will play back.

Multitrack Ogg in Players
As Ogg is now being used in multiple Web browsers in the new HTML5 media formats, there are in particular requirements for accessibility support for the hard-of-hearing and vision-impaired. Either multitrack Ogg needs to become more of a common case, or the association of external media files that provide synchronised accessibility data (captions, audio descriptions, sign language) to the main media file needs to become a standard in HTML5.

As it turn out, both these approaches are being considered and worked on in the W3C. Accessibility data that are audio or video tracks will in the near future have to come out of the media resource itself, but captions and other text tracks will also be available from external associated elements.

The availability of internal accessibility tracks in Ogg is a new use case – something Ogg has been ready to do, but has not gone into common usage. MPEG files on the other hand have for a long time been used with internal accessibility tracks and thus frameworks and players are in place to decode such tracks and do something sensible with them. This is not so much the case for Ogg.

For example, a current VLC build installed on Windows will display captions, because Ogg Kate support is activated. A current VLC build on any other platform, however, has Ogg Kate support deactivated in the build, so captions won’t display. This will hopefully change soon, but we have to look also beyond players and into media frameworks – in particular those that are being used by the browser vendors to provide Ogg support.

Multitrack Ogg in Browsers
Hopefully gstreamer (which is what Opera uses for Ogg support) and ffmpeg (which is what Chrome uses for Ogg support) will expose all available tracks to the browser so they can expose them to the user for turning on and off. Incidentally, a multitrack media JavaScript API is in development in the W3C HTML5 Accessibility Task Force for allowing such control.

The current version of Firefox uses liboggplay for Ogg support, but liboggplay’s multitrack support has been sketchy this far. So, Viktor Gal – the liboggplay maintainer – and I sat down at FOMS/LCA to discuss this and Viktor developed some patches to make the demo player in the liboggplay package, the glut-player, support the accessibility use cases.

I applied Viktor’s patch to my local copy of liboggplay and I am very excited to show you the screencast of glut-player playing back a video file with an audio description track and an English caption track all in sync:

elephants_dream_with_audiodescriptions_and_captions

Further developments
There are still important questions open: for example, how will a player know that an audio description track is to be played together with the main audio track, but a dub track (e.g. a German dub for an English video) is to be played as an alternative. Such metadata for the tracks is something that Ogg is still missing, but that Ogg can be extended with fairly easily through the use of the Skeleton track. It is something the Xiph community is now working on.

Summary
This is great progress towards accessibility support in Ogg and therefore in Web browsers. And there is more to come soon.

How to display seeked position for HTML5 video

Recently, I was asked for some help on coding with an HTML5 video element and its events. In particular the question was: how do I display the time position that somebody seeked to in a video?

Here is a code snipped that shows how to use the seeked event:


<video onseeked="writeVideoTime(this.currentTime);" src="video.ogv" controls></video>
<p>position:</p><div id="videotime"></div>
<script type="text/javascript">
// get video element
var video = document.getElementsByTagName("video")[0];
function writeVideoTime(t) {
document.getElementById("videotime").innerHTML=t;
}
</script>

Other events that can be used in a similar way are:

  • loadstart: UA requests the media data from the server
  • progress: UA is fetching media data from the server
  • suspend: UA is on purpose idling on the server connection mid-fetching
  • abort: UA aborts fetching media data from the server
  • error: UA aborts fetching media because of a network error
  • emptied: UA runs out of network buffered media data (I think)
  • stalled: UA is waiting for media data from the server
  • play: playback has begun after play() method returns
  • pause: playback has been paused after pause() method returns
  • loadedmetadata: UA has received all its setup information for the media resource, duration and dimensions and is ready to play
  • loadeddata: UA can render the media data at the current playback position for the first time
  • waiting: playback has stopped because the next frame is not available yet
  • playing: playback has started
  • canplay: playback can resume, but at risk of buffer underrun
  • canplaythrough: playback can resume without estimated risk of buffer underrun
  • seeking: seeking attribute changed to true (may be too short to catch)
  • seeked: seeking attribute changed to false
  • timeupdate: current playback position changed enough to report on it
  • ended: playback stopped at media resource end; ended attribute is true
  • ratechange: defaultPlaybackRate or playbackRate attribute have just changed
  • durationchange: duration attribute has changed
  • volumechange:volume attribute or the muted attribute has changed

Please refer to the actual event list in the specification for more details and more accurate information on the events.

Audio Track Accessibility for HTML5

I have talked a lot about synchronising multiple tracks of audio and video content recently. The reason was mainly that I foresee a need for more than two parallel audio and video tracks, such as audio descriptions for the vision-impaired or dub tracks for internationalisation, as well as sign language tracks for the hard-of-hearing.

It is almost impossible to introduce a good scheme to deliver the right video composition to a target audience. Common people will prefer bare a/v, vision-impaired would probably prefer only audio plus audio descriptions (but will probably take the video), and the hard-of-hearing will prefer video plus captions and possibly a sign language track . While it is possible to dynamically create files that contain such tracks on a server and then deliver the right composition, implementation of such a server method has not been very successful in the last years and it would likely take many years to roll out such new infrastructure.

So, the only other option we have is to synchronise completely separate media resource together as they are selected by the audience.

It is this need that this HTML5 accessibility demo is about: Check out the demo of multiple media resource synchronisation.

I created a Ogg video with only a video track (10m53s750). Then I created an audio track that is the original English audio track (10m53s696). Then I used a Spanish dub track that I found through BlenderNation as an alternative audio track (10m58s337). Lastly, I created an audio description track in the original language (10m53s706). This creates a video track with three optional audio tracks.

I took away all native controls from these elements when using the HTML5 audio and video tag and ran my own stop/play and seeking approaches, which handled all media elements in one go.

I was mostly interested in the quality of this experience. Would the different media files stay mostly in sync? They are normally decoded in different threads, so how big would the drift be?

The resulting page is the basis for such experiments with synchronisation.

The page prints the current playback position in all of the media files at a constant interval of 500ms. Note that when you pause and then play again, I am re-synching the audio tracks with the video track, but not when you just let the files play through.

I have let the files play through on my rather busy Macbook and have achieved the following interesting drift over the course of about 9 minutes:

Drift between multiple parallel played media elements

You will see that the video was the slowest, only doing roughly 540s, while the Spanish dub did 560s in the same time.

To fix such drifts, you can always include regular re-synchronisation points into the video playback. For example, you could set a timeout on the playback to re-sync every 500ms. Within such a short time, it is almost impossible to notice a drift. Don’t re-load the video, because it will lead to visual artifacts. But do use the video’s currentTime to re-set the others. (UPDATE: Actually, it depends on your situation, which track is the best choice as the main timeline. See also comments below.)

It is a workable way of associating random numbers of media tracks with videos, in particular in situations where the creation of merged files cannot easily be included in a workflow.

WebJam 2009 talk on video accessibility

On Wednesday evening I gave a 3 min presentation on video accessibility in HTML5 at the WebJam in Sydney. I used a video as my presentation means and explained things while playing it back. Here is the video, without my oral descriptions, but probably still useful to some. Note in particular how you can experience the issues of deaf (HoH), blind (VI) and foreign language users:

The Ogg version is here.