Tag Archives: HTML5 video

WebRTC books – a brief review

I just finished reading Rob Manson’s awesome book “Getting Started with WebRTC” and I can highly recommend it for any Web developer who is interested in WebRTC.

Rob explains very clearly how to create your first video, audio or data peer-connection using WebRTC in current Google Chrome or Firefox (I think it also now applies to Opera, though that wasn’t the case when his book was published). He makes available example code, so you can replicate it in your own Web application easily, including the setup of a signalling server. He also points out that you need a ICE (STUN/TURN) server to punch through firewalls and gives recommendations for what software is available, but stops short of explaining how to set them up.

Rob’s focus is very much on the features required in a typical Web application:

  • video calls
  • audio calls
  • text chats
  • file sharing

In fact, he provides the most in-depth demo of how to set up a good file sharing interface I have come across.

Rob then also extends his introduction to WebRTC to two key application areas: education and team communication. His recommendations are spot on and required reading for anyone developing applications in these spaces.

Before Rob’s book, I have also read Alan Johnson and Dan Burnett’s “WebRTC” book on APIs and RTCWEB protocols of the HTML5 Real-Time Web.

Alan and Dan’s book was written more than a year ago and explains that state of standardisation at that time. It’s probably a little out-dated now, but it still gives you good foundations on why some decisions were made the way they are and what are contentious issues (some of which still remain). If you really want to understand what happens behind the scenes when you call certain functions in the WebRTC APIs of browsers, then this is for you.

Alan and Dan’s book explains in more details than Rob’s book how IP addresses of communication partners are found, how firewall holepunching works, how sessions get negotiated, and how the standards process works. It’s probably less useful to a Web developer who just wants to implement video call functionality into their Web application, though if something goes wrong you may find yourself digging into the details of SDP, SRTP, DTLS, and other cryptic abbreviations of protocols that all need to work together to get a WebRTC call working.

Overall, both books are worthwhile and cover different aspects of WebRTC that you will stumble across if you are directly dealing with WebRTC code.

Summary Video Accessibility Talk

I’ve just got off a call to the UK Digital TV Group, for which I gave a talk on HTML5 video accessibility (slides best viewed in Google Chrome).

The slide provide a high-level summary of the accessibility features that we’ve developed in the W3C for HTML5, including:

  • Subtitles & Captions with WebVTT and the track element
  • Video Descriptions with WebVTT, the track element and speech synthesis
  • Chapters with WebVTT for semantic navigation
  • Audio Descriptions through synchronising an audio track with a video
  • Sign Language video synchronized with a main video

I received some excellent questions.

The obvious one was about why WebVTT and not TTML. While for anyone who has tried to implement TTML support, the advantages of WebVTT should be clear, for some the decision of the browsers to go with WebVTT still seems to be bothersome. The advantages of CSS over XSL-FO in a browser-context are obvious, but not as much outside browsers. So, the simplicity of WebVTT and the clear integration with HTML have to speak for themselves. Conversion between TTML and WebVTT was a feature that was being asked for.

I received a question about how to support ducking (reduce the volume of the main audio track) when using video descriptions. My reply was to either use video descriptions with WebVTT and do ducking during the times that a cue is active, or when using audio descriptions (i.e. actual audio tracks) to add an additional WebVTT file of kind=metadata to mark the intervals in which to do ducking. In both cases some JavaScript will be necessary.

I received another question about how to do clean audio, which I had almost forgotten was a requirement from our earlier media accessibility document. “Clean audio” consists of isolating the audio channel containing the spoken dialog and important non-speech information that can then be amplified or otherwise modified, while other channels containing music or ambient sounds are attenuated. I suggested using the mediagroup attribute to provide a main video element (without an audio track) and then the other channels as parallel audio tracks that can be turned on and off and attenuated individually. There is some JavaScript coding involved on top of the APIs that we have defined in HTML, but it can be implemented in browsers that support the mediagroup attribute.

Another question was about the possibilities to extend the list of @kind attribute values. I explained that right now we have a proposal for a new text track kind=”forced” so as to provide forced subtitles for sections of video with foreign language. These would be on when no other subtitle or caption tracks are activated. I also explained that if there is a need for application-specific text tracks, the kind=”metadata” would be the correct choice.

I received some further questions, in particular about how to apply styling to captions (e.g. color changes to text) and about how closely the browser are able to keep synchronization across multiple media elements. The earlier was easily answered with the ::cue pseudo-element, but the latter is a quality of implementation feature, so I had to defer to individual browsers.

Overall it was a good exercise to summarize the current state of HTML5 video accessibility and I was excited to show off support in Chrome for all the features that we designed into the standard.

My crazy linux.conf.au week

In January I attended the annual Australian Linux and Open Source conference (LCA). But since I was sick all of January and had a lot to catch up on, I never got around to sharing all the talks that I gave during that time.

Drupal Down Under

It started with a talk at Drupal Down Under, which happened the weekend before LCA. I gave a talk titled “HTML5 video specifications” (video, slides).

I spoke about the video and audio element in HTML5, how to provide fallback content, how to encode content, how to control them from JavaScript, and briefly about Drupal video modules, though the next presentation provided much more insight into those. I explained how to make the HTML5 media elements accessible, including accessible controls, captions, audio descriptions, and the new WebVTT file format. I ran out of time to introduce the last section of my slides which are on WebRTC.


On the first day of LCA I gave a talk both in the Multimedia Miniconf and the Browser Miniconf.

Browser Miniconf

In the Browser Miniconf I talked about “Web Standardisation – how browser vendors collaborate, or not” (slides). Maybe the most interesting part about this was that I tried out a new slide “deck” tool called impress.js. I’m not yet sure if I like it but it worked well for this talk, in which I explained how the HTML5 spec is authored and who has input.

I also sat on a panel of browser developers in the Browser Miniconf (more as a standards than as a browser developer, but that’s close enough). We were asked about all kinds of latest developments in HTML5, CSS3, and media standards in the browser.

Multimedia Miniconf

In the Multimedia Miniconf I gave a “HTML5 media accessibility update” (slides). I talked about the accessibility problems of Flash, how native HTML5 video players will be better, about accessible video controls, captions, navigation chapters, audio descriptions, and WebVTT. I also provided a demo of how to synchronize multiple video elements using a polyfill for the multitrack API.

I also provided an update on HTTP adaptive streaming APIs as a lightning talk in the Multimedia Miniconf. I used an extract of the Drupal conference slides for it.

Main conference

Finally, and most importantly, Alice Boxhall and myself gave a talk in the main linux.conf.au titled “Developing Accessible Web Apps – how hard can it be?” (video, slides). I spoke about a process that you can follow to make your Web applications accessible. I’m writing a separate blog post to explain this in more detail. In her part, Alice dug below the surface of browsers to explain how the accessibility markup that Web developers provide is transformed into data structures that are handed to accessibility technologies.

The new FOMS: Open Media Developers at OVC

Since 2007 I have organised the annual Foundations of Open Media Software (FOMS) developers workshop. Last year it was held for the first time in the northern hemisphere, in fact on the two days straight after the Open Video Conference (OVC).

This year I’m really excited to announce that the workshop will be an integral part of the Open Video Conference on 10-12 September 2011.

FOMS 2011 will take place as the Open Media Developers track at OVC and I would like to see as many if not more open media software developers attend as we had in last year’s FOMS.

Why should you go?

Well, firstly of course the people. As in previous years, we will have some of the key developers in open media software attend – not as celebrities, but to work with other key developers on hard problems and to make progress.

Then, secondly we believe we have some awesome sessions in preparation:

How we run it

I’m actually not quite satisfied with just these sessions. I’d like to be more flexible on how we make the three days a success for everyone. And this implies that there will continue to be room to add more sessions, even while at the conference, and create breakout groups to address really hard issues all the way through the conference.

I insist on this flexibility because I have seen in past years that the most productive outcomes are created by two or three people breaking away from the group, going into a corner and hacking up some demos or solutions to hard problems and taking that momentum away after the workshop.

To allow this to happen, we will have a plenary on the first day during which we will identify who is actually present at the workshop, what they are working on, what sessions they are planning on a attending, and what other topics they are keen to learn about during the conference that may not yet be addressed by existing sessions.

We’ll repeat this exercise on the Monday after all the rest of the conference is finished and we get a quieter day to just focus on being productive.

But is it worth the effort?

As in the past years, whether the workshop is a success for you depends on you and you alone. You have the power to direct what sessions and breakout groups are being created, and you have the possibility to find others at the workshop that share an interest and drag them away for some productive brainstorming or coding.

I’m going to make sure we have an adequate number of rooms available to actually achieve such an environment. I am very happy to have the support of OVC for this and I am assured we have the best location with plenty of space.

Trip sponsorships

As in previous FOMSes, we have again made sure that travel and conference sponsorship is available to community software developers that would otherwise not be able to attend FOMS. We have several such sponsorships and I encourage you to email the FOMS committee or OVC about it. Mention what you’re working on and what you’re interested to take away from OVC and we can give you free entry, hotel and flight sponsorship.

Oh, and don’t forget to Register for OVC!

Recent developments around WebVTT

People have been asking me lots of questions about WebVTT (Web Video Text Tracks) recently. Questions about its technical nature such as: are the features included in WebVTT sufficient for broadcast captions including positioning and colors? Questions about its standardisation level: when is the spec officially finished and when will it move from the WHATWG to the W3C? Questions about implementation: are any browsers supporting it yet and how can I make use of it now?

I’m going to answer all of these questions in this post to make it more efficient than answering tweets, emails, and skype and other phone conference requests. It’s about time I do a proper post about it.


I’m starting with the last area, because it is the simplest to answer.

No, no browser has as yet shipped support for the <track> element and therefore there is no support for WebVTT in browsers yet. However, implementations are in progress. For example, Webkit has recently received first patches for the track element, but there is still an open bug for a WebVTT parser. Similarly, Firefox can now parse the track element, but is still working on the element’s actual functionality.

However, you do not have to despair, because there are now a couple of JavaScript polyfill libraries for either just the track element or for video players with track support. You can start using these while you are waiting for the browsers to implement native support for the element and the file format.

Here are some of the libraries that I’ve come across that will support SRT and/or WebVTT (do leave a comment if you come across more):

  • Captionator – a polyfill for track and SRT parsing (WebVTT in the works)
  • js_videosub – a polyfill for track and SRT parsing
  • jscaptions – a polyfill for track and SRT parsing
  • LeanBack player – a video player with track and SRT, SUB, DFXP, and soon full WebVTT parsing support
  • playr – a video player that includes track and WebVTT parsing
  • MediaElementJS – a video player that includes track and SRT parsing
  • Kaltura’s video player – a video player that includes track and SRT parsing

I am actually most excited about the work of Ronny Mennerich from LeanbackPlayer on WebVTT, since he has been the first to really attack full support of cue settings and to discuss with Ian, me and the WHATWG about their meaning. His review notes with visual description of how settings are to be interpreted and his demo will be most useful to authors and other developers.


Before we dig into the technical progress that has been made recently, I want to answer the question of “maturity”.

The WebVTT specification is currently developed at the WHATWG. It is part of the HTML specification there. When development on it started (under its then name WebSRT), it was also part of the HTML5 specification of the W3C. However, there was a concern that HTML5 should be independent of the chosen captioning format and thus WebVTT currently only exists at the WHATWG.

In recent months – and particularly since browser vendors have indicated that they will indeed implement support for WebVTT as their implementation of the <track> element – the question of formal standardization of WebVTT at the W3C has arisen. I’m involved in this as a Google contractor and we’ve put together a proposed charter for a WebVTT Working Group at the W3C.

In the meantime, standardization progresses at the WHATWG productively. Much feedback has recently been brought together by Ian and changes have been applied or at least prepared for a second feature set to be added to WebVTT once the first lot is implemented. I’ve captured the potentially accepted and rejected new features in a wiki page.

Many of the new features are about making the WebVTT format more useful for authoring and data management. The introduction of comments, inline CSS settings and default cue settings will help authors reduce the amount of styling they have to provide. File-wide metadata will help with the exchange of management information in professional captioning scenarios and archives.

But even without these new features, WebVTT already has all the features necessary to support professional captioning requirements. I’ve prepared a draft mapping of CEA-608 captions to WebVTT to demonstrate these capabilities (CEA-608 is the TV captioning standard in the US).

So, overall, WebVTT is in a great state for you to start implementing support for it in caption creation applications and in video players. There’s no need to wait any longer – I don’t expect fundamental changes to be made, but only new features to be added.

New WebVTT Features

This takes us straight to looking at the recently introduced new features.

  • Simpler File Magic:
    Whereas previously the magic file identifier for a WebVTT file was a single line with “WEBVTT FILE”. This has now been changed to a single line with just “WEBVTT”.
  • Cue Bold Span:
    The <b> element has been introduced into WebVTT, thus aligning it somewhat more with SRT and with HTML.
  • CSS Selectors:
    The spec already allowed to use the names of tags, the classes of <c> tags, and the voice annotations of <v> tags as CSS selectors for ::cue. ID selector matching is now also available, where the cue identifier is used.
  • text-decoration support:
    The spec now also supports the CSS text-decoration property for WebVTT cues, allowing functionality such as blinking text and bold.

Further to this, the email identifies the means in which WebVTT is extensible:

  • Header area:
    The WebVTT header area is defined through the “WEBVTT” magic file identifier as a start and two empty lines as an end. It is possible to add into this area file-wide information header information.
  • Cues:
    Cues are defined to start with an optional identifier, and then a start/end time specification with “–>” separator. They end with two empty lines. Cues that contain a “–>” separator but don’t parse as valid start/end time are currently skipped. Such “cues” can be used to contain inline command blocks.
  • Inline in cues:
    Finally, within cues, everything that is within a “tag”, i.e. between “”, and does not parse as one of the defined start or end tags is ignored, so we can use these to hide text. Further, text between such start and end tags is visible even if the tags are ignored, so wen can introduce new markup tags in this way.

Given this background, the following V2 extensions have been discussed:

  • Metadata:
    Enter name-value pairs of metadata into the header area, e.g.

    00:00:15.000 --> 00:00:17.950
    first cue
  • Inline Cue Settings:
    Default cue settings can come in a “cue” of their own, e.g.

    DEFAULTS --> D:vertical A:end
    00:00.000 --> 00:02.000
    This is vertical and end-aligned.
    00:02.500 --> 00:05.000
    As is this.
    DEFAULTS --> A:start
    00:05.500 --> 00:07.000
    This is horizontal and start-aligned.
  • Inline CSS:
    Since CSS is used to format cue text, a means to do this directly in WebVTT without a need for a Web page and external style sheet is helpful and could be done in its own cue, e.g.

      STYLE -->
      ::cue(v[voice=Bob]) { color: green; }
      ::cue(c.narration) { font-style: italic; }
      ::cue(c.narration i) { font-style: normal; }
      00:00.000 --> 00:02.000
      <v Bob>Welcome.
      00:02.500 --> 00:05.000
      <c .narration>To <i>WebVTT</i>.
  • Comments:
    Both, comments within cues and complete cues commented out are possible, e.g.

     COMMENT -->
     00:02.000 --> 00:03.000
     two; this is entirely
     commented out
     00:06.000 --> 00:07.000
     this part of the cue is visible
     <! this part isn't >
     <and neither is this>

Finally, I believe we still need to add the following features:

  • Language tags:
    I’d like to add a language tag that allows to mark up a subpart of cue text as being in a different language. We need this feature for mixed-language cues (in particular where a different font may be necessary for the inline foreign-language text). But more importantly we will need this feature for cues that contain text descriptions rather than captions, such that a speech synthesizer can pick the correct language model to speak the foreign-language text. It was discussed that this could be done with a <lang jp>xxx</lang> type of markup.
  • Roll-up captions:
    When we use timestamp objects and the future text is hidden, then is un-hidden upon reaching its time, we should allow the cue text to scroll up a line when the un-hidden text requires adding a new line. This is the typical way in which TV live captions have been displayed and so users are acquainted with this display style.
  • Inline navigation:
    For chapter tracks the primary use of cues are for navigation. In other formats – in particular in DAISY-books for blind users – there are hierarchical navigation possibilities within media resources. We can use timestamp objects to provide further markers for navigation within cues, but in order to make these available in a hierarchical fashion, we will need a grouping tag. It would be possible to introduce a <nav> tag that can group several timestamp objects for navigation.
  • Default caption width:
    At the moment, the default display size of a caption cue is 100% of the video’s width (height for vertical directions), which can be overruled with the “S” cue setting. I think it should by default rather be the width (height) of the bounding box around all the text inside the cue.

Aside from these changes to WebVTT, there are also some things that can be improved on the <track> element. I personally support the introduction of the source element underneath the track element, because that allows us to provide different caption files for different devices through the @media media queries attribute and it allows support for more than just one default captioning format. This change needs to be made soon so we don’t run into trouble with the currently empty track element.

I further think a oncuelistchange event would be nice as well in cases where the number of tracks is somehow changed – in particular when coming from within a media file.

Other than this, I’m really very happy with the state that we have achieved this far.

WebVTT explained

On Wednesday, I gave a talk at Google about WebVTT, the Web Video Text Track file format that is under development at the WHATWG for solving time-aligned text challenges for video.

I started by explaining all the features that WebVTT supports for captions and subtitles, mentioned how WebVTT would be used for text audio descriptions and navigation/chapters, and explained how it is included into HTML5 markup, such that the browser provides some default rendering for these purposes. I also mentioned the metadata approach that allows any timed content to be included into cues.

The talk slides include a demo of how the <track> element works in the browser. I’ve actually used the Captionator polyfill for HTML5 to make this demo, which was developed by Chris Giffard and is available as open source from GitHub.

The talk was recorded and has been made available as a Google Tech talk with captions and also a separate version with extended audio descriptions.

The slides of the talk are also available (best to choose the black theme).

I’ve also created a full transcript of the described video.

Get the WebVTT specification from the WHATWG Website.

HTML5 Video Presentations at LCA 2011

Working in the WHAT WG and the W3C HTML WG, you sometimes forget that all the things that are being discussed so heatedly for standardization are actually leading to some really exciting new technologies that not many outside have really taken note of yet.

This week, during the Australian Linux Conference in Brisbane, I’ve been extremely lucky to be able to show off some awesome new features that browser vendors have implemented for the audio and video elements. The feedback that I got from people was uniformly plain surprise – nobody expected browser to have all these capabilities.

The examples that I showed off have mostly been the result of working on a book for almost 9 months of the past year and writing lots of examples of what can be achieved with existing implementations and specifications. They have been inspired by diverse demos that people made in the last years, so the book is linking to many more and many more amazing demos.

Incidentally, I promised to give a copy of the book away to the person with the best idea for a new Web application using HTML5 media. Since we ran out of time, please shoot me an email or a tweet (@silviapfeiffer) within the next 4 weeks and I will send another copy to the person with the best idea. The copy that I brought along was given to a student who wanted to use HTML5 video to display on surfaces of 3D moving objects.

So, let’s get to the talks.

On Monday, I gave a presentation on “Audio and Video processing in HTML5“, which had a strong focus on the Mozilla Audio API.

I further gave a brief lightning talk about “HTML5 Media Accessibility Update“. I am expecting lots to happen on this topic during this year.

Finally, I gave a presentation today on “The Latest and Coolest in HTML5 Media” with a strong focus on video, but also touching on audio and media accessibility.

The talks were streamed live – congrats to Ryan Verner for getting this working with support from Ben Hutchings from DebConf and the rest of the video team. The videos will apparently be available from http://linuxconfau.blip.tv/ in the near future.

UPDATE 4th Feb 2011: And here is my LCA talk …

with subtitles on YouTube:

Accessibility to Web video for the Vision-Impaired

In the past week, I was invited to an IBM workshop on audio/text descriptions for video in Japan. Geoff Freed and Trisha O’Connell from WGBH, and Michael Evans from BBC research were the other invited experts to speak about the current state of video accessibility around the world and where things are going in TV/digital TV and the Web.

The two day workshop was very productive. The first day was spent with presentations which were open to the public. A large vision-impaired community attended to understand where technology is going. It was very humbling to be part of an English-spoken workshop in Japan, where much of the audience is blind, but speaks English much better than my average experience with English in Japan. I met many very impressive and passionate people that are creating audio descriptions, adapting NVDA for the Japanese market, advocating to Broadcasters and Government to create more audio descriptions, and perform fundamental research for better tools to create audio descriptions. My own presentation was on “HTML5 Video Descriptions“.

On the second day, we only met with the IBM researchers and focused discussions on two topics:

  1. How to increase the amount of video descriptions
  2. HTML5 specifications for video descriptions

The first topic included concerns about guidelines for description authoring by beginners, how to raise awareness, who to lobby, and what production tools are required. I personally was more interested in the second topic and we moved into a smaller breakout group to focus on these discussions.

HTML5 specifications for video descriptions
Two topics were discussed related to video descriptions: text descriptions and audio descriptions. Text descriptions are descriptions authored as time-aligned text snippets and read out by a screen reader. Audio descriptions are audio recordings either of a human voice or even of a TTS (text-to-speech) synthesis – in either case, they are audio samples.

For a screen reader, the focus was actually largely on NVDA and people were very excited about the availability of this open source tool. There is a concern about how natural-sounding a screen reader can be made and IBM is doing much research there with some amazing results. In user experiment between WGBH and IBM they found that the more natural the voice sounds, the more people comprehend, but between a good screen reader and an actual human voice there is not much difference in the comprehension level. Broadcasters and other high-end producers are unlikely to accept TTS and will prefer the human voice, but for other materials – in particular for the large majority of content on the Web – TTS and screen readers can make a big difference.

An interesting lesson that I learnt was that video descriptions can be improved by 30% (i.e. 30% better comprehension) if we introduce extended descriptions, i.e. descriptions that can pause the main video to allow for a description be read for something that happens in the video, but where there is no obvious pause to read out the description. So, extended descriptions are one of the major challenges to get right.

We then looked at the path that we are currently progressing on in HTML5 with WebSRT, the TimedTrack API, the <track> elements and the new challenges around a multitrack API.

For text descriptions we identified a need for the following:

  • extension marker on cues: often it is very clear to the author of a description cue that there is no time for the cue to be read out in parallel to the main audio and the video needs to be paused. The proposal is for introduction of an extension marker on the cue to pause the video until the screen reader is finished. So, a speech-complete event from the screen reader API needs to be dealt with. To make this reliable, it might make sense to put a max duration on the cue so the video doesn’t end up waiting endlessly in case the screen reader event isn’t fired. The duration would be calculated based on a typical word speaking rate.
  • importance marker on cues: the duration of all text cues being read out by screen readers depends on the speed set-up of the screen reader. So, even when a cue has been created for a given audio break in the video, it may or may not fit into this break. For most cues it is important that they are read out completely before moving on, but for some it’s not. So, an importance maker could be introduced that determines whether a video stops at the end of the cue to allow the screen reader to finish, or whether the screen reader is silenced at that time no matter how far it has gotten.
  • ducking during cues: making the main audio track quieter in relation to the video description for the duration of a cue such as to allow the comprehension of the video description cue is important for comprehension
  • voice hints: an instruction at the beginning of the text description file for what voice to choose such that it won’t collide with e.g. the narrator voice of a video – typically the choice will be for a female voice when the narrator is male and the other way around – this will help initialize the screen reader appropriately
  • speed hints: an indicator at the beginning of a text description toward what word rate was used as the baseline for the timing of the cue durations such that a screen reader can be initialized with this
  • synthesis directives: while not a priority, eventually it will make for better quality synchronized text if it is possible to include some of the typical markers that speech synthesizers use (see e.g. SSML or speech CSS), including markers for speaker change, for emphasis, for pitch change and other prosody. It was, in fact, suggested that the CSS3’s speech module may be sufficient in particular since Opera already implements it.

This means we need to consider extending WebSRT cues with an “extension” marker and an “importance” marker. WebSRT further needs header-type metadata to include a voice and a speed hint for screen readers. The screen reader further needs to work more closely with the browser and exchange speech-complete events and hints for ducking. And finally we may need to allow for CSS3 speech styles on subparts of WebSRT cues, though I believe this latter one is not of high immediate importance.

For audio descriptions we identified a need for:

  • external/in-band descriptions: allowing external or in-band description tracks to be synchronized with the main video. It would be assumed in this case that the timeline of the description track is identical to the main video.
  • extended external descriptions: since it’s impossible to create in-band extended descriptions without changing the timeline of the main video, we can only properly solve the issue of extended audio descriptions through external resources. One idea that we came up with is to use a WebSRT file with links to short audio recordings as external extended audio descriptions. These can then be synchronized with the video and pause the video at the correct time etc through JavaScript. This is probably a sufficient solution for now. It supports both, sighted and vision-impaired users and does not extend the timeline of the original video. As an optimization, we can also do this through a single “virtual” resource that is a concatenation of the individual audio cues and is addressed through the WebSRT file with byte ranges.
  • ducking: making the main audio track quieter in relation to the video description for the duration of a cue such as to allow the comprehension of the video description cue is important for comprehension also with audio files, though it may be more difficult to realize
  • separate loudness control: making it possible for the viewer to separately turn the loudness of an audio description up/down in comparison to the main audio

For audio descriptions, we saw the need for introduction of a multitrack video API and markup to synchronize external audio description tracks with the main video. Extended audio descriptions should be solved through JavaScript and hooking up through the TimedTrack API, so mostly rolling it by hand at this stage. We will see how that develops in future. Ducking and separate loudness controls are equally needed here, but we do need more experiments in this space.

Finally, we discussed general needs to locate accessibility content such as audio descriptions by vision-impaired user:

  • the need for accessible user menus to turn on/off accessibility content
  • the introduction of dedicated and standardized keyboard short-cuts to turn on and manipulate the volume of audio descriptions (and captions)
  • the introduction of user preferences for automatically activating accessibility content; these could even learn from current usage, such that if a user activates descriptions for a video on one Website, the preferences pick this up; different user profiles are already introduced by ISO in “Access for all” and used in websites such as teachersdomain
  • means to generally locate accessibility content on the web, such as fields in search engines and RSS feeds
  • more generally there was a request to have caption on/off and description on/off buttons be introduced into remote controls of machines, which will become prevalent with the increasing amount of modern TV/Internet integrated devices

Overall, the workshop was a great success and I am keen to see more experimentation in this space. I also hope that some of the great work that was shown to us at IBM with extended descriptions and text descriptions will become available – if only as screencasts – so we can all learn from it to make better standards and technology.

adaptive HTTP streaming for open codecs

At this week’s FOMS in New York we had one over-arching topic that seemed to be of interest to every single participant: how to do adaptive bitrate streaming over HTTP for open codecs. On the first day, there was a general discussion about the advantages and disadvantages of adaptive HTTP streaming, while on the second day, we moved towards designing a solution for Ogg and WebM. While I didn’t attend all the discussions, I want to summarize the insights that I took out of the days in this blog post and the alternative implementation strategies that were came up with.

Use Cases for Adaptive HTTP Streaming

Streaming using RTP/RTSP has in the past been the main protocol to provide live video streams, either for broadcast or for real-time communication. It has been purpose-built for chunked video delivery and has features that many customers want, such as the ability to encrypt the stream, to tell players not to store the data, and to monitor the performance of the stream such that its bandwidth can be adapted. It has, however, also many disadvantages, not least that it goes over ports that normal firewalls block and thus is rather difficult to deploy, but also that it requires special server software, a client that speaks the protocol, and has a signalling overhead on the transport layer for adapting the stream.

RTP/RTSP has been invented to allow for high quality of service video consumption. In the last 10 years, however, it has become the norm to consume “canned” video (i.e. non-live video) over HTTP, making use of the byte-range request functionality of HTTP for seeking. While methods have been created to estimate the size of a pre-buffer before starting to play back in order to achieve continuous playback based on the bandwidth of your pipe at the beginning of downloading, not much can be done when one runs out of pre-buffer in the middle of playback or when the CPU on the machine doesn’t manage to catch up with decoding of the sheer amount of video data: your playback stops to go into re-buffering in the first case and starts to become choppy in the latter case.

An obvious approach to improving this situation is the scale the bandwidth of the video stream down, potentially even switch to a lower resolution video, right in the middle of playback. Apple’s HTTP live streaming, Microsoft’s Smooth Streaming, and Adobe’s Dynamic Streaming are all solutions in this space. Also, ISO/MPEG is working on DASH (Dynamic Adaptive Streaming over HTTP) is an effort to standardize the approach for MPEG media. No solution yets exist for the open formats within Ogg or WebM containers.

Some features of HTTP adaptive streaming are:

  • Enables adaptation of downloading to avoid continuing buffering when network or machine cannot cope.
  • Gapless switching between streams of different bitrate.
  • No special server software is required – any existing Web Server can be used to provide the streams.
  • The adaptation comes from the media player that actually knows what quality the user experiences rather than the network layer that knows nothing about the performance of the computer, and can only tell about the performance of the network.
  • Adaptation means that several versions of different bandwidth are made available on the server and the client switches between them based on knowledge it has about the video quality that the user experiences.
  • Bandwidth is not wasted by downloading video data that is not being consumed by the user, but rather content is pulled moments just before it is required, which works both for the live and canned content case and is particularly useful for long-form content.


In discussions at FOMS it was determined that mid-stream switching between different bitrate encoded audio files is possible. Just looking at the PCM domain, it requires stitching the waveform together at the switch-over point, but that is not a complex function. To be able to do that stitching with Vorbis-encoded files, there is no need for a overlap of data, because the encoded samples of the previous window in a different bitrate page can be used as input into the decoding of the current bitrate page, as long as the resulting PCM samples are stitched.

For video, mid-stream switching to a different bitrate encoded stream is also acceptable, as long as the switch-over point adheres to a keyframe, which can be independently decoded.

Thus, the preparation of the alternative bitstream videos requires temporal synchronisation of keyframes on video – the audio can deal with the switch-over at any point. A bit of intelligent encoding is thus necessary – requiring the encoding pipeline to provide regular keyframes at a certain rate would be sufficient. Then, the switch-over points are the keyframes.

Technical Realisation

With the solutions from Adobe, Microsoft and Apple, the technology has been created such there are special tools on the server that prepare the content for adaptive HTTP streaming and provide a manifest of the prepared content. Typically, the content is encoded in versions of different bitrates and the bandwidth versions are broken into chunks that can be decoded independently. These chunks are synchronised between the different bitrate versions such that there are defined switch-over points. The switch-over points as well as the file names of the different chunks are documented inside a manifest file. It is this manifest file that the player downloads instead of the resource at the beginning of streaming. This manifest file informs the player of the available resources and enables it to orchestrate the correct URL requests to the server as it progresses through the resource.

At FOMS, we took a step back from this approach and analysed what the general possibilities are for solving adaptive HTTP streaming. For example, it would be possible to not chunk the original media data, but instead perform range requests on the different bitrate versions of the resource. The following options were identified.


With Chunking, the original bitrate versions are chunked into smaller full resources with defined switch-over points. This implies creation of a header on each one of the chunks and thus introduces overhead. Assuming we use 10sec chunks and 6kBytes per chunk, that results in 5kBit/sec extra overhead. After chunking the files this way, we provide a manifest file (similar to Apple’s m3u8 file, or the SMIL-based manifest file of Microsoft, or Adobe’s Flash Media Manifest file). The manifest file informs the client about the chunks and the switch-over points and the client requests those different resources at the switch-over points.


  • Header overhead on the pipe.
  • Switch-over delay for decoding the header.
  • Possible problem with TCP slowstart on new files.
  • A piece of software is necessary on server to prepare the chunked files.
  • A large amount of files to manage on the server.
  • The client has to hide the switching between full resources.


  • Works for live streams, where increasing amounts of chunks are written.
  • Works well with CDNs, because mid-stream switching to another server is easy.
  • Chunks can be encoded such that there is no overlap in the data necessary on switch-over.
  • May work well with Web sockets.
  • Follows the way in which proprietary solutions are doing it, so may be easy to adopt.
  • If the chunks are concatenated on the client, you get chained Ogg files (similar concept in WebM?), which are planned to be supported by Web browsers and are thus legal files.

Chained Chunks

Alternatively to creating the large number of files, one could also just create the chained files. Then, the switch-over is not between different files, but between different byte ranges. The headers still have to be read and parsed. And a manifest file still has to exist, but it now points to byte ranges rather than different resources.

Advantages over Chunking:

  • No TCP-slowstart problem.
  • No large number of files on the server.

Disadvantages over Chunking:

  • Mid-stream switching to other servers is not easily possible – CDNs won’t like it.
  • Doesn’t work with Web sockets as easily.
  • New approach that vendors will have to grapple with.

Virtual Chunks

Since in Chained Chunks we are already doing byte-range requests, it is a short step towards simply dropping the repeating headers and just downloading them once at the beginning for all possible bitrate files. Then, as we seek to different positions in “the” file, the byte range of the bitrate version that makes sense to retrieve at that stage would be requested. This could even be done with media fragment URIs, through addressing with time ranges is less accurate than explicit byte ranges.

In contrast to the previous two options, this basically requires keeping n different encoding pipelines alive – one for every bitrate version. Then, the byte ranges of the chunks will be interpreted by the appropriate pipeline. The manifest now points to keyframes as switch-over points.

Advantage over Chained Chunking:

  • No header overhead.
  • No continuous re-initialisation of decoding pipelines.

Disadvantages over Chained Chunking:

  • Multiple decoding pipelines need to be maintained and byte ranges managed for each.

Unchunked Byte Ranges

We can even consider going all the way and not preparing the alternative bitrate resources for switching, i.e. not making sure that the keyframes align. This will then require the player to do the switching itself, determine when the next keyframe comes up in its current stream then seek to that position in the next stream, always making sure to go back to the last keyframe before that position and discard all data until it arrives at the same offset.


  • There will be an overlap in the timeline for download, which has to be managed from the buffering and alignment POV.
  • Overlap poses a challenge of downloading more data than necessary at exactly the time where one doesn’t have bandwidth to spare.
  • Requires seeking.
  • Messy.


  • No special authoring of resources on the server is needed.
  • Requires a very simple manifest file only with a list of alternative bitrate files.

Final concerns

At FOMS we weren’t able to make a final decision on how to achieve adaptive HTTP streaming for open codecs. Most agreed that moving forward with the first case would be the right thing to do, but the sheer number of files that can create is daunting and it would be nice to avoid that for users.

Other goals are to make it work in stand-alone players, which means they will need to support loading the manifest file. And finally we want to enable experimentation in the browser through JavaScript implementation, which means there needs to be an interface to provide the quality of decoding to JavaScript. Fortunately, a proposal for such a statistics API already exists. The number of received frames, the number of dropped frames, and the size of the video are the most important statistics required.

State of Media Accessibility in HTML5

Today I gave a talk at the Open Video Conference about the state of the specifications in HTML5 for media accessibility.

To be clear: at this exact moment, there is no actual specification text in the W3C version of HTML5 for media accessibility. There is, however, some text in the WHATWG version, providing a framework for text-based alternative content. Other alternative content still requires new specification text. Finally, there is no implementation in any browser yet for media accessibility, but we are getting closer. As browser vendors are moving towards implementing support for the WHATWG specifications of the <track> element, the TimedTrack JavaScript API, and the WebSRT format, video sites can also experiment with the provided specifications and contribute feedback to improve the specifications.

Attached are my slides from today’s talk. I went through some of the key requirements of accessibility users and showed how they are being met by the new specifications (in green) or could be met with some still-to-be-developed specifications (in blue). Note that the talk and slides focus on accessibility needs, but the developed technologies will be useful far beyond just accessibility needs and will also help satisfy other needs, such as the needs of internationalization (through subtitles), of exposing multitrack audio/video (through the JavaScript API), of providing timed metadata (through WebSRT), or even of supporting Karaoke (through WebSRT). In the tables on the last two pages I summarize the gaps in the specifications where we will be working on next and also show what is already possible with given specifications.