Tag Archives: accessibility

WebVTT Discussions at FOMS

At the recent FOMS (Foundations of Open Media Software and Standards) Developer Workshop, we had a massive focus on WebVTT and the state of its feature set. You will find links to summaries of the individual discussions in the FOMS Schedule page. Here are some of the key results I went away with.

1. WebVTT Regions

The key driving force for improvements to WebVTT continues to be the accurate representation of CEA608/708 captioning. As part of that drive, we’ve introduced regions (the CEA708 “window” concept) to WebVTT. WebVTT regions satisfy multiple requirements of CEA608/708 captions:

  1. support for rollup captions
  2. support for background color and border color on a group of cues independent of the background color of the individual cue
  3. possibility to move a group of cues from one location on screen to a different
  4. support to specify an anchor point and a growth direction for cues when their text size changes
  5. support for specifying a fixed number of lines to be rendered
  6. possibility to specify which region is rendered in front of which other one when regions overlap

While WebVTT regions enable us to satisfy all of the above points, the specification isn’t actually complete yet and some of the above needs aren’t satisfied yet.

We have an open bug to move a region elsewhere. A first discussion at FOMS seemed to to indicate that we’ll have to add syntax for updating a region at a particular time and thus give region definitions a way to be valid only for a certain time frame. I can imagine that the region definitions that we have in the header of the WebVTT file now would have an implicitly defined time frame from the start to the end of the file, but can be overruled by a re-definition anywhere within the WebVTT file. That redefinition needs to provide a start and end time.

We registered a bug to add specifying the width and height of regions (and possibly of cues) by em (i.e. by multiples of the largest character in a font). This should allow us to have the region grow/shrink around the region anchor point with a change of font size by script or a user. em specifications should also be applied to cues – that matches the column count of CEA708/608 better.

When regions overlap, the original region extension spec already suggested a “layer” cue setting. It will be easy to add it.

Another change that we will ultimately need is the “scroll” setting: we will need to introduce support for scrolling text down or from left-to-right or right-to-left, e.g. vertical scrolling text seems to be used in some Chinese caption use cases.

2. Unify Rendering Approach

The introduction of regions created a second code path in the rendering spec with some duplication. At FOMS we discussed if it was possible to unify that. The suggestion is to render all cues into a region. Those that are not part of a region would be rendered into an anonymous region that covers the complete viewport. There may be some consequences to this, e.g. cue settings should be usable across all cues, no matter whether or not part of a region, and avoiding cue overlap may need to be done within regions.

Here’s a rough outline of the path of the new rendering algorithm:

(1) Render the regions:

Specified Region Anonymous Region
Render values as given: Render following values:
  • width
  • lines
  • regionanchor
  • viewportanchor
  • scroll
  • 100%
  • videoheight/lineheight
  • 0,0
  • 0,0
  • none

(2) Render the cues:

  • Create a cue box and put it in its region (anonymous if none given).
  • Calculate position & size of cue box from cue settings (position, line, size).
  • Calculate position of cue text inside cue box from remaining cue settings (vertical, align).

3. Vertical Features

WebVTT includes vertical rendering, both right-to-left and left-to-right. However, regions are not defined for vertical. Eventually, we’re going to have to look at the vertical features of WebVTT with more details and figure out whether the spec is working for them and what real-world requirements we have missed. We hope we can get some help from users in countries where vertically rendered captions/subtitles are the norm.

4. Best Practices

Some of he WebVTT users at FOMS suggested it would be advantageous to start a list of “best practices” for how to author captions with WebVTT. Example recommendations are:

  • Use line numbers only to position cues from top or bottom of viewport. Don’t use otherwise.
  • Note that when the user increases the fontsize in rollup captions and thus introduces new line breaks, your cues will roll by faster because the number of lines of a rollup is fixed.
  • Make sure to use ‎ and ‏ UTF-8 markers to control the directionality of your text.

It would be nice if somebody started such a document.

5. Non-caption use cases

Instead of continuing to look back and improve our support of captions/subtitles in WebVTT, one session at FOMS also went ahead and looked forward to other use cases. The following requirements came out of this:

5.1 Preview Thumbnails

A common use case for timed data is the use of preview thumbnails on the navigation bar of videos. A native implementation of preview thumbnails would allow crawlers and search engines to have a standardised way of extracting timed images for media files, so introduction of a new @kind value “thumbnails” was suggested.

The content of a “thumbnails” cue could be any of:

  • an image URL
  • a sprite URL to a single image
  • a spatial & temporal media fragment URL to a media resource
  • base64 encoded image (data URI)
  • an iframe offset to the media resource

The suggestion is to allow anything that would work in a img @src attribute as value in a cue of @kind=”thumbnails”. Responsive images might also be useful for a track of @kind=”thumbnails”. It may even be possible to define an inband thumbnail track based on the track of @kind=”thumbnails”. Such cues should also work in the JavaScript track API.

5.2 Chapter markers

There is interest to put richer content than just a chapter title into chapter cues. Often, chapters consist of a title, text and and image. The text is not so important, but the image is used almost everywhere that chapters are used. There may be a need to extend chapter cue content with images, similar to what a @kind=”thumbnails” track offers.

The conclusion that we arrived at was that we need to make @kind=”thumbnails” work first and then look at using the learnings from that to extend @kind=”chapters”.

5.3 Inband tracks for live video

A difficult topic was opened with the question of how to transport text tracks in live video. In live captioning, end times are never created for cues, but are implied by the start time of the next cue. This is a use case that hasn’t been addressed in HTML5/WebVTT yet. An old proposal to allow a special end time value of “NEXT” was discussed and recommended for adoption. Also, there was support for the spec change that stops blocking loading VTT until all cues have been loaded.

5.4 Cross-domain VTT loading

A brief discussion centered around the fact that the spec disallows cross-domain loading of WebVTT files, but that no browser implements this. This needs to be discussion at the HTML WG level.

6. Regions in live captioning

The final topic that we discussed was how we could provide support for regions in live captioning.

  • The currently active region definitions will need to be come part of every header of every VTT file segment that HLS uses, so it’s available in case the cues in the segment file reference it.
  • “NEXT” in end time markers would make authoring of live captioned VTT files easier.
  • If the application wants to use 1 word at a time and doesn’t want to delay sending the word until the full cue is authored (e.g. in a Hangout type environment), we will need to introduce the concept of “cue continuation markers”, so we know that a cue could be extended with the next VTT file fragment.

This is an extensive and impressive amount of discussion around WebVTT and a lot of new work to be performed in the future. I’m very grateful for all the people who have contributed to these discussions at FOMS and will hopefully continue to help get the specifications right.

Summary Video Accessibility Talk

I’ve just got off a call to the UK Digital TV Group, for which I gave a talk on HTML5 video accessibility (slides best viewed in Google Chrome).

The slide provide a high-level summary of the accessibility features that we’ve developed in the W3C for HTML5, including:

  • Subtitles & Captions with WebVTT and the track element
  • Video Descriptions with WebVTT, the track element and speech synthesis
  • Chapters with WebVTT for semantic navigation
  • Audio Descriptions through synchronising an audio track with a video
  • Sign Language video synchronized with a main video

I received some excellent questions.

The obvious one was about why WebVTT and not TTML. While for anyone who has tried to implement TTML support, the advantages of WebVTT should be clear, for some the decision of the browsers to go with WebVTT still seems to be bothersome. The advantages of CSS over XSL-FO in a browser-context are obvious, but not as much outside browsers. So, the simplicity of WebVTT and the clear integration with HTML have to speak for themselves. Conversion between TTML and WebVTT was a feature that was being asked for.

I received a question about how to support ducking (reduce the volume of the main audio track) when using video descriptions. My reply was to either use video descriptions with WebVTT and do ducking during the times that a cue is active, or when using audio descriptions (i.e. actual audio tracks) to add an additional WebVTT file of kind=metadata to mark the intervals in which to do ducking. In both cases some JavaScript will be necessary.

I received another question about how to do clean audio, which I had almost forgotten was a requirement from our earlier media accessibility document. “Clean audio” consists of isolating the audio channel containing the spoken dialog and important non-speech information that can then be amplified or otherwise modified, while other channels containing music or ambient sounds are attenuated. I suggested using the mediagroup attribute to provide a main video element (without an audio track) and then the other channels as parallel audio tracks that can be turned on and off and attenuated individually. There is some JavaScript coding involved on top of the APIs that we have defined in HTML, but it can be implemented in browsers that support the mediagroup attribute.

Another question was about the possibilities to extend the list of @kind attribute values. I explained that right now we have a proposal for a new text track kind=”forced” so as to provide forced subtitles for sections of video with foreign language. These would be on when no other subtitle or caption tracks are activated. I also explained that if there is a need for application-specific text tracks, the kind=”metadata” would be the correct choice.

I received some further questions, in particular about how to apply styling to captions (e.g. color changes to text) and about how closely the browser are able to keep synchronization across multiple media elements. The earlier was easily answered with the ::cue pseudo-element, but the latter is a quality of implementation feature, so I had to defer to individual browsers.

Overall it was a good exercise to summarize the current state of HTML5 video accessibility and I was excited to show off support in Chrome for all the features that we designed into the standard.

The use cases for a element in HTML

The W3C HTML WG and the WHATWG are currently discussing the introduction of a <main> element into HTML.

The <main> element has been proposed by Steve Faulkner and is specified in a draft extension spec which is about to be accepted as a FPWD (first public working draft) by the W3C HTML WG. This implies that the W3C HTML WG will be looking for implementations and for feedback by implementers on this spec.

I am supportive of the introduction of a <main> element into HTML. However, I believe that the current spec and use case list don’t make a good enough case for its introduction. Here are my thoughts.

Main use case: accessibility

In my opinion, the main use case for the introduction of <main> is accessibility.

Like any other users, when blind users want to perceive a Web page/application, they need to have a quick means of grasping the content of a page. Since they cannot visually scan the layout and thus determine where the main content is, they use accessibility technology (AT) to find what is known as “landmarks”.

“Landmarks” tell the user what semantic content is on a page: a header (such as a banner), a search box, a navigation menu, some asides (also called complementary content), a footer, …. and the most important part: the main content of the page. It is this main content that a blind user most often wants to skip to directly.

In the days of HTML4, a hidden “skip to content” link at the beginning of the Web page was used as a means to help blind users access the main content.

In the days of ARIA, the aria @role=main enables authors to avoid a hidden link and instead mark the element where the main content begins to allow direct access to the main content. This attribute is supported by AT – in particular screen readers – by making it part of the landmarks that AT can directly skip to.

Both the hidden link and the ARIA @role=main approaches are, however, band aids: they are being used by those of us that make “finished” Web pages accessible by adding specific extra markup.

A world where ARIA is not necessary and where accessibility developers would be out of a job because the normal markup that everyone writes already creates accessible Web sites/applications would be much preferable over the current world of band-aids.

Therefore, to me, the primary use case for a <main> element is to achieve exactly this better world and not require specialized markup to tell a user (or a tool) where the main content on a page starts.

An immediate effect would be that pages that have a <main> element will expose a “main” landmark to blind and vision-impaired users that will enable them to directly access that main content on the page without having to wade through other text on the page. Without a <main> element, this functionality can currently only be provided using heuristics to skip other semantic and structural elements and is for this reason not typically implemented in AT.

Other use cases

The <main> element is a semantic element not unlike other new semantic elements such as <header>, <footer>, <aside>, <article>, <nav>, or <section>. Thus, it can also serve other uses where the main content on a Web page/Web application needs to be identified.

Data mining

For data mining of Web content, the identification of the main content is one of the key challenges. Many scholarly articles have been published on this topic. This stackoverflow article references and suggests a multitude of approaches, but the accepted answer says “there’s no way to do this that’s guaranteed to work”. This is because Web pages are inherently complex and many <div>, <p>, <iframe> and other elements are used to provide markup for styling, notifications, ads, analytics and other use cases that are necessary to make a Web page complete, but don’t contribute to what a user consumes as semantically rich content. A <main> element will allow authors to pro-actively direct data mining tools to the main content.

Search engines

One particularly important “data mining” tool are search engines. They, too, have a hard time to identify which sections of a Web page are more important than others and employ many heuristics to do so, see e.g. this ACM article. Yet, they still disappoint with poor results pointing to findings of keywords in little relevant sections of a page rather than ranking Web pages higher where the keywords turn up in the main content area. A <main> element would be able to help search engines give text in main content areas a higher weight and prefer them over other areas of the Web page. It would be able to rank different Web pages depending on where on the page the search words are found. The <main> element will be an additional hint that search engines will digest.

Visual focus

On small devices, the display of Web pages designed for Desktop often causes confusion as to where the main content can be found and read, in particular when the text ends up being too small to be readable. It would be nice if browsers on small devices had a functionality (maybe a default setting) where Web pages would start being displayed as zoomed in on the main content. This could alleviate some of the headaches of responsive Web design, where the recommendation is to show high priority content as the first content. Right now this problem is addressed through stylesheets that re-layout the page differently depending on device, but again this is a band-aid solution. Explicit semantic markup of the main content can solve this problem more elegantly.


Finally, naturally, <main> would also be used to style the main content differently from others. You can e.g. replace a semantically meaningless <div id=”main”> with a semantically meaningful <main> where their position is identical. My analysis below shows, that this is not always the case, since oftentimes <div id=”main”> is used to group everything together that is not the header – in particular where there are multiple columns. Thus, the ease of styling a <main> element is only a positive side effect and not actually a real use case. It does make it easier, however, to adapt the style of the main content e.g. with media queries.

Proposed alternative solutions

It has been proposed that existing markup serves to satisfy the use cases that <main> has been proposed for. Let’s analyse these on some of the most popular Web sites. First let’s list the propsed algorithms.

Proposed solution No 1: Scooby-Doo

On Sat, Nov 17, 2012 at 11:01 AM, Ian Hickson <ian@hixie.ch> wrote:
| The main content is whatever content isn't
| marked up as not being main content (anything not marked up with <header>,
| <aside>, <nav>, etc).

This implies that the first element that is not a <header>, <aside>, <nav>, or <footer> will be the element that we want to give to a blind user as the location where they should start reading. The algorithm is implemented in https://gist.github.com/4032962.

Proposed solution No 2: First article element

On Sat, Nov 17, 2012 at 8:01 AM, Ian Hickson  wrote:
| On Thu, 15 Nov 2012, Ian Yang wrote:
| >
| > That's a good idea. We really need an element to wrap all the <p>s,
| > <ul>s, <ol>s, <figure>s, <table>s ... etc of a blog post.
| That's called <article>.

This approach identifies the first <article> element on the page as containing the main content. Here’s the algorithm for this approach.

Proposed solution No 3: An example heuristic approach

The readability plugin has been developed to make Web pages readable by essentially removing all the non-main content from a page. An early source of readability is available. This demonstrates what a heuristic approach can perform.

Analysing alternative solutions


I’ve picked 4 typical Websites (top on Alexa) to analyse how these three different approaches fare. Ideally, I’d like to simply apply the above three scripts and compare pictures. However, since the semantic HTML5 elements <header>, <aside>, <nav>, and <footer> are not actually used by any of these Web sites, I don’t actually have this choice.

So, instead, I decided to make some assumptions of where these semantic elements would be used and what the outcome of applying the first two algorithms would be. I can then compare it to the third, which is a product so we can take screenshots.


http://google.com – search for “Scooby Doo”.

The search results page would likely be built with:

  • a <nav> menu for the Google bar
  • a <header> for the search bar
  • another <header> for the login section
  • another <nav> menu for the search types
  • a <div> to contain the rest of the page
  • a <div> for the app bar with the search number
  • a few <aside>s for the left and right column
  • a set of <article>s for the search results
“Scooby Doo” would find the first element after the headers as the “main content”. This is the element before the app bar in this case. Interestingly, there is a <div @id=main> already in the current Google results page, which “Scooby Doo” would likely also pick. However, there are a nav bar and two asides in this div, which clearly should not be part of the “main content”. Google actually placed a @role=main on a different element, namely the one that encapsulates all the search results.

“First Article” would find the first search result as the “main content”. While not quite the same as what Google intended – namely all search results – it is close enough to be useful.

The “readability” result is interesting, since it is not able to identify the main text on the page. It is actually aware of this problem and brings a warning before displaying this page:

Readability of google.com



A user page would likely be built with:

  • a <header> bar for the search and login bar
  • a <div> to contain the rest of the page
  • an <aside> for the left column
  • a <div> to contain the center and right column
  • an <aside> for the right column
  • a <header> to contain the center column “megaphone”
  • a <div> for the status posting
  • a set of <article>s for the home stream
“Scooby Doo” would find the first element after the headers as the “main content”. This is the element that contains all three columns. It’s actually a <div @id=content> already in the current Facebook user page, which “Scooby Doo” would likely also pick. However, Facebook selected a different element to place the @role=main : the center column.

“First Article” would find the first news item in the home stream. This is clearly not what Facebook intended, since they placed the @role=main on the center column, above the first blog post’s title. “First Article” would miss that title and the status posting.

The “readability” result again disappoints but warns that it failed:



A video page would likely be built with:

  • a <header> bar for the search and login bar
  • a <nav> for the menu
  • a <div> to contain the rest of the page
  • a <header> for the video title and channel links
  • a <div> to contain the video with controls
  • a <div> to contain the center and right column
  • an <aside> for the right column with an <article> per related video
  • an <aside> for the information below the video
  • a <article> per comment below the video
“Scooby Doo” would find the first element after the headers as the “main content”. This is the element that contains the rest of the page. It’s actually a <div @id=content> already in the current YouTube video page, which “Scooby Doo” would likely also pick. However, YouTube’s related videos and comments are unlikely to be what the user would regard as “main content” – it’s the video they are after, which generously has a <div id=watch-player>.

“First Article” would find the first related video or comment in the home stream. This is clearly not what YouTube intends.

The “readability” result is not quite as unusable, but still very bare:


http://wikipedia.com (“Overscan” page)

A Wikipedia page would likely be built with:

  • a <header> bar for the search, login and menu items
  • a <div> to contain the rest of the page
  • an &ls; article> with title and lots of text
  • <article> an <aside> with the table of contents
  • several <aside>s for the left column
Good news: “Scooby Doo” would find the first element after the headers as the “main content”. This is the element that contains the rest of the page. It’s actually a <div id=”content” role=”main”> element on Wikipedia, which “Scooby Doo” would likely also pick.

“First Article” would find the title and text of the main element on the page, but it would also include an <aside>.

The “readability” result is also in agreement.


In the following table we have summarised the results for the experiments:

Site Scooby-Doo First article Readability
Facebook.com FAIL FAIL FAIL

Clearly, Wikipedia is the prime example of a site where even the simple approaches find it easy to determine the main content on the page. WordPress blogs are similarly successful. Almost any other site, including news sites, social networks and search engine sites are petty hopeless with the proposed approaches, because there are too many elements that are used for layout or other purposes (notifications, hidden areas) such that the pre-determined list of semantic elements that are available simply don’t suffice to mark up a Web page/application completely.


It seems that in general it is impossible to determine which element(s) on a Web page should be the “main” piece of content that accessibility tools jump to when requested, that a search engine should put their focus on, or that should be highlighted to a general user to read. It would be very useful if the author of the Web page would provide a hint through a <main> element where that main content is to be found.

I think that the <main> element becomes particularly useful when combined with a default keyboard shortcut in browsers as proposed by Steve: we may actually find that non-accessibility users will also start making use of this shortcut, e.g. to get to videos on YouTube pages directly without having to tab over search boxes and other interactive elements, etc. Worthwhile markup indeed.

A systematic approach to making Web Applications accessible

With the latest developments in HTML5 and the still fairly new ARIA (Accessible Rich Interface Applications) attributes introduced by the W3C WAI (Web Accessibility Initiative), browsers have now implemented many features that allow you to make your JavaScript-heavy Web applications accessible.

Since I began working on making a complex web application accessible just over a year ago, I discovered that there was no step-by-step guide to approaching the changes necessary for creating an accessible Web application. Therefore, many people believe that it is still hard, if not impossible, to make Web applications accessible. In fact, it can be approached systematically, as this article will describe.

This post is based on a talk that Alice Boxhall and I gave at the recent Linux.conf.au titled “Developing accessible Web apps – how hard can it be?” (slides, video), which in turn was based on a Google Developer Day talk by Rachel Shearer (slides).

These talks, and this article, introduce a process that you can follow to make your Web applications accessible: each step will take you closer to having an application that can be accessed using a keyboard alone, and by users of screenreaders and other accessibility technology (AT).

The recommendations here only roughly conform to the requirements of WCAG (Web Content Accessibility Guidelines), which is the basis of legal accessibility requirements in many jurisdictions. The steps in this article may or may not be sufficient to meet a legal requirement. It is focused on the practical outcome of ensuring users with disabilities can use your Web application.

Step-by-step Approach

The steps to follow to make your Web apps accessible are as follows:

  1. Use native HTML tags wherever possible
  2. Make interactive elements keyboard accessible
  3. Provide extra markup for AT (accessibility technology)

If you are a total newcomer to accessibility, I highly recommend installing a screenreader and just trying to read/navigate some Web pages. On Windows you can install the free NVDA screenreader, on Mac you can activate the pre-installed VoiceOver screenreader, on Linux you can use Orca, and if you just want a browser plugin for Chrome try installing ChromeVox.

1. Use native HTML tags

As you implement your Web application with interactive controls, try to use as many native HTML tags as possible.

HTML5 provides a rich set of elements which can be used to both add functionality and provide semantic context to your page. HTML4 already included many useful interactive controls, like <a>, <button>, <input> and <select>, and semantic landmark elements like <h1>. HTML5 adds richer <input> controls, and a more sophisticated set of semantic markup elements like such as <time>, <progress>, <meter>, <nav>, <header>, <article> and <aside>. (Note: check browser support for browser support of the new tags).

Using as much of the rich HTML5 markup as possible means that you get all of the accessibility features which have been implemented in the browser for those elements, such as keyboard support, short-cut keys and accessibility metadata, for free. For generic tags you have to implement them completely from scratch.

What exactly do you miss out on when you use a generic tag such as <div> over a specific semantic one such as <button>?

  1. Generic tags are not focusable. That means you cannot reach them through using the [tab] on the keyboard.
  2. You cannot activate them with the space bar or enter key or perform any other keyboard interaction that would be regarded as typical with such a control.
  3. Since the role that the control represents is not specified in code but is only exposed through your custom visual styling, screenreaders cannot express to their users what type of control it is, e.g. button or link.
  4. Neither can screenreaders add the control to the list of controls on the page that are of a certain type, e.g. to navigate to all headers of a certain level on the page.
  5. And finally you need to manually style the element in order for it to look distinctive compared to other elements on the page; using a default control will allow the browser to provide the default style for the platform, which you can still override using CSS if you want.


Compare these two buttons. The first one is implemented using a <div> tag, the second one using a <button> tag. Try using a screenreader to experience the difference.

.custombutton {
cursor: pointer;
border: 1px solid #000;
background-color: #F6F6F6;
padding: 2px 5px;

 .custombutton {
  cursor: pointer;
  border: 1px solid #000;
  background-color: #F6F6F6;
  display: inline-block;
  padding: 2px 5px;
<div class="custombutton" onclick="alert('sent!')">
<button onclick="alert('sent!')">

2. Make interactive elements keyboard accessible

Many sophisticated web applications have some interactive controls that just have no appropriate HTML tag equivalent. In this case, you will have had to build an interactive element with JavaScript and <div> and/or <span> tags and lots of custom styling. The good news is, it’s possible to make even these custom controls accessible, and as a side benefit you will also make your application smoother to use for power users.

The first thing you can do to test usability of your control, or your Web app, is to unplug the mouse and try to use only the [TAB] and [ENTER] keys to interact with your application.

the tab key on the keyboardthe enter key on the keyboard

Try the following:

  • Can you reach all interactive elements with [TAB]?
  • Can you activate interactive elements with [ENTER] (or [SPACE])?
  • Are the elements in the right tab order?
  • After interaction: is the right element in focus?
  • Is there a keyboard shortcut that activates the element (accesskey)?

No? Let’s fix it.

2.1. Reaching interactive elements

If you have an element on your page that cannot be reached with [TAB], put a @tabindex attribute on it.


Here we have a <span> tag that works as a link (don’t do this – it’s just a simple example). The first one cannot be reached using [TAB] but the second one has a tabindex and is thus part of the tab order of the HTML page.

(Note: since we experiment lots with the tabindex in this article, to avoid confusion, click on some text in this paragraph and then hit the [TAB] key to see where it goes next. The click will set your keyboard focus in the DOM.)

.customlink {
text-decoration: underline;
cursor: pointer;


.customlink {
  text-decoration: underline;
  cursor: pointer;
<span class="customlink" onclick="alert('activated!')">
.customlink {
  text-decoration: underline;
  cursor: pointer;
<span class="customlink" onclick="alert('activated!')" tabindex="0">

You set @tabindex=0 to add an element into the native tab order of the page, which is the DOM order.

2.2. Activating interactive elements

Next, you typically want to be able to use the [ENTER] and [SPACE] keys to activate your custom control. To do so, you will need to implement an onkeydown event handler. Note that the keyCode for [ENTER] is 13 and for [SPACE] is 32.


Let’s add this functionality to the <span> tag from before. Try tabbing to it and hit the [ENTER] or [SPACE] key.

<span class="customlink" onclick="alert('activated!')" tabindex="0">

function handlekey(event) {
var target = event.target || event.srcElement;
if (event.keyCode == 13 || event.keyCode == 32) { target.onclick(); }


<span class="customlink" onclick="alert('activated!')" tabindex="0"
function handlekey(event) {
  var target = event.target || event.srcElement;
  if (event.keyCode == 13 || event.keyCode == 32) {

Note that there are some controls that might need support for keys other than [tab] or [enter] to be able to use them from the keyboard alone, for example a custom list box, menu or slider should respond to arrow keys.

2.3. Elements in the right tab order

Have you tried tabbing to all the elements on your page that you care about? If so, check if the order of tab stops seems right. The default order is given by the order in which interactive elements appear in the DOM. For example, if your page’s code has a right column that is coded before the main article, then the links in the right column will receive tab focus first before the links in the main article.

You could change this by re-ordering your DOM, but oftentimes this is not possible. So, instead give the elements that should be the first ones to receive tab focus a positive @tabindex. The tab access will start at the smallest non-zero @tabindex value. If multiple elements share the same @tabindex value, these controls receive tab focus in DOM order. After that, interactive elements and those with @tabindex=0 will receive tab focus in DOM order.


The one thing that always annoys me the most is if the tab order in forms that I am supposed to fill in is illogical. Here is an example where the first and last name are separated by the address because they are in a table. We could fix it by moving to a <div> based layout, but let’s use @tabindex to demonstrate the change.

.customtabs input {
width: 50px;

Firstname: Address:
Lastname: City:
<table class="customtabs">
      <input type="text" id="firstname">
      <input type="text" id="address">
      <input type="text" id="lastname">
      <input type="text" id="city">
Click here to test this form,
then [TAB]:

Firstname: Address:
Lastname: City:
<table class="customtabs">
      <input type="text" id="firstname" tabindex="10">
      <input type="text" id="address" tabindex="30">
      <input type="text" id="lastname" tabindex="20">
      <input type="text" id="city" tabindex="40">

Be very careful with using non-zero tabindex values. Since they change the tab order on the page, you may get side effects that you might not have intended, such as having to give other elements on the page a non-zero tabindex value to avoid skipping too many other elements as I would need to do here.

2.4. Focus on the right element

Some of the controls that you create may be rather complex and open elements on the page that were previously hidden. This is particularly the case for drop-downs, pop-ups, and menus in general. Oftentimes the hidden element is not defined in the DOM right after the interactive control, such that a [TAB] will not put your keyboard focus on the next element that you are interacting with.

The solution is to manage your keyboard focus from JavaScript using the .focus() method.


Here is a menu that is declared ahead of the menu button. If you tab onto the button and hit enter, the menu is revealed. But your tab focus is still on the menu button, so your next [TAB] will take you somewhere else. We fix it by setting the focus on the first menu item after opening the menu.

#custommenu {
padding: 3px;
border:1px solid #666;
.squarebuttons button {
border: 1px solid black;

function displayMenu(value) {

<div id="custommenu" style="display:none;">
  <button id="item1" onclick="displayMenu('none');">Menu item1</button>
  <button id="item2" onclick="displayMenu('none');">Menu item2</button>
<button onclick="displayMenu('block');">Menu</button>
function displayMenu(value) {

#custommenu2 {
padding: 3px;
border:1px solid #666;

function displayMenu2(value) {

<div id="custommenu" style="display:none;">
  <button id="item1" onclick="displayMenu('none');">Menu item1</button>
  <button id="item2" onclick="displayMenu('none');">Menu item2</button>
<button onclick="displayMenu('block');">Menu</button>
function displayMenu(value) {

You will notice that there are still some things you can improve on here. For example, after you close the menu again with one of the menu items, the focus does not move back onto the menu button.

Also, after opening the menu, you may prefer not to move the focus onto the first menu item but rather just onto the menu <div>. You can do so by giving that div a @tabindex and then calling .focus() on it. If you do not want to make the div part of the normal tabbing order, just give it a @tabindex=-1 value. This will allow your div to receive focus from script, but be exempt from accidental tabbing onto (though usually you just want to use @tabindex=0).

Bonus: If you want to help keyboard users even more, you can also put outlines on the element that is currently in focus using CSS”s outline property. If you want to avoid the outlines for mouse users, you can dynamically add a class that removes the outline in mouseover events but leaves it for :focus.

2.5. Provide sensible keyboard shortcuts

At this stage your application is actually keyboard accessible. Congratulations!

However, it’s still not very efficient: like power-users, screenreader users love keyboard shortcuts: can you imagine if you were forced to tab through an entire page, or navigate back to a menu tree at the top of the page, to reach each control you were interested in? And, obviously, anything which makes navigating the app via the keyboard more efficient for screenreader users will benefit all power users as well, like the ubiquitous keyboard shortcuts for cut, copy and paste.

HTML4 introduced so-called accesskeys for this. In HTML5 @accesskey is now allowed on all elements.

The @accesskey attribute takes the value of a keyboard key (e.g. @accesskey="x") and is activated through platform- and browser-specific activation keys. For example, on the Mac it’s generally the [Ctrl] key, in IE it’ the [Alt] key, in Firefox on Windows [Shift]-[Alt], and in Opera on Windows [Shift]-[ESC]. You press the activation key and the accesskey together which either activates or focuses the element with the @accesskey attribute.


var button = document.getElementById(‘accessbutton’);
if (button.accessKeyLabel) {
button.innerHTML += ‘ (‘ + button.accessKeyLabel + ‘)’;

<button id="accessbutton" onclick="alert('sent!')" accesskey="e">
  var button = document.getElementById('accessbutton');
  if (button.accessKeyLabel) {
    button.innerHTML += ' (' + button.accessKeyLabel + ')';

Now, the idea behind this is clever, but the execution is pretty poor. Firstly, the different activation keys between different platforms and browsers make it really hard for people to get used to the accesskeys. Secondly, the key combinations can conflict with browser and screenreader shortcut keys, the first of which will render browser shortcuts unusable and the second will effectively remove the accesskeys.

In the end it is up to the Web application developer whether to use the accesskey attribute or whether to implement explicit shortcut keys for the application through key event handlers on the window object. In either case, make sure to provide a help list for your shortcut keys.

Also note that a page with a really good hierarchical heading layout and use of ARIA landmarks can help to eliminate the need for accesskeys to jump around the page, since there are typically default navigations available in screen readers to jump directly to headings, hyperlinks, and ARIA landmarks.

3. Provide markup for AT

Having made the application keyboard accessible also has advantages for screenreaders, since they can now reach the controls individually and activate them. So, next we will use a screenreader and close our eyes to find out where we only provide visual cues to understand the necessary interaction.

Here are some of the issues to consider:

  • Role may need to get identified
  • States may need to be kept track of
  • Properties may need to be made explicit
  • Labels may need to be provided for elements

This is where the W3C’s ARIA (Accessible Rich Internet Applications) standard comes in. ARIA attributes provide semantic information to screen readers and other AT that is otherwise conveyed only visually.

Note that using ARIA does not automatically implement the standard widget behavior – you’ll still need to add focus management, keyboard navigation, and change aria attribute values in script.

3.1. ARIA roles

After implementing a custom interactive widget, you need to add a @role attribute to indicate what type of controls it is, e.g. that it is playing the role of a standard tag such as a button.


This menu button is implemented as a <div>, but with a role of “button” it is announced as a button by a screenreader.

<div tabindex="0" role="button">Menu</div>

ARIA roles also describe composite controls that do not have a native HTML equivalent.


This menu with menu items is implemented as a set of <div> tags, but with a role of “menu” and “menuitem” items.


<div role="menu">
  <div tabindex="0" role="menuitem">Cut</div>
  <div tabindex="0" role="menuitem">Copy</div>
  <div tabindex="0" role="menuitem">Paste</div>

3.2. ARIA states

Some interactive controls represent different states, e.g. a checkbox can be checked or unchecked, or a menu can be expanded or collapsed.


The following menu has states on the menu items, which are here not just used to give an aural indication through the screenreader, but also a visual one through CSS.

.custombutton:before {
content: “”;
.custombutton[aria-checked=true]:before {
content: “2713 “;


.custombutton[aria-checked=true]:before {
   content:  "2713 ";
<div role="menu">
  <div tabindex="0" role="menuitem" aria-checked="true">Left</div>
  <div tabindex="0" role="menuitem" aria-checked="false">Center</div>
  <div tabindex="0" role="menuitem" aria-checked="false">Right</div>

3.3. ARIA properties

Some of the functionality of interactive controls cannot be captured by the role attribute alone. We have ARIA properties to add features that the screenreader needs to announce, such as aria-label, aria-haspopup, aria-activedescendant, or aria-live.


The following drop-down menu uses aria-haspopup to tell the screenreader that there is a popup hidden behind the menu button together with an ARIA state of aria-expanded to track whether it’s open or closed.

.menu {
border: 1px solid black;
.menuitem:hover {
background: grey;
.menuitem[aria-checked=true]:before {
content: “2713 “;


var button = document.getElementById(“button”);
var menu = document.getElementById(“menu”);
var items = document.getElementsByClassName(“menuitem”);
var focused = 0;
function showMenu(evt) {
menu.style.visibility = ‘visible’;
focused = getSelected();
function hideMenu(evt) {
menu.style.visibility = ‘hidden’;
function getSelected() {
for (var i=0; i < items.length; i++) {
if (items[i].getAttribute('aria-checked') == 'true') {
return i;
function setSelected(elem) {
var curSelected = getSelected();
items[curSelected].setAttribute('aria-checked', 'false');
elem.setAttribute('aria-checked', 'true');
function selectItem(evt) {
function getPrevItem(index) {
var prev = index – 1;
if (prev < 0) {
prev = items.length – 1;
return prev;
function getNextItem(index) {
var next = index + 1;
if (next == items.length) {
next = 0;
return next;
function handleButtonKeys(evt) {
var key = evt.keyCode;
switch(key) {
case (13): /* ENTER */
case (32): /* SPACE */
function handleMenuKeys(evt) {
var key = evt.keyCode;
switch(key) {
case (38): /* UP */
focused = getPrevItem(focused);
case (40): /* DOWN */
focused = getNextItem(focused);
case (13): /* ENTER */
case (32): /* SPACE */
case (27): /* ESC */
button.addEventListener('click', showMenu, false);
button.addEventListener('keydown', handleButtonKeys, false);
for (var i = 0; i < items.length; i++) {
items[i].addEventListener('click', selectItem, false);
items[i].addEventListener('keydown', handleMenuKeys, false);

<div class="custombutton" id="button" tabindex="0" role="button"
   aria-expanded="false" aria-haspopup="true">
<div role="menu"  class="menu" id="menu" style="display: none;">
  <div tabindex="0" role="menuitem" class="menuitem" aria-checked="true">
  <div tabindex="0" role="menuitem" class="menuitem" aria-checked="false">
  <div tabindex="0" role="menuitem" class="menuitem" aria-checked="false">
[CSS and JavaScript for example omitted]

3.4. Labelling

The main issue that people know about accessibility seems to be that they have to put alt text onto images. This is only one means to provide labels to screenreaders for page content. Labels are short informative pieces of text that provide a name to a control.

There are actually several ways of providing labels for controls:

  • on img elements use @alt
  • on input elements use the label element
  • use @aria-labelledby if there is another element that contains the label
  • use @title if you also want a label to be used as a tooltip
  • otherwise use @aria-label

I’ll provide examples for the first two use cases – the other use cases are simple to deduce.


The following two images show the rough concept for providing alt text for images: images that provide information should be transcribed, images that are just decorative should receive an empty @alt attribute.

shocked lolcat titled 'HTML cannot do that!
Image by Noah Sussman
<img src="texture.jpg" alt="">
<img src="lolcat.jpg"
alt="shocked lolcat titled 'HTML cannot do that!">
<img src="texture.jpg" alt="">

When marking up decorative images with an empty @alt attribute, the image is actually completely removed from the accessibility tree and does not confuse the blind user. This is a desired effect, so do remember to mark up all your images with @alt attributes, even those that don’t contain anything of interest to AT.


In the example form above in Section 2.3, when tabbing directly on the input elements, the screen reader will only say “edit text” without announcing what meaning that text has. That’s not very useful. So let’s introduce a label element for the input elements. We’ll also add checkboxes with a label.

<label>Doctor title:</label>
  <input type="checkbox" id="doctor"/>
  <input type="text" id="firstname2"/>

<label for="lastname2">Lastname:</label>
  <input type="text" id="lastname2"/>

  <input type="text" id="address2">
<label for="city2">City:
  <input type="text" id="city2">
<label for="remember">Remember me:</label>
  <input type="checkbox" id="remember">

In this example we use several different approaches to show what a different it makes to use the <label> element to mark up input boxes.

The first two fields just have a <label> element next to a <input> element. When using a screenreader you will not notice a difference between this and not using the <label> element because there is no connection between the <label> and the <input> element.

In the third field we use the @for attribute to create that link. Now the input field isn’t just announced as “edit text”, but rather as “Lastname edit text”, which is much more useful. Also, the screenreader can now skip the labels and get straight on the input element.

In the fourth and fifth field we actually encapsulate the <input> element inside the <label> element, thus avoiding the need for a @for attribute, though it doesn’t hurt to explicity add it.

Finally we look at the checkbox. By including a referenced <label> element with the checkbox, we change the screenreaders announcement from just “checkbox not checked” to “Remember me checkbox not checked”. Also notice that the click target now includes the label, making the checkbox not only more usable to screenreaders, but also for mouse users.

4. Conclusions

This article introduced a process that you can follow to make your Web applications accessible. As you do that, you will noticed that there are other things that you may need to do in order to give the best experience to a power user on a keyboard, a blind user using a screenreader, or a vision-impaired user using a screen magnifier. But once you’ve made a start, you will notice that it’s not all black magic and a lot can be achieved with just a little markup.

You will find more markup in the WAI ARIA specification and many more resources at Mozilla’s ARIA portal. Now go and change the world!

Many thanks to Alice Boxhall and Dominic Mazzoni for their proof-reading and suggested changes that really helped improve the article!

My crazy linux.conf.au week

In January I attended the annual Australian Linux and Open Source conference (LCA). But since I was sick all of January and had a lot to catch up on, I never got around to sharing all the talks that I gave during that time.

Drupal Down Under

It started with a talk at Drupal Down Under, which happened the weekend before LCA. I gave a talk titled “HTML5 video specifications” (video, slides).

I spoke about the video and audio element in HTML5, how to provide fallback content, how to encode content, how to control them from JavaScript, and briefly about Drupal video modules, though the next presentation provided much more insight into those. I explained how to make the HTML5 media elements accessible, including accessible controls, captions, audio descriptions, and the new WebVTT file format. I ran out of time to introduce the last section of my slides which are on WebRTC.


On the first day of LCA I gave a talk both in the Multimedia Miniconf and the Browser Miniconf.

Browser Miniconf

In the Browser Miniconf I talked about “Web Standardisation – how browser vendors collaborate, or not” (slides). Maybe the most interesting part about this was that I tried out a new slide “deck” tool called impress.js. I’m not yet sure if I like it but it worked well for this talk, in which I explained how the HTML5 spec is authored and who has input.

I also sat on a panel of browser developers in the Browser Miniconf (more as a standards than as a browser developer, but that’s close enough). We were asked about all kinds of latest developments in HTML5, CSS3, and media standards in the browser.

Multimedia Miniconf

In the Multimedia Miniconf I gave a “HTML5 media accessibility update” (slides). I talked about the accessibility problems of Flash, how native HTML5 video players will be better, about accessible video controls, captions, navigation chapters, audio descriptions, and WebVTT. I also provided a demo of how to synchronize multiple video elements using a polyfill for the multitrack API.

I also provided an update on HTTP adaptive streaming APIs as a lightning talk in the Multimedia Miniconf. I used an extract of the Drupal conference slides for it.

Main conference

Finally, and most importantly, Alice Boxhall and myself gave a talk in the main linux.conf.au titled “Developing Accessible Web Apps – how hard can it be?” (video, slides). I spoke about a process that you can follow to make your Web applications accessible. I’m writing a separate blog post to explain this in more detail. In her part, Alice dug below the surface of browsers to explain how the accessibility markup that Web developers provide is transformed into data structures that are handed to accessibility technologies.

Open Media Developers Track at OVC 2011

The Open Video Conference that took place on 10-12 September was so overwhelming, I’ve still not been able to catch my breath! It was a dense three days for me, even though I only focused on the technology sessions of the conference and utterly missed out on all the policy and content discussions.

Roughly 60 people participated in the Open Media Software (OMS) developers track. This was an amazing group of people capable and willing to shape the future of video technology on the Web:

  • HTML5 video developers from Apple, Google, Opera, and Mozilla (though we missed the NZ folks),
  • codec developers from WebM, Xiph, and MPEG,
  • Web video developers from YouTube, JWPlayer, Kaltura, VideoJS, PopcornJS, etc.,
  • content publishers from Wikipedia, Internet Archive, YouTube, Netflix, etc.,
  • open source tool developers from FFmpeg, gstreamer, flumotion, VideoLAN, PiTiVi, etc,
  • and many more.

To provide a summary of all the discussions would be impossible, so I just want to share the key take-aways that I had from the main sessions.

WebRTC: Realtime Communications and HTML5

Tim Terriberry (Mozilla), Serge Lachapelle (Google) and Ethan Hugg (CISCO) moderated this session together (slides). There are activities both at the W3C and at IETF – the ones at IETF are supposed to focus on protocols, while the W3C ones on HTML5 extensions.

The current proposal of a PeerConnection API has been implemented in WebKit/Chrome as open source. It is expected that Firefox will have an add-on by Q1 next year. It enables video conferencing, including media capture, media encoding, signal processing (echo cancellation etc), secure transmission, and a data stream exchange.

Current discussions are around the signalling protocol and whether SIP needs to be required by the standard. Further, the codec question is under discussion with a question whether to mandate VP8 and Opus, since transcoding gateways are not desirable. Another question is how to measure the quality of the connection and how to report errors so as to allow adaptation.

What always amazes me around RTC is the sheer number of specialised protocols that seem to be required to implement this. WebRTC does not disappoint: in fact, the question was asked whether there could be a lighter alternative than to re-use dozens of years of protocol development – is it over-engineered? Can desktop players connect to a WebRTC session?

We are already in a second or third revision of this part of the HTML5 specification and yet it seems the requirements are still being collected. I’m quietly confident that everything is done to make the lives of the Web developer easier, but it sure looks like a huge task.

The Missing Link: Flash to HTML5

Zohar Babin (Kaltura) and myself moderated this session and I must admit that this session was the biggest eye-opener for me amongst all the sessions. There was a large number of Flash developers present in the room and that was great, because sometimes we just don’t listen enough to lessons learnt in the past.

This session gave me one of those aha-moments: it the form of the Flash appendBytes() API function.

The appendBytes() function allows a Flash developer to take a byteArray out of a connected video resource and do something with it – such as feed it to a video for display. When I heard that Web developers want that functionality for JavaScript and the video element, too, I instinctively rejected the idea wondering why on earth would a Web developer want to touch encoded video bytes – why not leave that to the browser.

But as it turns out, this is actually a really powerful enabler of functionality. For example, you can use it to:

  • display mid-roll video ads as part of the same video element,
  • sequence playlists of videos into the same video element,
  • implement DVR functionality (high-speed seeking),
  • do mash-ups,
  • do video editing,
  • adaptive streaming.

This totally blew my mind and I am now completely supportive of having such a function in HTML5. Together with media fragment URIs you could even leave all the header download management for resources to the Web browser and just request time ranges from a video through an appendBytes() function. This would be easier on the Web developer than having to deal with byte ranges and making sure that appropriate decoding pipelines are set up.

Standards for Video Accessibility

Philip Jagenstedt (Opera) and myself moderated this session. We focused on the HTML5 track element and the WebVTT file format. Many issues were identified that will still require work.

One particular topic was to find a standard means of rendering the UI for caption, subtitle, und description selection. For example, what icons should be used to indicate that subtitles or captions are available. While this is not part of the HTML5 specification, it’s still important to get this right across browsers since otherwise users will get confused with diverging interfaces.

Chaptering was discussed and a particular need to allow URLs to directly point at chapters was expressed. I suggested the use of named Media Fragment URLs.

The use of WebVTT for descriptions for the blind was also discussed. A suggestion was made to use the voice tag <v> to allow for “styling” (i.e. selection) of the screen reader voice.

Finally, multitrack audio or video resources were also discussed and the @mediagroup attribute was explained. A question about how to identify the language used in different alternative dubs was asked. This is an issue because @srclang is not on audio or video, only on text, so it’s a missing feature for the multitrack API.

Beyond this session, there was also a breakout session on WebVTT and the track element. As a consequence, a number of bugs were registered in the W3C bug tracker.

WebM: Testing, Metrics and New features

This session was moderated by John Luther and John Koleszar, both of the WebM Project. They started off with a presentation on current work on WebM, which includes quality testing and improvements, and encoder speed improvement. Then they moved on to questions about how to involve the community more.

The community criticised that communication of what is happening around WebM is very scarce. More sharing of information was requested, including a move to using open Google+ hangouts instead of Google internal video conferences. More use of the public bug tracker can also help include the community better.

Another pain point of the community was that code is introduced and removed without much feedback. It was requested to introduce a peer review process. Also it was requested that example code snippets are published when new features are announced so others can replicate the claims.

This all indicates to me that the WebM project is increasingly more open, but that there is still a lot to learn.

Standards for HTTP Adaptive Streaming

This session was moderated by Frank Galligan and Aaron Colwell (Google), and Mark Watson (Netflix).

Mark started off by giving us an introduction to MPEG DASH, the MPEG file format for HTTP adaptive streaming. MPEG has just finalized the format and he was able to show us some examples. DASH is XML-based and thus rather verbose. It is covering all eventualities of what parameters could be switched during transmissions, which makes it very broad. These include trick modes e.g. for fast forwarding, 3D, multi-view and multitrack content.

MPEG have defined profiles – one for live streaming which requires chunking of the files on the server, and one for on-demand which requires keyframe alignment of the files. There are clear specifications for how to do these with MPEG. Such profiles would need to be created for WebM and Ogg Theora, too, to make DASH universally applicable.

Further, the Web case needs a more restrictive adaptation approach, since the video element’s API is already accounting for some of the features that DASH provides for desktop applications. So, a Web-specific profile of DASH would be required.

Then Aaron introduced us to the MediaSource API and in particular the webkitSourceAppend() extension that he has been experimenting with. It is essentially an implementation of the appendBytes() function of Flash, which the Web developers had been asking for just a few sessions earlier. This was likely the biggest announcement of OVC, alas a quiet and technically-focused one.

Aaron explained that he had been trying to find a way to implement HTTP adaptive streaming into WebKit in a way in which it could be standardised. While doing so, he also came across other requirements around such chunked video handling, in particular around dynamic ad insertion, live streaming, DVR functionality (fast forward), constraint video editing, and mashups. While trying to sort out all these requirements, it became clear that it would be very difficult to implement strategies for stream switching, buffering and delivery of video chunks into the browser when so many different and likely contradictory requirements exist. Also, once an approach is implemented and specified for the browser, it becomes very difficult to innovate on it.

Instead, the easiest way to solve it right now and learn about what would be necessary to implement into the browser would be to actually allow Web developers to queue up a chunk of encoded video into a video element for decoding and display. Thus, the webkitSourceAppend() function was born (specification).

The proposed extension to the HTMLMediaElement is as follows:

partial interface HTMLMediaElement {
  // URL passed to src attribute to enable the media source logic.
  readonly attribute [URL] DOMString webkitMediaSourceURL;

  bool webkitSourceAppend(in Uint8Array data);

  // end of stream status codes.
  const unsigned short EOS_NO_ERROR = 0;
  const unsigned short EOS_NETWORK_ERR = 1;
  const unsigned short EOS_DECODE_ERR = 2;

  void webkitSourceEndOfStream(in unsigned short status);

  // states
  const unsigned short SOURCE_CLOSED = 0;
  const unsigned short SOURCE_OPEN = 1;
  const unsigned short SOURCE_ENDED = 2;

  readonly attribute unsigned short webkitSourceState;

The code is already checked into WebKit, but commented out behind a command-line compiler flag.

Frank then stepped forward to show how webkitSourceAppend() can be used to implement HTTP adaptive streaming. His example uses WebM – there are no examples with MPEG or Ogg yet.

The chunks that Frank’s demo used were 150 video frames long (6.25s) and 5s long audio. Stream switching only switched video, since audio data is much lower bandwidth and more important to retain at high quality. Switching was done on multiplexed files.

Every chunk requires an XHR range request – this could be optimised if the connections were kept open per adaptation. Seeking works, too, but since decoding requires download of a whole chunk, seeking latency is determined by the time it takes to download and decode that chunk.

Similar to DASH, when using this approach for live streaming, the server has to produce one file per chunk, since byte range requests are not possible on a continuously growing file.

Frank did not use DASH as the manifest format for his HTTP adaptive streaming demo, but instead used a hacked-up custom XML format. It would be possible to use JSON or any other format, too.

After this session, I was actually completely blown away by the possibilities that such a simple API extension allows. If I wasn’t sold on the idea of a appendBytes() function in the earlier session, this one completely changed my mind. While I still believe we need to standardise a HTTP adaptive streaming file format that all browsers will support for all codecs, and I still believe that a native implementation for support of such a file format is necessary, I also believe that this approach of webkitSourceAppend() is what HTML needs – and maybe it needs it faster than native HTTP adaptive streaming support.

Standards for Browser Video Playback Metrics

This session was moderated by Zachary Ozer and Pablo Schklowsky (JWPlayer). Their motivation for the topic was, in fact, also HTTP adaptive streaming. Once you leave the decisions about when to do stream switching to JavaScript (through a function such a wekitSourceAppend()), you have to expose stream metrics to the JS developer so they can make informed decisions. The other use cases is, of course, monitoring of the quality of video delivery for reporting to the provider, who may then decide to change their delivery environment.

The discussion found that we really care about metrics on three different levels:

  • measuring the network performance (bandwidth)
  • measuring the decoding pipeline performance
  • measuring the display quality

In the end, it seemed that work previously done by Steve Lacey on a proposal for video metrics was generally acceptable, except for the playbackJitter metric, which may be too aggregate to mean much.

Device Inputs / A/V in the Browser

I didn’t actually attend this session held by Anant Narayanan (Mozilla), but from what I heard, the discussion focused on how to manage permission of access to video camera, microphone and screen, e.g. when multiple applications (tabs) want access or when the same site wants access in a different session. This may apply to real-time communication with screen sharing, but also to photo sharing, video upload, or canvas access to devices e.g. for time lapse photography.

Open Video Editors

This was another session that I wasn’t able to attend, but I believe the creation of good open source video editing software and similar video creation software is really crucial to giving video a broader user appeal.

Jeff Fortin (PiTiVi) moderated this session and I was fascinated to later see his analysis of the lifecycle of open source video editors. It is shocking to see how many people/projects have tried to create an open source video editor and how many have stopped their project. It is likely that the creation of a video editor is such a complex challenge that it requires a larger and more committed open source project – single people will just run out of steam too quickly. This may be comparable to the creation of a Web browser (see the size of the Mozilla project) or a text processing system (see the size of the OpenOffice project).

Jeff also mentioned the need to create open video editor standards around playlist file formats etc. Possibly the Open Video Alliance could help. In any case, something has to be done in this space – maybe this would be a good topic to focus next year’s OVC on?

Monday’s Breakout Groups

The conference ended officially on Sunday night, but we had a third day of discussions / hackday at the wonderful New York Lawschool venue. We had collected issues of interest during the two previous days and organised the breakout groups on the morning (Schedule).

In the Content Protection/DRM session, Mark Watson from Netflix explained how their API works and that they believe that all we need in browsers is a secure way to exchange keys and an indicator of protection scheme is used – the actual protection scheme would not be implemented by the browser, but be provided by the underlying system (media framework/operating system). I think that until somebody actually implements something in a browser fork and shows how this can be done, we won’t have much progress. In my understanding, we may also need to disable part of the video API for encrypted content, because otherwise you can always e.g. grab frames from the video element into canvas and save them from there.

In the Playlists and Gapless Playback session, there was massive brainstorming about what new cool things can be done with the video element in browsers if playback between snippets can be made seamless. Further discussions were about a standard playlist file formats (such as XSPF, MRSS or M3U), media fragment URIs in playlists for mashups, and the need to expose track metadata for HTML5 media elements.

What more can I say? It was an amazing three days and the complexity of problems that we’re dealing with is a tribute to how far HTML5 and open video has already come and exciting news for the kind of applications that will be possible (both professional and community) once we’ve solved the problems of today. It will be exciting to see what progress we will have made by next year’s conference.

Thanks go to Google for sponsoring my trip to OVC.

UPDATE: We actually have a mailing list for open media developers who are interested in these and similar topics – do join at http://lists.annodex.net/cgi-bin/mailman/listinfo/foms.

WebVTT at W3C

Today we started a community group (CG) at the W3C for “Web Media Text Tracks”: http://www.w3.org/community/texttracks/.

The group has been created to work on many aspects of video text tracks of which captioning and the WebVTT format are key parts.

The main reason behind creating this group is to create a forum at the W3C for working on WebVTT to allow all browsers to support this format and be involved in its development.

We’ve not gone the full way to creating a Working Group, although that was the initial intention. We had objections from W3C members for going down that path, so are using the CG path for now.

This is actually a good thing because CGs are open for anyone to join, while WGs are only open to W3C members. The key difference is that specs coming out of WGs can become RECs (“standards”), while CG’s specs cannot.

If we eventually see a need to move WebVTT to a REC, that move will be straight forward, since there is a clear path for work to transition from a CG to a WG.

3rd W3C Web and TV Workshop, Hollywood

Curious about any new requirements that the TV community may have for HTML5 video, I attended the W3C Web and TV Workshop in Hollywood last week. It’s already the third of its kind and was also the largest to date showing an increasing interest of the TV community to converge with the Web community.

The Workshop Aim

I went into the Workshop not quite knowing what to expect. My previous contact with members of this community was restricted to email exchanges on the W3C Web and TV Interest Group (IG) mailing list. I knew there was some interest in video accessibility (well: particularly captions) and little knowledge of existing HTML5 specifications around text tracks and why the browsers were going with WebVTT. So I had decided to attend the workshop to get a better understanding of the community, it’s background, needs, and issues, and to hopefully teach some of the ways of HTML5. For that reason I had also submitted a WebVTT presentation/demo.

As it turned out, the workshop had as its key target the facilitation of communication between the TV and the HTML5 community. The aim was to identify features that need to be added to the HTML5 video element to satisfy the needs of the TV community. I obviously came to the right workshop.

The process that is being used by the W3C in the Interest Group is to have TV community members express their needs, then have HTML5 experts express how these needs can be satisfied with existing HTML5 features, then make trial implementations and identify any shortcomings, then move forward to progress these through HTML5 or HTML.next. This workshop clearly focused on the first step: expressing needs.

Often times it was painful for me to watch presenters defending their requirements and trying to impress on the audience how important a certain feature is to them when that features actually already has a HTML5 specification, but just not yet a browser implementations. That there were so few HTML5 video experts present and that they were given very little space to directly reply to the expressed needs and actually explain what is already possible (or specified to be possible) was probably one of the biggest drawbacks of the workshop.

To be fair, detailed technical discussions were not possible in a room with 150 attendees with a panel sitting at the front discussing topics and taking questions. Solving a use case with existing HTML5 markup and identifying the gaps requires smaller break-out groups of a maximum of maybe 20 people and sufficient HTML5 knowledge in the room. Ultimately they require a single person to try to implement it using JavaScript alone, and, failing that, writing browser extensions. Only such code actually proves that a feature is missing.

Now, the video features of HTML5 are still continuing to change almost on a daily basis. Much development is, for example, happening around real-time communication features and around the track element as we speak. So, focusing on further requirements finding around HTML5 video for now is probably a good thing.

The TV Community Approach

Before I move on to some of the topics covered by the workshop, I have to express some concern about the behaviour that I observed with lots of the TV community folks. Many people tried pushing existing solutions from other spaces into the Web unchanged with a claim of not re-inventing the wheel and following paved cowpaths, which are some of the underlying design principles for HTML5. I can understand where such behaviour originates thinking that having solved the same problems elsewhere before, those solutions should apply here, too. But I would like to warn people of this approach.

If we blindly apply solutions that were not developed for HTML5 into HTML we will end up with suboptimal solutions that will hurt us further down the track. The principles of not re-inventing the wheel and following paved cowpaths were introduced for features that were already implemented by browsers or in de-facto standard use by JavaScript libraries. They were not created for new features in HTML. The video element is a completely new feature in HTML thus everything around it is new.

I would therefore like to see some more respect given to HTML5 and the complexities involved in finding the best possible technical solutions for the Web given that the video element does not stand alone in HTML5, but is part of a much larger picture of technical capabilities on the Web where many of the requested features for TV applications may already be solved by existing HTML markup that is not part of the video element.

Also, HTML5 is not just about the HTML markup, but also about CSS and JavaScript and HTTP. There are several layers of technology involved in creating a Web application: not only a separation of work between client and servers, but also between the Operating System, the media framework, the browser, browser plugins, and JavaScript has to be balanced. To get this balance right is a fine art that will take many discussion, many experiments and sometimes several design approaches. We need patience and calm to work through this, not a rushed adoption of existing solutions from other spaces.

New Requirements

Now let’s get to the take-aways I had from the workshop’s sessions:

Session 1 / Content Provider and Consumer Perspective:

The sessions participants postulate that we will see the creation of application stores for TV applications similar to how we have experienced this for mobile phones and tablets. People enjoy collecting apps like they collect badges. Right now, the app store domain is dominated by native apps and now Web apps. The reason is that we haven’t got a standard platform for setting up Web app stores with Web apps that work in all browsers on all operating systems. Thus, developers have to re-deploy their app for many environments.

While essentially an orthogonal need to HTML standardisation, this seems to be one of the key issues that keep Web apps back from making big market inroads and W3C may do well in setting up a new WG to define a standard Web app manifest format and JS APIs.

Session 2+3 / Multi-screen TV in the Home Network:

Several technologies of hybrid TV broadcast and set-top-box Web content delivery were being pointed out, including the European HbbTV and the Japanese Hybridcast, the latter of which gave an in-depth demo.

Web purists would probably say that it would be simpler to just deliver all content over the Web and not have to worry about any further technical challenges encountered by having to synchronize content received via two vastly different delivery mechanisms. I personally believe this development is one of business models: we don’t yet know exactly how to earn money from TV content delivered over the Internet, but we do know how to do so with TV content. So, hybrids allow the continuation of existing income streams while allowing the features to be augmented with those people enjoy from the Internet.

Should requirements that emerge from such a use case for HTML5 video be taken seriously? I think they absolutely should. What I see happening is that a new way of using the Web is starting to emerge. The new way is video-focused rather than text-focused. We receive our Web content by watching video programming online – video channels, not Web pages are the core content that we consume in the living room. Video channels are where we start our browsing experience from. Search may still be our first point of call, but it will be search for video content or a video-centric app rather than search for a Web site.

And it will be a matter of many interconnected devices in the house that contribute to the experience: the 5.1 stereos that are spread all over the house and should receive our video’s sound, the different screens in the different areas of our house between which we move around, and remote controls, laptops or tablets that function as remote controls and preview stations and are used to determine our viewing experience and provide a back-channel to the publishers.

We have barely begun to identify how such interconnected devices within a home fit within the server-client-based view of the Web world, and the new Web Sockets functionality. The Home Networking Task Force of the Web and TV IG is looking at the issues and analysing existing protocols and standards that solve this picture. But I have a gnawing feeling that the best solution will be something new that is more Web-specific and fits better with the technology layers of the Web.

Session 4 / Synchronized Metadata:

The TV environment offers many data services, some of which have been legally prescribed. This session analysed TV needs and how they can be satisfied with current HTML5.

Subtitles and closed captioning support are one of the key requirements that have been legally prescribed to allow for equal access of non-native speakers, and blind and vision-impaired users to TV content. After demonstration of some key features defined into the HTML5 track element and the WebVTT format, it was generally accepted that HTML5 is making big progress in this space, in particular that browsers are in the process of implementing support for the track element. A concern still exists for complete coverage of all the CEA-608/708 features in WebVTT.

Further concern was raised for support of audio descriptions and audio translations, in particular since no browser has as yet committed to implementing the HTML5’s media multitrack API with the @mediagroup attribute. In this context I am excited to see first JavaScript polyfills emerge (see captionator.js & mediagroup.js).

Another concern was that many captions are actually delivered as raster images (in particular DVD captions) and how that would work in the Web context. The proposal was to use WebVTT and encode the raster images as data-URIs included in timed cues, then render them by JavaScript as an overlay. This is something to explore further.

Demos were shown using WebVTT to synchronize ads with videos, to display related metadata from a user’s life log with videos, to display thumbnails along a video’s timeline, and to show the rendering of text descriptions through screen readers. General agreement by the panel was that WebVTT offers many opportunities and that this area will continue to need further development and that we will see new capabilities on the Web around metadata that were not previously possible on TV.

Session 5 / Content Format and Codecs: DASH and Codec standards

The introduction of HTTP adaptive streaming into HTML5 was one of the core issues that kept returning in the discussions. This panel focused on MPEG DASH, but also mentioned the need for programmatic implementation of adaptive streaming functionality.

The work around MPEG DASH would require specifications of how to use DASH with WebM and Ogg Theora, as well as a specification of a HTML5 profile for DASH, which would limit the functionality possible in DASH files to the ones needed in a HTML5 video element. One criticism of DASH was its verbosity. Another was its unclear patent position. Panel attendees with included Qualcomm, Apple and Microsoft made very clear that their position is pro a royalty-free use of DASH.

The work around a programmatic implementation for adaptive streaming would require at least a JavaScript API to measure the quality of service of a presented video element and a JavaScript API to feed the video element with chunks of (encrypted) video content on the fly. Interestingly enough, there are existing experiments both around Video metrics and MediaSource extensions, so we can expect some progress in this space, even if these are not yet a strong focus of the HTML WG.

I would personally support the creation of Community Group at the W3C around HTTP adaptive streaming and DASH. I think it would work towards alleviating the perceived patent issues around DASH and allow the right members of the community to participate in preparing a specification for HTML5 without requiring them to become W3C members.

Session 6 / Content Protection and DRM

A core concern of the TV community is around content protection. The requirements in this space seem, however, very confused.

The key assumption here is that Web browsers should support the decoding of DRM-protected content in the HTML5 video element because the video element provides a desirable JavaScript API, accessibility features (the track element), default controls, and the possibility to synchronize multiple media elements. However, at the same time, the video element is part of the core content of a Web page and thus allows direct access to the image content in a canvas etc, so some of its functionality is not desirable.

The picture is further confused by requests for authentication, authorization, encryption, obfuscation, same-origin, secure transmission, secure decryption key delivery, unique content identification and other “content protection” techniques without a clear understanding of what is already possible on the Web and what requirements to content publishers actually have for delivering their content on the Web. This is further complicated by the fact that there are many competing solutions for DRM systems in the market with no clear standard that all browsers could support.

A thorough analysis of the technologies and solutions available in this space as well as an analysis of the needs for HTML5 is required before it becomes clear what solution HTML5 browsers may need to support. There seemed to be agreement in the group, though, that browsers would not need to implement DRM solutions, but rather only hand through the functionality of the platform on which they are running (including the media frameworks and operating system functionalities). How this is supposed to work was, however, unclear.

Session 7 / Web & TV: Additional Device & User Requirements

This was a catch-all session for topics that had not been addressed in other sessions. Among the topics addressed in this group were:

  • Parental Guidance: how to deal with ratings in an internationally inconsistent ratings landscape, how to deliver the ratings with the content, and how to enforce the viewing restrictions
  • Emergency Notifications: how to replicate on the Web the emergency notification functionality of TV by providing text overlays to alert users
  • TV channels: how to detect what channels of programming are available to users

Overall, the workshop was a worthwhile experience. It seems there is a lot of work still ahead for making HTML5 video the best it can be on the Web.

Recent developments around WebVTT

People have been asking me lots of questions about WebVTT (Web Video Text Tracks) recently. Questions about its technical nature such as: are the features included in WebVTT sufficient for broadcast captions including positioning and colors? Questions about its standardisation level: when is the spec officially finished and when will it move from the WHATWG to the W3C? Questions about implementation: are any browsers supporting it yet and how can I make use of it now?

I’m going to answer all of these questions in this post to make it more efficient than answering tweets, emails, and skype and other phone conference requests. It’s about time I do a proper post about it.


I’m starting with the last area, because it is the simplest to answer.

No, no browser has as yet shipped support for the <track> element and therefore there is no support for WebVTT in browsers yet. However, implementations are in progress. For example, Webkit has recently received first patches for the track element, but there is still an open bug for a WebVTT parser. Similarly, Firefox can now parse the track element, but is still working on the element’s actual functionality.

However, you do not have to despair, because there are now a couple of JavaScript polyfill libraries for either just the track element or for video players with track support. You can start using these while you are waiting for the browsers to implement native support for the element and the file format.

Here are some of the libraries that I’ve come across that will support SRT and/or WebVTT (do leave a comment if you come across more):

  • Captionator – a polyfill for track and SRT parsing (WebVTT in the works)
  • js_videosub – a polyfill for track and SRT parsing
  • jscaptions – a polyfill for track and SRT parsing
  • LeanBack player – a video player with track and SRT, SUB, DFXP, and soon full WebVTT parsing support
  • playr – a video player that includes track and WebVTT parsing
  • MediaElementJS – a video player that includes track and SRT parsing
  • Kaltura’s video player – a video player that includes track and SRT parsing

I am actually most excited about the work of Ronny Mennerich from LeanbackPlayer on WebVTT, since he has been the first to really attack full support of cue settings and to discuss with Ian, me and the WHATWG about their meaning. His review notes with visual description of how settings are to be interpreted and his demo will be most useful to authors and other developers.


Before we dig into the technical progress that has been made recently, I want to answer the question of “maturity”.

The WebVTT specification is currently developed at the WHATWG. It is part of the HTML specification there. When development on it started (under its then name WebSRT), it was also part of the HTML5 specification of the W3C. However, there was a concern that HTML5 should be independent of the chosen captioning format and thus WebVTT currently only exists at the WHATWG.

In recent months – and particularly since browser vendors have indicated that they will indeed implement support for WebVTT as their implementation of the <track> element – the question of formal standardization of WebVTT at the W3C has arisen. I’m involved in this as a Google contractor and we’ve put together a proposed charter for a WebVTT Working Group at the W3C.

In the meantime, standardization progresses at the WHATWG productively. Much feedback has recently been brought together by Ian and changes have been applied or at least prepared for a second feature set to be added to WebVTT once the first lot is implemented. I’ve captured the potentially accepted and rejected new features in a wiki page.

Many of the new features are about making the WebVTT format more useful for authoring and data management. The introduction of comments, inline CSS settings and default cue settings will help authors reduce the amount of styling they have to provide. File-wide metadata will help with the exchange of management information in professional captioning scenarios and archives.

But even without these new features, WebVTT already has all the features necessary to support professional captioning requirements. I’ve prepared a draft mapping of CEA-608 captions to WebVTT to demonstrate these capabilities (CEA-608 is the TV captioning standard in the US).

So, overall, WebVTT is in a great state for you to start implementing support for it in caption creation applications and in video players. There’s no need to wait any longer – I don’t expect fundamental changes to be made, but only new features to be added.

New WebVTT Features

This takes us straight to looking at the recently introduced new features.

  • Simpler File Magic:
    Whereas previously the magic file identifier for a WebVTT file was a single line with “WEBVTT FILE”. This has now been changed to a single line with just “WEBVTT”.
  • Cue Bold Span:
    The <b> element has been introduced into WebVTT, thus aligning it somewhat more with SRT and with HTML.
  • CSS Selectors:
    The spec already allowed to use the names of tags, the classes of <c> tags, and the voice annotations of <v> tags as CSS selectors for ::cue. ID selector matching is now also available, where the cue identifier is used.
  • text-decoration support:
    The spec now also supports the CSS text-decoration property for WebVTT cues, allowing functionality such as blinking text and bold.

Further to this, the email identifies the means in which WebVTT is extensible:

  • Header area:
    The WebVTT header area is defined through the “WEBVTT” magic file identifier as a start and two empty lines as an end. It is possible to add into this area file-wide information header information.
  • Cues:
    Cues are defined to start with an optional identifier, and then a start/end time specification with “–>” separator. They end with two empty lines. Cues that contain a “–>” separator but don’t parse as valid start/end time are currently skipped. Such “cues” can be used to contain inline command blocks.
  • Inline in cues:
    Finally, within cues, everything that is within a “tag”, i.e. between “”, and does not parse as one of the defined start or end tags is ignored, so we can use these to hide text. Further, text between such start and end tags is visible even if the tags are ignored, so wen can introduce new markup tags in this way.

Given this background, the following V2 extensions have been discussed:

  • Metadata:
    Enter name-value pairs of metadata into the header area, e.g.

    00:00:15.000 --> 00:00:17.950
    first cue
  • Inline Cue Settings:
    Default cue settings can come in a “cue” of their own, e.g.

    DEFAULTS --> D:vertical A:end
    00:00.000 --> 00:02.000
    This is vertical and end-aligned.
    00:02.500 --> 00:05.000
    As is this.
    DEFAULTS --> A:start
    00:05.500 --> 00:07.000
    This is horizontal and start-aligned.
  • Inline CSS:
    Since CSS is used to format cue text, a means to do this directly in WebVTT without a need for a Web page and external style sheet is helpful and could be done in its own cue, e.g.

      STYLE -->
      ::cue(v[voice=Bob]) { color: green; }
      ::cue(c.narration) { font-style: italic; }
      ::cue(c.narration i) { font-style: normal; }
      00:00.000 --> 00:02.000
      <v Bob>Welcome.
      00:02.500 --> 00:05.000
      <c .narration>To <i>WebVTT</i>.
  • Comments:
    Both, comments within cues and complete cues commented out are possible, e.g.

     COMMENT -->
     00:02.000 --> 00:03.000
     two; this is entirely
     commented out
     00:06.000 --> 00:07.000
     this part of the cue is visible
     <! this part isn't >
     <and neither is this>

Finally, I believe we still need to add the following features:

  • Language tags:
    I’d like to add a language tag that allows to mark up a subpart of cue text as being in a different language. We need this feature for mixed-language cues (in particular where a different font may be necessary for the inline foreign-language text). But more importantly we will need this feature for cues that contain text descriptions rather than captions, such that a speech synthesizer can pick the correct language model to speak the foreign-language text. It was discussed that this could be done with a <lang jp>xxx</lang> type of markup.
  • Roll-up captions:
    When we use timestamp objects and the future text is hidden, then is un-hidden upon reaching its time, we should allow the cue text to scroll up a line when the un-hidden text requires adding a new line. This is the typical way in which TV live captions have been displayed and so users are acquainted with this display style.
  • Inline navigation:
    For chapter tracks the primary use of cues are for navigation. In other formats – in particular in DAISY-books for blind users – there are hierarchical navigation possibilities within media resources. We can use timestamp objects to provide further markers for navigation within cues, but in order to make these available in a hierarchical fashion, we will need a grouping tag. It would be possible to introduce a <nav> tag that can group several timestamp objects for navigation.
  • Default caption width:
    At the moment, the default display size of a caption cue is 100% of the video’s width (height for vertical directions), which can be overruled with the “S” cue setting. I think it should by default rather be the width (height) of the bounding box around all the text inside the cue.

Aside from these changes to WebVTT, there are also some things that can be improved on the <track> element. I personally support the introduction of the source element underneath the track element, because that allows us to provide different caption files for different devices through the @media media queries attribute and it allows support for more than just one default captioning format. This change needs to be made soon so we don’t run into trouble with the currently empty track element.

I further think a oncuelistchange event would be nice as well in cases where the number of tracks is somehow changed – in particular when coming from within a media file.

Other than this, I’m really very happy with the state that we have achieved this far.