Tag Archives: HTML5

HTML5 Video element discussions at TPAC meetings

Last week’s TPAC (2009 W3C Technical Plenary / Advisory Committee) meetings were my second time at a TPAC and I found myself becoming highly involved with the progress on accessibility on the HTML5 video element. There were in particular two meetings of high relevanct: the Video Accessibility workshop and Friday’s HTML5 breakout group on the video element.

HTML5 Video Accessibility Workshop

The week started on Sunday with the “HTML5 Video Accessibility workshop” at Stanford University, organised by John Foliot and Dave Singer. They brought together a substantial number of people all representing a variety of interest groups. Everyone got their chance to present their viewpoint – check out the minutes of the meeting for a complete transcript.

The list of people and their discussion topics were as follows:

Accessibility Experts

  • Janina Sajka, chair of WAI Protocols and Formats: represented the vision-impaired community and expressed requirements for a deeply controllable access interface to audio-visual content, preferably in a structured manner similar to DAISY.
  • Sally Cain, RNIB, Member of W3C PF group: expressed a deep need for audio descriptions, which are often overlooked besides captions.
  • Ken Harrenstien, Google: has worked on captioning support for video.google and YouTube and shared his experiences, e.g. http://www.youtube.com/watch?v=QRS8MkLhQmM, and automated translation.
  • Victor Tsaran, Yahoo! Accessibility Manager: joined for a short time out of interest.


  • John Foliot, professor at Stanford Uni: showed a captioning service that he set up at Stanford University to enable lecturers to publish more accessible video – it uses humans for transcription, but automated tools to time-align, and provides a Web interface to the staff.
  • Matt May, Adobe: shared what Adobe learnt about accessibility in Flash – in particular that an instream-only approach to captions was a naive approach and that external captions are much more flexible, extensible, and can fit into current workflows.
  • Frank Olivier, Microsoft: attended to listen and learn.


  • Pierre-Antoine Champin from Liris (France), who was not able to attend, sent a video about their research work on media accessibility using automatic and manual annotation.
  • Hironobu Takagi, IBM Labs Tokyo, general chair for W4A: demonstrated a text-based audio description system combined with a high-quality, almost human-sounding speech synthesizer.
  • Dick Bulterman, Researcher at CWI in Amsterdam, co-chair of SYMM (group at W3C doing SMIL): reported on 14 years of experience with multimedia presentations and SMIL (slides) and the need to make temporal and spatial synchronisation explicit to be able to do the complex things.
  • Joakim S

W3C Workshop/Barcamp on HTML5 Video Accessibility

Web accessibility veteran John Foliot of Stanford University and Apple’s QuickTime EcoSystem Manager Dave Singer are organising a W3C Workshop/Barcamp on Video Accessibility on the Sunday before the W3C’s annual combined technical plenary meeting TPAC.

The workshop will take place on 1st November at Stanford University – see details on the Workshop.

If you read the announcement, you will see that this is about understanding all the issues around video (and audio) accessibility, understanding existing approaches, and trying to find solutions for HTML5 that all browser vendors will be able to support.

The workshop is run under the W3C Hypertext Coordination Group and registration is required.

W3C membership is not required in order to participate in the gathering. However, you are required to contribute your knowledge actively and constructively to the Workshop. You must come prepared to present on one of the questions in this document to help inform the discussion and make progress on proposing solutions.

I am very excited about this workshop because I think it is high time to move things forward.

If I can get my travel sorted, I will present my results on the video accessibility work that I did for Mozilla. It will cover both: out-of-band accessibility data for video elements, as well as in-line accessibility data and how to expose a common API in the Web browser for them. I have recently experimented with encoding srt and lrc files in Ogg and displaying them in Firefox by using the patches that were contributed by OggK and Felipe into Firefox. More about this soon.

HTML5 audio element accessibility

As part of my experiments in video accessibility I am also looking at the audio element. I have just finished a proof of concept for parsing Lyrics files for music in lrc format.

The demo uses Tay Zonday’s “Chocolate Rain” song both as a video with subtitles and as an audio file with lyrics. Fortunately, he published these all under a creative commons license, so I was able to use this music file. BTW: I found it really difficult to find a openly licensed music file with lyrics.

While I was at it, I also cleaned up all the old demos and now have a nice list of all demos in a central file.

ARIA – A Brief Introduction

Since working on video accessibility, I have felt rather inadequate not knowing exactly how general Web accessibility works, in particular ARIA. I have been pointed at the W3C WAI-ARIA primer, best practices, and WD specification, but found them almost impossible to read.

If you are looking for a document that gets right to the point, I can recommend Opera’s Introduction to WAI ARIA. It tells you what attributes there are and how to use them. More in-depth information is available in the W3C WAI-ARIA best practices. Here’s my little summary of what I learnt.

Getting straight to the point: ARIA mostly cares about giving screen control to the keyboard (away from the mouse) and about exposing semantic information, such that vision-impaired people have a way to interact with Web content and screen readers can read out useful information.

Basic keys
The basic keys in use for accessibility are the tab/shift+tab, arrow, enter, space and escape keys.

Keyboard Focus: tabbing
Normal tabbing includes form controls and anchors. This can be overruled with the tabindex attribute.

Adding a tabindex=0 to an element adds the element to the tab order in which it appears in the document. Adding a tabindex out of [1;32767] you can place any element into a desired order – lowest numbers first.

Adding a tabindex=-1 to an element removes it from tabbing order, but you can still get keyboard focus onto it through javascript, e.g. for the subelements of a menu. The aria-activedescendant attribute can tell which is active in a list of descendants.

Navigation Landmarks: roles
Screenreaders have a problem with expressing what the functionality of elements is – normally they can only read out the name of the element.

This is where the role attribute comes in. It provides semantic meaning, e.g. “slider” instead of “input” element.

ARIA has a large number of pre-defined roles. They are listed in the spec – each role has additional attributes to provide more assistive information – mostly state information on the particular element.

Live updated content: aria-live
When data is updated somewhere on screen, often assistive technology doesn’t get to know about it.

Regions that are marked with the aria-live attribute will be read out even if the user is focused on another part of the screen at that point.

Form input: aria-reqired
For screen readers it is not obvious if a form element’s entry is a required or optional entry. Add an aria-required attribute to the form entry element and your screen reader will tell you.

Labels and descriptions: aria-labelledby / aria-describedby
Most often the description or label for a page area sits already elsewhere on screen, but with no obivous relationship to an element other than visible neighbourship.

A screenreader can be told about the relationship by using the aria-labelledby / aria-describedby attributes, which allow to link to such an area through that area’s id attribute.

Is that all?
Yes, I think that’s essentially all. It’s not particularly difficult, but it has a high impact on accessibility. I hope your take-away is as big as mine!

BTW: WAI ARIA is written for good old HTML4, not HTML5. However, there are synchronisation activities under way and WAI ARIA attributes will still be relevant to HTML5. Some of the roles will become unnecessary with the new elements available in HTML5 – see a draft mapping of HTML5 elements to ARIA implicit roles in Henry’s excellent document, but it seems the tabbing order, live regions, and the role attribute are here to stay.

Amusement at WHATWG

This is not a technical post, but it made my day, so I thought I should share it.

For two years, the WHATWG has had an open twitter account: anyone who wanted to post a status message on WHATWG could just got to http://www.whatwg.org/#updater and update the twitter status.

For two years, the script kiddies didn’t find the account.

They discovered it about 12 hours ago. Check it out at http://twitter.com/WHATWG before twitter’s history eliminates the posts again.

Here are some of the “jewels” posted:

“WHATWG: We’re only half as evil as we seem.”

“The HXTML 2.0 spec has been finalized with only one tag which is <text>.”

“W3C issues announcement: Internet Explorer to be made obsolete. From fall onwards, IE6 and IE7 will be blocked from browsing the internet”

“I hope the script kiddies realizes that no one cares what is posted to the WHATWG twitter account”


“Our whole team of security experts was just fired.”

“i want <isitfriday> tag…” (me too!!)


“WHATWG announce working group on emoticons. Homer says (_8(|) ~doh!”

“WHATWG to start work on “Bible5″ http://bit.ly/TwZcX” (this is actually old, but still golden)


A review of the W3C Timed Text Authoring Format

UPDATE: The best demo I have seen so far of many of DFXP’s features is at http://www.w3.org/2009/02/ThisIsCoffee.html.

The W3C has published a third last call for the draft specification of DFXP, the Distribution Format Exchange Profile for the Timed Text Authoring Format – or short: for their new standard format for captions. Comments are due by the 30th June, so rush if you want to give any feedback. Here is what came to my mind as I was reading the 183 pages long document.

Please note: This review looks at DFXP from a Web view, i.e. how compatible is it with existing Web technologies, since my main use case will be on the Web, even if advocates will say that that’s not it’s main purpose, strangely enough, for a standard coming out of the W3C.

The state of affairs with caption formats

When it comes to caption and subtitles, there is no lack of formats. It seems, because it is an easy challenge to define a data format for something as simple as a piece of text and some timing information, every new project that wanted to deal with captions – or more generally timed text – created their own format. I am no exception to the rule. 🙂

Thus, the current state of affairs wrt timed text is that there are many different textual file formats to store such data, there are also many different video container formats each with their own data format (or even formats) for embedding timed text into them, and there is a lot of software that will deal with many input, output and encapsulation formats.

The problem with this situation is that the formats are all different in their complexity. The simple “piece of text and timing information” problem can be turned into as complex a problem as you desire. By adding layout information, styling information, animation functionality, metadata about the video and about the content, and possibly hyperlinks, we have ended up in a large mess of incompatible formats.

The aim of W3C Timed Text

The W3C Timed Text working group was chartered in January 2003 to attack this issue. It was supposed to become the super-format of all possible functionalities for timed text formats and therefore a perfect interchange format between applications (see requirements document). Its focus was for use on the Web and with SMIL and to make use of existing W3C technologies where possible

However, the history of captioning is TV and the scope of Timed Text is beyond mere use on the Web, so while W3C Timed Text took a lot of inspiration from other Web standards, it has become a stand-alone standard that does not rely on, e.g. the availability of a CSS engine, and it has no in-built hyperlinking functionality (see what requirements it fulfills).

Dissecting DFXP

So. let’s look into some of what DFXP provides.

Here is an example file taken straight from the draft – check the presentation here:

<tt xml:lang="" xmlns="http://www.w3.org/2006/10/ttaf1">
    <metadata xmlns:ttm="http://www.w3.org/2006/10/ttaf1#metadata">
      <ttm:title>Timed Text DFXP Example</ttm:title>
      <ttm:copyright>The Authors (c) 2006</ttm:copyright>

    <styling xmlns:tts="http://www.w3.org/2006/10/ttaf1#styling">
      <!-- s1 specifies default color, font, and text alignment -->
      <style xml:id="s1"
                 tts:textAlign="center" />
      <!-- alternative using yellow text but otherwise the same as style s1 -->
      <style xml:id="s2" style="s1" tts:color="yellow"/>
      <!-- a style based on s1 but justified to the right -->
      <style xml:id="s1Right" style="s1" tts:textAlign="end" />     
      <!-- a style based on s2 but justified to the left -->
      <style xml:id="s2Left" style="s2" tts:textAlign="start" />

    <layout xmlns:tts="http://www.w3.org/2006/10/ttaf1#styling">
      <region xml:id="subtitleArea"
                   tts:extent="560px 62px"
                   tts:padding="5px 3px"
                   tts:displayAlign="after" />
  <body region="subtitleArea">
      <p xml:id="subtitle1" begin="0.76s" end="3.45s">
        It seems a paradox, does it not,
      <p xml:id="subtitle2" begin="5.0s" end="10.0s">
        that the image formed on<br/>
        the Retina should be inverted?
      <p xml:id="subtitle3" begin="10.0s" end="16.0s" style="s2">
        It is puzzling, why is it<br/>
        we do not see things upside-down?
      <p xml:id="subtitle4" begin="17.2s" end="23.0s">
        You have never heard the Theory,<br/>
        then, that the Brain also is inverted?
      <p xml:id="subtitle5" begin="23.0s" end="27.0s" style="s2">
        No indeed! What a beautiful fact!
      <p xml:id="subtitle6a" begin="28.0s" end="34.6s" style="s2Left">
        But how is it proved?
      <p xml:id="subtitle6b" begin="28.0s" end="34.6s" style="s1Right">
        Thus: what we call
      <p xml:id="subtitle7" begin="34.6s" end="45.0s" style="s1Right">
        the vertex of the Brain<br/>
        is really its base
      <p xml:id="subtitle8" begin="45.0s" end="52.0s" style="s1Right">
        and what we call its base<br/>
        is really its vertex,
      <p xml:id="subtitle9a" begin="53.5s" end="58.7s">
        it is simply a question of nomenclature.
      <p xml:id="subtitle9b" begin="53.5s" end="58.7s" style="s2">
        How truly delightful!

I’m going to look at each of the different functionalities separately and discuss their strengths and weaknesses.


Let’s begin with the body of the DFXP document and what elements are defined for this area.

Firstly, <body> comes with optional begin, end, and dur attributes. As is the case for all time specifications in DFXP, there are both “end” and “dur” attributes. Why this over-specification? There is not even an explanation which of the two has higher priority when in conflict. This is plainly asking for trouble – why not simplify the spec?

The “region” and “style” attributes refer to a previously defined region and styles that are applied to the body. “id” and “lang” attributes allow to associate a name and a language with the body.

The “timeContainer” attribute enables the author to specify whether the elements in the body are all to be regarded as temporally parallel or in sequence, the default being parallel. This means that all text elements specified inside the body can render over the top of each other – a situation that is solved by giving them specific start and end times.

The containing elements of body are a sequence of <div> tags. The div element functions as a logical container and a temporal structuring element for a sequence of textual content units. div elements like body elements are allowed a “start”, “end” and “dur” attribute and generally everything that the body element also has, except that their children can be more div or p. Again, the children of the div element are all regarded as being temporally parallel.

The p element is basically the inner-most element that contains the actual text, including new-lines (br) and spans to associate further styling, metadata, or animations. The children of the p or span element are also all regarded as being temporally parallel, unless otherwise specified.

The structuring of text into div, p, and span elements seems to make sense and provide sufficient (if not even excessive) flexibility for any required timed text needs.


Once the text is specified and structured, the next question is where it should be positioned.

The extent attribute of the <tt> root element specifies the width and height of the root container, if not specified by the external authoring context.

Inside the root container, regions are defined through explicit <region> elements. The origin of placement for a region is the top left corner. Regions can define their “origin” offset, their “width” and “height”, the alignment of text within them through the “textAlign” and “displayAlign” styles, and whether text that “overflows” a region should be visible or hidden.

The way in which DFXP defines regions and placement of text within regions is very different to the way in which HTML and CSS work. By default, elements in HTML flow one after another in the same order as they appear in the source. CSS attributes applied to the elements can control their positioning through giving coordinates, or relative placements in relation to other elements. In DFXP elements are placed inside regions that are styled, making it incompatible with HTML.


The styling attributes available for DFXP are limited, but sufficient for timed text purposes. The way in which style associations to elements are resolved is quite diverse. Styles can be associated with regions, with individual elements, individually and as a group, through layouts and through parent elements. Compared to CSS, it feels complicated and potentially full of contradictions.


Further to styling, DFXP defines animations, which are discrete changes to some style parameter value that applies over some time interval. This is relevant for example to implement karaoke style colouring of text over time.


The <metadata> element serves as a generic container for grouping metadata information. It can be associated virtually with any element – which seems somewhat over-flexible, but provides for interesting meta data information such as meta data for styles or for a <br>.

In addition, metadata is actually limited to a set number of elements: title, desc, copyright, agent, name, and actor. These are strange fields – in particular if you compare them to the flexibility of HTML meta data, which consists of free-form name-value pairs, bringing us domain-specific schemes such as the Dublin Core. This is not easily possible here, but instead one has to define extensions to allow for such flexible meta data.

Other features

DFXP provides other features such as information that describes the related video file, e.g. frameRate, subFrameRate, frameRateMultiplier, pixelAspectRatio, smpteMode, timeBase, and tickRate. Such information will help at the point in time when DFXP is supposed to be multiplexed into a binary media file together with audio and video tracks. These attributes can provide information required for the multiplexing process. I am not sure that justifies their existence though.

Other, minor features are available too. Check out the full specification to get a complete picture.


Part of the publication of this draft is also a test suite. Several of the defined features are still not represented in the test suite, which to me raises the question if they are really required. It might do wonders to the draft size to remove them.


DFXP is a standard for timed text that is firmly grounded in past captioning specifications, but written in XML, and borrowing ideas from Web technologies. It is unfortunately not re-using existing Web infrastructure to implement its more complex features: no use of CSS for styling and layout, no use of hyperlinks. Also, the use of namespaces seems excessive and won’t make it easy to author this format, in particular since the defined namespaces do not map into the defined profiles.

DFXP is, however, simple to transcode to something that a Web Browser can deal with through its existing engines, because it has borrowed from other Web standards. It is thus easier to work with on the Web than most other formats. It should be relatively easy to map to HTML, CSS and javascript, as already started in the test suite with the HTML5 video element.

DFXP is witten in such a way that it is possible to put together a new profile with extensions that are more appropriate for specific needs, e.g. that fit better into existing Web infrastructure. Currently, DFXP has three defined profiles: one focused on transformation, one focused on presentation, and one that contains everything.

I think it’s time for a html5 profile of DFXP that at minimum extends DFXP with hyperlinks, making it a real timed text Web format.

Attaching subtitles to HTML5 video

During the last week, I made a proposal to the HTML5 working group about how to support out-of-band time-aligned text in HTML5. What I mean by that is basically: how to link a subtitle file to a video tag in HTML5. This would mirror the way in which in desktop-players you can load separate subtitle files by hand to go alongside a video.

My suggestion is best explained by an example:

<video src="http://example.com/video.ogv" controls>
<text category="CC" lang="en" type="text/x-srt" src="caption.srt"></text>
<text category="SUB" lang="de" type="application/ttaf+xml" src="german.dfxp"></text>
<text category="SUB" lang="jp" type="application/smil" src="japanese.smil"></text>
<text category="SUB" lang="fr" type="text/x-srt" src="translation_webservice/fr/caption.srt"></text>

  • “text” elements are subelements of the “video” element and therefore clearly related to one video (even if it comes in different formats).
  • the “category” tag allows us to specify what text category we are dealing with and allows the web browser to determine how to display it. The idea is that there would be default display for the different categories and css would allow to override these.
  • the “lang” tag allows the specification of alternative resources based on language, which allows the browser to select one by default based on browser preferences, and also to turn those tracks on by default that a particular user requires (e.g. because they are blind and have preset the browser accordingly).
  • the “type” tag allows specification of what actual time-aligned text format is being used in this instance; again, it will allow the browser to determine whether it is able to decode the file and thus make it available through an interface or not.
  • the “src” attribute obviously points to the time-aligned text resource. This could be a file, a script that extracts data from a database, or even a web service that dynamically creates the data
    based on some input.

This proposal provides for a lot of flexibility and is somewhat independent of the media file format, while still enabling the Web browser to deal with the text (as long as it can decode it). Also note that this is not meant as the only way in which time-aligned text would be delivered to the Web browser – we are continuing to investigate how to embed text inside Ogg as a more persistent means of keeping your text with your media.

Of course you are now aching to see this in action – and this is where the awesomeness starts. There are already three implementations.

First, Jan Gerber independently thought out a way to provide support for srt files that would be conformant with the existing HTML5 tags. His solution is at http://v2v.cc/~j/jquery.srt/. He is using javascript to load and parse the srt file and map it into HTML and thus onto the screen. Jan’s syntax looks like this:

<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="jquery.srt.js"></script>

<video src="http://example.com/video.ogv" id="video" controls>
<div class="srt"
data-srt="http://example.com/video.srt" />

Then, Michael Dale decided to use my suggested HTML5 syntax and add it to mv_embed. The example can be seen here – it’s the bottom of the two videos. You will need to click on the “CC” button on the player and click on “select transcripts” to see the different subtitles in English and Spanish. If you click onto a text element, the video will play from that offset. Michael’s syntax looks like this:

<video src="sample_fish.ogg" poster="sample_fish.jpg" duration="26">
<text category="SUB" lang="en" type="text/x-srt" default="true"
title="english SRT subtitles" src="sample_fish_text_en.srt">
<text category="SUB" lang="es" type="text/x-srt"
title="spanish SRT subtitles" src="sample_fish_text_es.srt">

Then, after a little conversation with the W3C Timed Text working group, Philippe Le Hegaret extended the current DFXP test suite to demonstrate use of the proposed syntax with DFXP and Ogg video inside the browser. To see the result, you’ll need Firefox 3.1. If you select the “HTML5 DFXP player prototype” as test player, you can click on the tests on the left and it will load the DFXP content. Philippe actually adapted Jan’s javascript file for this. And his syntax looks like this:

<video src="example.ogv" id="video" controls>
<text lang='en' type="application/ttaf+xml" src="testsuite/Content/Br001.xml"></text>

The cool thing about these implementations is that they all work by mapping the time-aligned text to HTML – and for DFXP the styling attributes are mapped to CSS. In this way, the data can be made part of the browser window and displayed through traditional means.

For time-aligned text that is multiplexed into a media file, we just have to do the same and we will be able to achieve the same functionality. Video accessibility in HTML5 – we’re getting there!

Demo of new HTML5 features

Ian Hickson, the main editor of the new HTML5 specification, gave a talk about some of the cool new features in HTML5 and some of the early implementations of these features in different browsers.

It’s pretty long demo with 1:25 hrs but he types in all the code manually, so you can re-do all of the demos yourself. The script of the talk with code examples is here.

The first 5 minutes are about the new video element and really worth watching.

Also, at 1:11 hrs Ian is asked about the choice of baseline codecs, in case you want to hear him speak what he has publicly written elsewhere.

I can’t wait to marry the video features with:

  1. the new media fragment addressing schemes in development at the W3C
  2. captions, subtitles and other timed text annotations for videos.

These will allow us search for specific topics directly inside the video (such as “form controls” in Ian’s video) and to hyperlink straight into these time offsets. A completely new world is coming!