Tag Archives: standards

WebVTT as a W3C Recommendation

Three weeks ago I attended TPAC, the annual meeting of W3C Working Groups. One of the meetings was of the Timed Text Working Group (TT-WG), that has been specifying TTML, the Timed Text Markup Language. It is now proposed that WebVTT be also standardised through the same Working Group.

How did that happen, you may ask, in particular since WebVTT and TTML have in the past been portrayed as rival caption formats? How will the WebVTT spec that is currently under development in the Text Track Community Group (TT-CG) move through a Working Group process?

I’ll explain first why there is a need for WebVTT to become a W3C Recommendation, and then how this is proposed to be part of the Timed Text Working Group deliverables, and finally how I can see this working between the TT-CG and the TT-WG.

Advantages of a W3C Recommendation

TTML is a XML-based markup format for captions developed during the time that XML was all the hotness. It has become a W3C standard (a so-called “Recommendation”) despite not having been implemented in any browsers (if you ask me: that’s actually a flaw of the W3C standardisation process: it requires only two interoperable implementations of any kind – and that could be anyone’s JavaScript library or Flash demonstrator – it doesn’t actually require browser implementations. But I digress…). To be fair, a subpart of TTML is by now implemented in Internet Explorer, but all the other major browsers have thus far rejected proposals of implementation.

Because of its Recommendation status, TTML has become the basis for several other caption standards that other SDOs have picked: the SMPTE’s SMPTE-TT format, the EBU’s EBU-TT format, and the DASH Industry Forum’s use of SMPTE-TT. SMPTE-TT has also become the “safe harbour” format for the US legislation on captioning as decided by the FCC. (Note that the FCC requirements for captions on the Web are actually based on a list of features rather than requiring a specific format. But that will be the topic of a different blog post…)

WebVTT is much younger than TTML. TTML was developed as an interchange format among caption authoring systems. WebVTT was built for rendering in Web browsers and with HTML5 in mind. It meets the requirements of the <track> element and supports more than just captions/subtitles. WebVTT is popular with browser developers and has already been implemented in all major browsers (Firefox Nightly is the last to implement it – all others have support already released).

As we can see and as has been proven by the HTML spec and multiple other specs: browsers don’t wait for specifications to have W3C Recommendation status before they implement them. Nor do they really care about the status of a spec – what they care about is whether a spec makes sense for the Web developer and user communities and whether it fits in the Web platform. WebVTT has obviously achieved this status, even with an evolving spec. (Note that the spec tries very hard not to break backwards compatibility, thus all past implementations will at least be compatible with the more basic features of the spec.)

Given that Web browsers don’t need WebVTT to become a W3C standard, why then should we spend effort in moving the spec through the W3C process to become a W3C Recommendation?

The modern Web is now much bigger than just Web browsers. Web specifications are being used in all kinds of devices including TV set-top boxes, phone and tablet apps, and even unexpected devices such as white goods. Videos are increasingly omnipresent thus exposing deaf and hard-of-hearing users to ever-growing challenges in interacting with content on diverse devices. Some of these devices will not use auto-updating software but fixed versions so can’t easily adapt to new features. Thus, caption producers (both commercial and community) need to be able to author captions (and other video accessibility content as defined by the HTML5 element) towards a feature set that is clearly defined to be supported by such non-updating devices.

Understandably, device vendors in this space have a need to build their technology on standardised specifications. SDOs for such device technologies like to reference fixed specifications so the feature set is not continually updating. To reference WebVTT, they could use a snapshot of the specification at any time and reference that, but that’s not how SDOs work. They prefer referencing an officially sanctioned and tested version of a specification – for a W3C specification that means creating a W3C Recommendation of the WebVTT spec.

Taking WebVTT on a W3C recommendation track is actually advantageous for browsers, too, because a test suite will have to be developed that proves that features are implemented in an interoperable manner. In summary, I can see the advantages and personally support the effort to take WebVTT through to a W3C Recommendation.

Choice of Working Group

FAIK this is the first time that a specification developed in a Community Group is being moved into the recommendation track. This is something that has been expected when the W3C created CGs, but not something that has an established process yet.

The first question of course is which WG would take it through to Recommendation? Would we create a new Working Group or find an existing one to move the specification through? Since WGs involve a lot of overhead, the preference was to add WebVTT to the charter of an existing WG. The two obvious candidates were the HTML WG and the TT-WG – the first because it’s where WebVTT originated and the latter because it’s the closest thematically.

Adding a deliverable to a WG is a major undertaking. The TT-WG is currently in the process of re-chartering and thus a suggestion was made to add WebVTT to the milestones of this WG. TBH that was not my first choice. Since I’m already an editor in the HTML WG and WebVTT is very closely related to HTML and can be tested extensively as part of HTML, I preferred the HTML WG. However, adding WebVTT to the TT-WG has some advantages, too.

Since TTML is an exchange format, lots of captions that will be created (at least professionally) will be in TTML and TTML-related formats. It makes sense to create a mapping from TTML to WebVTT for rendering in browsers. The expertise of both, TTML and WebVTT experts is required to develop a good mapping – as has been shown when we developed the mapping from CEA608/708 to WebVTT. Also, captioning experts are already in the TT-WG, so it helps to get a second set of eyes onto WebVTT.

A disadvantage of moving a specification out of a CG into a WG is, however, that you potentially lose a lot of the expertise that is already involved in the development of the spec. People don’t easily re-subscribe to additional mailing lists or want the additional complexity of involving another community (see e.g. this email).

So, a good process needs to be developed to allow everyone to contribute to the spec in the best way possible without requiring duplicate work. How can we do that?

The forthcoming process

At TPAC the TT-WG discussed for several hours what the next steps are in taking WebVTT through the TT-WG to recommendation status (agenda with slides). I won’t bore you with the different views – if you are keen, you can read the minutes.

What I came away with is the following process:

  1. Fix a few more bugs in the CG until we’re happy with the feature set in the CG. This should match the feature set that we realistically expect devices to implement for a first version of the WebVTT spec.
  2. Make a FSA (Final Specification Agreement) in the CG to create a stable reference and a clean IPR position.
  3. Assuming that the TT-WG’s charter has been approved with WebVTT as a milestone, we would next bring the FSA specification into the TT-WG as FPWD (First Public Working Draft) and immediately do a Last Call which effectively freezes the feature set (this is possible because there has already been wide community review of the WebVTT spec); in parallel, the CG can continue to develop the next version of the WebVTT spec with new features (just like it is happening with the HTML5 and HTML5.1 specifications).
  4. Develop a test suite and address any issues in the Last Call document (of course, also fix these issues in the CG version of the spec).
  5. As per W3C process, substantive and minor changes to Last Call documents have to be reported and raised issues addressed before the spec can progress to the next level: Candidate Recommendation status.
  6. For the next step – Proposed Recommendation status – an implementation report is necessary, and thus the test suite needs to be finalized for the given feature set. The feature set may also be reduced at this stage to just the ones implemented interoperably, leaving any other features for the next version of the spec.
  7. The final step is Recommendation status, which simply requires sufficient support and endorsement by W3C members.

The first version of the WebVTT spec naturally has a focus on captioning (and subtitling), since this has been the dominant use case that we have focused on this far and it’s the part that is the most compatibly implemented feature set of WebVTT in browsers. It’s my expectation that the next version of WebVTT will have a lot more features related to audio descriptions, chapters and metadata. Thus, this seems a good time for a first version feature freeze.

There are still several obstacles towards progressing WebVTT as a milestone of the TT-WG. Apart from the need to get buy-in from the TT-WG, the TT-CG, and the AC (Adivisory Committee who have to approve the new charter), we’re also looking at the license of the specification document.

The CG specification has an open license that allows creating derivative work as long as there is attribution, while the W3C document license for documents on the recommendation track does not allow the creation of derivative work unless given explicit exceptions. This is an issue that is currently being discussed in the W3C with a proposal for a CC-BY license on the Recommendation track. However, my view is that it’s probably ok to use the different document licenses: the TT-WG will work on WebVTT 1.0 and give it a W3C document license, while the CG starts working on the next WebVTT version under the open CG license. It probably actually makes sense to have a less open license on a frozen spec.

Making the best of a complicated world

WebVTT is now proposed as part of the recharter of the TT-WG. I have no idea how complicated the process will become to achieve a W3C WebVTT 1.0 Recommendation, but I am hoping that what is outlined above will be workable in such a way that all of us get to focus on progressing the technology.

At TPAC I got the impression that the TT-WG is committed to progressing WebVTT to Recommendation status. I know that the TT-CG is committed to continue developing WebVTT to its full potential for all kinds of media-time aligned content with new kinds already discussed at FOMS. Let’s enable both groups to achieve their goals. As a consequence, we will allow the two formats to excel where they do: TTML as an interchange format and WebVTT as a browser rendering format.

The use cases for a element in HTML

The W3C HTML WG and the WHATWG are currently discussing the introduction of a <main> element into HTML.

The <main> element has been proposed by Steve Faulkner and is specified in a draft extension spec which is about to be accepted as a FPWD (first public working draft) by the W3C HTML WG. This implies that the W3C HTML WG will be looking for implementations and for feedback by implementers on this spec.

I am supportive of the introduction of a <main> element into HTML. However, I believe that the current spec and use case list don’t make a good enough case for its introduction. Here are my thoughts.

Main use case: accessibility

In my opinion, the main use case for the introduction of <main> is accessibility.

Like any other users, when blind users want to perceive a Web page/application, they need to have a quick means of grasping the content of a page. Since they cannot visually scan the layout and thus determine where the main content is, they use accessibility technology (AT) to find what is known as “landmarks”.

“Landmarks” tell the user what semantic content is on a page: a header (such as a banner), a search box, a navigation menu, some asides (also called complementary content), a footer, …. and the most important part: the main content of the page. It is this main content that a blind user most often wants to skip to directly.

In the days of HTML4, a hidden “skip to content” link at the beginning of the Web page was used as a means to help blind users access the main content.

In the days of ARIA, the aria @role=main enables authors to avoid a hidden link and instead mark the element where the main content begins to allow direct access to the main content. This attribute is supported by AT – in particular screen readers – by making it part of the landmarks that AT can directly skip to.

Both the hidden link and the ARIA @role=main approaches are, however, band aids: they are being used by those of us that make “finished” Web pages accessible by adding specific extra markup.

A world where ARIA is not necessary and where accessibility developers would be out of a job because the normal markup that everyone writes already creates accessible Web sites/applications would be much preferable over the current world of band-aids.

Therefore, to me, the primary use case for a <main> element is to achieve exactly this better world and not require specialized markup to tell a user (or a tool) where the main content on a page starts.

An immediate effect would be that pages that have a <main> element will expose a “main” landmark to blind and vision-impaired users that will enable them to directly access that main content on the page without having to wade through other text on the page. Without a <main> element, this functionality can currently only be provided using heuristics to skip other semantic and structural elements and is for this reason not typically implemented in AT.

Other use cases

The <main> element is a semantic element not unlike other new semantic elements such as <header>, <footer>, <aside>, <article>, <nav>, or <section>. Thus, it can also serve other uses where the main content on a Web page/Web application needs to be identified.

Data mining

For data mining of Web content, the identification of the main content is one of the key challenges. Many scholarly articles have been published on this topic. This stackoverflow article references and suggests a multitude of approaches, but the accepted answer says “there’s no way to do this that’s guaranteed to work”. This is because Web pages are inherently complex and many <div>, <p>, <iframe> and other elements are used to provide markup for styling, notifications, ads, analytics and other use cases that are necessary to make a Web page complete, but don’t contribute to what a user consumes as semantically rich content. A <main> element will allow authors to pro-actively direct data mining tools to the main content.

Search engines

One particularly important “data mining” tool are search engines. They, too, have a hard time to identify which sections of a Web page are more important than others and employ many heuristics to do so, see e.g. this ACM article. Yet, they still disappoint with poor results pointing to findings of keywords in little relevant sections of a page rather than ranking Web pages higher where the keywords turn up in the main content area. A <main> element would be able to help search engines give text in main content areas a higher weight and prefer them over other areas of the Web page. It would be able to rank different Web pages depending on where on the page the search words are found. The <main> element will be an additional hint that search engines will digest.

Visual focus

On small devices, the display of Web pages designed for Desktop often causes confusion as to where the main content can be found and read, in particular when the text ends up being too small to be readable. It would be nice if browsers on small devices had a functionality (maybe a default setting) where Web pages would start being displayed as zoomed in on the main content. This could alleviate some of the headaches of responsive Web design, where the recommendation is to show high priority content as the first content. Right now this problem is addressed through stylesheets that re-layout the page differently depending on device, but again this is a band-aid solution. Explicit semantic markup of the main content can solve this problem more elegantly.


Finally, naturally, <main> would also be used to style the main content differently from others. You can e.g. replace a semantically meaningless <div id=”main”> with a semantically meaningful <main> where their position is identical. My analysis below shows, that this is not always the case, since oftentimes <div id=”main”> is used to group everything together that is not the header – in particular where there are multiple columns. Thus, the ease of styling a <main> element is only a positive side effect and not actually a real use case. It does make it easier, however, to adapt the style of the main content e.g. with media queries.

Proposed alternative solutions

It has been proposed that existing markup serves to satisfy the use cases that <main> has been proposed for. Let’s analyse these on some of the most popular Web sites. First let’s list the propsed algorithms.

Proposed solution No 1: Scooby-Doo

On Sat, Nov 17, 2012 at 11:01 AM, Ian Hickson <ian@hixie.ch> wrote:
| The main content is whatever content isn't
| marked up as not being main content (anything not marked up with <header>,
| <aside>, <nav>, etc).

This implies that the first element that is not a <header>, <aside>, <nav>, or <footer> will be the element that we want to give to a blind user as the location where they should start reading. The algorithm is implemented in https://gist.github.com/4032962.

Proposed solution No 2: First article element

On Sat, Nov 17, 2012 at 8:01 AM, Ian Hickson  wrote:
| On Thu, 15 Nov 2012, Ian Yang wrote:
| >
| > That's a good idea. We really need an element to wrap all the <p>s,
| > <ul>s, <ol>s, <figure>s, <table>s ... etc of a blog post.
| That's called <article>.

This approach identifies the first <article> element on the page as containing the main content. Here’s the algorithm for this approach.

Proposed solution No 3: An example heuristic approach

The readability plugin has been developed to make Web pages readable by essentially removing all the non-main content from a page. An early source of readability is available. This demonstrates what a heuristic approach can perform.

Analysing alternative solutions


I’ve picked 4 typical Websites (top on Alexa) to analyse how these three different approaches fare. Ideally, I’d like to simply apply the above three scripts and compare pictures. However, since the semantic HTML5 elements <header>, <aside>, <nav>, and <footer> are not actually used by any of these Web sites, I don’t actually have this choice.

So, instead, I decided to make some assumptions of where these semantic elements would be used and what the outcome of applying the first two algorithms would be. I can then compare it to the third, which is a product so we can take screenshots.


http://google.com – search for “Scooby Doo”.

The search results page would likely be built with:

  • a <nav> menu for the Google bar
  • a <header> for the search bar
  • another <header> for the login section
  • another <nav> menu for the search types
  • a <div> to contain the rest of the page
  • a <div> for the app bar with the search number
  • a few <aside>s for the left and right column
  • a set of <article>s for the search results
“Scooby Doo” would find the first element after the headers as the “main content”. This is the element before the app bar in this case. Interestingly, there is a <div @id=main> already in the current Google results page, which “Scooby Doo” would likely also pick. However, there are a nav bar and two asides in this div, which clearly should not be part of the “main content”. Google actually placed a @role=main on a different element, namely the one that encapsulates all the search results.

“First Article” would find the first search result as the “main content”. While not quite the same as what Google intended – namely all search results – it is close enough to be useful.

The “readability” result is interesting, since it is not able to identify the main text on the page. It is actually aware of this problem and brings a warning before displaying this page:

Readability of google.com



A user page would likely be built with:

  • a <header> bar for the search and login bar
  • a <div> to contain the rest of the page
  • an <aside> for the left column
  • a <div> to contain the center and right column
  • an <aside> for the right column
  • a <header> to contain the center column “megaphone”
  • a <div> for the status posting
  • a set of <article>s for the home stream
“Scooby Doo” would find the first element after the headers as the “main content”. This is the element that contains all three columns. It’s actually a <div @id=content> already in the current Facebook user page, which “Scooby Doo” would likely also pick. However, Facebook selected a different element to place the @role=main : the center column.

“First Article” would find the first news item in the home stream. This is clearly not what Facebook intended, since they placed the @role=main on the center column, above the first blog post’s title. “First Article” would miss that title and the status posting.

The “readability” result again disappoints but warns that it failed:



A video page would likely be built with:

  • a <header> bar for the search and login bar
  • a <nav> for the menu
  • a <div> to contain the rest of the page
  • a <header> for the video title and channel links
  • a <div> to contain the video with controls
  • a <div> to contain the center and right column
  • an <aside> for the right column with an <article> per related video
  • an <aside> for the information below the video
  • a <article> per comment below the video
“Scooby Doo” would find the first element after the headers as the “main content”. This is the element that contains the rest of the page. It’s actually a <div @id=content> already in the current YouTube video page, which “Scooby Doo” would likely also pick. However, YouTube’s related videos and comments are unlikely to be what the user would regard as “main content” – it’s the video they are after, which generously has a <div id=watch-player>.

“First Article” would find the first related video or comment in the home stream. This is clearly not what YouTube intends.

The “readability” result is not quite as unusable, but still very bare:


http://wikipedia.com (“Overscan” page)

A Wikipedia page would likely be built with:

  • a <header> bar for the search, login and menu items
  • a <div> to contain the rest of the page
  • an &ls; article> with title and lots of text
  • <article> an <aside> with the table of contents
  • several <aside>s for the left column
Good news: “Scooby Doo” would find the first element after the headers as the “main content”. This is the element that contains the rest of the page. It’s actually a <div id=”content” role=”main”> element on Wikipedia, which “Scooby Doo” would likely also pick.

“First Article” would find the title and text of the main element on the page, but it would also include an <aside>.

The “readability” result is also in agreement.


In the following table we have summarised the results for the experiments:

Site Scooby-Doo First article Readability
Facebook.com FAIL FAIL FAIL

Clearly, Wikipedia is the prime example of a site where even the simple approaches find it easy to determine the main content on the page. WordPress blogs are similarly successful. Almost any other site, including news sites, social networks and search engine sites are petty hopeless with the proposed approaches, because there are too many elements that are used for layout or other purposes (notifications, hidden areas) such that the pre-determined list of semantic elements that are available simply don’t suffice to mark up a Web page/application completely.


It seems that in general it is impossible to determine which element(s) on a Web page should be the “main” piece of content that accessibility tools jump to when requested, that a search engine should put their focus on, or that should be highlighted to a general user to read. It would be very useful if the author of the Web page would provide a hint through a <main> element where that main content is to be found.

I think that the <main> element becomes particularly useful when combined with a default keyboard shortcut in browsers as proposed by Steve: we may actually find that non-accessibility users will also start making use of this shortcut, e.g. to get to videos on YouTube pages directly without having to tab over search boxes and other interactive elements, etc. Worthwhile markup indeed.

Accessibility to Web video for the Vision-Impaired

In the past week, I was invited to an IBM workshop on audio/text descriptions for video in Japan. Geoff Freed and Trisha O’Connell from WGBH, and Michael Evans from BBC research were the other invited experts to speak about the current state of video accessibility around the world and where things are going in TV/digital TV and the Web.

The two day workshop was very productive. The first day was spent with presentations which were open to the public. A large vision-impaired community attended to understand where technology is going. It was very humbling to be part of an English-spoken workshop in Japan, where much of the audience is blind, but speaks English much better than my average experience with English in Japan. I met many very impressive and passionate people that are creating audio descriptions, adapting NVDA for the Japanese market, advocating to Broadcasters and Government to create more audio descriptions, and perform fundamental research for better tools to create audio descriptions. My own presentation was on “HTML5 Video Descriptions“.

On the second day, we only met with the IBM researchers and focused discussions on two topics:

  1. How to increase the amount of video descriptions
  2. HTML5 specifications for video descriptions

The first topic included concerns about guidelines for description authoring by beginners, how to raise awareness, who to lobby, and what production tools are required. I personally was more interested in the second topic and we moved into a smaller breakout group to focus on these discussions.

HTML5 specifications for video descriptions
Two topics were discussed related to video descriptions: text descriptions and audio descriptions. Text descriptions are descriptions authored as time-aligned text snippets and read out by a screen reader. Audio descriptions are audio recordings either of a human voice or even of a TTS (text-to-speech) synthesis – in either case, they are audio samples.

For a screen reader, the focus was actually largely on NVDA and people were very excited about the availability of this open source tool. There is a concern about how natural-sounding a screen reader can be made and IBM is doing much research there with some amazing results. In user experiment between WGBH and IBM they found that the more natural the voice sounds, the more people comprehend, but between a good screen reader and an actual human voice there is not much difference in the comprehension level. Broadcasters and other high-end producers are unlikely to accept TTS and will prefer the human voice, but for other materials – in particular for the large majority of content on the Web – TTS and screen readers can make a big difference.

An interesting lesson that I learnt was that video descriptions can be improved by 30% (i.e. 30% better comprehension) if we introduce extended descriptions, i.e. descriptions that can pause the main video to allow for a description be read for something that happens in the video, but where there is no obvious pause to read out the description. So, extended descriptions are one of the major challenges to get right.

We then looked at the path that we are currently progressing on in HTML5 with WebSRT, the TimedTrack API, the <track> elements and the new challenges around a multitrack API.

For text descriptions we identified a need for the following:

  • extension marker on cues: often it is very clear to the author of a description cue that there is no time for the cue to be read out in parallel to the main audio and the video needs to be paused. The proposal is for introduction of an extension marker on the cue to pause the video until the screen reader is finished. So, a speech-complete event from the screen reader API needs to be dealt with. To make this reliable, it might make sense to put a max duration on the cue so the video doesn’t end up waiting endlessly in case the screen reader event isn’t fired. The duration would be calculated based on a typical word speaking rate.
  • importance marker on cues: the duration of all text cues being read out by screen readers depends on the speed set-up of the screen reader. So, even when a cue has been created for a given audio break in the video, it may or may not fit into this break. For most cues it is important that they are read out completely before moving on, but for some it’s not. So, an importance maker could be introduced that determines whether a video stops at the end of the cue to allow the screen reader to finish, or whether the screen reader is silenced at that time no matter how far it has gotten.
  • ducking during cues: making the main audio track quieter in relation to the video description for the duration of a cue such as to allow the comprehension of the video description cue is important for comprehension
  • voice hints: an instruction at the beginning of the text description file for what voice to choose such that it won’t collide with e.g. the narrator voice of a video – typically the choice will be for a female voice when the narrator is male and the other way around – this will help initialize the screen reader appropriately
  • speed hints: an indicator at the beginning of a text description toward what word rate was used as the baseline for the timing of the cue durations such that a screen reader can be initialized with this
  • synthesis directives: while not a priority, eventually it will make for better quality synchronized text if it is possible to include some of the typical markers that speech synthesizers use (see e.g. SSML or speech CSS), including markers for speaker change, for emphasis, for pitch change and other prosody. It was, in fact, suggested that the CSS3’s speech module may be sufficient in particular since Opera already implements it.

This means we need to consider extending WebSRT cues with an “extension” marker and an “importance” marker. WebSRT further needs header-type metadata to include a voice and a speed hint for screen readers. The screen reader further needs to work more closely with the browser and exchange speech-complete events and hints for ducking. And finally we may need to allow for CSS3 speech styles on subparts of WebSRT cues, though I believe this latter one is not of high immediate importance.

For audio descriptions we identified a need for:

  • external/in-band descriptions: allowing external or in-band description tracks to be synchronized with the main video. It would be assumed in this case that the timeline of the description track is identical to the main video.
  • extended external descriptions: since it’s impossible to create in-band extended descriptions without changing the timeline of the main video, we can only properly solve the issue of extended audio descriptions through external resources. One idea that we came up with is to use a WebSRT file with links to short audio recordings as external extended audio descriptions. These can then be synchronized with the video and pause the video at the correct time etc through JavaScript. This is probably a sufficient solution for now. It supports both, sighted and vision-impaired users and does not extend the timeline of the original video. As an optimization, we can also do this through a single “virtual” resource that is a concatenation of the individual audio cues and is addressed through the WebSRT file with byte ranges.
  • ducking: making the main audio track quieter in relation to the video description for the duration of a cue such as to allow the comprehension of the video description cue is important for comprehension also with audio files, though it may be more difficult to realize
  • separate loudness control: making it possible for the viewer to separately turn the loudness of an audio description up/down in comparison to the main audio

For audio descriptions, we saw the need for introduction of a multitrack video API and markup to synchronize external audio description tracks with the main video. Extended audio descriptions should be solved through JavaScript and hooking up through the TimedTrack API, so mostly rolling it by hand at this stage. We will see how that develops in future. Ducking and separate loudness controls are equally needed here, but we do need more experiments in this space.

Finally, we discussed general needs to locate accessibility content such as audio descriptions by vision-impaired user:

  • the need for accessible user menus to turn on/off accessibility content
  • the introduction of dedicated and standardized keyboard short-cuts to turn on and manipulate the volume of audio descriptions (and captions)
  • the introduction of user preferences for automatically activating accessibility content; these could even learn from current usage, such that if a user activates descriptions for a video on one Website, the preferences pick this up; different user profiles are already introduced by ISO in “Access for all” and used in websites such as teachersdomain
  • means to generally locate accessibility content on the web, such as fields in search engines and RSS feeds
  • more generally there was a request to have caption on/off and description on/off buttons be introduced into remote controls of machines, which will become prevalent with the increasing amount of modern TV/Internet integrated devices

Overall, the workshop was a great success and I am keen to see more experimentation in this space. I also hope that some of the great work that was shown to us at IBM with extended descriptions and text descriptions will become available – if only as screencasts – so we can all learn from it to make better standards and technology.

HTML5 Video element discussions at TPAC meetings

Last week’s TPAC (2009 W3C Technical Plenary / Advisory Committee) meetings were my second time at a TPAC and I found myself becoming highly involved with the progress on accessibility on the HTML5 video element. There were in particular two meetings of high relevanct: the Video Accessibility workshop and Friday’s HTML5 breakout group on the video element.

HTML5 Video Accessibility Workshop

The week started on Sunday with the “HTML5 Video Accessibility workshop” at Stanford University, organised by John Foliot and Dave Singer. They brought together a substantial number of people all representing a variety of interest groups. Everyone got their chance to present their viewpoint – check out the minutes of the meeting for a complete transcript.

The list of people and their discussion topics were as follows:

Accessibility Experts

  • Janina Sajka, chair of WAI Protocols and Formats: represented the vision-impaired community and expressed requirements for a deeply controllable access interface to audio-visual content, preferably in a structured manner similar to DAISY.
  • Sally Cain, RNIB, Member of W3C PF group: expressed a deep need for audio descriptions, which are often overlooked besides captions.
  • Ken Harrenstien, Google: has worked on captioning support for video.google and YouTube and shared his experiences, e.g. http://www.youtube.com/watch?v=QRS8MkLhQmM, and automated translation.
  • Victor Tsaran, Yahoo! Accessibility Manager: joined for a short time out of interest.


  • John Foliot, professor at Stanford Uni: showed a captioning service that he set up at Stanford University to enable lecturers to publish more accessible video – it uses humans for transcription, but automated tools to time-align, and provides a Web interface to the staff.
  • Matt May, Adobe: shared what Adobe learnt about accessibility in Flash – in particular that an instream-only approach to captions was a naive approach and that external captions are much more flexible, extensible, and can fit into current workflows.
  • Frank Olivier, Microsoft: attended to listen and learn.


  • Pierre-Antoine Champin from Liris (France), who was not able to attend, sent a video about their research work on media accessibility using automatic and manual annotation.
  • Hironobu Takagi, IBM Labs Tokyo, general chair for W4A: demonstrated a text-based audio description system combined with a high-quality, almost human-sounding speech synthesizer.
  • Dick Bulterman, Researcher at CWI in Amsterdam, co-chair of SYMM (group at W3C doing SMIL): reported on 14 years of experience with multimedia presentations and SMIL (slides) and the need to make temporal and spatial synchronisation explicit to be able to do the complex things.
  • Joakim S