Tag Archives: captions

WebVTT Discussions at FOMS

At the recent FOMS (Foundations of Open Media Software and Standards) Developer Workshop, we had a massive focus on WebVTT and the state of its feature set. You will find links to summaries of the individual discussions in the FOMS Schedule page. Here are some of the key results I went away with.

1. WebVTT Regions

The key driving force for improvements to WebVTT continues to be the accurate representation of CEA608/708 captioning. As part of that drive, we’ve introduced regions (the CEA708 “window” concept) to WebVTT. WebVTT regions satisfy multiple requirements of CEA608/708 captions:

  1. support for rollup captions
  2. support for background color and border color on a group of cues independent of the background color of the individual cue
  3. possibility to move a group of cues from one location on screen to a different
  4. support to specify an anchor point and a growth direction for cues when their text size changes
  5. support for specifying a fixed number of lines to be rendered
  6. possibility to specify which region is rendered in front of which other one when regions overlap

While WebVTT regions enable us to satisfy all of the above points, the specification isn’t actually complete yet and some of the above needs aren’t satisfied yet.

We have an open bug to move a region elsewhere. A first discussion at FOMS seemed to to indicate that we’ll have to add syntax for updating a region at a particular time and thus give region definitions a way to be valid only for a certain time frame. I can imagine that the region definitions that we have in the header of the WebVTT file now would have an implicitly defined time frame from the start to the end of the file, but can be overruled by a re-definition anywhere within the WebVTT file. That redefinition needs to provide a start and end time.

We registered a bug to add specifying the width and height of regions (and possibly of cues) by em (i.e. by multiples of the largest character in a font). This should allow us to have the region grow/shrink around the region anchor point with a change of font size by script or a user. em specifications should also be applied to cues – that matches the column count of CEA708/608 better.

When regions overlap, the original region extension spec already suggested a “layer” cue setting. It will be easy to add it.

Another change that we will ultimately need is the “scroll” setting: we will need to introduce support for scrolling text down or from left-to-right or right-to-left, e.g. vertical scrolling text seems to be used in some Chinese caption use cases.

2. Unify Rendering Approach

The introduction of regions created a second code path in the rendering spec with some duplication. At FOMS we discussed if it was possible to unify that. The suggestion is to render all cues into a region. Those that are not part of a region would be rendered into an anonymous region that covers the complete viewport. There may be some consequences to this, e.g. cue settings should be usable across all cues, no matter whether or not part of a region, and avoiding cue overlap may need to be done within regions.

Here’s a rough outline of the path of the new rendering algorithm:

(1) Render the regions:

Specified Region Anonymous Region
Render values as given: Render following values:
  • width
  • lines
  • regionanchor
  • viewportanchor
  • scroll
  • 100%
  • videoheight/lineheight
  • 0,0
  • 0,0
  • none

(2) Render the cues:

  • Create a cue box and put it in its region (anonymous if none given).
  • Calculate position & size of cue box from cue settings (position, line, size).
  • Calculate position of cue text inside cue box from remaining cue settings (vertical, align).

3. Vertical Features

WebVTT includes vertical rendering, both right-to-left and left-to-right. However, regions are not defined for vertical. Eventually, we’re going to have to look at the vertical features of WebVTT with more details and figure out whether the spec is working for them and what real-world requirements we have missed. We hope we can get some help from users in countries where vertically rendered captions/subtitles are the norm.

4. Best Practices

Some of he WebVTT users at FOMS suggested it would be advantageous to start a list of “best practices” for how to author captions with WebVTT. Example recommendations are:

  • Use line numbers only to position cues from top or bottom of viewport. Don’t use otherwise.
  • Note that when the user increases the fontsize in rollup captions and thus introduces new line breaks, your cues will roll by faster because the number of lines of a rollup is fixed.
  • Make sure to use ‎ and ‏ UTF-8 markers to control the directionality of your text.

It would be nice if somebody started such a document.

5. Non-caption use cases

Instead of continuing to look back and improve our support of captions/subtitles in WebVTT, one session at FOMS also went ahead and looked forward to other use cases. The following requirements came out of this:

5.1 Preview Thumbnails

A common use case for timed data is the use of preview thumbnails on the navigation bar of videos. A native implementation of preview thumbnails would allow crawlers and search engines to have a standardised way of extracting timed images for media files, so introduction of a new @kind value “thumbnails” was suggested.

The content of a “thumbnails” cue could be any of:

  • an image URL
  • a sprite URL to a single image
  • a spatial & temporal media fragment URL to a media resource
  • base64 encoded image (data URI)
  • an iframe offset to the media resource

The suggestion is to allow anything that would work in a img @src attribute as value in a cue of @kind=”thumbnails”. Responsive images might also be useful for a track of @kind=”thumbnails”. It may even be possible to define an inband thumbnail track based on the track of @kind=”thumbnails”. Such cues should also work in the JavaScript track API.

5.2 Chapter markers

There is interest to put richer content than just a chapter title into chapter cues. Often, chapters consist of a title, text and and image. The text is not so important, but the image is used almost everywhere that chapters are used. There may be a need to extend chapter cue content with images, similar to what a @kind=”thumbnails” track offers.

The conclusion that we arrived at was that we need to make @kind=”thumbnails” work first and then look at using the learnings from that to extend @kind=”chapters”.

5.3 Inband tracks for live video

A difficult topic was opened with the question of how to transport text tracks in live video. In live captioning, end times are never created for cues, but are implied by the start time of the next cue. This is a use case that hasn’t been addressed in HTML5/WebVTT yet. An old proposal to allow a special end time value of “NEXT” was discussed and recommended for adoption. Also, there was support for the spec change that stops blocking loading VTT until all cues have been loaded.

5.4 Cross-domain VTT loading

A brief discussion centered around the fact that the spec disallows cross-domain loading of WebVTT files, but that no browser implements this. This needs to be discussion at the HTML WG level.

6. Regions in live captioning

The final topic that we discussed was how we could provide support for regions in live captioning.

  • The currently active region definitions will need to be come part of every header of every VTT file segment that HLS uses, so it’s available in case the cues in the segment file reference it.
  • “NEXT” in end time markers would make authoring of live captioned VTT files easier.
  • If the application wants to use 1 word at a time and doesn’t want to delay sending the word until the full cue is authored (e.g. in a Hangout type environment), we will need to introduce the concept of “cue continuation markers”, so we know that a cue could be extended with the next VTT file fragment.

This is an extensive and impressive amount of discussion around WebVTT and a lot of new work to be performed in the future. I’m very grateful for all the people who have contributed to these discussions at FOMS and will hopefully continue to help get the specifications right.

WebVTT as a W3C Recommendation

Three weeks ago I attended TPAC, the annual meeting of W3C Working Groups. One of the meetings was of the Timed Text Working Group (TT-WG), that has been specifying TTML, the Timed Text Markup Language. It is now proposed that WebVTT be also standardised through the same Working Group.

How did that happen, you may ask, in particular since WebVTT and TTML have in the past been portrayed as rival caption formats? How will the WebVTT spec that is currently under development in the Text Track Community Group (TT-CG) move through a Working Group process?

I’ll explain first why there is a need for WebVTT to become a W3C Recommendation, and then how this is proposed to be part of the Timed Text Working Group deliverables, and finally how I can see this working between the TT-CG and the TT-WG.

Advantages of a W3C Recommendation

TTML is a XML-based markup format for captions developed during the time that XML was all the hotness. It has become a W3C standard (a so-called “Recommendation”) despite not having been implemented in any browsers (if you ask me: that’s actually a flaw of the W3C standardisation process: it requires only two interoperable implementations of any kind – and that could be anyone’s JavaScript library or Flash demonstrator – it doesn’t actually require browser implementations. But I digress…). To be fair, a subpart of TTML is by now implemented in Internet Explorer, but all the other major browsers have thus far rejected proposals of implementation.

Because of its Recommendation status, TTML has become the basis for several other caption standards that other SDOs have picked: the SMPTE’s SMPTE-TT format, the EBU’s EBU-TT format, and the DASH Industry Forum’s use of SMPTE-TT. SMPTE-TT has also become the “safe harbour” format for the US legislation on captioning as decided by the FCC. (Note that the FCC requirements for captions on the Web are actually based on a list of features rather than requiring a specific format. But that will be the topic of a different blog post…)

WebVTT is much younger than TTML. TTML was developed as an interchange format among caption authoring systems. WebVTT was built for rendering in Web browsers and with HTML5 in mind. It meets the requirements of the <track> element and supports more than just captions/subtitles. WebVTT is popular with browser developers and has already been implemented in all major browsers (Firefox Nightly is the last to implement it – all others have support already released).

As we can see and as has been proven by the HTML spec and multiple other specs: browsers don’t wait for specifications to have W3C Recommendation status before they implement them. Nor do they really care about the status of a spec – what they care about is whether a spec makes sense for the Web developer and user communities and whether it fits in the Web platform. WebVTT has obviously achieved this status, even with an evolving spec. (Note that the spec tries very hard not to break backwards compatibility, thus all past implementations will at least be compatible with the more basic features of the spec.)

Given that Web browsers don’t need WebVTT to become a W3C standard, why then should we spend effort in moving the spec through the W3C process to become a W3C Recommendation?

The modern Web is now much bigger than just Web browsers. Web specifications are being used in all kinds of devices including TV set-top boxes, phone and tablet apps, and even unexpected devices such as white goods. Videos are increasingly omnipresent thus exposing deaf and hard-of-hearing users to ever-growing challenges in interacting with content on diverse devices. Some of these devices will not use auto-updating software but fixed versions so can’t easily adapt to new features. Thus, caption producers (both commercial and community) need to be able to author captions (and other video accessibility content as defined by the HTML5 element) towards a feature set that is clearly defined to be supported by such non-updating devices.

Understandably, device vendors in this space have a need to build their technology on standardised specifications. SDOs for such device technologies like to reference fixed specifications so the feature set is not continually updating. To reference WebVTT, they could use a snapshot of the specification at any time and reference that, but that’s not how SDOs work. They prefer referencing an officially sanctioned and tested version of a specification – for a W3C specification that means creating a W3C Recommendation of the WebVTT spec.

Taking WebVTT on a W3C recommendation track is actually advantageous for browsers, too, because a test suite will have to be developed that proves that features are implemented in an interoperable manner. In summary, I can see the advantages and personally support the effort to take WebVTT through to a W3C Recommendation.

Choice of Working Group

FAIK this is the first time that a specification developed in a Community Group is being moved into the recommendation track. This is something that has been expected when the W3C created CGs, but not something that has an established process yet.

The first question of course is which WG would take it through to Recommendation? Would we create a new Working Group or find an existing one to move the specification through? Since WGs involve a lot of overhead, the preference was to add WebVTT to the charter of an existing WG. The two obvious candidates were the HTML WG and the TT-WG – the first because it’s where WebVTT originated and the latter because it’s the closest thematically.

Adding a deliverable to a WG is a major undertaking. The TT-WG is currently in the process of re-chartering and thus a suggestion was made to add WebVTT to the milestones of this WG. TBH that was not my first choice. Since I’m already an editor in the HTML WG and WebVTT is very closely related to HTML and can be tested extensively as part of HTML, I preferred the HTML WG. However, adding WebVTT to the TT-WG has some advantages, too.

Since TTML is an exchange format, lots of captions that will be created (at least professionally) will be in TTML and TTML-related formats. It makes sense to create a mapping from TTML to WebVTT for rendering in browsers. The expertise of both, TTML and WebVTT experts is required to develop a good mapping – as has been shown when we developed the mapping from CEA608/708 to WebVTT. Also, captioning experts are already in the TT-WG, so it helps to get a second set of eyes onto WebVTT.

A disadvantage of moving a specification out of a CG into a WG is, however, that you potentially lose a lot of the expertise that is already involved in the development of the spec. People don’t easily re-subscribe to additional mailing lists or want the additional complexity of involving another community (see e.g. this email).

So, a good process needs to be developed to allow everyone to contribute to the spec in the best way possible without requiring duplicate work. How can we do that?

The forthcoming process

At TPAC the TT-WG discussed for several hours what the next steps are in taking WebVTT through the TT-WG to recommendation status (agenda with slides). I won’t bore you with the different views – if you are keen, you can read the minutes.

What I came away with is the following process:

  1. Fix a few more bugs in the CG until we’re happy with the feature set in the CG. This should match the feature set that we realistically expect devices to implement for a first version of the WebVTT spec.
  2. Make a FSA (Final Specification Agreement) in the CG to create a stable reference and a clean IPR position.
  3. Assuming that the TT-WG’s charter has been approved with WebVTT as a milestone, we would next bring the FSA specification into the TT-WG as FPWD (First Public Working Draft) and immediately do a Last Call which effectively freezes the feature set (this is possible because there has already been wide community review of the WebVTT spec); in parallel, the CG can continue to develop the next version of the WebVTT spec with new features (just like it is happening with the HTML5 and HTML5.1 specifications).
  4. Develop a test suite and address any issues in the Last Call document (of course, also fix these issues in the CG version of the spec).
  5. As per W3C process, substantive and minor changes to Last Call documents have to be reported and raised issues addressed before the spec can progress to the next level: Candidate Recommendation status.
  6. For the next step – Proposed Recommendation status – an implementation report is necessary, and thus the test suite needs to be finalized for the given feature set. The feature set may also be reduced at this stage to just the ones implemented interoperably, leaving any other features for the next version of the spec.
  7. The final step is Recommendation status, which simply requires sufficient support and endorsement by W3C members.

The first version of the WebVTT spec naturally has a focus on captioning (and subtitling), since this has been the dominant use case that we have focused on this far and it’s the part that is the most compatibly implemented feature set of WebVTT in browsers. It’s my expectation that the next version of WebVTT will have a lot more features related to audio descriptions, chapters and metadata. Thus, this seems a good time for a first version feature freeze.

There are still several obstacles towards progressing WebVTT as a milestone of the TT-WG. Apart from the need to get buy-in from the TT-WG, the TT-CG, and the AC (Adivisory Committee who have to approve the new charter), we’re also looking at the license of the specification document.

The CG specification has an open license that allows creating derivative work as long as there is attribution, while the W3C document license for documents on the recommendation track does not allow the creation of derivative work unless given explicit exceptions. This is an issue that is currently being discussed in the W3C with a proposal for a CC-BY license on the Recommendation track. However, my view is that it’s probably ok to use the different document licenses: the TT-WG will work on WebVTT 1.0 and give it a W3C document license, while the CG starts working on the next WebVTT version under the open CG license. It probably actually makes sense to have a less open license on a frozen spec.

Making the best of a complicated world

WebVTT is now proposed as part of the recharter of the TT-WG. I have no idea how complicated the process will become to achieve a W3C WebVTT 1.0 Recommendation, but I am hoping that what is outlined above will be workable in such a way that all of us get to focus on progressing the technology.

At TPAC I got the impression that the TT-WG is committed to progressing WebVTT to Recommendation status. I know that the TT-CG is committed to continue developing WebVTT to its full potential for all kinds of media-time aligned content with new kinds already discussed at FOMS. Let’s enable both groups to achieve their goals. As a consequence, we will allow the two formats to excel where they do: TTML as an interchange format and WebVTT as a browser rendering format.

WebVTT at W3C

Today we started a community group (CG) at the W3C for “Web Media Text Tracks”: http://www.w3.org/community/texttracks/.

The group has been created to work on many aspects of video text tracks of which captioning and the WebVTT format are key parts.

The main reason behind creating this group is to create a forum at the W3C for working on WebVTT to allow all browsers to support this format and be involved in its development.

We’ve not gone the full way to creating a Working Group, although that was the initial intention. We had objections from W3C members for going down that path, so are using the CG path for now.

This is actually a good thing because CGs are open for anyone to join, while WGs are only open to W3C members. The key difference is that specs coming out of WGs can become RECs (“standards”), while CG’s specs cannot.

If we eventually see a need to move WebVTT to a REC, that move will be straight forward, since there is a clear path for work to transition from a CG to a WG.

Recent developments around WebVTT

People have been asking me lots of questions about WebVTT (Web Video Text Tracks) recently. Questions about its technical nature such as: are the features included in WebVTT sufficient for broadcast captions including positioning and colors? Questions about its standardisation level: when is the spec officially finished and when will it move from the WHATWG to the W3C? Questions about implementation: are any browsers supporting it yet and how can I make use of it now?

I’m going to answer all of these questions in this post to make it more efficient than answering tweets, emails, and skype and other phone conference requests. It’s about time I do a proper post about it.


I’m starting with the last area, because it is the simplest to answer.

No, no browser has as yet shipped support for the <track> element and therefore there is no support for WebVTT in browsers yet. However, implementations are in progress. For example, Webkit has recently received first patches for the track element, but there is still an open bug for a WebVTT parser. Similarly, Firefox can now parse the track element, but is still working on the element’s actual functionality.

However, you do not have to despair, because there are now a couple of JavaScript polyfill libraries for either just the track element or for video players with track support. You can start using these while you are waiting for the browsers to implement native support for the element and the file format.

Here are some of the libraries that I’ve come across that will support SRT and/or WebVTT (do leave a comment if you come across more):

  • Captionator – a polyfill for track and SRT parsing (WebVTT in the works)
  • js_videosub – a polyfill for track and SRT parsing
  • jscaptions – a polyfill for track and SRT parsing
  • LeanBack player – a video player with track and SRT, SUB, DFXP, and soon full WebVTT parsing support
  • playr – a video player that includes track and WebVTT parsing
  • MediaElementJS – a video player that includes track and SRT parsing
  • Kaltura’s video player – a video player that includes track and SRT parsing

I am actually most excited about the work of Ronny Mennerich from LeanbackPlayer on WebVTT, since he has been the first to really attack full support of cue settings and to discuss with Ian, me and the WHATWG about their meaning. His review notes with visual description of how settings are to be interpreted and his demo will be most useful to authors and other developers.


Before we dig into the technical progress that has been made recently, I want to answer the question of “maturity”.

The WebVTT specification is currently developed at the WHATWG. It is part of the HTML specification there. When development on it started (under its then name WebSRT), it was also part of the HTML5 specification of the W3C. However, there was a concern that HTML5 should be independent of the chosen captioning format and thus WebVTT currently only exists at the WHATWG.

In recent months – and particularly since browser vendors have indicated that they will indeed implement support for WebVTT as their implementation of the <track> element – the question of formal standardization of WebVTT at the W3C has arisen. I’m involved in this as a Google contractor and we’ve put together a proposed charter for a WebVTT Working Group at the W3C.

In the meantime, standardization progresses at the WHATWG productively. Much feedback has recently been brought together by Ian and changes have been applied or at least prepared for a second feature set to be added to WebVTT once the first lot is implemented. I’ve captured the potentially accepted and rejected new features in a wiki page.

Many of the new features are about making the WebVTT format more useful for authoring and data management. The introduction of comments, inline CSS settings and default cue settings will help authors reduce the amount of styling they have to provide. File-wide metadata will help with the exchange of management information in professional captioning scenarios and archives.

But even without these new features, WebVTT already has all the features necessary to support professional captioning requirements. I’ve prepared a draft mapping of CEA-608 captions to WebVTT to demonstrate these capabilities (CEA-608 is the TV captioning standard in the US).

So, overall, WebVTT is in a great state for you to start implementing support for it in caption creation applications and in video players. There’s no need to wait any longer – I don’t expect fundamental changes to be made, but only new features to be added.

New WebVTT Features

This takes us straight to looking at the recently introduced new features.

  • Simpler File Magic:
    Whereas previously the magic file identifier for a WebVTT file was a single line with “WEBVTT FILE”. This has now been changed to a single line with just “WEBVTT”.
  • Cue Bold Span:
    The <b> element has been introduced into WebVTT, thus aligning it somewhat more with SRT and with HTML.
  • CSS Selectors:
    The spec already allowed to use the names of tags, the classes of <c> tags, and the voice annotations of <v> tags as CSS selectors for ::cue. ID selector matching is now also available, where the cue identifier is used.
  • text-decoration support:
    The spec now also supports the CSS text-decoration property for WebVTT cues, allowing functionality such as blinking text and bold.

Further to this, the email identifies the means in which WebVTT is extensible:

  • Header area:
    The WebVTT header area is defined through the “WEBVTT” magic file identifier as a start and two empty lines as an end. It is possible to add into this area file-wide information header information.
  • Cues:
    Cues are defined to start with an optional identifier, and then a start/end time specification with “–>” separator. They end with two empty lines. Cues that contain a “–>” separator but don’t parse as valid start/end time are currently skipped. Such “cues” can be used to contain inline command blocks.
  • Inline in cues:
    Finally, within cues, everything that is within a “tag”, i.e. between “”, and does not parse as one of the defined start or end tags is ignored, so we can use these to hide text. Further, text between such start and end tags is visible even if the tags are ignored, so wen can introduce new markup tags in this way.

Given this background, the following V2 extensions have been discussed:

  • Metadata:
    Enter name-value pairs of metadata into the header area, e.g.

    00:00:15.000 --> 00:00:17.950
    first cue
  • Inline Cue Settings:
    Default cue settings can come in a “cue” of their own, e.g.

    DEFAULTS --> D:vertical A:end
    00:00.000 --> 00:02.000
    This is vertical and end-aligned.
    00:02.500 --> 00:05.000
    As is this.
    DEFAULTS --> A:start
    00:05.500 --> 00:07.000
    This is horizontal and start-aligned.
  • Inline CSS:
    Since CSS is used to format cue text, a means to do this directly in WebVTT without a need for a Web page and external style sheet is helpful and could be done in its own cue, e.g.

      STYLE -->
      ::cue(v[voice=Bob]) { color: green; }
      ::cue(c.narration) { font-style: italic; }
      ::cue(c.narration i) { font-style: normal; }
      00:00.000 --> 00:02.000
      <v Bob>Welcome.
      00:02.500 --> 00:05.000
      <c .narration>To <i>WebVTT</i>.
  • Comments:
    Both, comments within cues and complete cues commented out are possible, e.g.

     COMMENT -->
     00:02.000 --> 00:03.000
     two; this is entirely
     commented out
     00:06.000 --> 00:07.000
     this part of the cue is visible
     <! this part isn't >
     <and neither is this>

Finally, I believe we still need to add the following features:

  • Language tags:
    I’d like to add a language tag that allows to mark up a subpart of cue text as being in a different language. We need this feature for mixed-language cues (in particular where a different font may be necessary for the inline foreign-language text). But more importantly we will need this feature for cues that contain text descriptions rather than captions, such that a speech synthesizer can pick the correct language model to speak the foreign-language text. It was discussed that this could be done with a <lang jp>xxx</lang> type of markup.
  • Roll-up captions:
    When we use timestamp objects and the future text is hidden, then is un-hidden upon reaching its time, we should allow the cue text to scroll up a line when the un-hidden text requires adding a new line. This is the typical way in which TV live captions have been displayed and so users are acquainted with this display style.
  • Inline navigation:
    For chapter tracks the primary use of cues are for navigation. In other formats – in particular in DAISY-books for blind users – there are hierarchical navigation possibilities within media resources. We can use timestamp objects to provide further markers for navigation within cues, but in order to make these available in a hierarchical fashion, we will need a grouping tag. It would be possible to introduce a <nav> tag that can group several timestamp objects for navigation.
  • Default caption width:
    At the moment, the default display size of a caption cue is 100% of the video’s width (height for vertical directions), which can be overruled with the “S” cue setting. I think it should by default rather be the width (height) of the bounding box around all the text inside the cue.

Aside from these changes to WebVTT, there are also some things that can be improved on the <track> element. I personally support the introduction of the source element underneath the track element, because that allows us to provide different caption files for different devices through the @media media queries attribute and it allows support for more than just one default captioning format. This change needs to be made soon so we don’t run into trouble with the currently empty track element.

I further think a oncuelistchange event would be nice as well in cases where the number of tracks is somehow changed – in particular when coming from within a media file.

Other than this, I’m really very happy with the state that we have achieved this far.

WebVTT explained

On Wednesday, I gave a talk at Google about WebVTT, the Web Video Text Track file format that is under development at the WHATWG for solving time-aligned text challenges for video.

I started by explaining all the features that WebVTT supports for captions and subtitles, mentioned how WebVTT would be used for text audio descriptions and navigation/chapters, and explained how it is included into HTML5 markup, such that the browser provides some default rendering for these purposes. I also mentioned the metadata approach that allows any timed content to be included into cues.

The talk slides include a demo of how the <track> element works in the browser. I’ve actually used the Captionator polyfill for HTML5 to make this demo, which was developed by Chris Giffard and is available as open source from GitHub.

The talk was recorded and has been made available as a Google Tech talk with captions and also a separate version with extended audio descriptions.

The slides of the talk are also available (best to choose the black theme).

I’ve also created a full transcript of the described video.

Get the WebVTT specification from the WHATWG Website.

State of Media Accessibility in HTML5

Today I gave a talk at the Open Video Conference about the state of the specifications in HTML5 for media accessibility.

To be clear: at this exact moment, there is no actual specification text in the W3C version of HTML5 for media accessibility. There is, however, some text in the WHATWG version, providing a framework for text-based alternative content. Other alternative content still requires new specification text. Finally, there is no implementation in any browser yet for media accessibility, but we are getting closer. As browser vendors are moving towards implementing support for the WHATWG specifications of the <track> element, the TimedTrack JavaScript API, and the WebSRT format, video sites can also experiment with the provided specifications and contribute feedback to improve the specifications.

Attached are my slides from today’s talk. I went through some of the key requirements of accessibility users and showed how they are being met by the new specifications (in green) or could be met with some still-to-be-developed specifications (in blue). Note that the talk and slides focus on accessibility needs, but the developed technologies will be useful far beyond just accessibility needs and will also help satisfy other needs, such as the needs of internationalization (through subtitles), of exposing multitrack audio/video (through the JavaScript API), of providing timed metadata (through WebSRT), or even of supporting Karaoke (through WebSRT). In the tables on the last two pages I summarize the gaps in the specifications where we will be working on next and also show what is already possible with given specifications.

Introducing media accessibility into HTML5

In recent months, people in the W3C HTML5 Accessibility Task Force developed two proposals for introducing caption, subtitle, and more generally time-aligned text support into HTML5 audio and video.

These time-aligned text files can either come as external files that are associated with the timeline of the media resource, or they come as part of the media resource in a binary track.

For both cases we now have proposals to extend the HTML5 specification.

Firstly, let’s look at time-aligned text in external files. The change proposal introduces markup to associate such external files as a kind of “virtual track” with a media resource. Here is an example:

<video src="video.ogv">
<track src="video_cc.ttml" type="application/ttaf+xml" language="en" role="caption"></track>
<track src="video_tad.srt" type="text/srt" language="en" role="textaudesc"></track>
<trackgroup role="subtitle">
<track src="video_sub_en.srt" type="text/srt; charset='Windows-1252'" language="en"></track>
<track src="video_sub_de.srt" type="text/srt; charset='ISO-8859-1'" language="de"></track>
<track src="video_sub_ja.srt" type="text/srt; charset='EUC-JP'" language="ja"></track>

The video resource is “video.ogv”. Associated with it are five timed text resources.

The first one is written in TTML (which is the new name for DFXP), is a caption track and in English. TTML is particularly useful when you want to provide more than just an unformatted piece of text to the viewers. Hearing-impaired users appreciate any visual help they can be provided with to absorb the caption text more quickly. This includes colour coding of speakers, positioning of text close to the speaking person on screen, or even animated musical notes to signify music. Thus, a format like TTML that allows for formatting and positioning information is an appropriate format to specify captions.

All other timed text resources are provided in SRT format, which is a simpler format that TTML with only plain text in the text cues.

The second text track is a textual audio description track. A textual audio description is in fact targeted at the vision-impaired and contains text that is expected to be read out by a screen reader or routed to a braille device. Thus, as the video plays, a vision-impaired user receives additional information about the visual content of the scene through their screen reader or braille device. The SRT format is particularly useful for providing textual audio descriptions since it only provides plain text, which can easily be handed on to assistive technology. When authoring such textual audio descriptions, it is very important to pick time intervals in the original media resource where no other significant audio cue is provided, such that the vision-impaired user is able to listen to the screen reader during that time.

The last three text tracks are subtitle tracks. They are grouped into a trackgroup element, which is not strictly necessary, but enables the author to say that these tracks are supposed to be alternatives. Thus, a Web Browser can create a menu with all the available tracks and put the tracks in the trackgroup into a menu of their own where only one option is selectable (similar to how radiobuttons work). Incidentally, the trackgroup element also allows to avoid having to repeat the role attribute in all the containing tracks. It is expected that these menus will be added to the default media controls and will thus be visible if the media element has a controls attribute.

With the role, type and language attributes, it is easy for a Web Browser to understand what the different tracks have to offer. A Web Browser can even decide to offer new functionality that is helpful to certain user groups. For example, an addition to a Web Browser’s default settings could be to allow users to instruct a Web Browser to always turn on captions or subtitles if they are available in the user’s main language. Or to always turn on textual audio descriptions. In this way, a user can customise their default experience of a media resource over and on top of what a Web page author decides to expose.

Incidentally, the choice of “track” as a name for relating external text resources to a media element has a deeper meaning. It is easily possible in future to extend “track” elements to not just point to dependent text resources, but also to dependent audio or video resources. For example, an actual audio description that is a recording of a human voice rather than a rendered text description could be association in the same way. Right now, such an implementation is not envisaged by the Browser vendors, but it will be something to work towards in the future.

Now, with such functionality available, there is naturally a desire to be able to control activation or de-activation of text tracks through JavaScript, not just through user interaction. A Web Developer may for example want to override the default controls provided by a Web Browser and run their own JavaScript-based controls, thus requiring to create their own selection menu for the tracks.

This is actually also an issue more generally and applies to all track types, including such tracks that come inside an existing media resource. In the current specification such tracks are not exposed and can therefore not be activated.

This is where the second specification that the W3C Accessibility Task Force has worked towards comes in: the media multitrack JavaScript API.

This specification introduces a read-only JavaScript interface to the audio and video elements to allow Web Developers to find out about the tracks (including the virtual tracks) that a media resource offers. The only action that the interface currently provides is to enable or disable tracks.
Here is an example use to turn on a french subtitle track:

if (video.tracks[2].role == "subtitle" && video.tracks[2].language == "fr") video.tracks[2].enabled = true;

There is still a need to introduce a means to actually expose the text cues as they relate to the currentTime of the media resource. This has not yet been specified in the given proposals.

The text cues could be exposed in several ways. They could be exposed through introducing an event, i.e. every time a new text cue becomes active, a callback is called which is given the active text cue (if such a callback had been registered previously). Another option is to simply write the text cues into a specified div-element in the DOM and thus expose them directly in the Browser. A third idea could be to expose the text cues in an iframe-like element to avoid any cross-site security issues. And a fourth idea that we have discussed is to expose the text cues in an attribute of the track.

All of this obviously also relates to how to actually render the text cues and whether to render them in a shadow DOM so as to make the JavaScript reading separate from the rendering and address security and copyright issues. I’d be curious in opinions here on how it should be done.

Embedding time-aligned text into Ogg

As part of my accessibility work for Mozilla and Xiph, it is necessary to define how time-aligned text such as subtitles, captions, or annotations, are encapsulated into Ogg. In the fansubber community this is called “hard subtitles” as opposed to “soft subtitles” which are subtitles that stay in a text file and are loaded separately to the video file into a media player and synchronised with the video by the media player. (as per comment below, all text annotations are “soft” – or also “closed”.)

I can hear you ask: so how do I do subtitles/captions with Ogg now? Well, it would have been possible to simply choose one subtitling format and map that into Ogg, then ask everyone to just use that one format and be done. But which one to choose? And why prefer a simpler one over a more complex one? And why just do subtitles and not any other time-aligned text?

So, instead, I analysed what types of time-aligned text “codecs” I have come across. Each one would have a multitude of text formats to capture the text data, because it is easy to invent a new format and standardisation hasn’t really happened in this space yet.

I have come up with the following list of typical time-aligned text codecs:

  • CC: closed captions (for the deaf)
  • SUB: subtitles
  • TAD: textual audio descriptions (for the blind – to be transferred to braille or TTS)
  • KTV: karaoke
  • TIK: ticker text
  • AR: active regions
  • NB: metadata & semantic annotations
  • TRX: transcripts / scripts
  • LRC: lyrics
  • LIN: linguistic markup
  • CUE: cue points, DVD style chapter markers and similar navigational landmarks

Let me know if you can think of any other classes of video/audio-related time-aligned text.

All of these texts can be represented in text files with some kind of time marker, and possibly some header information to set up the interpretation environment. So, the simplest way of creating a representation of these inside Ogg was to define a generic mapping for time-aligned text into Ogg.

The Xiph wiki holds the current draft specification for mapping text codecs into Ogg. For anyone wanting to map a text codec into Ogg, this should provide the framework. The idea is to separate the text codec’s data into header data and into timed text segments (which can have all sorts of styling and other information with it). Then, the mapping is simple. An example for srt is described on the wiki page.

The specification is still in draft status, because we’re still expecting feedback. In fact, what we now need is people trying an implementation and providing fixes to the specification.

To map your text codec of choice into Ogg, you will probably requrie further mapping specifications. Dependent on how complex your text codec of choice is, these additional mapping specifications may be rather simple or quite complicated. In the case of srt, it should be trivial. Considering the massive amount of srt already freely available online, the srt mapping may well have a really large impact. Enough hits. Let me know if you’re coding up something!

My next duty is to look for a representation that is generic enough to provide representations for any of the above listed text codecs. This representation is what will need to be available to a Web Browser when working with a Web video that has related text. Current contenders are OggKate and W3C TimedText, but I am not sure if either are too restrictive. I am indeed looking for the next generation of captioning technology that will be able to provide any type of time-aligned text that relates to audio/video.