2011, Sep 19th, W3C Web and TV Workshop: “WebVTT in HTML5”


Position Statement
“WebVTT in HTML5 for video accessibiity and beyond”

The purpose of this statement is to give an introduction to WebVTT [1] to build a basis for discussion of requirements that come from TV use cases to time-synchronized Web applications.

HTML5 is offering a generic means to associate time-aligned text or metadata with audio and video resources through the new <track> element [2]. In theory, <track> can accept any number of file formats as input – similar to how <img>, <video> and <audio> can in theory accept any image, video or audio resource as input. In practice, however, the choice of file format is restricted by what the browser vendors will develop support for.

One particular format that has been custom developed for HTML requirements and is in the process or planned to be implemented by most modern desktop browsers is the WebVTT file format. WebVTT is short for Web Video Text Tracks. It is a line-based file format that simply supplies data to the audio or video element by time interval within the timeline of the media resource. The time-intervals are called “cues”. WebVTT thus provides a generic platform for time-synchronized application use cases around HTML5 audio and video.

The main use cases that motivated the creation of <track> and WebVTT are in accessibility to provide text alternatives along the timeline.

The use of WebVTT for captions and subtitles has been described in detail. WebVTT’s functionality compares to that of modern TV caption formats. Positioning and cue size are specified through cue settings. A subpart of CSS has been specified to be applicable to WebVTT cue styling. In this way, WebVTT can also be used outside Web browsers by applications that do not support a full CSS engine but can implement support for the small number of styling commands specified.

The specification of how to use WebVTT for DVD-style chapters has been detailed recently. It allows for a hierarchy of chapters with arbitrary depth, which is very useful for navigation purposes. When made keyboard accessible, this hierarchical access will satisfy the navigation needs of blind users, and is equally useful to any user.

WebVTT can also be used to provide text descriptions for media resources. While this has not been specified in depth yet, it is expected that in the first instance, WebVTT cues for text descriptions will be purely textual without markup. These can be rendered through
screen readers using browser accessibility APIs in a similar manner to how active regions are rendered. Further developments here are possible to e.g. provide prosody to voicing, even though that is not yet a typical feature supported by screen readers.

The <track> element and WebVTT have been developed also with uses cases beyond these mainly accessiblility-motivated requirements in mind. For these use cases there is a catch-all kind of track, which is called “metadata”. It can be used for any type of timed metadata or timed text use case. Examples of such use cases could be timed and positioned annotations (similar to how YouTube’s annotations work), timed geo-coordinates, or timed and positioned hyperlinks. The rendering has to be provided through JavaScript and as such, the way in which the data is specified will be custom and can take on any
form, including JSON or XML.

A WebVTT working group is in the process of creation at the W3C. Thus, given this understanding of existing capabilities of HTML5 for time-synchronized data, this position paper would like to explore what further standardisation needs we may be able to foresee.


[1] WebVTT specification:

[2] Track element:


Leave a Reply

Your email address will not be published. Required fields are marked *