Recent developments around WebVTT

People have been asking me lots of questions about WebVTT (Web Video Text Tracks) recently. Questions about its technical nature such as: are the features included in WebVTT sufficient for broadcast captions including positioning and colors? Questions about its standardisation level: when is the spec officially finished and when will it move from the WHATWG to the W3C? Questions about implementation: are any browsers supporting it yet and how can I make use of it now?

I’m going to answer all of these questions in this post to make it more efficient than answering tweets, emails, and skype and other phone conference requests. It’s about time I do a proper post about it.

Implementations

I’m starting with the last area, because it is the simplest to answer.

No, no browser has as yet shipped support for the <track> element and therefore there is no support for WebVTT in browsers yet. However, implementations are in progress. For example, Webkit has recently received first patches for the track element, but there is still an open bug for a WebVTT parser. Similarly, Firefox can now parse the track element, but is still working on the element’s actual functionality.

However, you do not have to despair, because there are now a couple of JavaScript polyfill libraries for either just the track element or for video players with track support. You can start using these while you are waiting for the browsers to implement native support for the element and the file format.

Here are some of the libraries that I’ve come across that will support SRT and/or WebVTT (do leave a comment if you come across more):

  • Captionator – a polyfill for track and SRT parsing (WebVTT in the works)
  • js_videosub – a polyfill for track and SRT parsing
  • jscaptions – a polyfill for track and SRT parsing
  • LeanBack player – a video player with track and SRT, SUB, DFXP, and soon full WebVTT parsing support
  • playr – a video player that includes track and WebVTT parsing
  • MediaElementJS – a video player that includes track and SRT parsing
  • Kaltura’s video player – a video player that includes track and SRT parsing

I am actually most excited about the work of Ronny Mennerich from LeanbackPlayer on WebVTT, since he has been the first to really attack full support of cue settings and to discuss with Ian, me and the WHATWG about their meaning. His review notes with visual description of how settings are to be interpreted and his demo will be most useful to authors and other developers.

Standardisation

Before we dig into the technical progress that has been made recently, I want to answer the question of “maturity”.

The WebVTT specification is currently developed at the WHATWG. It is part of the HTML specification there. When development on it started (under its then name WebSRT), it was also part of the HTML5 specification of the W3C. However, there was a concern that HTML5 should be independent of the chosen captioning format and thus WebVTT currently only exists at the WHATWG.

In recent months – and particularly since browser vendors have indicated that they will indeed implement support for WebVTT as their implementation of the <track> element – the question of formal standardization of WebVTT at the W3C has arisen. I’m involved in this as a Google contractor and we’ve put together a proposed charter for a WebVTT Working Group at the W3C.

In the meantime, standardization progresses at the WHATWG productively. Much feedback has recently been brought together by Ian and changes have been applied or at least prepared for a second feature set to be added to WebVTT once the first lot is implemented. I’ve captured the potentially accepted and rejected new features in a wiki page.

Many of the new features are about making the WebVTT format more useful for authoring and data management. The introduction of comments, inline CSS settings and default cue settings will help authors reduce the amount of styling they have to provide. File-wide metadata will help with the exchange of management information in professional captioning scenarios and archives.

But even without these new features, WebVTT already has all the features necessary to support professional captioning requirements. I’ve prepared a draft mapping of CEA-608 captions to WebVTT to demonstrate these capabilities (CEA-608 is the TV captioning standard in the US).

So, overall, WebVTT is in a great state for you to start implementing support for it in caption creation applications and in video players. There’s no need to wait any longer – I don’t expect fundamental changes to be made, but only new features to be added.

New WebVTT Features

This takes us straight to looking at the recently introduced new features.

  • Simpler File Magic:
    Whereas previously the magic file identifier for a WebVTT file was a single line with “WEBVTT FILE”. This has now been changed to a single line with just “WEBVTT”.
  • Cue Bold Span:
    The <b> element has been introduced into WebVTT, thus aligning it somewhat more with SRT and with HTML.
  • CSS Selectors:
    The spec already allowed to use the names of tags, the classes of <c> tags, and the voice annotations of <v> tags as CSS selectors for ::cue. ID selector matching is now also available, where the cue identifier is used.
  • text-decoration support:
    The spec now also supports the CSS text-decoration property for WebVTT cues, allowing functionality such as blinking text and bold.

Further to this, the email identifies the means in which WebVTT is extensible:

  • Header area:
    The WebVTT header area is defined through the “WEBVTT” magic file identifier as a start and two empty lines as an end. It is possible to add into this area file-wide information header information.
  • Cues:
    Cues are defined to start with an optional identifier, and then a start/end time specification with “–>” separator. They end with two empty lines. Cues that contain a “–>” separator but don’t parse as valid start/end time are currently skipped. Such “cues” can be used to contain inline command blocks.
  • Inline in cues:
    Finally, within cues, everything that is within a “tag”, i.e. between “”, and does not parse as one of the defined start or end tags is ignored, so we can use these to hide text. Further, text between such start and end tags is visible even if the tags are ignored, so wen can introduce new markup tags in this way.

Given this background, the following V2 extensions have been discussed:

  • Metadata:
    Enter name-value pairs of metadata into the header area, e.g.

    WEBVTT
    Language=zh
    Kind=Caption
    Version=V1_ABC
    License=CC-BY-SA
    
    1
    00:00:15.000 --> 00:00:17.950
    first cue
  • Inline Cue Settings:
    Default cue settings can come in a “cue” of their own, e.g.

    WEBVTT
    
    DEFAULTS --> D:vertical A:end
    
    00:00.000 --> 00:02.000
    This is vertical and end-aligned.
    
    00:02.500 --> 00:05.000
    As is this.
    
    DEFAULTS --> A:start
    
    00:05.500 --> 00:07.000
    This is horizontal and start-aligned.
    
  • Inline CSS:
    Since CSS is used to format cue text, a means to do this directly in WebVTT without a need for a Web page and external style sheet is helpful and could be done in its own cue, e.g.

    WEBVTT
    
      STYLE -->
      ::cue(v[voice=Bob]) { color: green; }
      ::cue(c.narration) { font-style: italic; }
      ::cue(c.narration i) { font-style: normal; }
    
      00:00.000 --> 00:02.000
      <v Bob>Welcome.
    
      00:02.500 --> 00:05.000
      <c .narration>To <i>WebVTT</i>.
    
  • Comments:
    Both, comments within cues and complete cues commented out are possible, e.g.

    WEBVTT
    
     COMMENT -->
     00:02.000 --> 00:03.000
     two; this is entirely
     commented out
     
     00:06.000 --> 00:07.000
     this part of the cue is visible
     <! this part isn't >
     <and neither is this>
    

Finally, I believe we still need to add the following features:

  • Language tags:
    I’d like to add a language tag that allows to mark up a subpart of cue text as being in a different language. We need this feature for mixed-language cues (in particular where a different font may be necessary for the inline foreign-language text). But more importantly we will need this feature for cues that contain text descriptions rather than captions, such that a speech synthesizer can pick the correct language model to speak the foreign-language text. It was discussed that this could be done with a <lang jp>xxx</lang> type of markup.
  • Roll-up captions:
    When we use timestamp objects and the future text is hidden, then is un-hidden upon reaching its time, we should allow the cue text to scroll up a line when the un-hidden text requires adding a new line. This is the typical way in which TV live captions have been displayed and so users are acquainted with this display style.
  • Inline navigation:
    For chapter tracks the primary use of cues are for navigation. In other formats – in particular in DAISY-books for blind users – there are hierarchical navigation possibilities within media resources. We can use timestamp objects to provide further markers for navigation within cues, but in order to make these available in a hierarchical fashion, we will need a grouping tag. It would be possible to introduce a <nav> tag that can group several timestamp objects for navigation.
  • Default caption width:
    At the moment, the default display size of a caption cue is 100% of the video’s width (height for vertical directions), which can be overruled with the “S” cue setting. I think it should by default rather be the width (height) of the bounding box around all the text inside the cue.

Aside from these changes to WebVTT, there are also some things that can be improved on the <track> element. I personally support the introduction of the source element underneath the track element, because that allows us to provide different caption files for different devices through the @media media queries attribute and it allows support for more than just one default captioning format. This change needs to be made soon so we don’t run into trouble with the currently empty track element.

I further think a oncuelistchange event would be nice as well in cases where the number of tracks is somehow changed – in particular when coming from within a media file.

Other than this, I’m really very happy with the state that we have achieved this far.