Recent developments around WebVTT

People have been asking me lots of questions about WebVTT (Web Video Text Tracks) recently. Questions about its technical nature such as: are the features included in WebVTT sufficient for broadcast captions including positioning and colors? Questions about its standardisation level: when is the spec officially finished and when will it move from the WHATWG to the W3C? Questions about implementation: are any browsers supporting it yet and how can I make use of it now?

I’m going to answer all of these questions in this post to make it more efficient than answering tweets, emails, and skype and other phone conference requests. It’s about time I do a proper post about it.


I’m starting with the last area, because it is the simplest to answer.

No, no browser has as yet shipped support for the <track> element and therefore there is no support for WebVTT in browsers yet. However, implementations are in progress. For example, Webkit has recently received first patches for the track element, but there is still an open bug for a WebVTT parser. Similarly, Firefox can now parse the track element, but is still working on the element’s actual functionality.

However, you do not have to despair, because there are now a couple of JavaScript polyfill libraries for either just the track element or for video players with track support. You can start using these while you are waiting for the browsers to implement native support for the element and the file format.

Here are some of the libraries that I’ve come across that will support SRT and/or WebVTT (do leave a comment if you come across more):

  • Captionator – a polyfill for track and SRT parsing (WebVTT in the works)
  • js_videosub – a polyfill for track and SRT parsing
  • jscaptions – a polyfill for track and SRT parsing
  • LeanBack player – a video player with track and SRT, SUB, DFXP, and soon full WebVTT parsing support
  • playr – a video player that includes track and WebVTT parsing
  • MediaElementJS – a video player that includes track and SRT parsing
  • Kaltura’s video player – a video player that includes track and SRT parsing

I am actually most excited about the work of Ronny Mennerich from LeanbackPlayer on WebVTT, since he has been the first to really attack full support of cue settings and to discuss with Ian, me and the WHATWG about their meaning. His review notes with visual description of how settings are to be interpreted and his demo will be most useful to authors and other developers.


Before we dig into the technical progress that has been made recently, I want to answer the question of “maturity”.

The WebVTT specification is currently developed at the WHATWG. It is part of the HTML specification there. When development on it started (under its then name WebSRT), it was also part of the HTML5 specification of the W3C. However, there was a concern that HTML5 should be independent of the chosen captioning format and thus WebVTT currently only exists at the WHATWG.

In recent months – and particularly since browser vendors have indicated that they will indeed implement support for WebVTT as their implementation of the <track> element – the question of formal standardization of WebVTT at the W3C has arisen. I’m involved in this as a Google contractor and we’ve put together a proposed charter for a WebVTT Working Group at the W3C.

In the meantime, standardization progresses at the WHATWG productively. Much feedback has recently been brought together by Ian and changes have been applied or at least prepared for a second feature set to be added to WebVTT once the first lot is implemented. I’ve captured the potentially accepted and rejected new features in a wiki page.

Many of the new features are about making the WebVTT format more useful for authoring and data management. The introduction of comments, inline CSS settings and default cue settings will help authors reduce the amount of styling they have to provide. File-wide metadata will help with the exchange of management information in professional captioning scenarios and archives.

But even without these new features, WebVTT already has all the features necessary to support professional captioning requirements. I’ve prepared a draft mapping of CEA-608 captions to WebVTT to demonstrate these capabilities (CEA-608 is the TV captioning standard in the US).

So, overall, WebVTT is in a great state for you to start implementing support for it in caption creation applications and in video players. There’s no need to wait any longer – I don’t expect fundamental changes to be made, but only new features to be added.

New WebVTT Features

This takes us straight to looking at the recently introduced new features.

  • Simpler File Magic:
    Whereas previously the magic file identifier for a WebVTT file was a single line with “WEBVTT FILE”. This has now been changed to a single line with just “WEBVTT”.
  • Cue Bold Span:
    The <b> element has been introduced into WebVTT, thus aligning it somewhat more with SRT and with HTML.
  • CSS Selectors:
    The spec already allowed to use the names of tags, the classes of <c> tags, and the voice annotations of <v> tags as CSS selectors for ::cue. ID selector matching is now also available, where the cue identifier is used.
  • text-decoration support:
    The spec now also supports the CSS text-decoration property for WebVTT cues, allowing functionality such as blinking text and bold.

Further to this, the email identifies the means in which WebVTT is extensible:

  • Header area:
    The WebVTT header area is defined through the “WEBVTT” magic file identifier as a start and two empty lines as an end. It is possible to add into this area file-wide information header information.
  • Cues:
    Cues are defined to start with an optional identifier, and then a start/end time specification with “–>” separator. They end with two empty lines. Cues that contain a “–>” separator but don’t parse as valid start/end time are currently skipped. Such “cues” can be used to contain inline command blocks.
  • Inline in cues:
    Finally, within cues, everything that is within a “tag”, i.e. between “”, and does not parse as one of the defined start or end tags is ignored, so we can use these to hide text. Further, text between such start and end tags is visible even if the tags are ignored, so wen can introduce new markup tags in this way.

Given this background, the following V2 extensions have been discussed:

  • Metadata:
    Enter name-value pairs of metadata into the header area, e.g.

    00:00:15.000 --> 00:00:17.950
    first cue
  • Inline Cue Settings:
    Default cue settings can come in a “cue” of their own, e.g.

    DEFAULTS --> D:vertical A:end
    00:00.000 --> 00:02.000
    This is vertical and end-aligned.
    00:02.500 --> 00:05.000
    As is this.
    DEFAULTS --> A:start
    00:05.500 --> 00:07.000
    This is horizontal and start-aligned.
  • Inline CSS:
    Since CSS is used to format cue text, a means to do this directly in WebVTT without a need for a Web page and external style sheet is helpful and could be done in its own cue, e.g.

      STYLE -->
      ::cue(v[voice=Bob]) { color: green; }
      ::cue(c.narration) { font-style: italic; }
      ::cue(c.narration i) { font-style: normal; }
      00:00.000 --> 00:02.000
      <v Bob>Welcome.
      00:02.500 --> 00:05.000
      <c .narration>To <i>WebVTT</i>.
  • Comments:
    Both, comments within cues and complete cues commented out are possible, e.g.

     COMMENT -->
     00:02.000 --> 00:03.000
     two; this is entirely
     commented out
     00:06.000 --> 00:07.000
     this part of the cue is visible
     <! this part isn't >
     <and neither is this>

Finally, I believe we still need to add the following features:

  • Language tags:
    I’d like to add a language tag that allows to mark up a subpart of cue text as being in a different language. We need this feature for mixed-language cues (in particular where a different font may be necessary for the inline foreign-language text). But more importantly we will need this feature for cues that contain text descriptions rather than captions, such that a speech synthesizer can pick the correct language model to speak the foreign-language text. It was discussed that this could be done with a <lang jp>xxx</lang> type of markup.
  • Roll-up captions:
    When we use timestamp objects and the future text is hidden, then is un-hidden upon reaching its time, we should allow the cue text to scroll up a line when the un-hidden text requires adding a new line. This is the typical way in which TV live captions have been displayed and so users are acquainted with this display style.
  • Inline navigation:
    For chapter tracks the primary use of cues are for navigation. In other formats – in particular in DAISY-books for blind users – there are hierarchical navigation possibilities within media resources. We can use timestamp objects to provide further markers for navigation within cues, but in order to make these available in a hierarchical fashion, we will need a grouping tag. It would be possible to introduce a <nav> tag that can group several timestamp objects for navigation.
  • Default caption width:
    At the moment, the default display size of a caption cue is 100% of the video’s width (height for vertical directions), which can be overruled with the “S” cue setting. I think it should by default rather be the width (height) of the bounding box around all the text inside the cue.

Aside from these changes to WebVTT, there are also some things that can be improved on the <track> element. I personally support the introduction of the source element underneath the track element, because that allows us to provide different caption files for different devices through the @media media queries attribute and it allows support for more than just one default captioning format. This change needs to be made soon so we don’t run into trouble with the currently empty track element.

I further think a oncuelistchange event would be nice as well in cases where the number of tracks is somehow changed – in particular when coming from within a media file.

Other than this, I’m really very happy with the state that we have achieved this far.

30 thoughts on “Recent developments around WebVTT

  1. Excellent roundup and good to hear that everything is moving forward with this. The sooner browsers start adopting it, the better.

    That said, there’s no reason for people to no start using it, given the number of JS libraries available. LeanbackPlayer is looking particularly good.

  2. Again we see the perils of letting foreigners without lived experience write captioning specifications, especially when they don’t have access to the full standards documents.

    708 manifestly does not use Unicode.

    Toggles for colour, underlining, blink, and italics insert a space because they take up one. Certain preset combinations of toggles are available so that multiple spaces will not show up.

    You haven’t addressed transparent-space positioning.

    And, most galling of all, you perpetuate the phenomenally awful and user-hostile choice of fonts that made the 708 spec a laughingstock.

  3. @Joe WebVTT is a file format for the Web, while 608 and 708 are formats for displaying captions character by character on a TV. They will never be fully compatible and that is also not the idea. However, meaningful transcoding between them is still possible. (Note that I have not touched 708 yet, but only 608 and it’s a draft document).

    I’d be more than happy if you helped with figuring out the details about transcoding to/from 608/708. Given your experience, we will likely be able to fix the holes that remain. The document in the wiki is but a first draft.

  4. Actually, people can’t use WebVTT if it keeps changing.

    As far as I know, none of the JS libraries support the new functionality–do they? Hopefully people can use WebVTT as it was previously defined (WEBVTT FILE and all) and get some functionality.

    1. @Shelley It will continue changing just like HTML is continuing to change: features will be added. That’s just the way of the world. However, what I am saying is that we have now decided on the fundamental makeup of WebVTT and the ways in which it can change, so that any further additions will now be backwards compatible. That’s why I am saying that we’re ready for implementations.

  5. PS The work is good, and you’ve done a great job. But I just finished covering WebVTT for a book. Sigh.

    Oh well, time for a quick edit and insert 😉

  6. Well here is how things typically work in the world. A branch of the specification is created. In that branch there will eventually be a feature freeze and the focus shifts to fixing bugs in the already existing features. This also makes it less of a moving target for people who are going to implement the specification and tech writers, like Shelley, who write about it. Eventually a stable specification is released and focus shifts to the the trunk again (the main branch, this could have continued parallel with the branch with a feature freeze). Then the cycle repeats itself again.

  7. I did want to say that I think the inline CSS is not a good idea. People should be able to use class names within the WebVTT, which they can do now. Then they should be able to apply a stylesheet to their web pages, and expect to see the changes applied.

    Adding a unique way to add CSS to WebVTT is like adding a unique way to add CSS to a JavaScript library. Of course it doesn’t make sense to do so. If developers of JS want a look and feel for any UI, they use CSS. The same with people creating unique WebVTT files (though this is a little strange to even contemplate). They can use markup, or they can apply class names.

    Now, I can see perhaps adding more pseudo classes to CSS, to handle the new capabilities of WebVTT, but the more complicated you make WebVTT the more inconsistent the resulting playback to the people who need subtitles and captions, and the more people will either much up WebVTT, or look fondly at using SRT, instead.

  8. A second thing — do we need the metadata? The information in the metadata is included in the track element.

    It’s not as if WebVTT is going to exist independent of HTML5 track use.

    I do like the defaults.

  9. I must admit I agree with Shelley here. Also making it over complicated to use is going to encourage people to shy away from using it and we’ll be back to square one.

  10. @Shelley both of these features – metadata and inline CSS – are necessary if you start using WebVTT outside Web browsers. Think about loading a WebVTT and the related video file into the Quicktime player or into VLC. Neither of these will want to implement full CSS, but they may need the CSS features for styling. Thus, the limited amount of CSS that’s applicable to WebVTT should be allowed to be included in-band.

    Similarly, the metadata is important to maintain as part of the actual content, since otherwise playback without the accompanying Web page will result in a different (and probably wrong) presentation.

    Also note that these V2 extensions have not been included in the specification yet, so I would refrain from mentioning them in a book or.

  11. @Ian These are optional features. None of them are required. In the simplest case, people will just write SRT files with dots instead of commas as millisec separator and with an additional WEBVTT file magic. All of WebVTT’s features can be learnt and included incrementally from there as needed.

  12. My concerns about including the metadata and CSS within the file is we’re getting beyond the scope of WebVTT.

    Right now, we have a subcaption/caption format that should hopefully work with HTML5 video track elements. If we expand the scope, we’re adding complexity to the format that will eventually undermine the original scope for the format: for the caption files for use in HTML5 video.

    You mentioned about applications wanting to support WebVTT styles but not fully support CSS. Well, then we have to add constraints into the style section so that authors and developers know what subset of CSS is, or is not, going to be supported. We’ll then be adding yet more complexity, which is, again, just going to make SRT look even more attractive.

    It may sound trite, but doesn’t make it any less true: sometimes less is more.

    I think it is more important to solidify a good, solid, version of WebVTT and get the browser companies to agree to support it (and/or build up a good set of tools that can be used in the meantime). Then, later, if there’s a real world demand, add more features.

  13. @Shelly These are fair enough concerns. That’s why none of this is in the spec yet, but only being discussed. Note that I said “the following V2 extensions are being discussed”. None of those are in the spec yet.

  14. That’s cool, Silvia.

    I’m looking forward to your new W3C group on WebVTT. And that the W3C moves on this relatively quickly and that you’re made editor.

  15. I think the key for getting traction with WebVTT is mobile browser support (Android, iOS). If these implement and WebVTT support, it will be the only way to display closed captions and subtitles on mobile devices. You should be all over these guys to make sure they get it done ;).

    On the desktop, it is always possible to do a polyfill using whatever file format. Formats like SRT and DFXP are embedded in workflows, tools and existing libraries, which makes the case for moving to WebVTT really difficult.

  16. @JeroenW I think you’re right – I hadn’t really thought about it in this way. IIUC Webkit is the basis for the default browser on Android and also for Safari on iOS, so with the Webkit implementation we are on the best track. Should talk to the other mobile browser developers though…

  17. Since people are talking about WebVTT v2, does it mean that v1 is already finished ?

    Looking at JavaScript libraries out there, i’d like to find one (under a free licence) that would allow to display WebVTT captions in a browser already compatible with the video element, but would remain inactive in a browser compatible video + track elements.
    The reason behind my quest is that Planets often cut off all the JavaScript stuff, therefore i would like to be sure that captions would be displayed first using native browser technology if available, and the library would start only as a fallback if browser supports only video and not track element


  18. @antistree: there are further webkit bugs still about the track element – in particular there isn’t yet a native display. So, the implementation isn’t quite complete yet.

    Also, most polyfill libraries are implemented in such a way that they only kick in when the browser doesn’t support the tag natively. I suppose they will all start making sure that they work with the new webkit support. If that’s not the case, register bugs on them.

    As for using the polyfill in planets, you will have to make sure that the JS library is linked in every blog post, because planets strip off the normal page layout and only take the content of posts during aggregation. That’s not a problem of planets or browser, but just they way in which aggregation works.

  19. @silvia : thanks. i can’t wait to try a browser implementation of the track element 🙂

    (concerning Planets, i guess it depends which Planet. The one which repeats my articles displays an alternative message that say that you have to read the article on the original blog to see the JavaScript element that has been stripped off. The Planet team confirmed me that it’s done for security reasons.)

  20. @silvia : Captionator.js seems to be the thing i need
    It’s free software and Christopher Giffard confirmed me that It tests for native support when loading – if support is already present in the browser, it steps down and lets the browser do its thing.

  21. Hi,
    I have been working on creating a webvtt encoder, but intended solely for ipads, not websites. I would like to know if anyone has been able to get Inline css to work, and thus have colors. I used the SYLE–> keyword as mentioned above at the beginning of a webvtt file, but that does not seem to make difference. Here is a sample of my webvtt file

    Language = ENG
    Kind = Caption
    STYLE –>
    ::cue (c.rouge) {color:red;}

    00:00:23.923 –> 00:00:25.791 align:start
    ♪ Where it’ll land… ♪

    00:00:25.892 –> 00:00:27.393 align:end
    ♪ well, I know. ♪

    00:00:27.493 –> 00:00:28.728 align:start
    ♪ Flick of a wrist; ♪

    00:00:28.828 –> 00:00:30.496 align:end
    ♪ you know I can’t miss. ♪

    00:00:30.596 –> 00:00:33.466 align:start
    ♪ The bottle’s gonna stop
    on the one ♪

    Any help would be appreciated. Thanks in advance

  22. This is better
    00:00:23.923 –> 00:00:25.791 align:start
    ♪ Where it’ll land… ♪

    00:00:25.892 –> 00:00:27.393 align:end
    ♪ well, I know. ♪

    00:00:27.493 –> 00:00:28.728 align:start
    ♪ Flick of a wrist; ♪

Comments are closed.