WebVTT explained

On Wednesday, I gave a talk at Google about WebVTT, the Web Video Text Track file format that is under development at the WHATWG for solving time-aligned text challenges for video.

I started by explaining all the features that WebVTT supports for captions and subtitles, mentioned how WebVTT would be used for text audio descriptions and navigation/chapters, and explained how it is included into HTML5 markup, such that the browser provides some default rendering for these purposes. I also mentioned the metadata approach that allows any timed content to be included into cues.

The talk slides include a demo of how the <track> element works in the browser. I’ve actually used the Captionator polyfill for HTML5 to make this demo, which was developed by Chris Giffard and is available as open source from GitHub.

The talk was recorded and has been made available as a Google Tech talk with captions and also a separate version with extended audio descriptions.

The slides of the talk are also available (best to choose the black theme).

I’ve also created a full transcript of the described video.

Get the WebVTT specification from the WHATWG Website.

30 thoughts on “WebVTT explained

    1. @Philip The audio descriptions are part of the file. YouTube doesn’t yet have a means to provide text descriptions or extended text descriptions as would be necessary in this example.

      @Laura Let me see if I can get a transcript together. I’ll add a link to the post when done.

      @John Great to see more polyfills in this space. I can see from your code that you support SRT. But do you also parse WebVTT properly?

  1. @Laura – there is now a transcript in the format of a WebVTT file available. The timing markers on the WebVTT file are according to a change that Google proposed. You will notice how they differ from current WebVTT spec.

  2. Thank you very much Silvia for posting the transcript.

    Is there anyway that transcripts could be extracted and offered to users automatically as a link as part of a video’s chrome? Authors wouldn’t need to remember to post separate transcripts links if this was possible.

    Dialup is often my only choice. So a video is useless without a transcript.

    Thanks, again. Your hard work is much appreciated.

  3. @Laura On YouTube, the transcript is available through the transcript button which sits underneath the video. This is where I got the transcript from. I think you should be able to get to it even if you are on dial-up.

    If you are asking about what we should do in the HTML5 spec, then indeed, I am waiting to see what happens to the @longdesc attribute for images, which I think could fulfill your need. Other than that, Web page authors could be encouraged to produce timed transcripts like the YouTube pages have them from the captions and text descriptions. They are actually useful for all viewers.

  4. Thank you Silvia for this interesting post.

    As every .vtt file describes one specific video stream, would it better to add a possibility to specify the canonical URL of this video in the header of VTT file ?
    Having this URL specified, search engines can index only VTT files without downloading the video file itself.

    The second question: is there a way to specify a link into video stream to start playing it from specified time mark?
    For example, I want to post a link to a fragment from someone’s lecture or long interview (as a citation) and don’t want users to seek this position manually, can I specify an URL like that:
    http://somesite.com/video/tubepage.html#object_id
    in the meaning to play only a fragment from 61.5 sec to 68.033 sec ?

    Having the canonical URL of the video stream specified in the header of .vtt file, we can have links like that:
    http://somesite.com/videovtt/tubepage.vtt#
    then the browser knows where to get the video file and which fragment of it it should play. If the video file becomes not available, then the browser can show only cues content for the specified fragment of the tube.

  5. @Maxime

    Q1: I’m trying to get a section into WebVTT where metadata can be stored. Your proposal of putting a link to the video into the header would be possible to be added to this section then. I already want @kind, @language, @label to be able to be added there, too, and other metadata that people may find useful.

    Q2: This is not WebVTT related, but generally a media fragment URI question. Yes, there is now a specification and we am trying to get that added to the HTML standard and to browsers, too. See http://www.w3.org/TR/media-frags/ . Whether Websites add it so you can put the link into the Web page but feed that on to the video element is a different question. Several sites are doing that already and you have to put that into your Web page URLs yourself.

    Q3: Not sure that’s such a useful approach compared to reading the full WebVTT file.

  6. Great presentation. I have a question about serialization and combination of different kinds of subtitles.
    I created a script that analyses WebVTT files (embedded with the -Tag) and adds the Cue Texts as paragraph into the document depending on the current time of the video. With ARIA a screenreader can get informed about new added Cue Texts. The user may either use audio descriptions or a braille device.
    If the user wants to get information about more than one kind of subtitles (e.g. captions and descriptions, or subtitles and chapters) the data shouldn’t (and can’t) be displayed in parallel but needs to be presented one after the other. Are there any thoughts about serializing subtitles? In the video of your presentation, the video stops and a speaker describes the current action. Is this the preferred way? Even so, how long should the video get interrupted? Usually you don’t get feedback of the screenreader. But the tempo of screen readers can differ.

    1. @Dirk we certainly haven’t talked about how to serialize captions and descriptions other than somebody doing it in JavaScript. For audio descriptions you will have material that either has sufficient gaps or doesn’t. In the case of presentations, most of the time you will not have enough space, so you will need to extend the timeline and create extended audio descriptions. Right now we haven’t got an automatic solution for creating them and the best way is indeed to create a separate video. For an automated solution where breaks are created in the original video, we don’t really know yet how to go about it.

      As for aria-live being used for reading out cue text, I’ve got a demo at http://www.annodex.net/~silvia/itext/elephant_no_skin.html and a recording of the effect using NVDA in Firefox at: http://www.youtube.com/watch?v=MYlunLChzqw . You do need to work with an average reading rate for such text descriptions.

  7. Hi Sylvia,
    Very interesting article and talk.

    I’m the developer of Playr (http://www.delphiki.com/html5/playr/) and as I just made a first implementation of cue timestamps, one question pop up in my mind.

    It’s about the :past pseudo-element: does it apply to the text displayed at the start of the cue time range?

    For example:
    00:00:17,556 –> 00:00:20,631
    Can you hear it?
    The noise, the drumbeat?

    Should :past be applied to “Can you hear it?” at the start?

    Thanks

  8. The tags where stripped, here is the cue with < and > replaced with [ and ]:
    00:00:17,556 –> 00:00:20,631
    Can you hear it?
    [00:00:18,556]The noise, [00:00:19,600]the drumbeat?

    1. @Julien you can also escape “<" with &lt – the end tag is then automatically escaped.

      As for your question about :past – yes, I think it would also be applied to the first one, which would implicitly start at the cue's start time. Does that create any problems?

  9. Thanks for the great presentation Silvia.

    How did you add line break in the captions though? I checked ‘a full transcript of the described video’ and it does not have LF where it shows them on the screen.

    Aslo looks like you can move the captions on screen with mouse. How to do that?

    Thanks again
    -Russell

    1. The full transcript was made by concatenating the caption text, so may not be representative. Also, YouTube does line breaks automatically when the lines get too long.

      In short: for the presentations, the line breaks were done manually with LF.

Leave a Reply

Your email address will not be published. Required fields are marked *