WebVTT at W3C

Today we started a community group (CG) at the W3C for “Web Media Text Tracks”: http://www.w3.org/community/texttracks/.

The group has been created to work on many aspects of video text tracks of which captioning and the WebVTT format are key parts.

The main reason behind creating this group is to create a forum at the W3C for working on WebVTT to allow all browsers to support this format and be involved in its development.

We’ve not gone the full way to creating a Working Group, although that was the initial intention. We had objections from W3C members for going down that path, so are using the CG path for now.

This is actually a good thing because CGs are open for anyone to join, while WGs are only open to W3C members. The key difference is that specs coming out of WGs can become RECs (“standards”), while CG’s specs cannot.

If we eventually see a need to move WebVTT to a REC, that move will be straight forward, since there is a clear path for work to transition from a CG to a WG.

9 thoughts on “WebVTT at W3C

  1. There was strong support expressed at the Web and TV interest group meeting on harmonizing WebVTT format with W3C TTML (and the SMPTE TT implementation of that for subtitles and captions).

    Is harmonization with the W3C Recommendation and SMPTE standard a goal of this Community Group?

    That discussion noted that television over Internet is one delivery path in a larger content ecosystem that is obligated to deliver specific functionality from content creator to consumer, similar to CEA-708, DVB Subtitles, Teletext, ARIB, etc. SMPTE TT carries the presentation information in the TTML and supports embedding in video streams, supports delivery and presentation of image subtitles used for the majority of world languages, signing, etc. WebVTT is optimized for out of band delivery, simple text capture, and relies on a suitably authored Web page to provide presentation style and features.

    Both approaches have advantages, but having two conflicting standards does not. It would be great if WebVTT could be specified as a profile or automatic derivation of SMPTE TT so that television content could deliver full render intent from content creator to consumer over broadcast, download, Internet streaming, etc. without the “author many, test many” problem created by unrelated subtitle formats.

  2. @Kilroy

    If you mean by “harmonization” the possibility to transcode losslessly between WebVTT and TTML (and other formats), then: yes, that is one of the goals of the Community Group.

    Note that WebVTT has actually been designed both to be delivered out-of-band and in-band. It is also much easier to be delivered in-band than TTML, because it’s a line-based format. The XML-based TTML format in contrast has to be flattened before being able to be interleaved with video files.

    As for image subtitles: at the workshop we discussed that you can include image subtitles by putting data-urls or base64-encoded images inside the WebVTT cues. A better solution would, however, be to convert them to text and thus enable Web features such as automatic translation or Web search on them. Text is also better for the accessibility community, since it is possible to change fonts, font size, font color, and background for text, but not for image subtitles.

    Part of the work of the Community Group is also to extend the current WebVTT specification to include more features such as styling directly in the WebVTT file and other features that offline caption and subtitling applications require.

    1. If we wanted a XML based format, we could use TTML. However, there were several objections to TTML, including the use of XSL-FO and problems in linearising it for encapsulation in media. A line-based format like WebVTT is easier to author and encapsulate. You can always use XML or JSON inside the cues if you need it, which is, in fact, much more useful.

  3. This kind of argument makes me wonder… How does WebVTT make it “easier to author and encapsulate”? Are you seriously considering writing WebVTT by hand? Everything is generated through converters and editors, how is authoring made easier by a fragile line-based format? And the encapsulation argument is weak too: encapsulating any data in a stream for instance implies wrapping it in some container, which surely has nothing to do with WebVTT. To take an analogy, WebVTT is itself an enveloppe, wrapping a content. If you want to convert data to another format (another enveloppe), you do not put the letter+enveloppe in the new one, you extract the content from the enveloppe and wrap it in the new one.

    1. @Olivier : yes, we also still consider hand authoring. But in addition, any file that can be hand authored easily is also more easily supported by automated tools. In particular any of the richer features – this is why you barely ever find a TTML file that does more than a plain-old SRT file – it’s just too complicated to support the richer features. Finally the question of wrapping: any XML based format is by nature hierarchical. Flattening a hierarchical format into something that can be interleaved into a time-based container is not simple. In contrast, WebVTT is by nature time-linear because the cues have defined start and end times and are required to be ordered by start time. This makes it trivial to interleave them with other time-based data such as audio and video. There is no complicated extraction and flattening necessary.

  4. I do not want to feed the troll too much, but there are at least 2 points that seem questionable to me:

    * “any file that can be hand authored easily is also more easily supported by automated tools”: not necessarily, since hand-authored also often means rather lax in respecting the standards. Look at the mess that HTML has become, and the impossibility to come back to a more regular XHTML syntax, in part because of the hand-authored HTML introduced some de-facto “standards” that we could not get away from. And writing a parser for a dedicated syntax like WebVTT is clearly not easier than using a standard parser for JSON or XML.

    * “any XML based format is by nature hierarchical”: just because it can does not mean that it must. You can perfectly specify linear schemas. You can put in a spec that you add the constraint that items have to be ordered by start time. It has nothing to do with the serialisation format.

    Do not get me wrong: I recognize the value of domain-specific languages, and like to use them when appropriate. The issue I have is about the choice of srt as inspiration (use of –> to separate timecodes, timecodes sometimes with commas, sometimes with dots as separators – I know that WebVTT consistenly uses dots) and the random selection of features (quasi-but-not-quite HTML for content markup, quasi-but-not-quite CSS for cue settings, support for some things like ruby annotations but no generic way of defining) that look to me like they will introduce more confusion than simplification. Only future will tell.

    But to wrap up, I admit it is more a matter of taste, and will be happy to see any reasonable standard emerge.

  5. @Olivier

    Your argument is upside down: you seem to assume that lax hand authoring causes a specification to become complex. That, however, is not the cause of complex specifications. The cause is lax parsers and tolerant implementations to faulty files. Faulty files can be created both by automated tools and by hand.

    My argument still stands: simple specifications are more easily supported both by hand and by automated tools. In addition, they also are simpler to adhere to and thus cause less faulty files, thus stopping the creep of complexity into the specification.

    As for XML: absolutely all XML files are hierarchical, that is a fact that you cannot twist. There is always a root element that contains all the other elements, even if those elements are completely linear. Thus, at minimum you have to deal with the serialisation of the root element.

    I do agree, however, that whether you like the WebVTT language or not is really a matter of taste.

  6. Hi,
    Can anybody share the sample clips for WEBVTT attributes and CSS attributes.

    For Example : WEBVTT support Cue Components. Using the Cue Components, CSS attributes are supported. It will be helpful if anybody shares the sample files and Clips with CSS attributes.


Comments are closed.