New proposal for captions and other timed text for HTML5

The first specification for how to include captions, subtitles, lyrics, and similar time-aligned text with HTML5 media elements has received a lot of feedback – probably because there are several demos available.

The feedback has encouraged me to develop a new specification that includes the concerns and makes it easier to associate out-of-band time-aligned text (i.e. subtitles stored in separate files to the video/audio file). A simple example of the new specification using srt files is this:

<video src="video.ogv" controls>
   <itextlist category="CC">
     <itext src="" lang="en"/>
     <itext src="" lang="de"/>
     <itext src="" lang="fr"/>
     <itext src="" lang="jp"/>

By default, the charset of the itext file is UTF-8, and the default format is text/srt (incidentally a mime type the still needs to be registered). Also by default the browser is expected to select for display the track that matches the set default language of the browser. This has been proven to work well in the previous experiments.

Check out the new itext specification, read on to get an introduction to what has changed, and leave me your feedback if you can!

The itextlist element
You will have noticed that in comparison to the previous specification, this specification contains a grouping element called “itextlist”. This is necessary because we have to distinguish between alternative time-aligned text tracks and ones that can be additional, i.e. displayed at the same time. In the first specification this was done by inspecting each itext element’s category and grouping them together, but that resulted in much repetition and unreadable specifications.

Also, it was not clear which itext elements were to be displayed in the same region and which in different ones. Now, their styling can be controlled uniformly.

The final advantage is that association of callbacks for entering and leaving text segments as extracted from the itext elements can now be controlled from the itextlist element in a uniform manner.

This change also makes it simple for a parser to determine the structure of the menu that is created and included in the controls element of the audio or video element.

Incidentally, a patch for Firefox already exists that makes this part of the browser. It does not yet support this new itext specification, but here is a screenshot that Felipe Corr

14 thoughts on “New proposal for captions and other timed text for HTML5

  1. Silvia,
    I get the impression that all the functionality you need is already available in various specs, and could be re-used easily without inventing new syntax.

    The category and name functionality could be picked up from XHTML role, if I’m not mistaken. The itext/itextlist is SMIL par and text (or ref). Both of these specs are modularized, so you should be able to pick up just the pieces you need. In the namespaced XML world then you would be done, in the HTML world you would need a bit of extra work to import things into your spec.

  2. Pingback: Recopilaci
  3. Hello;

    a good proposal after all, alas in which I don’t believe in additional data structures that are imposed by using the hypertext mark-up, already a structured data by itself.

    For this type of business my recommendation is scripting an extensible subtitle language in XML, where we could set all the elements and attributes as they are needed:

    [group name=”fall of the chopper”]

    [scene order=”122″ start=”15:25″ end=”15:40″ /]


    [sub type=”ambiance” scene=”122″]sounds of the chopper breaking-down[/sub]
    [sub scene=”122″ source=”6″ imperatives=”true” index=”1″]Hey Julian, WATCH OUT![/sub]
    [sub scene=”122″ source=”0″ index=”-1″]Mom, are we gonna be all right?[/sub]
    [sub scene=”123″ source=”6″ index=”1″]Are you OK?[/sub]

    In my book, this example is “marking up”, not the one in yours. And for this kind of mark-up, HTML is definitely not the most suitable environment; we need to have our own set of rules, elements and attributes.

    To support and validate the above example, an appropriate DTD could be defined accordingly.

    Yours, on the other hand, is linking to different set of information, not an information the actual HTML document is supposed to represent, like a movie file being linked from the HTML, but not marked-up or embedded as a part of its native structure, like what we did and failed in base64 images.

    In my opinion, this is a non-HTML matter, and is a necessarily XML one; therefore it could be linked to an HTML document, like an RDF, as a META data inside the HEAD.

    Thanks a lot for this well-written article so we could think about it and write something 🙂

    best regards

    p.s. MODERATOR: this is final – I promise 😀

  4. @Jack thanks for the comments – and you are right: there are plenty of existing syntax elements in other specifications that could be tweaked, adapted and possibly re-used. However, none of them really fit.

    “category” is very different to “role” – it is the category of time-aligned text we are talking about and there is a limited list part of the spec.

    “name” could be replaced by “title” or something else – I am not particularly fussed about this though I needed it as an attribute rather than as content model, which would have been more obvious.

    I am also consciously refraining from re-implementing SMIL. I do not want the full complexity of the “seq” and “par” elements. Also, the “text” or “ref” elements do not compare to “itext” which references a particular type of interactive text files similar to how “img” references particular types of image files.

    Further, HTML doesn’t do namespaces, so every adoption from another standard would need to be replicated into HTML anyway. And since there is not an exact match between the needs that itext and itextlist express and those provided by other specs, I’d rather avoid that complexity.

    The important thing here is though that we have looked at existing syntaxes and have learnt from them, so even through there is no direct re-use, there is indeed conceptual re-use and learnings.

  5. @kunter There is no need to merge the markup of subtitles (or other time-aligned text for that matter) with HTML directly, just as there is no need to base64 encode images and include them directly in HTML. The itext proposal replicates for subtitles what we do for images and thus it follows completely along the HTML philosophy. DFXP is more than enough mark-up for subtitles, and so is srt or any of the other millions of formats that people have come up with over the years.

    As for linking to subtitles in a HTML head element: that won’t work when you have multiple video elements on the web page. You really do need a solution that clearly associates with a particular video element.

  6. Is the intent here that the name attribute of the itextlist element be localized as its sent down the wire? Just wondering if the idea is that is a well-defined string that the browser localizes or it’s something that should be localized before the UA sees it.

    1. @blizzard The “name” attribute could indeed be localized before the UA sees it, since it will get displayed in the menu. The idea is however to allow the page author to influence the name of the menu. This may be a bad idea, I don’t know – I’m happy for suggestions there.

  7. >> Callbacks on timed text segments <> It is possible to achieve this effect simply through adding a timeupdate event listener, but proper callbacks like these are much more efficient. <> I am also consciously refraining from re-implementing SMIL. I do not want the full complexity of the

  8. @Sam

    Thanks for the extensive feedback – I’ve added it to .

    Re SMIL: yes, it is oriented towards multi-media presentations where the timeline is in control – that is not how Web pages work, which are essentially static text content enriched with interactive and media elements. Thus the poor fit.

    Re live timed events: I think it is possible to point the video src url to a live broadcast, which then gets updated continuously. It might make sense to turn off the controls for such an element. I also don’t see a problem in attaching a subtitle file that is continuously updated in an itext element to such a live video source. The text could continue to be pushed/pulled. With javascript, it would also be possible to continue pulling other content, such as images or other text.

    Re timed events: I assume you are saying that you like the callback methods that were introduced on the timed text segments, since they allow you to do such timed updates? I can see how that would be possible – nice ideas!

  9. >> Re timed events: I assume you are saying that you like the callback methods that were introduced on the timed text segments, since they allow you to do such timed updates? <<

    That's right. I was also trying to say (probably not very clearly!) that it would be good to be able to listen for enter and leave events as well as being able to to attach event handler callbacks to onenter and onleave — if only to encourage coders to move JavaScript out of HTML.

  10. It would be fantastic if there was a commen video standard to embed text in a video file that every browser could just access the same way as the video and audio streams inside.

    Your idea is a hack around the stupid patent driven reality.

    Your idea is well designed but I am afraid it also reenforces the reality. Why should people care to use a open standard for video with text embeding abilities if they can just use your hack?

    People saving the video will end up with a useless file not containing the text they might need. And no indication of that before they save it. The text will mysteriously disappear.

  11. @Doris

    There are use cases for both, text inside a video/audio file, and text outside, but related. For most Web developers, outside is in fact a lot more sensible since it’s easier to update such files.

    I am neither trying to avoid a patent reality nor trying to hack around issues. Even if there existed only a single format in which we encoded and encapsulated audio and video, I would still propose to use both: in-stream and external (out-of-band) time-aligned text.

    For Web pages we already have the reality that it consists of many files that together create a consistent presentation. It’s not a problem – we have zip/tar files and many other means to solve this issues.

    In fact, I think it will be even less of a problem for video, since a server can provide the additional service of embedding a text files (or, in fact, text from a database) inside a video file upon download, should that be desirable. Also, it is possible to download the video and the associated text files as package. If the text “mysteriously disappears”, it’s a feature/bug of the Website rather than a fundamental design issue.

Leave a Reply

Your email address will not be published. Required fields are marked *