Embedding time-aligned text into Ogg

As part of my accessibility work for Mozilla and Xiph, it is necessary to define how time-aligned text such as subtitles, captions, or annotations, are encapsulated into Ogg. In the fansubber community this is called “hard subtitles” as opposed to “soft subtitles” which are subtitles that stay in a text file and are loaded separately to the video file into a media player and synchronised with the video by the media player. (as per comment below, all text annotations are “soft” – or also “closed”.)

I can hear you ask: so how do I do subtitles/captions with Ogg now? Well, it would have been possible to simply choose one subtitling format and map that into Ogg, then ask everyone to just use that one format and be done. But which one to choose? And why prefer a simpler one over a more complex one? And why just do subtitles and not any other time-aligned text?

So, instead, I analysed what types of time-aligned text “codecs” I have come across. Each one would have a multitude of text formats to capture the text data, because it is easy to invent a new format and standardisation hasn’t really happened in this space yet.

I have come up with the following list of typical time-aligned text codecs:

  • CC: closed captions (for the deaf)
  • SUB: subtitles
  • TAD: textual audio descriptions (for the blind – to be transferred to braille or TTS)
  • KTV: karaoke
  • TIK: ticker text
  • AR: active regions
  • NB: metadata & semantic annotations
  • TRX: transcripts / scripts
  • LRC: lyrics
  • LIN: linguistic markup
  • CUE: cue points, DVD style chapter markers and similar navigational landmarks

Let me know if you can think of any other classes of video/audio-related time-aligned text.

All of these texts can be represented in text files with some kind of time marker, and possibly some header information to set up the interpretation environment. So, the simplest way of creating a representation of these inside Ogg was to define a generic mapping for time-aligned text into Ogg.

The Xiph wiki holds the current draft specification for mapping text codecs into Ogg. For anyone wanting to map a text codec into Ogg, this should provide the framework. The idea is to separate the text codec’s data into header data and into timed text segments (which can have all sorts of styling and other information with it). Then, the mapping is simple. An example for srt is described on the wiki page.

The specification is still in draft status, because we’re still expecting feedback. In fact, what we now need is people trying an implementation and providing fixes to the specification.

To map your text codec of choice into Ogg, you will probably requrie further mapping specifications. Dependent on how complex your text codec of choice is, these additional mapping specifications may be rather simple or quite complicated. In the case of srt, it should be trivial. Considering the massive amount of srt already freely available online, the srt mapping may well have a really large impact. Enough hits. Let me know if you’re coding up something!

My next duty is to look for a representation that is generic enough to provide representations for any of the above listed text codecs. This representation is what will need to be available to a Web Browser when working with a Web video that has related text. Current contenders are OggKate and W3C TimedText, but I am not sure if either are too restrictive. I am indeed looking for the next generation of captioning technology that will be able to provide any type of time-aligned text that relates to audio/video.

9 thoughts on “Embedding time-aligned text into Ogg

  1. Hi Silvia,

    The other type of text annotation I’ve seen recently is a director’s commentary or behind-the-scenes information. In “Who Framed Roger Rabbit” they have a ‘Toontown Confidential’ version of the playback where text gets superimposed on the movie at given points to describe extra information about the scene – such as the number of different effects that a particular scene used or information about particular characters. I think this might fit into your ‘NB’ category but it’s not strictly metadata about the movie _file_ – I think it fits more into a ‘movie information’ (e.g. ‘NFO’) or ‘trivia’ (‘TRV’) tag.

    Hope this helps,


  2. I would agree with it being a NB category. NB is not about file information, but about semantic comments and things. This includes movie information and trivia IMHO.

    In the end it really depends on how it’s being presented, because there will be default displays for the different types of text annotations. For example, captions and subtitles will be at the bottom of the screen in a certain area, probably in black with a white boundary. However, we need input by designers, GUI experts and the like for what defaults make sense. It’s not a trivial exercise overall. 🙂

  3. I suspect that your distinction between hard and soft subtitles isn’t accurate. As I understand it, hard subtitles are subtitles that are rendered into the images that make up the video during encoding, while soft subtitles are actually rendered at playback-time based on text input and timing tags.

    DVD overlay subtitles are I guess a corner-case, since they’re pre-rendered but those prerenders are triggered by timing values. But I’m pretty sure they’re hard subs because they’re not in a textual format.

    On the other hand, I believe the MKV format has a way of including a soft sub stream in the actual container (rather than as an external file), but I don’t know how standard that is. I think mplayer supports it…

    Anyway, as I understand it, what you’re talking about here are all soft subs. Hard subs are handled by the encoder processing them into text at encoding time.

  4. TBBle,

    I have just read it up on http://en.wikipedia.org/wiki/Fansub and apparently you are right. Strangely enough, this relates to “closed” and “open” captions – open captions are hardcoded into the imagery and cannot be removed, while closed captions are in a separate data stream that can be turned off.

    Thanks for noticing.

  5. I’m not sure I understand how the codec types would be used, or, at least, they don’t seem like codec types to me, more like a semantic annotation of what the track is about.

    There doesn’t seem to be a straight mapping between your codec types and encodings (neither 1-to-many nor many-to-1), for example, DFXP or smilText or realText could be used for any of subtitles, captions, karaoke and tickertext. And probably some of the others as well. So, only the document author would be able to set the codec type, based on the content of the track.

    This make the use of these codec types completely different (from a client application point of view) from other codec types like for video. For the latter, the codec type signals (to the app) whether it’ll be able to decode and display the track, nothing about its semantics. For these text codecs it’s exactly the reverse: the client app only gets semantic info from the codec type, and cannot determine whether it’ll be able to decode the stream…

  6. Hi Jack,

    The codec types are what a user in e.g. a Web browser would choose to be displayed, if it is so available and if it understand the format of content type, i.e. the format of the description.

    For example, DFXP, smilText and realText can be used to describe any of these. But one instance document should only represent one text codec type. I.e. one instance DFXP document should be either a caption or a tickertext or a karaoke file.

    When encapsulating this into Ogg, you want to know which it is. You want to know this at the level of encapsulation because the server might need to do content adaptation and throw away some of the tracks. Also, it is easier for the client to know this in a quick way and at a defined location in the bitstream, such that it can present to its user the ones that the user prefers.

    So, it doesn’t matter if you have a format like DFXP which is able to specify multiple of these types, or a format like srt which is only for subtitles (and maybe captions). The author will specify which format it is and which codec type as it gets multiplexed into Ogg and you’re sorted.

    As for your comparison with video codec types: you are right that video codec types, in particular the “codecs” parameter in a MIME type, has a very different meaning. I may have chosen a bad name to describe what I mean. If anyone has a better suggestion for what to call them – I think one person suggested codec category – let me know.

    BTW: the actual MIME type of the text codec is also encapsulated into OGG, so you don’t need to worry about that missing.

  7. Not sure if this is an addition, or if it is already considered to be included in say the NB metadata codec, but about camera position either relative (some sort of studio XY coord) or absolute (long/lat/altitude) and in fact pan, tilt, zoom or other camera related data. This could even apply to audio I guess. Either one would hope/expect this metadata would have a standard format that could be included.

  8. Hi Martin,

    Hmm… there seems to be a lot going under NB. The director’s comments that Paul suggested would be displayed on screen, whereas these more metadata like comments would probably only be machine-readable. It may well be better to create another type for this. How about META?

Comments are closed.