What is the raw format of time-aligned text?

My grant with Mozilla on exploring the state and possible ways forward for video accessibility on the Web is over. I have posted a detailed report in the Mozilla wiki, which you should read if you care about the details. It has been a very instructive time of analysis and I have learnt a lot about the needs and requirements for time-aligned text.

For example, I learnt that for many deaf people, our spoken language is just their second language while their first language is actually sign language, thus making it very important to allow for picture-in-picture display of sign-language video tracks in Web video.

I also learnt about more technical challenges, e.g. how difficult it may be to simply map the content of a linked Web resource into a current Web page when one cannot be certain about the security implications of this mapping. This will be of importance as we synchronise out-of-band time-aligned text (e.g. a subtitle file on a server) to a video file that is included in a Web page through a HTML5 <video> tag.

There are two large work items that need to be undertaken next in my opinion.

Firstly we have to create a collection of example files that explain the different categories of time-aligned text that were identified and their specific requirements. For example, the requirements of simple subtitle files are clear. But what about karaoke files? Or ticker-text? We need pseudo-code example files that explain the details of what people may want to display with such files.

I regard the DFXP test suite as one source of such files – the guys at the W3C TimedText working group have done a tremendous job at collecting the types of timed text documents that they want to be able to display.

Another source will be the examples directory in libkate, which implements OggKate, a format that I can envisage as the default encoding format for time-aligned text inside Ogg, because of its existing extensive code base and the principles with which OggKate was developed.

The second work item is more challenging and more fundamental to time-aligned text in Web browsers. We have to create a specification of how to represent time-aligned text inside Web browsers – basically the DOM and the API, but also what processing needs to be done to get the text there. I have a proposal on using a <text> element inside the <video> element to reference out-of-band time-aligned text resources. However, the question is what to do with them then.

The more I thought about this, the more the question is reduced to finding the “raw format” of time-aligned text: When a Web browser decodes a time-aligend text file, what is its internal representation of it, its “raw” state. This will map to HTML, CSS, javascript, and other existing Web technology. But what is this minimal, “raw” representation? Text and graphics with positioning information, style information, timing information, state information, and potentially hyperlinks? is that all?

These are the questions that I think need to be explored next.

In parallel we should start with an implementation of support for the simplest type of time-aligned text: plain SRT. The raw format for this is simple: just a series of text with start and end times. Even though this is simple, it has no straightforward mapping into HTML since HTML does not understand time, so it can only be dealt with in javascript or through SVG. It may be time to include a simple concept of time into HTML. Let’s just avoid making it as complex as HTML+Time!

A basic support of SRT in Firefox would create a first step towards a framework for more complicated time-aligned text. It would also create access to video for a large number of hearing-impaired and non-native viewers and get us a large step towards comprehensive media accessibility on the Web. I hope we can address this in 2009.