Category Archives: random

What is “interoperable TTML”?

I’ve just tried to come to terms with the latest state of TTML, the Timed Text Markup Language.

TTML has been specified by the W3C Timed Text Working Group and released as a RECommendation v1.0 in November 2010. Since then, several organisations have tried to adopt it as their caption file format. This includes the SMPTE, the EBU (European Broadcasting Union), and Microsoft.

Both, Microsoft and the EBU actually looked at TTML in detail and decided that in order to make it usable for their use cases, a restriction of its functionalities is needed.


The EBU released EBU-TT, which restricts the set of valid attributes and feature. “The EBU-TT format is intended to constrain the features provided by TTML, especially to make EBU-TT more suitable for the use with broadcast video and web video applications.” (see EBU-TT).

In addition, EBU-specific namespaces were introduce to extend TTML with EBU-specific data types, e.g. ebuttdt:frameRateMultiplierType or ebuttdt:smpteTimingType. Similarly, a bunch of metadata elements were introduced, e.g. ebuttm:documentMetadata, ebuttm:documentEbuttVersion, or ebuttm:documentIdentifier.

The use of namespaces as an extensibility mechanism will ascertain that EBU-TT files continue to be valid TTML files. However, any vanilla TTML parser will not know what to do with these custom extensions and will drop them on the floor.

Simple Delivery Profile

With the intention to make TTML ready for “internet delivery of Captions originated in the United States”, Microsoft proposed a “Simple Delivery Profile for Closed Captions (US)” (see Simple Profile). The Simple Profile is also a restriction of TTML.

Unfortunately, the Microsoft profile is not the same as the EBU-TT profile: for example, it contains the “set” element, which is not conformant in EBU-TT. Similarly, the supported style features are different, e.g. Simple Profile supports “display-region”, while EBU-TT does not. On the other hand, EBU-TT supports monospace, sans-serif and serif fonts, while the Simple profile does not.

Thus files created for the Simple Delivery Profile will not work on players that expect EBU-TT and the reverse.

Fortunately, the Simple Delivery Profile does not introduce any new namespaces and new features, so at least it is an explicit subpart of TTML and not both a restriction and extension like EBU-TT.


SMPTE also created a version of the TTML standard called SMPTE-TT. SMPTE did not decide on a subset of TTML for their purposes – it was simply adopted as a complete set. “This Standard provides a framework for timed text to be supported for content delivered via broadband means,…” (see SMPTE-TT).

However, SMPTE extended TTML in SMPTE-TT with an ability to store a binary blob with captions in another format. This allows using SMPTE-TT as a transport format for any caption format and is deemed to help with “backwards compatibility”.

Now, instead of specifying a profile, SMPTE decided to define how to convert CEA-608 captions to SMPTE-TT. Even if it’s not called a “profile”, that’s actually what it is. It even has its own namespace: “m608:”.


With all these different versions of TTML, I ask myself what a video player that claims support for TTML will do to get something working. The only chance it has is to implement all the extensions defined in all the different profiles. I pity the player that has to deal with a SMPTE-TT file that has a binary blob in it and is expected to be able to decode this.

Now, what is a caption author supposed to do when creating TTML? They obviously cannot expect all players to be able to play back all TTML versions. Should they create different files depending on what platform they are targeting, i.e. a EBU-TT version, a SMPTE-TT version, a vanilla TTML version, and a Simple Delivery Profile version? Should they by throwing all the features of all the versions into one TTML file and hope that the players will pick out the right things that they require and drop the rest on the floor?

Maybe the best way to progress would be to make a list of the “safe” features: those features that every TTML profile supports. That may be the best way to get an “interoperable TTML” file. Here’s me hoping that this minimal set of features doesn’t just end up being the usual (starttime, endtime, text) triple.


I just found out that UltraViolet have their own profile of SMPTE-TT called CFF-TT (see UltraViolet FAQ and spec). They are making some SMPTE-TT fields optional, but introduce a new @forcedDisplayMode attribute under their own namespace “cff:”.

Best economy flight evva!

Over the years, I have flown a lot – mainly between Sydney and Frankfurt or Sydney and San Francisco. Today, for the first time in a long time, I had a flight with Qantas from Sydney to San Francisco. And I must say: it was the most productive and most comfortable economy flight I had in a long time.

This is gonna feel awkward, since it’s not one of my usual technical posts. But I just have to say “Thank you” to Qantas. When I fly to the US, I tend to catch a US airline because they usually turn up as the cheapest. This time, Qantas was the second cheapest, so I decided to spend the extra hundred bucks on getting a modern airline. Yes, get that US airlines: no matter which of you I take, I always feel like I am thrown back into the last century. Legspace is rare, seats are uncomfortable, food is crap, service is poor, oh … and have you ever heard of personal entertainment screens? Yes, I know, your planes are from the last century. But honestly: I had a personal entertainment screen on my Singapore Airlines flight when coming to Australia for the first time in 1998! Couldn’t you at least upgrade the inside of your planes?

Anyway, back to this flight. It all started with the question: would you like to sit in the centre isle in front of the baby bassinet? Oh, I usually take a window seat to get some peace and quiet – but hey, I’m not going to say “no” to space! And, man did I use it!

I settled in with a good book and a little nap until the first meal and after that felt strengthened and awake enough to start hacking. With my new MacBook Pro, I was bound to get a few hours in before the battery would die on me. Not the 7 hours, that Apple claims, but that’s because I was going to do lots of compiles of Firefox. Anyway – without a seat in front of me, without the personal entertainment screen pulled out, and with the nice thick cushion that Qantas supply on my lap, protecting me from the laptop heat, I almost felt like I was back home in my living room.

On top of that – and unfortunately for Qantas, but fortunately for me – the plane was only two thirds full, so I had the middle seat on my left empty, which I immediately used to extend my table space. I had continuing catering service for the next 4-5 hours of compiling, applying OggK patches to the new Chris Double Firefox codebase, and fixing compile errors (all configuration based – I have yet to get to writing actual code). Ongoing catering service, no need to cook for myself, uninterrupted coding time, good music from the inflight entertainment service – I think I’ll move my office into a Qantas plane! Not been this productive in ages!

Everywhere around me the lights were out, people were watching movies, but I was working and really enjoying it. And then, the battery was empty, half way into the flight. Bummer! But I didn’t give up this easily. Thought it’d be worth asking if there was a way to recharge without occupying a toilet for two hours. And as with everything else, Qantas inflight personnel made an extra effort to please: they found me a empty seat in business class and hooked up the laptop for an hour to recharge. Totally, utterly awesome! I got it back after another nice reading break – cannot start watching movies, since that makes the brain go mash. I got another few hours of compiling in before my body forced me to catch a few hours of sleep.

Now, I’m about an hour away from San Fran and the laptop claims 40min of power left. Funnily, that number seems to go up rather than down, so I’m sure it will last until arrival (uh! It’s now at 1:24min – oh, compilation just finished!). Hopefully I will be able to find out, why some of the Ogg Theora/Vorbis/Kate videos that I created using kateenc and oggz-merge don’t play in the patched Firefox. After all, it would be awesome to be able to show it off in the upcoming HTML5 Video Accessibility workshop!

MySQL, Snow Leopard and ruby

I got a shiny new MacBook Pro on the weekend, yay! After months of complaining about the slowness and the heat evaporating from my old Macbook, I’m finally off to better grounds.

But then there was the annoying task of setting up the machine with all the software that I’m using. MySQL and ruby turned out to be particular problems. I installed MySQL for 10.5, since MySQL haven’t published one for OS 10.6 yet. I ran “gem install mysql”. And then the pain started.

I got all the errors that were reported elsewhere:
uninitialized constant MysqlCompat::MysqlRes” and “undefined method `real_connect’ for Mysql:Class (NoMethodError)“. I tried all the suggestions – including:
"sudo env ARCHFLAGS="-arch x86" gem install mysql -- --with-mysql-config=/usr/local/mysql-5.1.39-osx10.5-x86/bin/mysql_config -V --debug, but just couldn’t get there.

My laptop reports in the System Software Overview: “64-bit Kernel and Extensions: No”, so I assumed I had to use the 32 bit versions. However, that was a wrong assumption. Even though my kernel seems to be 32 bit, applications seem to be 64 bit.

So, eventually I re-installed MySQL for Mac OS X 10.5 (x86_64) and ran the correct gem install command:
sudo env ARCHFLAGS="-arch x86_64" gem install mysql -- --with-mysql-config=/usr/local/mysql-5.1.39-osx10.5-x86 and things were fine.

Additionally, there was some fighting with the PrefPane and re-starting mysql. I had to kill it manually and I had to install the updated PrefPane of Swoon dot net to make it work.

Hope this helps somebody avoid the same pain!

Amusement at WHATWG

This is not a technical post, but it made my day, so I thought I should share it.

For two years, the WHATWG has had an open twitter account: anyone who wanted to post a status message on WHATWG could just got to and update the twitter status.

For two years, the script kiddies didn’t find the account.

They discovered it about 12 hours ago. Check it out at before twitter’s history eliminates the posts again.

Here are some of the “jewels” posted:

“WHATWG: We’re only half as evil as we seem.”

“The HXTML 2.0 spec has been finalized with only one tag which is <text>.”

“W3C issues announcement: Internet Explorer to be made obsolete. From fall onwards, IE6 and IE7 will be blocked from browsing the internet”

“I hope the script kiddies realizes that no one cares what is posted to the WHATWG twitter account”


“Our whole team of security experts was just fired.”

“i want <isitfriday> tag…” (me too!!)


“WHATWG announce working group on emoticons. Homer says (_8(|) ~doh!”

“WHATWG to start work on “Bible5″” (this is actually old, but still golden)


Mario is dead and dismantled

Beloved Shuttle Box
Beloved Shuttle Box

Six weeks ago, on a fatal Saturday, both my washing machine and cute little Mario died in one day. The washing machine was quickly repaired, but there was no hope for Mario, as the burnt smell of electronics indicated. It wasn’t going to start up again.

Mario had been the first server to run the code developed at Vquence. It was our development and testing server for more than 8 months until we moved to a server at The Planet – later to Voxel and now ultimately to Amazon.

After it was relieved off Vquence duty, Mario became what it was originally bought to become: a media server. Running Linux and MythTV, it was the beloved center of our living room for the last 2 years. But it seems the heavy duty VCR work as well as running Linux exhausted him.

Well, it is now replaced by an ordinary HP machine – I will miss the cute little shuttle.

If anyone wants the remains, let me know.

Video as an enabler for broadband applications

Last week, I gave a brief statement on the importance of video as an enabler for broadband applications at the Public Sphere event of Senator Kate Lundy.

I found it really difficult to summarize all the things that I find important about video technology in a modern distributed online world in a 10 min speech. Therefore, I’d like to extend on some of the key points that I was trying to make in this blog post.

Video provides presence

One of the biggest problems we have with the online world is that it mostly still evolves around text. To exchange information with others, to publish, to chat (email, irc or twitter) or do our work, we mostly still rely on the written word as a communication means. However, we all know how restrictive this is – everyone who has ever seen a flame war develop on a mailing list, a friendship break over a badly formulated email, a host of negative comments posted on a mis-formulated blog post, or a twitter storm explode over a misunderstanding knows that text is very hard to get right. Lacking any sort of personal expression supporting the expressed words (other than the occasional emoticon), sentences can be read or interpreted in the wrong way.

A phone call (or skype call) is better than text: how often have you exchanged 10 or even 20 emails with a friend to e.g. arrange to meet for a beer, when a simple phone call would have solved it within seconds. But even a phone call provides a reduced set of communication channels in comparison to a personal meeting: gesture, posture, mime and motion are there to enrich communication channels and help us understand the other better. Just think about the cognitive challenges in a phone conference in comparison to the ease of speaking to people when you see them.

With communication that uses video, we have a much higher communication “bandwidth” between people, i.e. a lot less has to be actually said in words so we can understand each other, because gesture, posture, mime and motion speak for us, too. While we cannot touch each other in a video communication, e.g. for shaking hands or kissing cheeks, video provides for all these other channels of communication providing a much higher perceived feeling of “presence” to the remote person or people. When my son speaks over skype with my family in Germany, and we cannot turn on the web cam because the bandwidth and latency are too poor, he loses interest very quickly in speaking to these “soul-less” voices.

The availability of bandwidth will make it possible for humans to communicate with each other at a more natural level, feeling more engaged and involved. This has implications not just on immediate communications, such as person-to-person calls or video conferences, but on any application that requires the interaction of people.

Video requirements are the block to create new applications

Bandwidth requirements for most online applications are pretty low. Consider for example a remote surgery where a surgical expert on one end operates on a patient at a remote location with surgical staff and operating equipment. The actual data that needs to be exchanged between the surgeon and the operating machines is fairly low – they are mostly command-control data that has to be delivered at high accuracy and low delay, but does not require high bandwidth. What turns such a remote surgery scenario into a challenge with existing networks are the requirements for multiple video channels – the surgeon needs to be visible to the staff and probably to the patient – in turn, the surgeon needs to see the staff, needs to see the patient from multiple angles to gain the full picture, needs to see the supporting documents such as X-rays, schedules, blood analysis etc, and of course he needs to see the video coming from the operating equipment possibly from within the patient that gives him feedback on the actual operation.

As you can see, it is video that creates the need for high bandwidth.

This is not restricted to medical applications. Almost all new remote applications that we create end up having a huge visual requirement with multiple video streams. This is natural, since almost all remote applications involve more than one person and each person has the capability to look into different directions. Thus, the presence of each person has to be replicated and the representation of the environment has to be replicated.

Even in a simple scenario such as a video conference, a single camera and microphone are very restrictive and do not provide the ability to every participant to interact with any of the other people present, but restrict them to the person/group that the camera is currently focused on. Back channels such as affirmative side chats or mimic exchanges of opinion are lost. Multiple video channels can make up for this.

In my experience from the many projects I have been somewhat involved with over the years that tried to develop new remote applications – teleteaching at Mannheim University or the CeNTIE project at CSIRO – video is the bandwidth-needy channel, but video is not the main purpose of the application. Rather, the needs for information for the involved people are what drives the setup of the data and communication channels for a particular application.

Immediately, applications in the following areas come to mind that will be enabled through broadband:

  • education: remote lectures, remote seminars, remote tutoring, remote access to research text/data
  • health: remote surgery, remote expert visits, remote patient monitoring
  • business: remote workplace, remote person-to-person collaboration with data sharing and visualisation, remote water-cooler conversations, remote team presence
  • entertainment: remote theatre/concert/opera visit, home cinema, high-quality video-on-demand

But ultimately, there is impact into all aspects of our lives: consider e.g. the new possibilities for citizen involvement in politics with remote video technology, or collaborative remote video editing in video production, or in sports for data collection. Simply ask yourself “what would I do differently if I had unlimited bandwidth?” and I’m sure you will come up with at least another 2 or 3 new applications in your field of expertise that have not been mentioned before.

Technical challenges

Video (with audio) is an inherently volatile data stream that is highly sensitive to specific kinds of networking issues.

End-to-end delays such as are typical with satellite-based connections destroy the feeling of presence and create at best awkward communications, at worst destructive feedback-loops in live operations. Unfortunately, there is a natural limit to the speed in which data can flow between two points. Given that the largest distance between two points on earth is approx 20,000 km and the speed of light is approx 300,000 km/s, a roundtrip must take at least 133ms. Considering that humans can detect a delay as small as 10ms in a remote communication and are really put off by a delay of 100ms, this is a technical challenge that we will find hard to overcome. It shows, however, that it is a technical requirement to minimize end-to-end dealys as much as possible.

Packet jitter is another challenge that video deals with badly. In networks, packets cannot easily be guaranteed to arrive at a certain required rate. For example, video needs to play back at a fixed picture rate (typically 25 frames per second) for humans to be able to view it as smooth motion. Whether video is transferred live or from a file, video packets are required to arrive at a certain rate such that the pictures can be decoded and displayed at the expected rate. The variance in delay of packets arriving because of network congestion is called packet jitter. If packet jitter is high, the video will either have to stop and buffer packets until enough video frames have arrived for it to display again, or it will have to drop packets and therefore video frames to keep in sync with a live stream. Typically the biggest problem of dropping packets is the drop-out of audio – while we can tolerate some drop-outs in video, audio drop-outs are unacceptable to maintain a conversation.

In most of the application scenarios, there is a varying need for video quality.

For example, a head shot of a person that is required for communication doesn’t need high-quality video – it is sufficient if the person can be seen and the communication can be held. The audio resolution can be telephone quality (i.e. 8kHz audio sampling rate) and the video can be highly compressed and at a smallish resolution (e.g. 320×240 px) giving standard skype quality video which requires about 400Kbps in bandwidth.

At the other end of the scale are e.g. medical and large-screen applications where a high sound quality is required e.g. to hear heart beats properly (i.e. 48-96kHz audio sampling rate) and the video can’t be compressed (much) so as not to introduce artifacts, which gives at a high HDTV resolution of e.g. 1920x1080px bandwidth requirements of 30Mbps compressed – uncompressed would be about triple that.

So, depending on the tolerance of the application to picture size, compression artifacts, and the number of parallel video streams required, bandwidth requirements for video can be relatively low or really high.

Further technical issues around video are that online video can be handled differently to analog video. The video can have all sorts of metadata associated with it – it can have hyperlinks to other content – it can be accompanied by advertising in more flexible ways – and it can be automatically personalised towards the needs of the individual viewers, just to name a few rich functions of online video. It is here where a lot of new ideas for monetisation will evolve.

Non-technical challenges

Apart from technical challenges, the use of video also creates issues in other dimensions.

People are worried about their behaviour as it is always potentially recorded and thus may not perform their duties with the same focus and concentration as is necessary.

People are worried about video connections always being potentially enabled and thus having potentially a remote listener/viewer that is unwanted.

On top of such privacy issues come issues in data security as increasingly data is distributed remotely.

We should also not forget that there are people that have varying requirements for their communication. A large challenge for such new applications will be to make them accessible. For example the automated creation of captions for remote video communication may well turn out to be a major challenge, but also an opportunity for later archiving and search.

When looking at the expected move of professional video content from TV to online, there are more issues about copyrighted content and usage rights – mostly this has to do with legacy content where online use was not considered in licensing agreements. This is a large inhibitor e.g. for Australia in creating a Hulu-like service.

In fact, monetisation is a huge issue, since video is not cheap: there is a cost in the development of applications, there is a cost in bandwidth, in storage, and a cost in content production that has to be covered somewhere. Simply expecting the user to pay for being online and then to pay again for each separate application, potentially subscribing to a multitude of services, may not be the best way to cope with the cost. Advertising will certainly play a big role in the monetisation mix and new forms of advertising will emerge, such as personalised permission-based advertising based on the information available about a person e.g. through their Google searches.

In this context, the measurement of the use of video in bandwidth, storage and as part of an application will be a big enabler towards figuring out how to pay for all the involved expenditure and what new monetisation models to come up with.

Further in the context of cost and monetisation it should be added that the use of open source software, in particular open source video technology such as open codecs can help bring down cost while at the same time create more interoperability. For example, if Skype used an open codec and open protocols rather than their proprietary technology, other applications could be built using the skype infrastructure and user base.

Approach to developing good new applications

These are just the challenges for video streams themselves. However, in new applications, video streams will just be a tool for creating an integral application, ultimately driven by the processes and data needs of the application. The creation of all the other parts of the application – the machinery, control panels, the data pools, the processes, the human interface, security and privacy measures etc – are what make up the product challenge. A product ultimately has to function in a way that makes it a usable tool in achieving a certain outcome. Unless the use of the product becomes natural and the distance disappears from the minds of the people involved, a remote application does not succeed.

The CeNTIE project, the approach towards the development of new remote applications was to assume no limits on available bandwidth. Then a challenge would be identified in an application area, e.g. in the medical space, and a prototype would be built with lots of input from the domain experts. Then the prototype would actually be deployed into a real working situation and tested. The feedback from the domain experts would be used to improve the application with further technology and improved processes. Ultimately, a usable setup would emerge, which was then ready to be turned into a product for commercialisation.

We have the capabilities here in Australia to develop world-class new applications on high-bandwidth networks. We need to support this further with bandwidth – hopefully the NBN will achieve this. But we also need to support this further with commercialisation support – unfortunately most of the applications that I saw being developed at the CSIRO never made it past the successful prototype. But this is fodder for another blog post at a different time.

Finally, I’d like to point out that we also have a large challenge in overcoming tradition. Most of us would be challenged to trust a doctor and his equipment for doing a surgical operation on our body from a remote location. There are issues of trust and culture involved that may take us a while to deal with and accept.

UPDATE (11/6/09): It seems that CISCO’s latest report, which predicts global IP traffic to increase 5-fold over the next 3 years, agrees with the analysis that most of this increase will be caused by video.

Google video: 2.5 years later, my predictions come true

When Google bought YouTube in October 2006, I wrote a blog entry about how Google video is a hosting site and that with the purchase of YouTube, Google has the opportunity to turn the Google brand back to video search.

Well, today, that prediction has come true and Google video has stopped hosting videos for users. So, things are now clear: YouTube is a video publishing site and Google video is a search engine.

Hold on: not so fast.

According to ComScore’s most U.S. search engine Rankings for August 2008, YouTube is the second largest search engine on the Web, ahead of Yahoo. At Vquence, we explain to customers that many people now use YouTube search as their entry point into the Web. Video is their Web. And when it comes to video, it’s all about YouTube.

Because people search for videos on YouTube, most videos that get published will have a copy on YouTube. Thus, YouTube is the dominant place to find video – not Google video. Also, YouTube is turning more and more into a search engine like Google: just this week they published “featured search results“, making a YouTube search result page look almost identical to a Google search result page: there is some featured content on top of the actual search results and there are some paid-for ads on the right.

Since it has taken Google such a long time to move Google video from hosting service to search service, I wonder if it’s not too late for Google video already. It feels now just like an add-on to YouTube – a place you go when all other searches fail.

Yahoo video search was once the best video search around. Then came Truveo and blinkx and a whole bunch more. Now, nobody writes about them any more – everybody just goes to YouTube itself or to Google Universal Search to go and find a video.

It would be nice if Google video search stayed around – if only as a discovery tool for when Web video goes directly onto our TVs. But I doubt, Google will find a good way to monetize it. YouTube’s search will be monetized quicker and more effectively.

Xiph / Annodex part of GSoC this year

Yup, we made it again! And because we want to create some awesome code, I’m posting our call for student applications here.

To all students: Please apply to Xiph GSoC projects and help us make open media technology rock the world! seeking Summer of Code student applications!

2009 is an important year for free codecs: Ogg Vorbis on every Android device, Ogg Theora support in development for Mozilla Firefox 3.5, and expanded Ogg hosting by the Internet Archive and Wikimedia. and, who develop free codecs (Ogg Vorbis, Theora, Dirac, Speex, CELT, FLAC) and web video support for them, have been selected as a mentoring organization for Google Summer of Code 2009.

We are actively seeking student projects for Summer of Code.
A list of project suggestions is at:

Students should feel free to select one of these, develop a variation, or propose their own ideas! Some examples:

  • Develop a conference bridge or reference SIP client for CELT, the new, ultra-low delay audio codec that bridges the gap between Vorbis and Speex for applications where both high quality audio and low delay
    are desired. If you enjoy hacking on networks, you’ll have fun with these CELT projects.
  • Develop components to support all Ogg codecs for OpenMAX IL, the media plugins used in Maemo, Android and LIMO mobile devices. This touches on many interesting projects, and is perfect for anyone with
    an interest in mobile and embedded systems who wants a broad introduction to multimedia codecs.
  • Write a JavaScript Library for Subtitles, Captions and other time-aligned text. The main focus of this project is around enabling video accessibility for Ogg in Firefox. The project requires a student with experience in JavaScript development, HTML and CSS, but also with some understanding of C for liboggplay and libkate, and of C++ for Firefox.
  • Make a Proof of Concept for HTML5 Ogg Video support in the Chromium Browser, using liboggplay (our Ogg Theora playback library, as used in Mozilla Firefox). Full support for HTML5 <video> is a lot of work, but let’s get the ball rolling with a proof of concept for Theora frame decoding and rendering.
  • Add support for import and export of XSPF playlists to Songbird, the Mozilla-powered open music player. This project requires good XML foo, the opportunity to work with cross-platform XUL and JavaScript, and perhaps some C++.


The student application period starts on Monday (March 23):
and runs for a little under 2 weeks, until Friday April 3.

Details of our application process are at:

Interested students *must* get involved with the project development community, on project mailing lists and IRC, before the application deadline. When selecting projects, preference will be given to students who have submitted at least one patch to a or project before the application deadline.

Students will receive a grant from Google for successful work on their GSoC projects. Hacking on free multimedia projects is fun and can have a big impact. We need students who love to hack, to help put support for free codecs into more applications, browsers and networks.

What is the raw format of time-aligned text?

My grant with Mozilla on exploring the state and possible ways forward for video accessibility on the Web is over. I have posted a detailed report in the Mozilla wiki, which you should read if you care about the details. It has been a very instructive time of analysis and I have learnt a lot about the needs and requirements for time-aligned text.

For example, I learnt that for many deaf people, our spoken language is just their second language while their first language is actually sign language, thus making it very important to allow for picture-in-picture display of sign-language video tracks in Web video.

I also learnt about more technical challenges, e.g. how difficult it may be to simply map the content of a linked Web resource into a current Web page when one cannot be certain about the security implications of this mapping. This will be of importance as we synchronise out-of-band time-aligned text (e.g. a subtitle file on a server) to a video file that is included in a Web page through a HTML5 <video> tag.

There are two large work items that need to be undertaken next in my opinion.

Firstly we have to create a collection of example files that explain the different categories of time-aligned text that were identified and their specific requirements. For example, the requirements of simple subtitle files are clear. But what about karaoke files? Or ticker-text? We need pseudo-code example files that explain the details of what people may want to display with such files.

I regard the DFXP test suite as one source of such files – the guys at the W3C TimedText working group have done a tremendous job at collecting the types of timed text documents that they want to be able to display.

Another source will be the examples directory in libkate, which implements OggKate, a format that I can envisage as the default encoding format for time-aligned text inside Ogg, because of its existing extensive code base and the principles with which OggKate was developed.

The second work item is more challenging and more fundamental to time-aligned text in Web browsers. We have to create a specification of how to represent time-aligned text inside Web browsers – basically the DOM and the API, but also what processing needs to be done to get the text there. I have a proposal on using a <text> element inside the <video> element to reference out-of-band time-aligned text resources. However, the question is what to do with them then.

The more I thought about this, the more the question is reduced to finding the “raw format” of time-aligned text: When a Web browser decodes a time-aligend text file, what is its internal representation of it, its “raw” state. This will map to HTML, CSS, javascript, and other existing Web technology. But what is this minimal, “raw” representation? Text and graphics with positioning information, style information, timing information, state information, and potentially hyperlinks? is that all?

These are the questions that I think need to be explored next.

In parallel we should start with an implementation of support for the simplest type of time-aligned text: plain SRT. The raw format for this is simple: just a series of text with start and end times. Even though this is simple, it has no straightforward mapping into HTML since HTML does not understand time, so it can only be dealt with in javascript or through SVG. It may be time to include a simple concept of time into HTML. Let’s just avoid making it as complex as HTML+Time!

A basic support of SRT in Firefox would create a first step towards a framework for more complicated time-aligned text. It would also create access to video for a large number of hearing-impaired and non-native viewers and get us a large step towards comprehensive media accessibility on the Web. I hope we can address this in 2009.