Category Archives: Digital Media

Attaching subtitles to HTML5 video

During the last week, I made a proposal to the HTML5 working group about how to support out-of-band time-aligned text in HTML5. What I mean by that is basically: how to link a subtitle file to a video tag in HTML5. This would mirror the way in which in desktop-players you can load separate subtitle files by hand to go alongside a video.

My suggestion is best explained by an example:


<video src="http://example.com/video.ogv" controls>
<text category="CC" lang="en" type="text/x-srt" src="caption.srt"></text>
<text category="SUB" lang="de" type="application/ttaf+xml" src="german.dfxp"></text>
<text category="SUB" lang="jp" type="application/smil" src="japanese.smil"></text>
<text category="SUB" lang="fr" type="text/x-srt" src="translation_webservice/fr/caption.srt"></text>
</video>

  • “text” elements are subelements of the “video” element and therefore clearly related to one video (even if it comes in different formats).
  • the “category” tag allows us to specify what text category we are dealing with and allows the web browser to determine how to display it. The idea is that there would be default display for the different categories and css would allow to override these.
  • the “lang” tag allows the specification of alternative resources based on language, which allows the browser to select one by default based on browser preferences, and also to turn those tracks on by default that a particular user requires (e.g. because they are blind and have preset the browser accordingly).
  • the “type” tag allows specification of what actual time-aligned text format is being used in this instance; again, it will allow the browser to determine whether it is able to decode the file and thus make it available through an interface or not.
  • the “src” attribute obviously points to the time-aligned text resource. This could be a file, a script that extracts data from a database, or even a web service that dynamically creates the data
    based on some input.

This proposal provides for a lot of flexibility and is somewhat independent of the media file format, while still enabling the Web browser to deal with the text (as long as it can decode it). Also note that this is not meant as the only way in which time-aligned text would be delivered to the Web browser – we are continuing to investigate how to embed text inside Ogg as a more persistent means of keeping your text with your media.

Of course you are now aching to see this in action – and this is where the awesomeness starts. There are already three implementations.

First, Jan Gerber independently thought out a way to provide support for srt files that would be conformant with the existing HTML5 tags. His solution is at http://v2v.cc/~j/jquery.srt/. He is using javascript to load and parse the srt file and map it into HTML and thus onto the screen. Jan’s syntax looks like this:


<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="jquery.srt.js"></script>

<video src="http://example.com/video.ogv" id="video" controls>
<div class="srt"
data-video="video"
data-srt="http://example.com/video.srt" />

Then, Michael Dale decided to use my suggested HTML5 syntax and add it to mv_embed. The example can be seen here – it’s the bottom of the two videos. You will need to click on the “CC” button on the player and click on “select transcripts” to see the different subtitles in English and Spanish. If you click onto a text element, the video will play from that offset. Michael’s syntax looks like this:


<video src="sample_fish.ogg" poster="sample_fish.jpg" duration="26">
<text category="SUB" lang="en" type="text/x-srt" default="true"
title="english SRT subtitles" src="sample_fish_text_en.srt">
</text>
<text category="SUB" lang="es" type="text/x-srt"
title="spanish SRT subtitles" src="sample_fish_text_es.srt">
</text>
</video>

Then, after a little conversation with the W3C Timed Text working group, Philippe Le Hegaret extended the current DFXP test suite to demonstrate use of the proposed syntax with DFXP and Ogg video inside the browser. To see the result, you’ll need Firefox 3.1. If you select the “HTML5 DFXP player prototype” as test player, you can click on the tests on the left and it will load the DFXP content. Philippe actually adapted Jan’s javascript file for this. And his syntax looks like this:


<video src="example.ogv" id="video" controls>
<text lang='en' type="application/ttaf+xml" src="testsuite/Content/Br001.xml"></text>
</video>

The cool thing about these implementations is that they all work by mapping the time-aligned text to HTML – and for DFXP the styling attributes are mapped to CSS. In this way, the data can be made part of the browser window and displayed through traditional means.

For time-aligned text that is multiplexed into a media file, we just have to do the same and we will be able to achieve the same functionality. Video accessibility in HTML5 – we’re getting there!

Embedding time-aligned text into Ogg

As part of my accessibility work for Mozilla and Xiph, it is necessary to define how time-aligned text such as subtitles, captions, or annotations, are encapsulated into Ogg. In the fansubber community this is called “hard subtitles” as opposed to “soft subtitles” which are subtitles that stay in a text file and are loaded separately to the video file into a media player and synchronised with the video by the media player. (as per comment below, all text annotations are “soft” – or also “closed”.)

I can hear you ask: so how do I do subtitles/captions with Ogg now? Well, it would have been possible to simply choose one subtitling format and map that into Ogg, then ask everyone to just use that one format and be done. But which one to choose? And why prefer a simpler one over a more complex one? And why just do subtitles and not any other time-aligned text?

So, instead, I analysed what types of time-aligned text “codecs” I have come across. Each one would have a multitude of text formats to capture the text data, because it is easy to invent a new format and standardisation hasn’t really happened in this space yet.

I have come up with the following list of typical time-aligned text codecs:

  • CC: closed captions (for the deaf)
  • SUB: subtitles
  • TAD: textual audio descriptions (for the blind – to be transferred to braille or TTS)
  • KTV: karaoke
  • TIK: ticker text
  • AR: active regions
  • NB: metadata & semantic annotations
  • TRX: transcripts / scripts
  • LRC: lyrics
  • LIN: linguistic markup
  • CUE: cue points, DVD style chapter markers and similar navigational landmarks

Let me know if you can think of any other classes of video/audio-related time-aligned text.

All of these texts can be represented in text files with some kind of time marker, and possibly some header information to set up the interpretation environment. So, the simplest way of creating a representation of these inside Ogg was to define a generic mapping for time-aligned text into Ogg.

The Xiph wiki holds the current draft specification for mapping text codecs into Ogg. For anyone wanting to map a text codec into Ogg, this should provide the framework. The idea is to separate the text codec’s data into header data and into timed text segments (which can have all sorts of styling and other information with it). Then, the mapping is simple. An example for srt is described on the wiki page.

The specification is still in draft status, because we’re still expecting feedback. In fact, what we now need is people trying an implementation and providing fixes to the specification.

To map your text codec of choice into Ogg, you will probably requrie further mapping specifications. Dependent on how complex your text codec of choice is, these additional mapping specifications may be rather simple or quite complicated. In the case of srt, it should be trivial. Considering the massive amount of srt already freely available online, the srt mapping may well have a really large impact. Enough hits. Let me know if you’re coding up something!

My next duty is to look for a representation that is generic enough to provide representations for any of the above listed text codecs. This representation is what will need to be available to a Web Browser when working with a Web video that has related text. Current contenders are OggKate and W3C TimedText, but I am not sure if either are too restrictive. I am indeed looking for the next generation of captioning technology that will be able to provide any type of time-aligned text that relates to audio/video.

“Sorry, this video is no longer available”

Recently, I noticed an increasing number of videos on YouTube were no longer available – even if they had just been shared through a blog post by friends or even if they were the main video on a producer’s YouTube page, such as QuantumOfSolace.

I was suspicious for a while that there was something wrong with my browser, but when my colleague was able to play the video from the same network and I wasn’t, something had to be done.

I am running Firefox 3.0.4 on OS X 10.5.5 with the Flash 10.0 d26 plugin. First we thought it might be blocked for au.youtube.com and not for www.youtube.com, but there was no difference. Still the same “Sorry, this video is no longer available”.

Finally, an installation of the latest Flash plugin 10.0 r12 fixed the issue. So, if a large number of videos on YouTube isn’t available to you for no apparent reason, you might want to upgrade your Flash plugin.

News from the open media world

Today, there were so many news that I can only summarise them in a short post.

The guys from Collabora have announced that they are going to support the development of PiTiVi – one of the best open source video editors around. They are even looking to hire people to help Christian Schaller, the author of PiTiVi. The plan is to have a feature-rich video editor ready by April next year that is comparable in quality to basic proprietary video editors.

The BBC Dirac team have today announced a ffmpeg2dirac software package, which is built along the same lines as the commonly used ffmpeg2theora and of course transcodes any media stream to Ogg Dirac/Vorbis. With Ogg Dirac/Vorbis playback already available in vlc and mplayer, this covers the much needed creation side of Ogg Dirac/Vorbis files. Dirac is an open source, non-patent-encumbered video codec developed by the BBC. It creates higher quality video than Theora at comparable bitrates.

The FOMSFoundations of Open Media Software hacker workshop for open media software announced today the current list of confirmed participants for the January Workshop. It seems that this year we have a big focus on open video codecs, on browser support of media, on open Flash software, and on media frameworks. It is still possible to take part in the workshop – check out the CFP page.

Finally an important security message: Mozilla has decided to put a security measure around the HTML5 audio and video elements that will stop them from being exploited by cross-site scripting exploits. Chris Double explains the changes that are necessary to your setup to enable your published audio or video to be displayed on domains that are different to the domain on which these files are hosted.

Media fragment URI addressing

In the media fragment working group at the W3C, we are introducing a standard means to address fragments of media resources through URIs. The idea is to define URIs such as http://example.com/video.ogv#t=24m16s-30m12s, which would only retrieve the subpart of video.ogv that is of interest to the user and thus save bandwidth. This is particularly important for mobile devices, but also for pointing out highlights in videos on the Web, bookmarking, and other use cases.

I’d like to give a brief look into the state of discussion from a technical viewpoint here.

Let’s start by considering the protocols for which such a scheme could be defined. We are currently focusing on HTTP and RTSP, since they are open protocols for media delivery. P2P protocols are also under consideration, however, most of them are proprietary. Also, p2p protocols are mostly used to transfer complete large files, so fragment addressing may not be desired. RSTP already has a mechanism to address temporal fragments of media resources through a range parameter of the play request as part of the protocol parameters. Yet, there is no URI addressing scheme for this. Our key focus however is HTTP, since most video content nowadays is transferred over HTTP, e.g. YouTube.

Another topic that needs discussion are the types of fragmentation for which we will specify addressing schemes. At the moment, we are considering temporal fragmentation, spatial fragmentation, and fragmentation by tracks. In temporal fragmentation, a request asks for a time interval that is a subpart of the media resource (e.g. audio or video). In spatial fragmentation, the request is for an image region (e.g. in an image or a video). Track fragmentation addresses the issue where, e.g. a blind person would not require to receive the actual video data for a video and thus a user agent could request only the data tracks from the resource that are really required for the user.

Another concern is the syntax of URI addressing. URI fragments (“#”) have been invented to created URIs that point at so-called “secondary” resources. Per definition, a secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations. It is therefore the perfect syntax for media fragment URIs.

The only issue is that URI fragments (“#”) are not expected to be transferred from the client to the server (e.g. Apache strips it off the URI if it receives it). Therefore, in the temporal URI specification of Annodex we decided to use the query (“?”) parameter instead. This is however not necessary. The W3C working group is proposing to have the user agent strip off the URI fragment specification and transform it into a protocol parameter. For HTTP, the idea is to introduce new range units for the types of fragmentation that we will define. Then, the Range and Content-Range headers can be used to request and deliver the information about the fragmentation.

The most complicated issue that we are dealing with is the issue of caching in Web proxies. Existing Web proxies will not be able to understand new range units and will therefore not cache such requests. This is unfortunate and we are trying to devise two schemes – one for existing Web proxies and one for future more intelligent Web proxies – to enable proxy caching. This discussion has many dimensions – such as e.g. the ability to uniquely map time to bytes for any codec format, the ability to recompose new fragment requests from existing combined fragment requests, or the need and abilities for partial re-encoding. Mostly we are dealing with the complexities and restrictions of different codecs and encapsulation formats. Possibly, the idea of recomposition of ranges in Web proxies is too complex to realise and caching is best done by regarding each fragment as its own cacheable resource, but this hasn’t been decided yet.

We now have experts from the squid community, from YouTube/Google, HTTP experts, Web accessibility experts, SMIL experts, me from Annodex/Xiph, and a more people with diverse media backgrounds in the team. It’s a great group and we are covering the issues from all aspects. The brief update above is given from my perspective, and only lists the key issues superficially, while the discussions that we’re having on the mailing list and in meetings are much more in-depth.

I am not quite expecting us to meet the deadline of having a first working draft before the end of this month, but certainly before Christmas.

Theora 1.0 released!

While the open source codec “Theora” has been available since 2004 in a stable format, the open source community is very careful about giving any piece of software the “1.0” stamp of quality and libtheora has been put under scrutiny for years.

Today, libtheora 1.0 was finally released – rejoice and go ahead using it in production!

More hard-core improvements to libtheora are also in the pipeline under a version nick-named “Thusnelda”, improving mostly on quality and bit-rate.

W3C Technical Plenary / Advisory Committee Meetings Week 2008

I spent last week in France, near Cannes, at the W3C TPAC meeting. This is the one big meeting that the W3C has every year to bring together all (or most) of the technical working groups and other active groups at the W3C.

It was not my first time at a standards body meeting – I have been part of ISO/MPEG before and also of IETF, and spoken with people at IEEE and SMPTE. However, this time was different. I felt like I was with people that spoke my language. I also felt like my experience was valued and will help solving some of the future challenges for the Web. I am very excited to be an invited expert on the Media Fragments and Media Annotations working groups and be able to provide input into HTML5.

In the Media Fragments working group we are developing a URI addressing scheme that enables direct linking to media fragments, in particular temporal and spatial segments. Experience from our earlier temporal URI scheme is one of the inputs to the scheme. Currently it looks likely that we will choose a scheme that has “#” in it and then require changes to browsers, Web proxys, and servers to enable delivery of media fragments.

In the Media Annotations working group we are deciding upon an ontology to generically describe media resources – something based on Dublin Core but more extended and more appropriate for audio and video. We are currently looking at Adobe’s XMP specification.

As for HTML5 – there was not much of a discussion at the TPAC meeting about the audio and video elements (unless I missed it by attending the other groups). However, from some of the discussions it became clear to me that they are still in very early stage of specification and much can be done to help define the general architecture of how to publish video on the Web and its metadata, help define javascript APIs and DOM models, and help define accessibility.

I actually gave a lightning talk about the next challenges of HTML5 video at TPAC (see my “video slides“) which points out the need for standard definitions of video structure and annotations together with an API to reach them. I had lots of discussions with people afterwards and also learnt a lot more about how to do accessibility for Web video. I should really write it up in an article…

Of course, I also met a lot of cool people at TPAC, amongst them Larry Masinter, Ian Hickson, and Tim Berners-Lee – past and new heros of Web standards. 🙂 It was totally awesome and I am very grateful to Mozilla for sending me there and enabling me to learn more about the greater picture of video accessibility and the role it plays on the Web.

Demo of new HTML5 features

Ian Hickson, the main editor of the new HTML5 specification, gave a talk about some of the cool new features in HTML5 and some of the early implementations of these features in different browsers.

It’s pretty long demo with 1:25 hrs but he types in all the code manually, so you can re-do all of the demos yourself. The script of the talk with code examples is here.

The first 5 minutes are about the new video element and really worth watching.

Also, at 1:11 hrs Ian is asked about the choice of baseline codecs, in case you want to hear him speak what he has publicly written elsewhere.

I can’t wait to marry the video features with:

  1. the new media fragment addressing schemes in development at the W3C
  2. captions, subtitles and other timed text annotations for videos.

These will allow us search for specific topics directly inside the video (such as “form controls” in Ian’s video) and to hyperlink straight into these time offsets. A completely new world is coming!

Video Accessibility for Firefox

Ogg has struggled for the last few years to recommend the best format to provide caption and subtitle support for Ogg Theora. The OGM fork had a firm focus on using subtitles in SRT, SSA or VobSub format. However, in Ogg we have always found these too simplistic and wanted a more comprehensive solution. The main aim was to have timed text included into the video stream in a time-aligned fashion. Writ, CMML, and now Kate all do this. And yet, we have still not defined which is the one format that we want everybody to support as the caption/subtitle format.

With Ogg Theora having been chosen by Mozilla as the baseline video codec for Firefox and the HTML5 <video> tag, Mozilla is looking to solve this problem in a community fashion: the solution needs to be acceptable to Xiph, supported by Opera who are also experimenting with Ogg Theora, and ultimately provide a proposal to the W3C and WHATWG that can sensibly be included into HTML5.

As a first step in this direction, Mozilla have contracted me to analyse the situation and propose a way forward.

The contract goes beyond simple captions and subtitles though: it analyses all accessibility requirements for video, which includes audio annotations for the blind, sign language video tracks, and also transcripts, karaoke, and metadata tracks as more generic timed text example tracks. The analysis will thus be about how to enable a framework for creating a timed text track in Ogg and which concrete formats should be supported for each of the required functionalities.

While I can do much of the analysis myself, a decision on how to move forward can only be made with lots of community input. The whole process of this analysis will therefore be an open one with information being collected on the Mozilla Wiki, see https://wiki.mozilla.org/Accessibility/Video_Accessibility .

An open mailing list is also set up at Xiph to create a discussion forum for video accessibility: accessibility@lists.xiph.org. Join there if you’d like to provide input. I am particularly keen for people with disabilities to join because we need to get it right for them!

I am very excited about this project and feel honoured for being supported to help solve accessibility issues for Ogg and Firefox! Let’s get it right!