Category Archives: video accessibility

Demo of deep hyperlinking into HTML5 video

In an effort to give a demo of some of the W3C Media Fragment WG specification capabilities, I implemented a HTML5 page with a video element that reacts to fragment offset changes to the URL bar and the <video> element.

Demo Features

The demo can be found on the Annodex Web server. It has the following features:

If you simply load that Web page, you will see the video jump to an offset because it is referred to as “elephants_dream/elephant.ogv#t=20”.

If you change or add a temporal fragment in the URL bar, the video jumps to this time offset and overrules the video’s fragment addressing. (This only works in Firefox 3.6, see below – in older Firefoxes you actually have to reload the page for this to happen.) This functionality is similar to a time linking functionality that YouTube also provides.

When you hit the “play” button on the video and let it play a bit before hitting “pause” again – the second at which you hit “pause” is displayed in the page’s URL bar . In Firefox, this even leads to an addition to the browser’s history, so you can jump back to the previous pause position.

Three input boxes allow for experimentation with different functionality.

  • The first one contains a link to the current Web page with the media fragment for the current video playback position. This text is displayed for cut-and-paste purposes, e.g. to send it in an email to friends.
  • The second one is an entry box which accepts float values as time offsets. Once entered, the video will jump to the given time offset. The URL of the video and the page URL will be updated.
  • The third one is an entry box which accepts a video URL that replaces the <video> element’s @src attribute value. It is meant for experimentation with different temporal media fragment URLs as they get loaded into the <video> element.

Javascript Hacks

You can look at the source code of the page – all the javascript in use is actually at the bottom of the page. Here are some of the juicy bits of what I’ve done:

Since Web browsers do not support the parsing and reaction to media fragment URIs, I implemented this in javascript. Once the video is loaded, i.e. the “loadedmetadata” event is called on the video, I parse the video’s @currentSrc attribute and jump to a time offset if given. I use the @currentSrc, because it will be the URL that the video element is using after having parsed the @src attribute and all the containing <source> elements (if they exist). This function is also called when the video’s @src attribute is changed through javascript.

This is the only bit from the demo that the browsers should do natively. The remaining functionality hooks up the temporal addressing for the video with the browser’s URL bar.

To display a URL in the URL bar that people can cut and paste to send to their friends, I hooked up the video’s “pause” event with an update to the URL bar. If you are jumping around through javascript calls to video.currentTime, you will also have to make these changes to the URL bar.

Finally, I am capturing the window’s “hashchange” event, which is new in HTML5 and only implemented in Firefox 3.6. This means that if you change the temporal offset on the page’s URL, the browser will parse it and jump the video to the offset time.

Optimisation

Doing these kinds of jumps around on video can be very slow when the seeking is happening on the remote server. Firefox actually implements seeking over the network, which in the case of Ogg can require multiple jumps back and forth on the remote video file with byte range requests to locate the correct offset location.

To reduce as much as possible the effort that Firefox has to make with seeking, I referred to Mozilla’s very useful help page to speed up video. It is recommended to deliver the X-Content-Duration HTTP header from your Web server. For Ogg media, this can be provided through the oggz-chop CGI. Since I didn’t want to install it on my Apache server, I hard coded X-Content-Duration in a .htaccess file in the directory that serves the media file. The .htaccess file looks as follows:

<Files "elephant.ogv">
Header set X-Content-Duration "653.791"
</Files>

This should now help Firefox to avoid the extra seek necessary to determine the video’s duration and display the transport bar faster.

I also added the @autobuffer attribute to the <video> element, which should make the complete video file available to the browser and thus speed up seeking enormously since it will not need to do any network requests and can just do it on the local file.

ToDos

This is only a first and very simple demo of media fragments and video. I have not made an effort to capture any errors or to parse a URL that is more complicated than simply containing “#t=”. Feel free to report any bugs to me in the comments or send me patches.

Also, I have not made an effort to use time ranges, which is part of the W3C Media Fragment spec. This should be simple to add, since it just requires to stop the video playback at the given end time.

Also, I have only implemented parsing of the most simple default time spec in seconds and fragments. None of the more complicated npt, smpte, or clock specifications have been implemented yet.

The possibilities for deeper access to video and for improved video accessibility with these URLs are vast. Just imagine hooking up the caption elements of e.g. an srt file with temporal hyperlinks and you can provide deep interaction between the video content and the captions. You could even drive this to the extreme and jump between single words if you mark up each with its time relationship. Happy experimenting!

UPDATE: I forgot to mention that it is really annoying that the video has to be re-loaded when the @src attribute is changed, even if only the hash changes. As support for media fragments is implemented in <video> and <audio> elements, it would be advantageous if the “load()” function checked whether only the hash changed and does not re-load the full resource in these cases.

Thanks go to Chris Double and Chris Pearce from Mozilla for their feedback and suggestions for improvement on an early version of this.

Media Fragment addressing into a live stream

A few months back, Thomas reported on a cool flumotion experiment that he hacked together which allows jumping back in time on a live video stream.

Thomas used a URI scheme with a negative offset to do the jumping back on the http stream:
http://localhost:8800?offset=-120

John left a comment pointing to current work being done in the W3C on Media Fragment addressing, but had to notice that despite Annodex’s temporal URIs having a live stream addressing feature, the new W3C draft didn’t accommodate such a use case.

We got to work in the working group and I am very happy to announce that as of today there is now a draft specification for addressing time offsets by wall-clock time.

Say, you are watching Thomas’ live stream from above at http://localhost:8800 and you want to jump back by 2 min. Your player would grab the current streaming time, e.g. 2009-08-26T12:34:04Z and subtract the two minutes, giving 2009-08-26T12:32:04Z. Then the player would use this to tell your streaming server to jump back by two minutes using this URL:
http://localhost:8800#t=clock:2009-08-26T12:32:04Z.

Or another example would be: you had a stream running all day from a conference and you want to go back to a particular session. You know that it was between 10am and 11am German time (UTC+2 right now). Then your URL would be as follows:
http://conference:8800#t=clock:2009-08-26T10:00+02:00,2009-08-26T11:00+02:00

Now if only there was an implementation… 🙂

ARIA – A Brief Introduction

Since working on video accessibility, I have felt rather inadequate not knowing exactly how general Web accessibility works, in particular ARIA. I have been pointed at the W3C WAI-ARIA primer, best practices, and WD specification, but found them almost impossible to read.

If you are looking for a document that gets right to the point, I can recommend Opera’s Introduction to WAI ARIA. It tells you what attributes there are and how to use them. More in-depth information is available in the W3C WAI-ARIA best practices. Here’s my little summary of what I learnt.

Getting straight to the point: ARIA mostly cares about giving screen control to the keyboard (away from the mouse) and about exposing semantic information, such that vision-impaired people have a way to interact with Web content and screen readers can read out useful information.

Basic keys
The basic keys in use for accessibility are the tab/shift+tab, arrow, enter, space and escape keys.

Keyboard Focus: tabbing
Normal tabbing includes form controls and anchors. This can be overruled with the tabindex attribute.

Adding a tabindex=0 to an element adds the element to the tab order in which it appears in the document. Adding a tabindex out of [1;32767] you can place any element into a desired order – lowest numbers first.

Adding a tabindex=-1 to an element removes it from tabbing order, but you can still get keyboard focus onto it through javascript, e.g. for the subelements of a menu. The aria-activedescendant attribute can tell which is active in a list of descendants.

Navigation Landmarks: roles
Screenreaders have a problem with expressing what the functionality of elements is – normally they can only read out the name of the element.

This is where the role attribute comes in. It provides semantic meaning, e.g. “slider” instead of “input” element.

ARIA has a large number of pre-defined roles. They are listed in the spec – each role has additional attributes to provide more assistive information – mostly state information on the particular element.

Live updated content: aria-live
When data is updated somewhere on screen, often assistive technology doesn’t get to know about it.

Regions that are marked with the aria-live attribute will be read out even if the user is focused on another part of the screen at that point.

Form input: aria-reqired
For screen readers it is not obvious if a form element’s entry is a required or optional entry. Add an aria-required attribute to the form entry element and your screen reader will tell you.

Labels and descriptions: aria-labelledby / aria-describedby
Most often the description or label for a page area sits already elsewhere on screen, but with no obivous relationship to an element other than visible neighbourship.

A screenreader can be told about the relationship by using the aria-labelledby / aria-describedby attributes, which allow to link to such an area through that area’s id attribute.

Is that all?
Yes, I think that’s essentially all. It’s not particularly difficult, but it has a high impact on accessibility. I hope your take-away is as big as mine!

BTW: WAI ARIA is written for good old HTML4, not HTML5. However, there are synchronisation activities under way and WAI ARIA attributes will still be relevant to HTML5. Some of the roles will become unnecessary with the new elements available in HTML5 – see a draft mapping of HTML5 elements to ARIA implicit roles in Henry’s excellent document, but it seems the tabbing order, live regions, and the role attribute are here to stay.

Jumping to time offsets in HTML5 video

For many years now I have been progressing a deeper view of video on the Web than just as a binary blob. We need direct access to time offsets and sections of videos.

Such direct access can be achieved either by providing a javascript interface through which a video’s playback position can be controlled, or by using URLs that directly communicate with the Web server about controlling the playback position. I will explain the approaches that can be applied on the HTML5 <video> tag for such deep video interaction.

Controlling a video’s playback with javascript

currentTime

Right now, you can use the video element’s “currentTime” property to read and set the current playback position of a video resource. This is very useful to directly jump between different sections in the video, such as exemplified in the BBC’s recent R&D TV demo. To jump to a time offset in a video, all you have to do in javascript is:


var video = document.getElementsByTagName("video")[0];
video.currentTime = starttimeoffset;

timeupdate

Further, if you want to stop playback at a certain time point, you can use another functionality of the HTML5 <video> tag: the “timeupdate” event:


video.addEventListener("timeupdate", function() {
if (video.currentTime >= endtimeoffset) {
video.pause();
}}, false);

When the “timeupdate” event fires, which is supposed to happen at a min resolution of 250ms, you can catch the end of your desired interval fairly accurately.

setTimeout / setInterval

Alternatively to using the “timeupdate” event that is provided by the <video> tag, there is always the possibility of using the javascript “setTimeout” or “setInterval” functions:


setTimeout(video.pause(), (endtimeoffset - starttimeoffset)*1000);

The “setTimeout” function is used to call a function or evaluate an expression after a specified number of milliseconds. So, you’d have to call this straight after starting the playback at the given starttimeoffset.

If instead you wanted something to happen at a frequent rate in parallel to the video playback (such as check if you need to display a new ad or a new subtitle), you could use the javascript setInterval function:


setInterval( function() {displaySubtitle(video.currentTime);}, 100);

The “setInterval” function is used to call a function or evaluate an expression at the specified intervall. So, in the given example, every 100ms it is tested whether a new subtitle needs to be displayed for the video current playback time.

Note that for subtitles it makes a lot more sense to use the existing “timeupdate” event of the video rather than creating a frequenty setInterval interrupt, since this will continue calling the function until clearInterval() is called or the window is closed. Also, the BBC found in experiments with Firefox that “timeupdate” is more accurate than polling the “currentTime” regularly.

Controlling a video’s playback through a URL

There are some existing example implementations that control a video’s playback time through a URL.

In 2001, in the Annodex project we proposed temporal URIs and implemented the spec for Ogg content. This is now successfully in use at Metavid.org, where it is very useful since Metavid handles very long videos where direct access to subsections is critical. A URL such as http://metavid.org/wiki/Stream:Senate_proceeding_02-13-09/0:05:40/0:47:29 work well to directly view that segment.

More recently, YouTube rolled out a URI scheme to directly jump to an offset in a YouTube video, e.g. http://www.youtube.com/watch?v=PjDw3azfZWI#t=31m09s. While most YouTube content is short form, and such direct access may not make much sense for a video of less than 2 min duration, some YouTube content is long enough to make this a very useful feature.

You may have noticed that the YouTube use of URIs for jumping to offsets is slightly different to the one used by Metavid. The YouTube video will be displayed as always, but the playback position in the video player changes based on the time offset. The Metavid video in contrast will not display a transport bar for the full video, but instead only present the requested part of the video with an appropriate localised keyframe.

Having realised the need for such URLs, the W3C created a Media Fragments working group.

Proposed Time schemes

For temporal addressing, it currently proposes the following schemes:


t=10,20
t=npt:10,20
.
t=120s,121.5s
t=npt:120,0:02:01.5
.
t=smpte-30:0:02:00,0:02:01:15
t=smpte-25:0:02:00:00,0:02:01:12.1
.
t=clock:20090726T111901Z,20090726T121901Z

If there is no time scheme given, it defaults to “npt”, which stands for “normal playback time”. It is basically a time offset given in seconds, but can be provided in a few different formats.

If a “smpte” scheme is given, the time code is provided in the way in which DVRs display time codes, namely according to the SMPTE timecode standard.

Finally, a “clock” time scheme can be given. This is relevant in particular to live streaming applications, which would like to provide a URL under which a live video is provided, but also allow the user to jump back in time to previously streamed data.

Fragments and Queries

Further, the W3C Media Fragment Working Group is discussing the use of both URI addressing schemes for time offsets: fragments (“#”) and queries (“?”).

The important difference is that queries produce a new resource, while fragments provide a sub-resource.

This means that if you load a URI such as http://www.example.org/video.ogv?t=60,100 , the resulting resource is a video of duration 40s. Since relates to the full resource, it is possible to expect from the user agent (i.e. web browser) to display a timeline of 60-100 rather than 0-40 – after all, the browser could just get this out of the URL. However, it is essentially a new resource and could therefore just be regarded as a different video.

If instead you load a URI such as http://www.example.org/video.ogv#t=60,100, the user agent recognizes http://www.example.org/video.ogv as the resource and knows that it is supposed to display the 40s extract of that resource. Using no special server support, the browser could just implement this using the currentTime and timeUpdate javascript functionality.

An optimisation should, however, be made on this latter fragment delivery such that a user does not have to wait until the full beginning of the resource is downloaded before playback starts: Web servers should be expected to implement a server extension that can deal with such offsets and then deliver from the time offset rather than the beginning of the file.

How this is communicated to the server – what extra headers or http communication mechanisms should be used – is currently under discussion at the W3C Media Fragments working group.

The different aspects of video accessibility

In the last week, I have received many emails replying to my request for feedback on the video accessibility demo. Thanks very much to everyone who took the time.

Interestingly, I got very little feedback on the subtitles and textual audio annotation aspects of my demo, actually, even though that was the key aspect of my analysis. It’s my own fault, however, because I chose a good looking video player skin over an accessible one.

This is where I need to take a step back and explain about the status of HTML5 video and its general accessibility aspects. Some of this is a repetition of an email that I sent to the W3C WAI-XTECH mailing list.

Browser support of HTML5 video

The HTML5 video tag is still a rather new tag that has not been implemented in all browsers yet – and not all browsers support the Ogg Theora/Video codec that my demo uses. Only the latest Firefox 3.5 release will support my demo out of the box. For Chrome and Opera you will have to use the latest nightly build (which I am not even sure are publicly available). IE does not support it at all. For Safari/Webkit you will need the latest release and install the XiphQT quicktime component to provide support for the codec.

My recommendation is clearly to use Firefox 3.5 to try this demo.

Standardisation status of HTML5 video

The standardisation of the HTML5 video tag is still in process. Some of the attributes have not been validated through implementations, some of the use cases have not been turned into specifications, and most importantly to the topic of interest here, there have been very little experiments with accessibility around the HTML5 video tag.

Accessibility of video controls

Most of the comments that I received on my demo were concerned with the accessibility of the video controls.

In HTML5 video, there is a attribute called @controls. If it is available, the browser is expected to display default controls on top of the video. Here is what the current specification says:

“This user interface should include features to begin playback, pause playback, seek to an arbitrary position in the content (if the content supports arbitrary seeking), change the volume, and show the media content in manners more suitable to the user (e.g. full-screen video or in an independent resizable window).”

In Firefox 3.5, the controls attribute currently creates the following controls:

  • play/pause button (toggles between the two)
  • slider for current playback position and seeking (also displays how much of the video has currently been downloaded)
  • duration display
  • roll-over button for volume on/off and to display slider for volume
  • FAIK fullscreen is not currently implemented

Further, the HTML5 specification prescribes that if the @controls attribute is not available, “user agents may provide controls to affect playback of the media resource (e.g. play, pause, seeking, and volume controls), but such features should not interfere with the page’s normal rendering. For example, such features could be exposed in the media element’s context menu.”

In Firefox 3.5, this has been implemented with a right-click context menu, which contains:

  • play/pause toggle
  • mute/unmute toggle
  • show/hide controls toggle

When the controls are being displayed, there are keyboard shortcuts to control them:

  • space bar toggles between play and pause
  • left/right arrow winds video forward/back by 5 sec
  • CTRL+left/right arrow winds video forward/back by 60sec
  • HOME+left/right jumps to beginning/end of video
  • when focused on the volume button, up/down arrow increases/decreases volume

As for exposure of these controls to screen readers, Mozilla implemented this in June, see Marco Zehe’s blog post on it. It implies having to use focus mode for now, so if you haven’t been able to use keyboard for controlling the video element yet, that may be the reason.

New video accessibility work

My work is actually meant to take video accessibility a step further and explore how to deal with what I call time-aligned text files for video and audio. For the purposes of accessibility, I am mainly concerned with subtitles, captions, and audio descriptions that come in textual form and should be read out by a screen reader or made available to braille devices.

I am exploring both, time-aligned text that comes within a video file, but also those that are available as external Web resources and are just associated to the video through HTML. It is this latter use case that my demo explored.

To create a nice looking demo, I used a skin for the video player that was developed by somebody else. Now, I didn’t pay attention to whether that skin was actually accessible and this is the source of most of the problems that have been mentioned to me thus far.

A new, simpler demo

I have now developed a new demo that uses the default player controls which should be accessible as described above. I
hope that the extra button that I implemented for the menu with all the text tracks is now accessible through a screen reader, too.

UPDATE: Note that there is currently a bug in Firefox that prevents tabbing to the video element from working. This will be possible in future.

First experiments with itext

My accessibility work for Mozilla is showing first results.

I have now implemented a demo for the previously proposed <itext> element. During the development process, the specification became more concrete.

I’m sure you’re keen to check out the demo.

Please note the following features of the demo:

  • It experiments with four different types of time-aligned text: subtitles, captions, chapters, and textual audio annotations.
  • It extends the video controls by a menu button for the time-aligned text tracks. This enables the user to switch between different languages for the different tracks.
  • The textual audio annotations are mapped into an aria-live activated div element, such that they are indeed read out by screen-readers; this div sits behind the video, invisible to everyone else.
  • The chapters are displayed as text on top of the video.
  • The subtitles and captions are displayed as overlays at the bottom of the video.
  • The display styles and positions are supposed to be default display mechanisms for these kinds of tracks, that could be overwritten by the stylesheet of a Web developer, who intends to place the text elsewhere on screen.

In order to “hear” the textual audio annotations work, you will need to install a screen reader such as JAWS, NVDA, or the firevox plugin on the Mac.

As far as I am aware, this is the first demo of HTML5 video accessibility that includes support for the vision-impaired, hearing-impaired, and also for foreign language speakers.

There have been initial discussions about this proposal, the results of which are captured in the wiki page. I expect a lot more heated discussion will happen on the WHATWG mailing list when I post it soon. I am well aware that probably most of the javascript API will need to be changed, and also some of the HTML.

Also please note that there are some bugs still left on the software, which should not inhibit the discussion at this stage. We will definitely develop a newer and better version.

I am particularly proud that I was able to make this work in the experimental builds of Opera and Chrome, as well as in Safari with XiphQT installed, and of course in Firefox 3.5.

Screenshot of first itext video player
Screenshot of first itext video player experiment

More video accessibility work

It’s already old news, but I am really excited about having started a new part-time contract with Mozilla to continue pushing the HTML5 video and audio elements towards accessibility.

My aim is two-fold: firstly to improve the HTML5 audio and video tags with textual representations, and secondly to hook up the Ogg file format with these accessibility features through an Ogg-internal text codec.

The textual representation that I am after is closely based on the itext elements I have been proposing for a while. They are meant to be a simple way to associate external subtitle/caption files with the HTML5 video and audio tags. I am initially looking at srt and DFXP formats, because I think they are extremes of a spectrum of time-aligned text formats from simple to complex. I am preparing a specification and javascript demonstration of the itext feature and will then be looking for constructive criticism from accessibility, captioning, Web, video and any other expert who cares to provide input. My hope is to move the caption discussion forward on the WHATWG and ultimately achieve a cross-browser standard means for associating time-aligned text with media streams.

The Ogg-internal solution for subtitles – and more generally for time-aligned text – is then a logical next step towards solving accessibility. From the many discussions I have had on the topic of how best to associate subtitles with video I have learnt that there is a need for both: external text files with subtitles, as well as subtitles that are multiplexed with the media into a single binary fie. Here, I am particularly looking at the Kate codec as a means of multiplexing srt and DFXP into Ogg.

Eventually, the idea is to have a seamless interface in the Web Browser for dealing with subtitles, captions, karaoke, timed metadata, and similar time-aligned text. The user interaction should be identical no matter whether the text comes from within a binary media file or from a secondary Web resource. Once this seamless interface exists, hooking up accessibility tools such as screen readers or braille devices to the data should in theory be simple.

Javascript libraries for support

Now that Firefox 3.5 is released with native HTML5 <video> tag support, it seems that there is a new javascript library every day that provides fallback mechanisms for older browsers or those that do not support Ogg Theora.

This blog post collects the javascript libraries that I have found thus far and that are for different purposes, so you can pick the one most appropriate for you. Be aware that the list is probably already outdated when I post the article, so if you could help me keeping it up-to-date with comments, that would be great. 🙂

Before I dig into the libraries, let me explain how fallback works with <video>.

Generally, if you’re using the HTML5 <video> element, your fallback mechanism for browsers that do not support <video> is the HTML code your write inside the <video> element. A browser that supports the <video> element will not interpret the content, while all other browsers will:


<video src="video.ogv" controls>
Your browser does not support the HTML5 video element.
</video>

To do more than just text, you could provide a video fallback option. There are basically two options: you can fall back to a Flash solution:


<video src="video.ogv" controls>
<object width="320" height="240">
<param name="movie" value="video.swf">
<embed src="video.swf" width="320" height="240">
</embed>
</object>
</video>

or if you are using Ogg Theora and don’t want to create a video in a different format, you can fall back to using the java player called cortado:


<video src="video.ogv" controls width="320" height="240">
<applet code="com.fluendo.player.Cortado.class" archive="http://theora.org/cortado.jar" width="320" height="240">
<param name="url" value="video.ogv"/>
</applet>
</video>

Now, even if your browser support’s the <video> element, it may not be able to play the video format of your choice. For example, Firefox and Opera only support Ogg Theora, while Safari/Webkit supports MPEG4 and other codecs that the QuickTime framework supports, and Chrome supports both Ogg Theora and MPEG4. For this situation, the <video> element has an in-built selection mechanism: you do not put a “src” attribute onto the <video> element, but rather include <source> elements inside <video> which the browser will try one after the other until it finds one it plays:


<video controls width="320" height="240">
<source src="video.ogv" type="video/ogg" />
<source src="video.mp4" type="video/mp4" />
</video>

You can of course combine all the methods above to optimise the experience for your users, which is what has been done in this and this (Video For Everybody) example without the use of javascript. I actually like these approaches best and you may want to check them out before you consider using a javascript library.

But now, let’s look at the promised list of javascript libraries.

Firstly, let’s look at some libraries that let you support more than just one codec format. These allow you to provide video in the format most preferable by the given browser-mediaframework-OS combination. Note that you will need to encode and provide your videos in multiple formats for these to work.

  • mv_embed: this is probably the library that has been around the longest to provide &let;video> fallback mechanisms. It has evolved heaps over the last years and now supports Ogg Theora and Flash fallbacks.
  • several posts that demonstrate how to play flv files in a <video> tag.
  • html5flash: provides on top of the Ogg Theora and MPEG4 codec support also Flash support in the HTML5 video element through a chromeless Flash video player. It also exposes the <video> element’s javascript API to Flash content.
  • foxyvideo: provides a fallback flash player and a JavaScript library for HTML5 video controls that also includes a nearly identical ActionScript implementation.

Finally, let’s look at some libraries that are only focused around Ogg Theora support in browsers:

  • Celt’s javascript: a minimal javascript that checks for native Ogg Theora <video> support and the VLC plugin, and falls back to Cortado if nothing else works.
  • stealthisfilm’s javascript: checks for native support, VLC, liboggplay, Totem, any other Ogg Theora player, and cortado as fallback.
  • Wikimedia’s javascript: checks for QuickTime, VLC, native, Totem, KMPlayer, Kaffeine and Mplayer support before falling back to Cortado support.

Open Video Conference Working Group: HTML5 and

At the recent Open Video Conference, I was asked to chair a working group on HTML5 and the <video> tag. Since the conference had attracted a large number of open media software developers as well as HTML5 <video> tag developers, it was a great group of people that were on the panel with me: Philip Jagenstedt from Opera, Jan Gerber from Xiph, Viktor Gal from Annodex, Michael Dale from Metavid, and Eric Carlson from Apple. This meant we had three browser vendors and their <video> tag developers present as well as two javascript library developers representing some of the largest content sites that are already using Ogg Theora/Vorbis with the <video> tag, plus myself looking into accessiblity for <video>.

The biggest topic around the <video> tag is of course the question of baseline codec: which codec can and should become the required codec for anyone implementing <video> tag support. Fortunately, this discussion was held during the panel just ahead of ours. Thus, our panel was able to focus on the achievements of the HTML5 video tag and implementations of it, as well as the challenges still ahead.

Unfortunately, the panel was cut short at the conference to only 30 min, so we ended up doing mostly demos of HTML5 video working in different browsers and doing cool things such as working with SVG.

The challenges that we identified and that are still ahead to solve are:

  • annotation support: closed captions, subtitles, time-aligned metadata, and their DOM exposure
  • track selection: how to select between alternate audio tracks, alternate annotation tracks, based on e.g. language, or accessibility requirements; what would the content negotiation protocol look like
  • how to support live streaming
  • how to support in-browser a/v capture
  • how to support live video communication (skype-style)
  • how to support video playlists
  • how to support basic video editing functionality
  • what would a decent media server for html5 video look like; what capabilities would it have

Here are the slides we made for the working group.

Download PDF: Open Video Conference: HML5 and video Panel

Video: Video of the session at archive.org

A review of the W3C Timed Text Authoring Format

UPDATE: The best demo I have seen so far of many of DFXP’s features is at http://www.w3.org/2009/02/ThisIsCoffee.html.

The W3C has published a third last call for the draft specification of DFXP, the Distribution Format Exchange Profile for the Timed Text Authoring Format – or short: for their new standard format for captions. Comments are due by the 30th June, so rush if you want to give any feedback. Here is what came to my mind as I was reading the 183 pages long document.

Please note: This review looks at DFXP from a Web view, i.e. how compatible is it with existing Web technologies, since my main use case will be on the Web, even if advocates will say that that’s not it’s main purpose, strangely enough, for a standard coming out of the W3C.

The state of affairs with caption formats

When it comes to caption and subtitles, there is no lack of formats. It seems, because it is an easy challenge to define a data format for something as simple as a piece of text and some timing information, every new project that wanted to deal with captions – or more generally timed text – created their own format. I am no exception to the rule. 🙂

Thus, the current state of affairs wrt timed text is that there are many different textual file formats to store such data, there are also many different video container formats each with their own data format (or even formats) for embedding timed text into them, and there is a lot of software that will deal with many input, output and encapsulation formats.

The problem with this situation is that the formats are all different in their complexity. The simple “piece of text and timing information” problem can be turned into as complex a problem as you desire. By adding layout information, styling information, animation functionality, metadata about the video and about the content, and possibly hyperlinks, we have ended up in a large mess of incompatible formats.

The aim of W3C Timed Text

The W3C Timed Text working group was chartered in January 2003 to attack this issue. It was supposed to become the super-format of all possible functionalities for timed text formats and therefore a perfect interchange format between applications (see requirements document). Its focus was for use on the Web and with SMIL and to make use of existing W3C technologies where possible

However, the history of captioning is TV and the scope of Timed Text is beyond mere use on the Web, so while W3C Timed Text took a lot of inspiration from other Web standards, it has become a stand-alone standard that does not rely on, e.g. the availability of a CSS engine, and it has no in-built hyperlinking functionality (see what requirements it fulfills).

Dissecting DFXP

So. let’s look into some of what DFXP provides.

Here is an example file taken straight from the draft – check the presentation here:

<tt xml:lang="" xmlns="http://www.w3.org/2006/10/ttaf1">
  <head>
    <metadata xmlns:ttm="http://www.w3.org/2006/10/ttaf1#metadata">
      <ttm:title>Timed Text DFXP Example</ttm:title>
      <ttm:copyright>The Authors (c) 2006</ttm:copyright>
    </metadata>

    <styling xmlns:tts="http://www.w3.org/2006/10/ttaf1#styling">
      <!-- s1 specifies default color, font, and text alignment -->
      <style xml:id="s1"
                 tts:color="white"
                 tts:fontFamily="proportionalSansSerif"
                 tts:fontSize="22px"
                 tts:textAlign="center" />
      <!-- alternative using yellow text but otherwise the same as style s1 -->
      <style xml:id="s2" style="s1" tts:color="yellow"/>
      <!-- a style based on s1 but justified to the right -->
      <style xml:id="s1Right" style="s1" tts:textAlign="end" />     
      <!-- a style based on s2 but justified to the left -->
      <style xml:id="s2Left" style="s2" tts:textAlign="start" />
    </styling>

    <layout xmlns:tts="http://www.w3.org/2006/10/ttaf1#styling">
      <region xml:id="subtitleArea"
                   style="s1"
                   tts:extent="560px 62px"
                   tts:padding="5px 3px"
                   tts:backgroundColor="black"
                   tts:displayAlign="after" />
    </layout> 
  </head>
  <body region="subtitleArea">
    <div>
      <p xml:id="subtitle1" begin="0.76s" end="3.45s">
        It seems a paradox, does it not,
      </p>
      <p xml:id="subtitle2" begin="5.0s" end="10.0s">
        that the image formed on<br/>
        the Retina should be inverted?
      </p>
      <p xml:id="subtitle3" begin="10.0s" end="16.0s" style="s2">
        It is puzzling, why is it<br/>
        we do not see things upside-down?
      </p>
      <p xml:id="subtitle4" begin="17.2s" end="23.0s">
        You have never heard the Theory,<br/>
        then, that the Brain also is inverted?
      </p>
      <p xml:id="subtitle5" begin="23.0s" end="27.0s" style="s2">
        No indeed! What a beautiful fact!
      </p>
      <p xml:id="subtitle6a" begin="28.0s" end="34.6s" style="s2Left">
        But how is it proved?
      </p>
      <p xml:id="subtitle6b" begin="28.0s" end="34.6s" style="s1Right">
        Thus: what we call
      </p>
      <p xml:id="subtitle7" begin="34.6s" end="45.0s" style="s1Right">
        the vertex of the Brain<br/>
        is really its base
      </p>
      <p xml:id="subtitle8" begin="45.0s" end="52.0s" style="s1Right">
        and what we call its base<br/>
        is really its vertex,
      </p>
      <p xml:id="subtitle9a" begin="53.5s" end="58.7s">
        it is simply a question of nomenclature.
      </p>
      <p xml:id="subtitle9b" begin="53.5s" end="58.7s" style="s2">
        How truly delightful!
      </p>
    </div>    
  </body>
</tt>

I’m going to look at each of the different functionalities separately and discuss their strengths and weaknesses.

Content

Let’s begin with the body of the DFXP document and what elements are defined for this area.

Firstly, <body> comes with optional begin, end, and dur attributes. As is the case for all time specifications in DFXP, there are both “end” and “dur” attributes. Why this over-specification? There is not even an explanation which of the two has higher priority when in conflict. This is plainly asking for trouble – why not simplify the spec?

The “region” and “style” attributes refer to a previously defined region and styles that are applied to the body. “id” and “lang” attributes allow to associate a name and a language with the body.

The “timeContainer” attribute enables the author to specify whether the elements in the body are all to be regarded as temporally parallel or in sequence, the default being parallel. This means that all text elements specified inside the body can render over the top of each other – a situation that is solved by giving them specific start and end times.

The containing elements of body are a sequence of <div> tags. The div element functions as a logical container and a temporal structuring element for a sequence of textual content units. div elements like body elements are allowed a “start”, “end” and “dur” attribute and generally everything that the body element also has, except that their children can be more div or p. Again, the children of the div element are all regarded as being temporally parallel.

The p element is basically the inner-most element that contains the actual text, including new-lines (br) and spans to associate further styling, metadata, or animations. The children of the p or span element are also all regarded as being temporally parallel, unless otherwise specified.

The structuring of text into div, p, and span elements seems to make sense and provide sufficient (if not even excessive) flexibility for any required timed text needs.

Layout

Once the text is specified and structured, the next question is where it should be positioned.

The extent attribute of the <tt> root element specifies the width and height of the root container, if not specified by the external authoring context.

Inside the root container, regions are defined through explicit <region> elements. The origin of placement for a region is the top left corner. Regions can define their “origin” offset, their “width” and “height”, the alignment of text within them through the “textAlign” and “displayAlign” styles, and whether text that “overflows” a region should be visible or hidden.

The way in which DFXP defines regions and placement of text within regions is very different to the way in which HTML and CSS work. By default, elements in HTML flow one after another in the same order as they appear in the source. CSS attributes applied to the elements can control their positioning through giving coordinates, or relative placements in relation to other elements. In DFXP elements are placed inside regions that are styled, making it incompatible with HTML.

Styling

The styling attributes available for DFXP are limited, but sufficient for timed text purposes. The way in which style associations to elements are resolved is quite diverse. Styles can be associated with regions, with individual elements, individually and as a group, through layouts and through parent elements. Compared to CSS, it feels complicated and potentially full of contradictions.

Animation

Further to styling, DFXP defines animations, which are discrete changes to some style parameter value that applies over some time interval. This is relevant for example to implement karaoke style colouring of text over time.

Metadata

The <metadata> element serves as a generic container for grouping metadata information. It can be associated virtually with any element – which seems somewhat over-flexible, but provides for interesting meta data information such as meta data for styles or for a <br>.

In addition, metadata is actually limited to a set number of elements: title, desc, copyright, agent, name, and actor. These are strange fields – in particular if you compare them to the flexibility of HTML meta data, which consists of free-form name-value pairs, bringing us domain-specific schemes such as the Dublin Core. This is not easily possible here, but instead one has to define extensions to allow for such flexible meta data.

Other features

DFXP provides other features such as information that describes the related video file, e.g. frameRate, subFrameRate, frameRateMultiplier, pixelAspectRatio, smpteMode, timeBase, and tickRate. Such information will help at the point in time when DFXP is supposed to be multiplexed into a binary media file together with audio and video tracks. These attributes can provide information required for the multiplexing process. I am not sure that justifies their existence though.

Other, minor features are available too. Check out the full specification to get a complete picture.

Examples

Part of the publication of this draft is also a test suite. Several of the defined features are still not represented in the test suite, which to me raises the question if they are really required. It might do wonders to the draft size to remove them.

Summary

DFXP is a standard for timed text that is firmly grounded in past captioning specifications, but written in XML, and borrowing ideas from Web technologies. It is unfortunately not re-using existing Web infrastructure to implement its more complex features: no use of CSS for styling and layout, no use of hyperlinks. Also, the use of namespaces seems excessive and won’t make it easy to author this format, in particular since the defined namespaces do not map into the defined profiles.

DFXP is, however, simple to transcode to something that a Web Browser can deal with through its existing engines, because it has borrowed from other Web standards. It is thus easier to work with on the Web than most other formats. It should be relatively easy to map to HTML, CSS and javascript, as already started in the test suite with the HTML5 video element.

DFXP is witten in such a way that it is possible to put together a new profile with extensions that are more appropriate for specific needs, e.g. that fit better into existing Web infrastructure. Currently, DFXP has three defined profiles: one focused on transformation, one focused on presentation, and one that contains everything.

I think it’s time for a html5 profile of DFXP that at minimum extends DFXP with hyperlinks, making it a real timed text Web format.