Category Archives: Digital Media

Web Directions South 2009 talk on HTML5 video

Yesterday, I gave a talk on the HTML5 video element at Web Directions South.

The title was “Taking HTML5 <video> a step further” and the abstract was provided goes as follows:

This talk focuses on the efforts engaged by W3C to improve the new HTML 5 media elements with mechanisms to allow people to access multimedia content, including audio and video. Such developments are also useful beyond accessibility needs and will lead to a general improvement of the usability of media, making media discoverable and generally a prime citizen on the Web.

Silvia will discuss what is currently technically possible with the HTML5 media elements, and what is still missing. She will describe a general framework of accessibility for HTML5 media elements and present her work for the Mozilla Corporation that includes captions, subtitles, textual audio annotations, timed metadata, and other time-aligned text with the HTML5 media elements. Silvia will also discuss work of the W3C Media Fragments group to further enhance video usability and accessibility by making it possible to directly address temporal offsets in video, as well as spatial areas and tracks.

Here are my slides:

Download the pdf from here.

There was also a video recording and I will add that here as soon as it is published.

UPDATE:
The video is available on Tinyvid:

I’m not going to try and upload this 50min long video to YouTube – with it’s 10 min limit, I won’t get very far.

WebJam 2009 talk on video accessibility

On Wednesday evening I gave a 3 min presentation on video accessibility in HTML5 at the WebJam in Sydney. I used a video as my presentation means and explained things while playing it back. Here is the video, without my oral descriptions, but probably still useful to some. Note in particular how you can experience the issues of deaf (HoH), blind (VI) and foreign language users:

The Ogg version is here.

New proposal for captions and other timed text for HTML5

The first specification for how to include captions, subtitles, lyrics, and similar time-aligned text with HTML5 media elements has received a lot of feedback – probably because there are several demos available.

The feedback has encouraged me to develop a new specification that includes the concerns and makes it easier to associate out-of-band time-aligned text (i.e. subtitles stored in separate files to the video/audio file). A simple example of the new specification using srt files is this:

<video src="video.ogv" controls>
   <itextlist category="CC">
     <itext src="caption_en.srt" lang="en"/>
     <itext src="caption_de.srt" lang="de"/>
     <itext src="caption_fr.srt" lang="fr"/>
     <itext src="caption_jp.srt" lang="jp"/>
   </itextlist>
 </video>

By default, the charset of the itext file is UTF-8, and the default format is text/srt (incidentally a mime type the still needs to be registered). Also by default the browser is expected to select for display the track that matches the set default language of the browser. This has been proven to work well in the previous experiments.

Check out the new itext specification, read on to get an introduction to what has changed, and leave me your feedback if you can!

The itextlist element
You will have noticed that in comparison to the previous specification, this specification contains a grouping element called “itextlist”. This is necessary because we have to distinguish between alternative time-aligned text tracks and ones that can be additional, i.e. displayed at the same time. In the first specification this was done by inspecting each itext element’s category and grouping them together, but that resulted in much repetition and unreadable specifications.

Also, it was not clear which itext elements were to be displayed in the same region and which in different ones. Now, their styling can be controlled uniformly.

The final advantage is that association of callbacks for entering and leaving text segments as extracted from the itext elements can now be controlled from the itextlist element in a uniform manner.

This change also makes it simple for a parser to determine the structure of the menu that is created and included in the controls element of the audio or video element.

Incidentally, a patch for Firefox already exists that makes this part of the browser. It does not yet support this new itext specification, but here is a screenshot that Felipe Corr

W3C Workshop/Barcamp on HTML5 Video Accessibility

Web accessibility veteran John Foliot of Stanford University and Apple’s QuickTime EcoSystem Manager Dave Singer are organising a W3C Workshop/Barcamp on Video Accessibility on the Sunday before the W3C’s annual combined technical plenary meeting TPAC.

The workshop will take place on 1st November at Stanford University – see details on the Workshop.

If you read the announcement, you will see that this is about understanding all the issues around video (and audio) accessibility, understanding existing approaches, and trying to find solutions for HTML5 that all browser vendors will be able to support.

The workshop is run under the W3C Hypertext Coordination Group and registration is required.

W3C membership is not required in order to participate in the gathering. However, you are required to contribute your knowledge actively and constructively to the Workshop. You must come prepared to present on one of the questions in this document to help inform the discussion and make progress on proposing solutions.

I am very excited about this workshop because I think it is high time to move things forward.

If I can get my travel sorted, I will present my results on the video accessibility work that I did for Mozilla. It will cover both: out-of-band accessibility data for video elements, as well as in-line accessibility data and how to expose a common API in the Web browser for them. I have recently experimented with encoding srt and lrc files in Ogg and displaying them in Firefox by using the patches that were contributed by OggK and Felipe into Firefox. More about this soon.

Tracking Status of Video Accessibility Work

Just a brief note to let everyone know about a new wikipage I created for my Mozilla work about video accessibility, where I want to track the status and outcomes of my work. You can find it at https://wiki.mozilla.org/Accessibility/Video_a11y_Aug09. It lists the following sections: Test File Collection, Specifications, Demo implementations using JavaScript, Related open bugs in Mozilla, and Publications.

HTML5 audio element accessibility

As part of my experiments in video accessibility I am also looking at the audio element. I have just finished a proof of concept for parsing Lyrics files for music in lrc format.

The demo uses Tay Zonday’s “Chocolate Rain” song both as a video with subtitles and as an audio file with lyrics. Fortunately, he published these all under a creative commons license, so I was able to use this music file. BTW: I found it really difficult to find a openly licensed music file with lyrics.

While I was at it, I also cleaned up all the old demos and now have a nice list of all demos in a central file.

Open Standards for Sign Languages

Looking at accessibility for video includes sign language. It is a most fascinating area to get into and an area that still leaves a lot to formalise and standardise. A lot has happened in recent years and a lot still needs to be done.

Sign languages are different languages to spoken languages: they emerged in parallel to spoken languages in communities whose boundaries may not overlap with the boundaries of spoken languages. However, most developed means to translate spoken language artifacts (i.e. letters) into sign language artifacts (i.e. signs). So, a typical signer will speak/write at least 3-4 “languages”: the spoken language of their hearing peers, lip reading of that spoken language, letter signs of the spoken language, and finally the native sign language of the community they live in.

Encoding sign language in the computer is a real challenge. Firstly, there is the problem of enumerating all available languages. Then there is the challenge to find an alphabet to represent all “characters” that can be used in sign across many (preferably all) sign languages. Then there is the need to encode these characters in a way that computers can deal with. And finally, there is the need to find a screen representation of the characters. In this blog post, I want to describe the status for all of these.

Currently, sign language can only be represented as a video track by recording sign speakers. Once a sign character list together with an encoding and representation means for them and a specification of the different sign languages is available, it is possible to encode sign sentences in computer-readable form. Further, programs can be written that can present sign sentences on screen, that translate between different sign languages, and between sign and spoken languages. Also, avatars can be programmed that actually present animated sign sentences.

Imagine a computer that instead of presenting letters in your spoken language uses sign language characters and has keys with signs on them instead of letters. To a sign speaker this would be a lot more natural, since for most sign is their mother tongue.

Listing all existing sign languages
It was a challenge to create codes for all existing spoken languages – the current list of language codes has only been finalised in 1998.

Until the 1980s, scientists assumed that it is impossible to develop as rich a language with signs as with writing and speaking. Thus, the native languages of deaf people were often regarded as inferior to spoken languages. In many countries it was even prohibited to teach the language in schools for the deaf and instead they were taught to speak an oral language and read lips. In France this prohibition was only lifted in 1991! Only in about 1985 was it proven that sign languages are indeed as rich as spoken languages and deserve the right to be called a “language” and be treated as a fully capable means of communication.

So, there hasn’t actually been much time to map out a list of all sign languages. The best list I was able to find is in Wikipedia. It lists 28 N/S American, 38 European, 34 Asia-Pacific-AU/NZ, 30 African, and 13 Middle Eastern sign languages – in summary 143 sign languages. This list contains 177 sign languages.

Interestingly, there is also a new International Sign Language in development called Gestuno which is in use in international events (Olympics, conferences etc.) but has only a limited vocabulary.

In 1999 the Irish National Body, Deaf Action Committee for SignWriting, proposed the addition of sign language codes to ISO-639-2. Instead, a single code entered the list: sgn for sign language. In 2001, this led to the development of IETF language extension codes in RFC 3066 for 22 sign languages. In September 2006, this standard was replaced by RFC 4646, which defines 135 subtags for sign languages, including one for the International Sign Language and a generic “sgn” one.

While not complete, the current IANA subtag language registry now regards sign languages as valid derivatives of a country’s languages and therefore handles them identically to spoken languages. It’s also extensible such that any sign language not yet registered can still be specified.

Characters for sign languages
The written word is very powerful for preserving and sharing information. For a very long time there has been no written representation of sign languages. This is not surprising considering that there are still indigenous spoken languages that have no written representation. Also, the written representation of the spoken language around the community of a sign language would have served the sign community sufficiently for most purposes – except for the accurate capture of their thoughts and sign communications. It would always be a foreign language.

To move sign languages into the 20th century, the invention of characters for signs was necessary.

It is relatively easy to map the alphabets of spoken languages to signs (e.g. American (ASL) manual alphabet, British, Australian and NZ (AUSLAN) manual alphabet, or German manual finger alphabet, also see fingerspelling). Interesting the AUSLAN manual alphabet is a two-handed one while the ASL one is single-handed.

Fonts are available for these alphabets, too, e.g. British Sign Font, American Sign Font, French Sign Font and more.

The real challenge lies in capturing the proper signs deaf people use to communicate amongst themselves.

This is rather challenging, since sign languages uses the hands, head and body, with constantly changing movements and orientations for communication. Thus, while spoken language only has one dimension (sound) over time, sign languages have “three dimensions” and capturing this in characters is difficult. Many sign languages to this date don’t have a widely used written form, e.g. AUSLAN. Mostly in use nowadays are sequences of photos or videos – which of course cannot be computer processed easily.

Two main writing systems have been developed: the phonemic Stokoe notation and the iconic SignWriting.

Stokoe notation was created by William Stokoe for ASL in 1960, with Latin letters and numbers used for the shapes they have in fingerspelling, and iconic glyphs to transcribe the position, movement, and orientation of the hands. Adaptations were made to other sign languages to include further phonemes not found in ASL. Stokoe notation is written left-to-right on a page and can be typed with the proper font installed. It has a Unicode/ASCII mapping, but does not easily apply to other sign languages than ASL since it does not capture all possible signs. It has no representation for facial and body expressions and is therefore a relatively poor representation for sign.

SignWriting was created by Valerie Sutton in 1974, a dancer who had two years earlier developed DanceWriting and later developed MimeWriting, SportsWriting, and ScienceWriting. SignWriting is a writing system which uses visual symbols to represent the handshapes, movements, and facial expressions of sign languages. It is a generic sign alphabet with a list of symbols that can be used to write any sign language in the world.

SignWriting can be easily learnt by signers and is more popular now than Stokoe. Signers compose the symbols together in a spatial way to represent their signs. They then write the composed symbols from top to bottom on a page, similar to other iconic character sets. SignWriting currently supports 73 different sign languages, whose dictionaries and encyclopedias are captured in SignPuddle. This will eventually allow the creation of complete corpora for all sign languages.

Unicode encoding of SignWriting and visual representation
Because of its unique challenges of having to cover the spatial combination of symbols as a new symbol rather than just the sequential combination of symbols, it took a while to get a Unicode representation of SignWriting.

About a year ago, on 19th September 2008, Valerie Sutton released the International SignWriting Alphabet (ISWA 2008).

A binary representation of SignWriting is defined in ISWA 2008. It is based on a representing 639 base symbols and their potential 6 fill and 16 rotation variants in 61,343 code points, that completely cover the subset of 35023 valid symbol codes. The spatial aspect of SignWriting are encoded in a 2-dimensional coordinate system. The dimensions go from -1919 through 1919 to place the top left corner of the symbol.

SignWriting base symbols are encoded in plane 4 of Unicode, which provides 65,536 code points, easily covering the defined 61,343 Binary SignWriting code points. Further special control and number characters are used to encode the spatial layout.

Visual Representation of SignWriting
Valerie Sutton created over 35k individual PNG images for ISWA 2008, which have been reformatted for standard color & reduced file size, and renamed to the character code. They are a font used to represent the signs. The images can be accessed on Valerie’s server.

Closing
After learning all this today, I have to say that Valerie Sutton has just turned into a new idol of mine. The achievements with SignWriting and the possibilities it will enable are massive.

Now I just have to figure out what to do when we hit on a sign language track that has been encoded in SignWriting and it represents captions. Maybe it is possible to display sign as overlay but on the left side of the video. This would be similar to some other languages that go from top to bottom rather than left to right.

Updated video accessibility demo

Just a brief note to share that I have updated the video accessibility demo at http://www.annodex.net/~silvia/itext/elephant_no_skin.html.

It should now support ARIA and tab access to the menu, which I have simply put next to the video. I implemented the menu by learning from YUI. My Firefox 3.5.3 actually doesn’t tab through it, but then it also doesn’t tab through the YUI example, which I think is correct. Go figure.

Also, the textual audio descriptions are improved and should now work better with screenreaders.

I have also just prepared a recorded audio description of “Elephants Dreams” (German accent warning).

You can also download the multitrack Ogg Theora video file that contains the original audio and video track plus the audio description as an extra track, created using oggz-merge.

As soon as some kind soul donates a sign language track for “Elephants Dream”, I will have a pretty complete set of video accessibility tracks for that video. This will certainly become the basis for more video a11y work!

URI fragments vs URI queries for media fragment addressing

In the W3C Media Fragment Working Group (MFWG) we have had long discussions about the use of the URI query (“?”) or the URI fragment (“#”) addressing approach for addressing directly into media fragments, and the diverse new HTTP headers required to serve such URI requests, considering such side conditions as the stripping-off of fragment parameters from a URI by Web browsers, or the existence of caching Web proxies.

As explained earlier, URI queries request (primary) resources, while URI fragments address secondary resources, which have a relationship to their primary resource. So, in the strictest sense of their specifications, to address segments in media resources without losing the context of the primary resource, we can only use URI fragments.

Browser-supported Media Fragment URIs

For this reason, URI fragments are also the way in which my last media fragment addressing demo has been implemented. For example, I would address

Demo of deep hyperlinking into HTML5 video

In an effort to give a demo of some of the W3C Media Fragment WG specification capabilities, I implemented a HTML5 page with a video element that reacts to fragment offset changes to the URL bar and the <video> element.

Demo Features

The demo can be found on the Annodex Web server. It has the following features:

If you simply load that Web page, you will see the video jump to an offset because it is referred to as “elephants_dream/elephant.ogv#t=20”.

If you change or add a temporal fragment in the URL bar, the video jumps to this time offset and overrules the video’s fragment addressing. (This only works in Firefox 3.6, see below – in older Firefoxes you actually have to reload the page for this to happen.) This functionality is similar to a time linking functionality that YouTube also provides.

When you hit the “play” button on the video and let it play a bit before hitting “pause” again – the second at which you hit “pause” is displayed in the page’s URL bar . In Firefox, this even leads to an addition to the browser’s history, so you can jump back to the previous pause position.

Three input boxes allow for experimentation with different functionality.

  • The first one contains a link to the current Web page with the media fragment for the current video playback position. This text is displayed for cut-and-paste purposes, e.g. to send it in an email to friends.
  • The second one is an entry box which accepts float values as time offsets. Once entered, the video will jump to the given time offset. The URL of the video and the page URL will be updated.
  • The third one is an entry box which accepts a video URL that replaces the <video> element’s @src attribute value. It is meant for experimentation with different temporal media fragment URLs as they get loaded into the <video> element.

Javascript Hacks

You can look at the source code of the page – all the javascript in use is actually at the bottom of the page. Here are some of the juicy bits of what I’ve done:

Since Web browsers do not support the parsing and reaction to media fragment URIs, I implemented this in javascript. Once the video is loaded, i.e. the “loadedmetadata” event is called on the video, I parse the video’s @currentSrc attribute and jump to a time offset if given. I use the @currentSrc, because it will be the URL that the video element is using after having parsed the @src attribute and all the containing <source> elements (if they exist). This function is also called when the video’s @src attribute is changed through javascript.

This is the only bit from the demo that the browsers should do natively. The remaining functionality hooks up the temporal addressing for the video with the browser’s URL bar.

To display a URL in the URL bar that people can cut and paste to send to their friends, I hooked up the video’s “pause” event with an update to the URL bar. If you are jumping around through javascript calls to video.currentTime, you will also have to make these changes to the URL bar.

Finally, I am capturing the window’s “hashchange” event, which is new in HTML5 and only implemented in Firefox 3.6. This means that if you change the temporal offset on the page’s URL, the browser will parse it and jump the video to the offset time.

Optimisation

Doing these kinds of jumps around on video can be very slow when the seeking is happening on the remote server. Firefox actually implements seeking over the network, which in the case of Ogg can require multiple jumps back and forth on the remote video file with byte range requests to locate the correct offset location.

To reduce as much as possible the effort that Firefox has to make with seeking, I referred to Mozilla’s very useful help page to speed up video. It is recommended to deliver the X-Content-Duration HTTP header from your Web server. For Ogg media, this can be provided through the oggz-chop CGI. Since I didn’t want to install it on my Apache server, I hard coded X-Content-Duration in a .htaccess file in the directory that serves the media file. The .htaccess file looks as follows:

<Files "elephant.ogv">
Header set X-Content-Duration "653.791"
</Files>

This should now help Firefox to avoid the extra seek necessary to determine the video’s duration and display the transport bar faster.

I also added the @autobuffer attribute to the <video> element, which should make the complete video file available to the browser and thus speed up seeking enormously since it will not need to do any network requests and can just do it on the local file.

ToDos

This is only a first and very simple demo of media fragments and video. I have not made an effort to capture any errors or to parse a URL that is more complicated than simply containing “#t=”. Feel free to report any bugs to me in the comments or send me patches.

Also, I have not made an effort to use time ranges, which is part of the W3C Media Fragment spec. This should be simple to add, since it just requires to stop the video playback at the given end time.

Also, I have only implemented parsing of the most simple default time spec in seconds and fragments. None of the more complicated npt, smpte, or clock specifications have been implemented yet.

The possibilities for deeper access to video and for improved video accessibility with these URLs are vast. Just imagine hooking up the caption elements of e.g. an srt file with temporal hyperlinks and you can provide deep interaction between the video content and the captions. You could even drive this to the extreme and jump between single words if you mark up each with its time relationship. Happy experimenting!

UPDATE: I forgot to mention that it is really annoying that the video has to be re-loaded when the @src attribute is changed, even if only the hash changes. As support for media fragments is implemented in <video> and <audio> elements, it would be advantageous if the “load()” function checked whether only the hash changed and does not re-load the full resource in these cases.

Thanks go to Chris Double and Chris Pearce from Mozilla for their feedback and suggestions for improvement on an early version of this.