Tag Archives: audio element

Introducing media accessibility into HTML5

In recent months, people in the W3C HTML5 Accessibility Task Force developed two proposals for introducing caption, subtitle, and more generally time-aligned text support into HTML5 audio and video.

These time-aligned text files can either come as external files that are associated with the timeline of the media resource, or they come as part of the media resource in a binary track.

For both cases we now have proposals to extend the HTML5 specification.

Firstly, let’s look at time-aligned text in external files. The change proposal introduces markup to associate such external files as a kind of “virtual track” with a media resource. Here is an example:

<video src="video.ogv">
<track src="video_cc.ttml" type="application/ttaf+xml" language="en" role="caption"></track>
<track src="video_tad.srt" type="text/srt" language="en" role="textaudesc"></track>
<trackgroup role="subtitle">
<track src="video_sub_en.srt" type="text/srt; charset='Windows-1252'" language="en"></track>
<track src="video_sub_de.srt" type="text/srt; charset='ISO-8859-1'" language="de"></track>
<track src="video_sub_ja.srt" type="text/srt; charset='EUC-JP'" language="ja"></track>

The video resource is “video.ogv”. Associated with it are five timed text resources.

The first one is written in TTML (which is the new name for DFXP), is a caption track and in English. TTML is particularly useful when you want to provide more than just an unformatted piece of text to the viewers. Hearing-impaired users appreciate any visual help they can be provided with to absorb the caption text more quickly. This includes colour coding of speakers, positioning of text close to the speaking person on screen, or even animated musical notes to signify music. Thus, a format like TTML that allows for formatting and positioning information is an appropriate format to specify captions.

All other timed text resources are provided in SRT format, which is a simpler format that TTML with only plain text in the text cues.

The second text track is a textual audio description track. A textual audio description is in fact targeted at the vision-impaired and contains text that is expected to be read out by a screen reader or routed to a braille device. Thus, as the video plays, a vision-impaired user receives additional information about the visual content of the scene through their screen reader or braille device. The SRT format is particularly useful for providing textual audio descriptions since it only provides plain text, which can easily be handed on to assistive technology. When authoring such textual audio descriptions, it is very important to pick time intervals in the original media resource where no other significant audio cue is provided, such that the vision-impaired user is able to listen to the screen reader during that time.

The last three text tracks are subtitle tracks. They are grouped into a trackgroup element, which is not strictly necessary, but enables the author to say that these tracks are supposed to be alternatives. Thus, a Web Browser can create a menu with all the available tracks and put the tracks in the trackgroup into a menu of their own where only one option is selectable (similar to how radiobuttons work). Incidentally, the trackgroup element also allows to avoid having to repeat the role attribute in all the containing tracks. It is expected that these menus will be added to the default media controls and will thus be visible if the media element has a controls attribute.

With the role, type and language attributes, it is easy for a Web Browser to understand what the different tracks have to offer. A Web Browser can even decide to offer new functionality that is helpful to certain user groups. For example, an addition to a Web Browser’s default settings could be to allow users to instruct a Web Browser to always turn on captions or subtitles if they are available in the user’s main language. Or to always turn on textual audio descriptions. In this way, a user can customise their default experience of a media resource over and on top of what a Web page author decides to expose.

Incidentally, the choice of “track” as a name for relating external text resources to a media element has a deeper meaning. It is easily possible in future to extend “track” elements to not just point to dependent text resources, but also to dependent audio or video resources. For example, an actual audio description that is a recording of a human voice rather than a rendered text description could be association in the same way. Right now, such an implementation is not envisaged by the Browser vendors, but it will be something to work towards in the future.

Now, with such functionality available, there is naturally a desire to be able to control activation or de-activation of text tracks through JavaScript, not just through user interaction. A Web Developer may for example want to override the default controls provided by a Web Browser and run their own JavaScript-based controls, thus requiring to create their own selection menu for the tracks.

This is actually also an issue more generally and applies to all track types, including such tracks that come inside an existing media resource. In the current specification such tracks are not exposed and can therefore not be activated.

This is where the second specification that the W3C Accessibility Task Force has worked towards comes in: the media multitrack JavaScript API.

This specification introduces a read-only JavaScript interface to the audio and video elements to allow Web Developers to find out about the tracks (including the virtual tracks) that a media resource offers. The only action that the interface currently provides is to enable or disable tracks.
Here is an example use to turn on a french subtitle track:

if (video.tracks[2].role == "subtitle" && video.tracks[2].language == "fr") video.tracks[2].enabled = true;

There is still a need to introduce a means to actually expose the text cues as they relate to the currentTime of the media resource. This has not yet been specified in the given proposals.

The text cues could be exposed in several ways. They could be exposed through introducing an event, i.e. every time a new text cue becomes active, a callback is called which is given the active text cue (if such a callback had been registered previously). Another option is to simply write the text cues into a specified div-element in the DOM and thus expose them directly in the Browser. A third idea could be to expose the text cues in an iframe-like element to avoid any cross-site security issues. And a fourth idea that we have discussed is to expose the text cues in an attribute of the track.

All of this obviously also relates to how to actually render the text cues and whether to render them in a shadow DOM so as to make the JavaScript reading separate from the rendering and address security and copyright issues. I’d be curious in opinions here on how it should be done.

How to display seeked position for HTML5 video

Recently, I was asked for some help on coding with an HTML5 video element and its events. In particular the question was: how do I display the time position that somebody seeked to in a video?

Here is a code snipped that shows how to use the seeked event:

<video onseeked="writeVideoTime(this.currentTime);" src="video.ogv" controls></video>
<p>position:</p><div id="videotime"></div>
<script type="text/javascript">
// get video element
var video = document.getElementsByTagName("video")[0];
function writeVideoTime(t) {

Other events that can be used in a similar way are:

  • loadstart: UA requests the media data from the server
  • progress: UA is fetching media data from the server
  • suspend: UA is on purpose idling on the server connection mid-fetching
  • abort: UA aborts fetching media data from the server
  • error: UA aborts fetching media because of a network error
  • emptied: UA runs out of network buffered media data (I think)
  • stalled: UA is waiting for media data from the server
  • play: playback has begun after play() method returns
  • pause: playback has been paused after pause() method returns
  • loadedmetadata: UA has received all its setup information for the media resource, duration and dimensions and is ready to play
  • loadeddata: UA can render the media data at the current playback position for the first time
  • waiting: playback has stopped because the next frame is not available yet
  • playing: playback has started
  • canplay: playback can resume, but at risk of buffer underrun
  • canplaythrough: playback can resume without estimated risk of buffer underrun
  • seeking: seeking attribute changed to true (may be too short to catch)
  • seeked: seeking attribute changed to false
  • timeupdate: current playback position changed enough to report on it
  • ended: playback stopped at media resource end; ended attribute is true
  • ratechange: defaultPlaybackRate or playbackRate attribute have just changed
  • durationchange: duration attribute has changed
  • volumechange:volume attribute or the muted attribute has changed

Please refer to the actual event list in the specification for more details and more accurate information on the events.

Audio Track Accessibility for HTML5

I have talked a lot about synchronising multiple tracks of audio and video content recently. The reason was mainly that I foresee a need for more than two parallel audio and video tracks, such as audio descriptions for the vision-impaired or dub tracks for internationalisation, as well as sign language tracks for the hard-of-hearing.

It is almost impossible to introduce a good scheme to deliver the right video composition to a target audience. Common people will prefer bare a/v, vision-impaired would probably prefer only audio plus audio descriptions (but will probably take the video), and the hard-of-hearing will prefer video plus captions and possibly a sign language track . While it is possible to dynamically create files that contain such tracks on a server and then deliver the right composition, implementation of such a server method has not been very successful in the last years and it would likely take many years to roll out such new infrastructure.

So, the only other option we have is to synchronise completely separate media resource together as they are selected by the audience.

It is this need that this HTML5 accessibility demo is about: Check out the demo of multiple media resource synchronisation.

I created a Ogg video with only a video track (10m53s750). Then I created an audio track that is the original English audio track (10m53s696). Then I used a Spanish dub track that I found through BlenderNation as an alternative audio track (10m58s337). Lastly, I created an audio description track in the original language (10m53s706). This creates a video track with three optional audio tracks.

I took away all native controls from these elements when using the HTML5 audio and video tag and ran my own stop/play and seeking approaches, which handled all media elements in one go.

I was mostly interested in the quality of this experience. Would the different media files stay mostly in sync? They are normally decoded in different threads, so how big would the drift be?

The resulting page is the basis for such experiments with synchronisation.

The page prints the current playback position in all of the media files at a constant interval of 500ms. Note that when you pause and then play again, I am re-synching the audio tracks with the video track, but not when you just let the files play through.

I have let the files play through on my rather busy Macbook and have achieved the following interesting drift over the course of about 9 minutes:

Drift between multiple parallel played media elements

You will see that the video was the slowest, only doing roughly 540s, while the Spanish dub did 560s in the same time.

To fix such drifts, you can always include regular re-synchronisation points into the video playback. For example, you could set a timeout on the playback to re-sync every 500ms. Within such a short time, it is almost impossible to notice a drift. Don’t re-load the video, because it will lead to visual artifacts. But do use the video’s currentTime to re-set the others. (UPDATE: Actually, it depends on your situation, which track is the best choice as the main timeline. See also comments below.)

It is a workable way of associating random numbers of media tracks with videos, in particular in situations where the creation of merged files cannot easily be included in a workflow.

The model of a time-linear media resource for HTML5

HTML5 has been criticised for not having a timing model of the media resource in its new media elements. This article spells it out and builds a framework of how we should think about HTML5 media resources. Note: these are my thoughts and nothing offical from HTML5 – just conclusions I have drawn from the specs and from discussions I had.

What is a time-linear media resource?

In HTML5 and also in the Media Fragment URI specification we deal only with audio and video resources that represent a single timeline exclusively. Let’s call such Web resources a time-linear media resource.

The Media Fragment requirements document actually has a very nice picture to describe such resources – replicated here for your convenience:

Model of a Media Resource

The resource can potentially consist of any number of audio, video, text, image or other time-aligned data tracks. All these tracks adhere to a single timeline, which tends to be defined by the main audio or video track, while other tracks have been created to synchronise with these main tracks.

This model matches with the world view of video on YouTube and any other video hosting service. It also matches with video used on any video streaming service.

Background on the choice of “time-linear”

I’ve deliberately chosen the word “time-linear” because we are talking about a single, gap-free, linear timeline here and not multiple timelines that represent the single resource.

The word “linear” is, however, somewhat over-used, since the introduction of digital systems into the world of analog film introduced what is now known as “non-linear video editing”. This term originates from the fact that non-linear video editing systems don’t have to linearly spool through film material to get to a edit point, but can directly access any frame in the footage as easily as any other.

When talking about a time-linear media resource, we are referring to a digital resource and therefore direct access to any frame in the footage is possible. So, a time-linear media resource will still be usable within a non-linear editing process.

As a Web resource, a time-linear media resource is not addressed as a sequence of frames or samples, since these are encoding specific. Rather, the resource is handled abstractly as an object that has track and time dimensions – and possibly spatial dimensions where image or video tracks are concerned. The framerate encoding of the resource itself does not matter and could, in fact, be changed without changing the resource’s time, track and spatial dimensions and thus without changing the resource’s address.

Interactive Multimedia

The term “time-linear” is used to specify the difference between a media resource that follows a single timeline, in contrast to one that deals with multiple timelines, linked together based on conditions, events, user interactions, or other disruptions to make a fully interactive multi-media experience. Thus, media resources in HTML5 and Media Fragments do not qualify as interactive multimedia themselves because they are not regarded as a graph of interlinked media resources, but simply as a single time-linear resource.

In this respect, time-linear media resources are also different from the kind of interactive mult-media experiences that an Adobe Shockwave Flash, Silverlight, or a SMIL file can create. These can go far beyond what current typical video publishing and communication applications on the Web require and go far beyond what the HTML5 media elements were created for. If your application has a need for multiple timelines, it may be necessary to use SMIL, Silverlight, or Adobe Flash to create it.

Note that the fact that the HTML5 media elements are part of the Web, and therefore expose states and integrate with JavaScript, provides Web developers with a certain control over the playback order of a time-linear media resource. The simple functions pause(), play(), and the currentTime attribute allow JavaScript developers to control the current playback offset and whether to stop or start playback. Thus, it is possible to interrupt a playback and present, e.g. a overlay text with a hyperlink, or an additional media resource, or anything else a Web developer can imagine right in the middle of playing back a media resource.

In this way, time-linear media resources can contribute towards an interactive multi-media experience, created by a Web developer through a combination of multiple media resources, image resources, text resources and Web pages. The limitations of this approach are not yet clear at this stage – how far will such a constructed multi-media experience be able to take us and where does it become more complicated than an Adobe Flash, Silverlight, or SMIL experience. The answer to this question will, I believe, become clearer through the next few years of HTML5 usage and further extensions to HTML5 media may well be necessary then.

Proper handling of time-linear media resources in HTML5

At this stage, however, we have already determined several limitations of the existing HTML5 media elements that require resolution without changing the time-linear nature of the resource.

1. Expose structure

Above all, there is a need to expose the above painted structure of a time-linear media resource to the Web page. Right now, when the <video> element links to a video file, it only accesses the main audio and video tracks, decodes them and displays them. The media framework that sits underneath the user agent (UA) and does the actual decoding for the UA might know about other tracks and might even decode, e.g. a caption track and display it by default, but the UA has no means of knowing this happens and controlling this.

We need a means to expose the available tracks inside a time-linear media resource and allow the UA some control over it – e.g. to choose whether to turn on/off a caption track, to choose which video track to display, or to choose which dubbed audio track to display.

I’ll discuss in another article different approaches on how to expose the structure. Suffice for now that we recognise the need to expose the tracks.

2. Separate the media resource concept from actual files

A HTML page is a sequence of HTML tags delivered over HTTP to a UA. A HTML page is a Web resource. It can be created dynamically and contain links to other Web resources such as images which complete its presentation.

We have to move to a similar “virtual” view of a media resource. Typically, a video is a single file with a video and an audio track. But also typically, caption and subtitle tracks for such a video file are stored in other files, possibly even on other servers. The caption or subtitle tracks are still in sync with the video file and therefore are actual tracks of that time-linear media resource. There is no reason to treat this differently to when the caption or subtitle track is inside the media file.

When we separate the media resource concept from actual files, we will find it easier to deal with time-linear media resources in HTML5.

3. Track activation and Display styling

A time-linear media resource, when regarded completely abstractly, can contain all sorts of alternative and additional tracks.

For example, the existing <source> elements inside a video or audio element are currently mostly being used to link to alternative encodings of the main media resource – e.g. either in mpeg4 or ogg format. We can regard these as alternative tracks within the same (virtual) time-linear media resource.

Similarly, the <source> elements have also been suggested to be used for alternate encodings, such as for mobile and Web. Again, these can be regarded as alternative tracks of the same time-linear media resource.

Another example are subtitle tracks for a main media resource, which are currently discussed to be referenced using the <itext> element. These are in principle alternative tracks amongst themselves, but additional to the main media resource. Also, some people are actually interested in displaying two subtitle tracks at the same time to learn translations.

Another example are sign language tracks, which are video tracks that can be regarded as an alternative to the audio tracks for hard-of-hearing users. They are then additional video tracks to the original video track and it is not clear how to display more than one video track. Typically, sign language tracks are displayed as picture-in-picture, but on the Web, where video is usually displayed in a small area, this may not be optimal.

As you can see, when deciding which tracks need to be displayed one needs to analyse the relationships between the tracks. Further, user preferences need to come into play when activating tracks. Finally, the user should be able to interactively activate tracks as well.

Once it is clear, what tracks need displaying, there is still the challenge of how to display them. It should be possible to provide default displays for typical track types, and allow Web authors to override these default display styles since they know what actual tracks their resource is dealing with.

While the default display seems to be typically an issue left to the UA to solve, the display overrides are typically dealt with on the Web through CSS approaches. How we solve this is for another time – right now we can just state the need for algorithms for track activiation and for default and override styling.


To make media resources a prime citizens on the Web, we have to go beyond simply replicating digital media files. The Web is based on hyperlinks between Web resources, and that includes hyperlinking out of resources (e.g. from any word within a Web page) as well as hyperlinking into resources (e.g. fragment URIs into Web pages).

To turn video and audio into hypervideo and hyperaudio, we need to enable hyperlinking into and out of them.

Hyperlinking into media resources is fortunately already being addressed by the W3C Media Fragments working group, which also regards media resources in the same way as HTML5. The addressing schemes under consideration are the following:

  • temporal fragment URI addressing: address a time offset/region of a media resource
  • spatial fragment URI addressing: address a rectangular region of a media resource (where available)
  • track fragment URI addressing: address one or more tracks of a media resource
  • named fragment URI addressing: address a named region of a media resource
  • a combination of the above addressing schemes

With such addressing schemes available, there is still a need to hook up the addressing with the resource. For the temporal and the spatial dimension, resolving the addressing into actual byte ranges is relatively obvious across any media type. However, track addressing and named addressing need to be resolved. Track addressing will become easier when we solve the above stated requirement of exposing the track structure of a media resource. The name definition requires association of an id or name with temporal offsets, spatial areas, or tracks. The addressing scheme will be available soon – whether our media resources can support them is another challenge to solve.

Finally, hyperlinking out of media resources is something that is not generally supported at this stage. Certainly, some types of media resources – QuickTime, Flash, MPEG4, Ogg – support the definition of tracks that can contain HTML marked-up text and thus can also contain hyperlinks. But standardisation in this space has not really happened yet. It seems to be clear that hyperlinks out of media files will come from some type of textual track. But a standard format for such time-aligned text tracks doesn’t yet exist. This is a challenge to be addressed in the near future.


The Web has always tried to deal with new extensions in the simplest possible manner, providing support for the majority of current use cases and allowing for the few extraordinary use cases to be satisfied by use of JavaScript or embedding of external, more complex objects.

With the new media elements in HTML5, this is no different. So far, the most basic need has been satisfied: that of including simple video and audio files into Web pages. However, many basic requirements are not being satisfied yet: accessibility needs, codec choice, device-independence needs are just some of the core requirements that make it important to extend our view of <audio> and <video> to a broader view of a Web media resource without changing the basic understanding of an audio and video resource.

This post has created the concept of a “media resource”, where we keep the simplicity of a single timeline. At the same time, it has tried to classify the list of shortcomings of the current media elements in a way that will help us address these shortcomings in a Web-conformant means.

If we accept the need to expose the structure of a media resource, the need to separate the media resource concept from actual files, the need for an approach to track activation, and the need to deal with styling of displayed tracks, we can take the next steps and propose solutions for these.

Further, understanding the structure of a media resources allows us to start addressing the harder questions of how to associate events with a media resource, how to associate a navigable structure with a media resource, or how to turn media resources into hypermedia.

HTML5 audio element accessibility

As part of my experiments in video accessibility I am also looking at the audio element. I have just finished a proof of concept for parsing Lyrics files for music in lrc format.

The demo uses Tay Zonday’s “Chocolate Rain” song both as a video with subtitles and as an audio file with lyrics. Fortunately, he published these all under a creative commons license, so I was able to use this music file. BTW: I found it really difficult to find a openly licensed music file with lyrics.

While I was at it, I also cleaned up all the old demos and now have a nice list of all demos in a central file.