The report of the recent W3C Video on the Web workshop has come out and has some recommendations to form a Video Metadata Working Group, or even more generally a Web Video Working Group.
I had some discussions with people that have a keen interest in the space and we have come up with a list of topics that a W3C Video Working Group should look into. I want to share this list here. It goes into somewhat more detailed than the topics that the W3C Video on the Web workshop has raised. Feel free to add any further concerns or suggestions that you have in the comments – I’d be curious to get feedback.
First, there are the fundamental issues:
- Choice of royalty-free baseline codecs for audio and video
- Choice of encapsulation format for multi-track media delivery
Both of these really require the generation of a list of requirements and use cases, then analysis of existing format with respect to these requirements and finally a decision on which ones to use.
Requirements for codecs would encompass, amongst others, the need to cover different delivery and receiving devices – from mobile phones with 3G bandwidth, over Web video, to full-screen TV video over ADSL.
Here are some requirements for an encapsulation format:
- usable for live streaming and for canned delivery,
- the ability to easily decode from any offset in a media file,
- the use for temporal and spatial hyperlinking and the required partial delivery that comes with these,
- the ability to dynamically create multi-track media streams on a server and to deliver requested tracks only,
- the ability to compose valid streams by composing segments from different servers based on a (play)list of temporal hyperlinks,
- the ability to cache segments in the network,
- and the ability to easily add a different “codec” track into the encapsulation (as a means of preparing for future improved codecs or other codec plugins).
The decisions for an encapsulation format and for a/v codecs may potentially require a further specification of how to map specific codecs into the chosen encapsulation format.
Then we have the “Web” requirements:
The technologies that have created what is known as the World Wide Web are fundamentally a hypertext markup language (HTML), a hypertext transfer protocol (HTTP) and a resource addressing scheme (URIs). Together they define the distributed nature of the Web. We need to build an infrastructure for hypermedia that builds on the existing Web technologies so we can make video a first-class citizen on the Web.
- Create a URI-compatible means of temporal hyperlinking directly into time offsets of media files.
- Create a URI-compatible means of spatial hyperlinking directly into picture areas of video files.
- Create a HTTP-compatible protocol for negotiating and transferring video content between a Web server and a Web client. This also includes a definition of how video can be cached in HTTP network proxies and the like.
- Create a markup language for video that also enables hyperlinks from any time and region in a video to any other Web resource. Time-aligned annotations and metadata need to be part of this, just like HTML annotates text.
All of these measures together will turn ordinary media into hypermedia, ready for a distributed usage on the Web.
In addition to these fundamental Web technologies, to integrate into modern Web environments, there would need to be:
- a standard definition of a javascript API to interact with the media data,
- an event model,
- a DOM integration of the textual markup,
- and possibly the use of CSS or SVG to define layout, effects, transitions and other presentation issues.
Then there are the Metadata requirements:
We all know that videos have a massive amount of metadata – i.e. data about the video. There are different types of metadata and they need to be handled differently.
- Time-aligned text, such as captions, subtitles, transcripts, karaoke and similar text.
- Header-type metadata, such as the ID3 tags for mp3 files, or the vorbiscomments for Ogg files.
- Manifest-type description of the relationships between different media file tracks, similar to what SMIL enables, like the recent ROE format in development with Xiph.
The time-aligned text should actually be regarded as a codec, because it is time-aligned just like audio or video data. If we want to be able to do live streaming of annotated media content and receive all the data as a multiplexed stream through one connection, we need to be able to multiplex the text codec into the binary stream just like we do with audio and video. Thus, the definition of the time-aligned text codecs have to ascertain the ability to multiplex.
Header-type metadata should be machine accessible and available for human consumption as required. They can be used to manage copyright and other rights-related information.
The manifest is important for dynamically creating multi-track media files as required through a client-server interaction, such as the request for a specific language audio track with the video rather than the default.
Other topics of interest:
There are two more topics that I would like to point out that require activities.
- “DRM”: It needs to be analysed what the real need is here. Is it a need to encrypt the media file such that it can only be read by specific recipients? Maybe an encryption scheme with public and private keys could provide this functionality? Or is it a need to retain copyright and licensing information with the media data? Then the encapsulation of metadata inside the media files may be a good solution already, since this information stays with the media file after a delivery or copy act.
- Accessibility: It needs to be ascertained that the association of captions, sign language, video descriptions and the like in a time-aligned fashion to the video is possible with the chosen encapsulation format. A standard time-aligned format for specifying sign language would be needed.
This list of required technologies has been built through years of experience experimenting with the seamless integration of video into the World Wide Web in the Annodex project and through further recent discussions from the W3C Video on the Web workshop and elsewhere.
This list is just providing a structure towards what is necessary to address in making video a first-class citizen on the Web. There are many difficult detail problems to solve in each one of these areas. It is a challenge to understand the complexity of the problem, but I hope this structure can contribute to break down some of the complexity and help us to start attacking the issues.