WebRTC predictions for 2016

I wrote these predictions in the first week of January and meant to publish them as encouragement to think about where WebRTC still needs some work. I’d like to be able to compare the state of WebRTC in the browser a year from now. Therefore, without further ado, here are my thoughts.

WebRTC Browser support

I’m quite optimistic when it comes to browser support for WebRTC. We have seen Edge bring in initial support last year and Apple looking to hire engineers to implement WebRTC. My prediction is that we will see the following developments in 2016:

  • Edge will become interoperable with Chrome and Firefox, i.e. it will publish VP8/VP9 and H.264/H.265 support
  • Firefox of course continues to support both VP8/VP9 and H.264/H.265
  • Chrome will follow the spec and implement H.264/H.265 support (to add to their already existing VP8/VP9 support)
  • Safari will enter the WebRTC space but only with H.264/H.265 support

Codec Observations

With Edge and Safari entering the WebRTC space, there will be a larger focus on H.264/H.265. It will help with creating interoperability between the browsers.

However, since there are so many flavours of H.264/H.265, I expect that when different browsers are used at different endpoints, we will get poor quality video calls because of having to negotiate a common denominator. Certainly, baseline will work interoperably, but better encoding quality and lower bandwidth will only be achieved if all endpoints use the same browser.

Thus, we will get to the funny situation where we buy ourselves interoperability at the cost of video quality and bandwidth. I’d call that a “degree of interoperability” and not the best possible outcome.

I’m going to go out on a limb and say that at this stage, Google is going to consider strongly to improve the case of VP8/VP9 by improving its bandwidth adaptability: I think they will buy themselves some SVC capability and make VP9 the best quality codec for live video conferencing. Thus, when Safari eventually follows the standard and also implements VP8/VP9 support, the interoperability win of H.264/H.265 will become only temporary overshadowed by a vastly better video quality when using VP9.

The Enterprise Boundary

Like all video conferencing technology, WebRTC is having a hard time dealing with the corporate boundary: firewalls and proxies get in the way of setting up video connections from within an enterprise to people outside.

The telco world has come up with the concept of SBCs (session border controller). SBCs come packed with functionality to deal with security, signalling protocol translation, Quality of Service policing, regulatory requirements, statistics, billing, and even media service like transcoding.

SBCs are a total overkill for a world where a large number of Web applications simply want to add a WebRTC feature – probably mostly to provide a video or audio customer support service, but it could be a live training session with call-in, or an interest group conference all.

We cannot install a custom SBC solution for every WebRTC service provider in every enterprise. That’s like saying we need a custom Web proxy for every Web server. It doesn’t scale.

Cloud services thrive on their ability to sell directly to an individual in an organisation on their credit card without that individual having to ask their IT department to put special rules in place. WebRTC will not make progress in the corporate environment unless this is fixed.

We need a solution that allows all WebRTC services to get through an enterprise firewall and enterprise proxy. I think the WebRTC standards have done pretty well with firewalls and connecting to a TURN server on port 443 will do the trick most of the time. But enterprise proxies are the next frontier.

What it takes is some kind of media packet forwarding service that sits on the firewall or in a proxy and allows WebRTC media packets through – maybe with some configuration that is necessary in the browsers or the Web app to add this service as another type of TURN server.

I don’t have a full understanding of the problems involved, but I think such a solution is vital before WebRTC can go mainstream. I expect that this year we will see some clever people coming up with a solution for this and a new type of product will be born and rolled out to enterprises around the world.


So these are my predictions. In summary, they address the key areas where I think WebRTC still has to make progress: interoperability between browsers, video quality at low bitrates, and the enterprise boundary. I’m really curious to see where we stand with these a year from now.

It’s worth mentioning Philipp Hancke’s tweet reply to my post:

— we saw some clever people come up with a solution already. Now it needs to be implemented :-)

My journey to Coviu

My new startup just released our MVP – this is the story of what got me here.

I love creating new applications that let people do their work better or in a manner that wasn’t possible before.

German building and loan socityMy first such passion was as a student intern when I built a system for a building and loan association’s monthly customer magazine. The group I worked with was managing their advertiser contacts through a set of paper cards and I wrote a dBase based system (yes, that long ago) that would manage their customer relationships. They loved it – until it got replaced by an SAP system that cost 100 times what I cost them, had really poor UX, and only gave them half the functionality. It was a corporate system with ongoing support, which made all the difference to them.

Dr Scholz und Partner GmbHThe story repeated itself with a CRM for my Uncle’s construction company, and with a resume and quotation management system for Accenture right after Uni, both of which I left behind when I decided to go into research.

Even as a PhD student, I never lost sight of challenges that people were facing and wanted to develop technology to overcome problems. The aim of my PhD thesis was to prepare for the oncoming onslaught of audio and video on the Internet (yes, this was 1994!) by developing algorithms to automatically extract and locate information in such files, which would enable users to structure, index and search such content.

Many of the use cases that we explored are now part of products or continue to be challenges: finding music that matches your preferences, identifying music or video pieces e.g. to count ads on the radio or to mark copyright infringement, or the automated creation of video summaries such as trailers.


This continued when I joined the CSIRO in Australia – I was working on segmenting speech into words or talk spurts since that would simplify captioning & subtitling, and on MPEG-7 which was a (slightly over-engineered) standard to structure metadata about audio and video.

In 2001 I had the idea of replicating the Web for videos: i.e. creating hyperlinked and searchable video-only experiences. We called it “Annodex” for annotated and indexed video and it needed full-screen hyperlinked video in browsers – man were we ahead of our time! It was my first step into standards, got several IETF RFCs to my name, and started my involvement with open codecs through Xiph.

vquence logoAround the time that YouTube was founded in 2006, I founded Vquence – originally a video search company for the Web, but pivoted to a video metadata mining company. Vquence still exists and continues to sell its data to channel partners, but it lacks the user impact that has always driven my work.

As the video element started being developed for HTML5, I had to get involved. I contributed many use cases to the W3C, became a co-editor of the HTML5 spec and focused on video captioning with WebVTT while contracting to Mozilla and later to Google. We made huge progress and today the technology exists to publish video on the Web with captions, making the Web more inclusive for everybody. I contributed code to YouTube and Google Chrome, but was keen to make a bigger impact again.

NICTA logoThe opportunity came when a couple of former CSIRO colleagues who now worked for NICTA approached me to get me interested in addressing new use cases for video conferencing in the context of WebRTC. We worked on a kiosk-style solution to service delivery for large service organisations, particularly targeting government. The emerging WebRTC standard posed many technical challenges that we addressed by building rtc.io , by contributing to the standards, and registering bugs on the browsers.

Fast-forward through the development of a few further custom solutions for customers in health and education and we are starting to see patterns of need emerge. The core learning that we’ve come away with is that to get things done, you have to go beyond “talking heads” in a video call. It’s not just about seeing the other person, but much more about having a shared view of the things that need to be worked on and a shared way of interacting with them. Also, we learnt that the things that are being worked on are quite varied and may include multiple input cameras, digital documents, Web pages, applications, device data, controls, forms.

Coviu logoSo we set out to build a solution that would enable productive remote collaboration to take place. It would need to provide an excellent user experience, it would need to be simple to work with, provide for the standard use cases out of the box, yet be architected to be extensible for specialised data sharing needs that we knew some of our customers had. It would need to be usable directly on Coviu.com, but also able to integrate with specialised applications that some of our customers were already using, such as the applications that they spend most of their time in (CRMs, practice management systems, learning management systems, team chat systems). It would need to require our customers to sign up, yet their clients to join a call without sign-up.

Collaboration is a big problem. People are continuing to get more comfortable with technology and are less and less inclined to travel distances just to get a service done. In a country as large as Australia, where 12% of the population lives in rural and remote areas, people may not even be able to travel distances, particularly to receive or provide recurring or specialised services, or to achieve work/life balance. To make the world a global village, we need to be able to work together better remotely.

The need for collaboration is being recognised by specialised Web applications already, such as the LiveShare feature of Invision for Designers, Codassium for pair programming, or the recently announced Dropbox Paper. Few go all the way to video – WebRTC is still regarded as a complicated feature to support.

Coviu in action

With Coviu, we’d like to offer a collaboration feature to every Web app. We now have a Web app that provides a modern and beautifully designed collaboration interface. To enable other Web apps to integrate it, we are now developing an API. Integration may entail customisation of the data sharing part of Coviu – something Coviu has been designed for. How to replicate the data and keep it consistent when people collaborate remotely – that is where Coviu makes a difference.

We have started our journey and have just launched free signup to the Coviu base product, which allows individuals to own their own “room” (i.e. a fixed URL) in which to collaborate with others. A huge shout out goes to everyone in the Coviu team – a pretty amazing group of people – who have turned the app from an idea to reality. You are all awesome!

With Coviu you can share and annotate:

  • images (show your mum photos of your last holidays, or get feedback on an architecture diagram from a customer),
  • pdf files (give a presentation remotely, or walk a customer through a contract),
  • whiteboards (brainstorm with a colleague), and
  • share an application window (watch a YouTube video together, or work through your task list with your colleagues).

All of these are regarded as “shared documents” in Coviu and thus have zooming and annotations features and are listed in a document tray for ease of navigation.

This is just the beginning of how we want to make working together online more productive. Give it a go and let us know what you think.


WebVTT Audio Descriptions for Elephants Dream

When I set out to improve accessibility on the Web and we started developing WebSRT – later to be renamed to WebVTT – I needed an example video to demonstrate captions / subtitles, audio descriptions, transcripts, navigation markers and sign language.

I needed a freely available video with spoken text that either already had such data available or that I could create it for. Naturally I chose “Elephants Dream” by the Orange Open Movie Project , because it was created under the Creative Commons Attribution 2.5 license.

As it turned out, the Blender Foundation had already created a collection of SRT files that would represent the English original as well as the translated languages. I was able to reuse them by merely adding a WEBVTT header.

Then there was a need for a textual audio description. I read up on the plot online and finally wrote up a time-alignd audio description. I’m hereby making that file available under the Create Commons Attribution 4.0 license. I’ve added a few lines to the medadata headers so it doesn’t confuse players. Feel free to reuse at will – I know there are others out there that have a similar need to demonstrate accessibility features.

Progress with rtc.io

At the end of July, I gave a presentation about WebRTC and rtc.io at the WDCNZ Web Dev Conference in beautiful Wellington, NZ.


Putting that talk together reminded me about how far we have come in the last year both with the progress of WebRTC, its standards and browser implementations, as well as with our own small team at NICTA and our rtc.io WebRTC toolbox.

WDCNZ presentation page5

One of the most exciting opportunities is still under-exploited: the data channel. When I talked about the above slide and pointed out Bananabread, PeerCDN, Copay, PubNub and also later WebTorrent, that’s where I really started to get Web Developers excited about WebRTC. They can totally see the shift in paradigm to peer-to-peer applications away from the Server-based architecture of the current Web.

Many were also excited to learn more about rtc.io, our own npm nodules based approach to a JavaScript API for WebRTC.


We believe that the World of JavaScript has reached a critical stage where we can no longer code by copy-and-paste of JavaScript snippets from all over the Web universe. We need a more structured module reuse approach to JavaScript. Node with JavaScript on the back end really only motivated this development. However, we’ve needed it for a long time on the front end, too. One big library (jquery anyone?) that does everything that anyone could ever need on the front-end isn’t going to work any longer with the amount of functionality that we now expect Web applications to support. Just look at the insane growth of npm compared to other module collections:

Packages per day across popular platforms (Shamelessly copied from: http://blog.nodejitsu.com/npm-innovation-through-modularity/)

For those that – like myself – found it difficult to understand how to tap into the sheer power of npm modules as a font end developer, simply use browserify. npm modules are prepared following the CommonJS module definition spec. Browserify works natively with that and “compiles” all the dependencies of a npm modules into a single bundle.js file that you can use on the front end through a script tag as you would in plain HTML. You can learn more about browserify and module definitions and how to use browserify.

For those of you not quite ready to dive in with browserify we have prepared prepared the rtc module, which exposes the most commonly used packages of rtc.io through an “RTC” object from a browserified JavaScript file. You can also directly download the JavaScript file from GitHub.

Using rtc.io rtc JS library
Using rtc.io rtc JS library

So, I hope you enjoy rtc.io and I hope you enjoy my slides and large collection of interesting links inside the deck, and of course: enjoy WebRTC! Thanks to Damon, JEeff, Cathy, Pete and Nathan – you’re an awesome team!

On a side note, I was really excited to meet the author of browserify, James Halliday (@substack) at WDCNZ, whose talk on “building your own tools” seemed to take me back to the times where everything was done on the command-line. I think James is using Node and the Web in a way that would appeal to a Linux Kernel developer. Fascinating!!

AppRTC : Google’s WebRTC test app and its parameters

If you’ve been interested in WebRTC and haven’t lived under a rock, you will know about Google’s open source testing application for WebRTC: AppRTC.

When you go to the site, a new video conferencing room is automatically created for you and you can share the provided URL with somebody else and thus connect (make sure you’re using Google Chrome, Opera or Mozilla Firefox).

We’ve been using this application forever to check whether any issues with our own WebRTC applications are due to network connectivity issues, firewall issues, or browser bugs, in which case AppRTC breaks down, too. Otherwise we’re pretty sure to have to dig deeper into our own code.

Now, AppRTC creates a pretty poor quality video conference, because the browsers use a 640×480 resolution by default. However, there are many query parameters that can be added to the AppRTC URL through which the connection can be manipulated.

Here are my favourite parameters:

  • hd=true : turns on high definition, ie. minWidth=1280,minHeight=720
  • stereo=true : turns on stereo audio
  • debug=loopback : connect to yourself (great to check your own firewalls)
  • tt=60 : by default, the channel is closed after 30min – this gives you 60 (max 1440)

For example, here’s how a stereo, HD loopback test would look like: https://apprtc.appspot.com/?r=82313387&hd=true&stereo=true&debug=loopback .

This is not the limit of the available parameter, though. Here are some others that you may find interesting for some more in-depth geekery:

  • ss=[stunserver] : in case you want to test a different STUN server to the default Google ones
  • ts=[turnserver] : in case you want to test a different TURN server to the default Google ones
  • tp=[password] : password for the TURN server
  • audio=true&video=false : audio-only call
  • audio=false : video-only call
  • audio=googEchoCancellation=false,googAutoGainControl=true : disable echo cancellation and enable gain control
  • audio=googNoiseReduction=true : enable noise reduction (more Google-specific parameters)
  • asc=ISAC/16000 : preferred audio send codec is ISAC at 16kHz (use on Android)
  • arc=opus/48000 : preferred audio receive codec is opus at 48kHz
  • dtls=false : disable datagram transport layer security
  • dscp=true : enable DSCP
  • ipv6=true : enable IPv6

AppRTC’s source code is available here. And here is the file with the parameters (in case you want to check if they have changed).

Have fun playing with the main and always up-to-date WebRTC application: AppRTC.

UPDATE 12 May 2014

AppRTC now also supports the following bitrate controls:

  • arbr=[bitrate] : set audio receive bitrate
  • asbr=[bitrate] : set audio send bitrate
  • vsbr=[bitrate] : set video receive bitrate
  • vrbr=[bitrate] : set video send bitrate

Example usage: https://apprtc.appspot.com/?r=&asbr=128&vsbr=4096&hd=true

Use deck.js as a remote presentation tool

deck.js is one of the new HTML5-based presentation tools. It’s simple to use, in particular for your basic, every-day presentation needs. You can also create more complex slides with animations etc. if you know your HTML and CSS.

Yesterday at linux.conf.au (LCA), I gave a presentation using deck.js. But I didn’t give it from the lectern in the room in Perth where LCA is being held – instead I gave it from the comfort of my home office at the other end of the country.

I used my laptop with in-built webcam and my Chrome browser to give this presentation. Beforehand, I had uploaded the presentation to a Web server and shared the link with the organiser of my speaker track, who was on site in Perth and had set up his laptop in the same fashion as myself. His screen was projecting the Chrome tab in which my slides were loaded and he had hooked up the audio output of his laptop to the room speaker system. His camera was pointed at the audience so I could see their reaction.

I loaded a slide master URL:
and the room loaded the URL without query string:

Then I gave my talk exactly as I would if I was in the same room. Yes, it felt exactly as though I was there, including nervousness and audience feedback.

How did we do that? WebRTC (Web Real-time Communication) to the rescue, of course!

We used one of the modules of the rtc.io project called rtc-glue to add the video conferencing functionality and the slide navigation to deck.js. It was actually really really simple!

Here are the few things we added to deck.js to make it work:

  • Code added to index.html to make the video connection work:
    <meta name="rtc-signalhost" content="http://rtc.io/switchboard/">
    <meta name="rtc-room" content="lca2014">
    <video id="localV" rtc-capture="camera" muted></video>
    <video id="peerV" rtc-peer rtc-stream="localV"></video>
    <script src="glue.js"></script>
    glue.config.iceServers = [{ url: 'stun:stun.l.google.com:19302' }];

    The iceServers config is required to punch through firewalls – you may also need a TURN server. Note that you need a signalling server – in our case we used http://rtc.io/switchboard/, which runs the code from rtc-switchboard.

  • Added glue.js library to deck.js:

    Downloaded from https://raw.github.com/rtc-io/rtc-glue/master/dist/glue.js into the source directory of deck.js.

  • Code added to index.html to synchronize slide navigation:
    glue.events.once('connected', function(signaller) {
      if (location.search.slice(1) !== '') {
        $(document).bind('deck.change', function(evt, from, to) {
          signaller.send('/slide', {
            idx: to,
            sender: signaller.id
      signaller.on('slide', function(data) {
        console.log('received notification to change to slide: ', data.idx);
        $.deck('go', data.idx);

    This simply registers a callback on the slide master end to send a slide position message to the room end, and a callback on the room end that initiates the slide navigation.

And that’s it!

You can find my slide deck on GitHub.

Feel free to write your own slides in this manner – I would love to have more users of this approach. It should also be fairly simple to extend this to share pointer positions, so you can actually use the mouse pointer to point to things on your slides remotely. Would love to hear your experiences!

Note that the slides are actually a talk about the rtc.io project, so if you want to find out more about these modules and what other things you can do, read the slide deck or watch the talk when it has been published by LCA.

Many thanks to Damon Oehlman for his help in getting this working.

BTW: somebody should really fix that print style sheet for deck.js – I’m only ever getting the one slide that is currently showing. 😉

WebRTC books – a brief review

I just finished reading Rob Manson’s awesome book “Getting Started with WebRTC” and I can highly recommend it for any Web developer who is interested in WebRTC.

Rob explains very clearly how to create your first video, audio or data peer-connection using WebRTC in current Google Chrome or Firefox (I think it also now applies to Opera, though that wasn’t the case when his book was published). He makes available example code, so you can replicate it in your own Web application easily, including the setup of a signalling server. He also points out that you need a ICE (STUN/TURN) server to punch through firewalls and gives recommendations for what software is available, but stops short of explaining how to set them up.

Rob’s focus is very much on the features required in a typical Web application:

  • video calls
  • audio calls
  • text chats
  • file sharing

In fact, he provides the most in-depth demo of how to set up a good file sharing interface I have come across.

Rob then also extends his introduction to WebRTC to two key application areas: education and team communication. His recommendations are spot on and required reading for anyone developing applications in these spaces.

Before Rob’s book, I have also read Alan Johnson and Dan Burnett’s “WebRTC” book on APIs and RTCWEB protocols of the HTML5 Real-Time Web.

Alan and Dan’s book was written more than a year ago and explains that state of standardisation at that time. It’s probably a little out-dated now, but it still gives you good foundations on why some decisions were made the way they are and what are contentious issues (some of which still remain). If you really want to understand what happens behind the scenes when you call certain functions in the WebRTC APIs of browsers, then this is for you.

Alan and Dan’s book explains in more details than Rob’s book how IP addresses of communication partners are found, how firewall holepunching works, how sessions get negotiated, and how the standards process works. It’s probably less useful to a Web developer who just wants to implement video call functionality into their Web application, though if something goes wrong you may find yourself digging into the details of SDP, SRTP, DTLS, and other cryptic abbreviations of protocols that all need to work together to get a WebRTC call working.

Overall, both books are worthwhile and cover different aspects of WebRTC that you will stumble across if you are directly dealing with WebRTC code.

WebVTT Discussions at FOMS

At the recent FOMS (Foundations of Open Media Software and Standards) Developer Workshop, we had a massive focus on WebVTT and the state of its feature set. You will find links to summaries of the individual discussions in the FOMS Schedule page. Here are some of the key results I went away with.

1. WebVTT Regions

The key driving force for improvements to WebVTT continues to be the accurate representation of CEA608/708 captioning. As part of that drive, we’ve introduced regions (the CEA708 “window” concept) to WebVTT. WebVTT regions satisfy multiple requirements of CEA608/708 captions:

  1. support for rollup captions
  2. support for background color and border color on a group of cues independent of the background color of the individual cue
  3. possibility to move a group of cues from one location on screen to a different
  4. support to specify an anchor point and a growth direction for cues when their text size changes
  5. support for specifying a fixed number of lines to be rendered
  6. possibility to specify which region is rendered in front of which other one when regions overlap

While WebVTT regions enable us to satisfy all of the above points, the specification isn’t actually complete yet and some of the above needs aren’t satisfied yet.

We have an open bug to move a region elsewhere. A first discussion at FOMS seemed to to indicate that we’ll have to add syntax for updating a region at a particular time and thus give region definitions a way to be valid only for a certain time frame. I can imagine that the region definitions that we have in the header of the WebVTT file now would have an implicitly defined time frame from the start to the end of the file, but can be overruled by a re-definition anywhere within the WebVTT file. That redefinition needs to provide a start and end time.

We registered a bug to add specifying the width and height of regions (and possibly of cues) by em (i.e. by multiples of the largest character in a font). This should allow us to have the region grow/shrink around the region anchor point with a change of font size by script or a user. em specifications should also be applied to cues – that matches the column count of CEA708/608 better.

When regions overlap, the original region extension spec already suggested a “layer” cue setting. It will be easy to add it.

Another change that we will ultimately need is the “scroll” setting: we will need to introduce support for scrolling text down or from left-to-right or right-to-left, e.g. vertical scrolling text seems to be used in some Chinese caption use cases.

2. Unify Rendering Approach

The introduction of regions created a second code path in the rendering spec with some duplication. At FOMS we discussed if it was possible to unify that. The suggestion is to render all cues into a region. Those that are not part of a region would be rendered into an anonymous region that covers the complete viewport. There may be some consequences to this, e.g. cue settings should be usable across all cues, no matter whether or not part of a region, and avoiding cue overlap may need to be done within regions.

Here’s a rough outline of the path of the new rendering algorithm:

(1) Render the regions:

Specified Region Anonymous Region
Render values as given: Render following values:
  • width
  • lines
  • regionanchor
  • viewportanchor
  • scroll
  • 100%
  • videoheight/lineheight
  • 0,0
  • 0,0
  • none

(2) Render the cues:

  • Create a cue box and put it in its region (anonymous if none given).
  • Calculate position & size of cue box from cue settings (position, line, size).
  • Calculate position of cue text inside cue box from remaining cue settings (vertical, align).

3. Vertical Features

WebVTT includes vertical rendering, both right-to-left and left-to-right. However, regions are not defined for vertical. Eventually, we’re going to have to look at the vertical features of WebVTT with more details and figure out whether the spec is working for them and what real-world requirements we have missed. We hope we can get some help from users in countries where vertically rendered captions/subtitles are the norm.

4. Best Practices

Some of he WebVTT users at FOMS suggested it would be advantageous to start a list of “best practices” for how to author captions with WebVTT. Example recommendations are:

  • Use line numbers only to position cues from top or bottom of viewport. Don’t use otherwise.
  • Note that when the user increases the fontsize in rollup captions and thus introduces new line breaks, your cues will roll by faster because the number of lines of a rollup is fixed.
  • Make sure to use &lrm; and &rlm; UTF-8 markers to control the directionality of your text.

It would be nice if somebody started such a document.

5. Non-caption use cases

Instead of continuing to look back and improve our support of captions/subtitles in WebVTT, one session at FOMS also went ahead and looked forward to other use cases. The following requirements came out of this:

5.1 Preview Thumbnails

A common use case for timed data is the use of preview thumbnails on the navigation bar of videos. A native implementation of preview thumbnails would allow crawlers and search engines to have a standardised way of extracting timed images for media files, so introduction of a new @kind value “thumbnails” was suggested.

The content of a “thumbnails” cue could be any of:

  • an image URL
  • a sprite URL to a single image
  • a spatial & temporal media fragment URL to a media resource
  • base64 encoded image (data URI)
  • an iframe offset to the media resource

The suggestion is to allow anything that would work in a img @src attribute as value in a cue of @kind=”thumbnails”. Responsive images might also be useful for a track of @kind=”thumbnails”. It may even be possible to define an inband thumbnail track based on the track of @kind=”thumbnails”. Such cues should also work in the JavaScript track API.

5.2 Chapter markers

There is interest to put richer content than just a chapter title into chapter cues. Often, chapters consist of a title, text and and image. The text is not so important, but the image is used almost everywhere that chapters are used. There may be a need to extend chapter cue content with images, similar to what a @kind=”thumbnails” track offers.

The conclusion that we arrived at was that we need to make @kind=”thumbnails” work first and then look at using the learnings from that to extend @kind=”chapters”.

5.3 Inband tracks for live video

A difficult topic was opened with the question of how to transport text tracks in live video. In live captioning, end times are never created for cues, but are implied by the start time of the next cue. This is a use case that hasn’t been addressed in HTML5/WebVTT yet. An old proposal to allow a special end time value of “NEXT” was discussed and recommended for adoption. Also, there was support for the spec change that stops blocking loading VTT until all cues have been loaded.

5.4 Cross-domain VTT loading

A brief discussion centered around the fact that the spec disallows cross-domain loading of WebVTT files, but that no browser implements this. This needs to be discussion at the HTML WG level.

6. Regions in live captioning

The final topic that we discussed was how we could provide support for regions in live captioning.

  • The currently active region definitions will need to be come part of every header of every VTT file segment that HLS uses, so it’s available in case the cues in the segment file reference it.
  • “NEXT” in end time markers would make authoring of live captioned VTT files easier.
  • If the application wants to use 1 word at a time and doesn’t want to delay sending the word until the full cue is authored (e.g. in a Hangout type environment), we will need to introduce the concept of “cue continuation markers”, so we know that a cue could be extended with the next VTT file fragment.

This is an extensive and impressive amount of discussion around WebVTT and a lot of new work to be performed in the future. I’m very grateful for all the people who have contributed to these discussions at FOMS and will hopefully continue to help get the specifications right.

WebVTT as a W3C Recommendation

Three weeks ago I attended TPAC, the annual meeting of W3C Working Groups. One of the meetings was of the Timed Text Working Group (TT-WG), that has been specifying TTML, the Timed Text Markup Language. It is now proposed that WebVTT be also standardised through the same Working Group.

How did that happen, you may ask, in particular since WebVTT and TTML have in the past been portrayed as rival caption formats? How will the WebVTT spec that is currently under development in the Text Track Community Group (TT-CG) move through a Working Group process?

I’ll explain first why there is a need for WebVTT to become a W3C Recommendation, and then how this is proposed to be part of the Timed Text Working Group deliverables, and finally how I can see this working between the TT-CG and the TT-WG.

Advantages of a W3C Recommendation

TTML is a XML-based markup format for captions developed during the time that XML was all the hotness. It has become a W3C standard (a so-called “Recommendation”) despite not having been implemented in any browsers (if you ask me: that’s actually a flaw of the W3C standardisation process: it requires only two interoperable implementations of any kind – and that could be anyone’s JavaScript library or Flash demonstrator – it doesn’t actually require browser implementations. But I digress…). To be fair, a subpart of TTML is by now implemented in Internet Explorer, but all the other major browsers have thus far rejected proposals of implementation.

Because of its Recommendation status, TTML has become the basis for several other caption standards that other SDOs have picked: the SMPTE’s SMPTE-TT format, the EBU’s EBU-TT format, and the DASH Industry Forum’s use of SMPTE-TT. SMPTE-TT has also become the “safe harbour” format for the US legislation on captioning as decided by the FCC. (Note that the FCC requirements for captions on the Web are actually based on a list of features rather than requiring a specific format. But that will be the topic of a different blog post…)

WebVTT is much younger than TTML. TTML was developed as an interchange format among caption authoring systems. WebVTT was built for rendering in Web browsers and with HTML5 in mind. It meets the requirements of the <track> element and supports more than just captions/subtitles. WebVTT is popular with browser developers and has already been implemented in all major browsers (Firefox Nightly is the last to implement it – all others have support already released).

As we can see and as has been proven by the HTML spec and multiple other specs: browsers don’t wait for specifications to have W3C Recommendation status before they implement them. Nor do they really care about the status of a spec – what they care about is whether a spec makes sense for the Web developer and user communities and whether it fits in the Web platform. WebVTT has obviously achieved this status, even with an evolving spec. (Note that the spec tries very hard not to break backwards compatibility, thus all past implementations will at least be compatible with the more basic features of the spec.)

Given that Web browsers don’t need WebVTT to become a W3C standard, why then should we spend effort in moving the spec through the W3C process to become a W3C Recommendation?

The modern Web is now much bigger than just Web browsers. Web specifications are being used in all kinds of devices including TV set-top boxes, phone and tablet apps, and even unexpected devices such as white goods. Videos are increasingly omnipresent thus exposing deaf and hard-of-hearing users to ever-growing challenges in interacting with content on diverse devices. Some of these devices will not use auto-updating software but fixed versions so can’t easily adapt to new features. Thus, caption producers (both commercial and community) need to be able to author captions (and other video accessibility content as defined by the HTML5 element) towards a feature set that is clearly defined to be supported by such non-updating devices.

Understandably, device vendors in this space have a need to build their technology on standardised specifications. SDOs for such device technologies like to reference fixed specifications so the feature set is not continually updating. To reference WebVTT, they could use a snapshot of the specification at any time and reference that, but that’s not how SDOs work. They prefer referencing an officially sanctioned and tested version of a specification – for a W3C specification that means creating a W3C Recommendation of the WebVTT spec.

Taking WebVTT on a W3C recommendation track is actually advantageous for browsers, too, because a test suite will have to be developed that proves that features are implemented in an interoperable manner. In summary, I can see the advantages and personally support the effort to take WebVTT through to a W3C Recommendation.

Choice of Working Group

FAIK this is the first time that a specification developed in a Community Group is being moved into the recommendation track. This is something that has been expected when the W3C created CGs, but not something that has an established process yet.

The first question of course is which WG would take it through to Recommendation? Would we create a new Working Group or find an existing one to move the specification through? Since WGs involve a lot of overhead, the preference was to add WebVTT to the charter of an existing WG. The two obvious candidates were the HTML WG and the TT-WG – the first because it’s where WebVTT originated and the latter because it’s the closest thematically.

Adding a deliverable to a WG is a major undertaking. The TT-WG is currently in the process of re-chartering and thus a suggestion was made to add WebVTT to the milestones of this WG. TBH that was not my first choice. Since I’m already an editor in the HTML WG and WebVTT is very closely related to HTML and can be tested extensively as part of HTML, I preferred the HTML WG. However, adding WebVTT to the TT-WG has some advantages, too.

Since TTML is an exchange format, lots of captions that will be created (at least professionally) will be in TTML and TTML-related formats. It makes sense to create a mapping from TTML to WebVTT for rendering in browsers. The expertise of both, TTML and WebVTT experts is required to develop a good mapping – as has been shown when we developed the mapping from CEA608/708 to WebVTT. Also, captioning experts are already in the TT-WG, so it helps to get a second set of eyes onto WebVTT.

A disadvantage of moving a specification out of a CG into a WG is, however, that you potentially lose a lot of the expertise that is already involved in the development of the spec. People don’t easily re-subscribe to additional mailing lists or want the additional complexity of involving another community (see e.g. this email).

So, a good process needs to be developed to allow everyone to contribute to the spec in the best way possible without requiring duplicate work. How can we do that?

The forthcoming process

At TPAC the TT-WG discussed for several hours what the next steps are in taking WebVTT through the TT-WG to recommendation status (agenda with slides). I won’t bore you with the different views – if you are keen, you can read the minutes.

What I came away with is the following process:

  1. Fix a few more bugs in the CG until we’re happy with the feature set in the CG. This should match the feature set that we realistically expect devices to implement for a first version of the WebVTT spec.
  2. Make a FSA (Final Specification Agreement) in the CG to create a stable reference and a clean IPR position.
  3. Assuming that the TT-WG’s charter has been approved with WebVTT as a milestone, we would next bring the FSA specification into the TT-WG as FPWD (First Public Working Draft) and immediately do a Last Call which effectively freezes the feature set (this is possible because there has already been wide community review of the WebVTT spec); in parallel, the CG can continue to develop the next version of the WebVTT spec with new features (just like it is happening with the HTML5 and HTML5.1 specifications).
  4. Develop a test suite and address any issues in the Last Call document (of course, also fix these issues in the CG version of the spec).
  5. As per W3C process, substantive and minor changes to Last Call documents have to be reported and raised issues addressed before the spec can progress to the next level: Candidate Recommendation status.
  6. For the next step – Proposed Recommendation status – an implementation report is necessary, and thus the test suite needs to be finalized for the given feature set. The feature set may also be reduced at this stage to just the ones implemented interoperably, leaving any other features for the next version of the spec.
  7. The final step is Recommendation status, which simply requires sufficient support and endorsement by W3C members.

The first version of the WebVTT spec naturally has a focus on captioning (and subtitling), since this has been the dominant use case that we have focused on this far and it’s the part that is the most compatibly implemented feature set of WebVTT in browsers. It’s my expectation that the next version of WebVTT will have a lot more features related to audio descriptions, chapters and metadata. Thus, this seems a good time for a first version feature freeze.

There are still several obstacles towards progressing WebVTT as a milestone of the TT-WG. Apart from the need to get buy-in from the TT-WG, the TT-CG, and the AC (Adivisory Committee who have to approve the new charter), we’re also looking at the license of the specification document.

The CG specification has an open license that allows creating derivative work as long as there is attribution, while the W3C document license for documents on the recommendation track does not allow the creation of derivative work unless given explicit exceptions. This is an issue that is currently being discussed in the W3C with a proposal for a CC-BY license on the Recommendation track. However, my view is that it’s probably ok to use the different document licenses: the TT-WG will work on WebVTT 1.0 and give it a W3C document license, while the CG starts working on the next WebVTT version under the open CG license. It probably actually makes sense to have a less open license on a frozen spec.

Making the best of a complicated world

WebVTT is now proposed as part of the recharter of the TT-WG. I have no idea how complicated the process will become to achieve a W3C WebVTT 1.0 Recommendation, but I am hoping that what is outlined above will be workable in such a way that all of us get to focus on progressing the technology.

At TPAC I got the impression that the TT-WG is committed to progressing WebVTT to Recommendation status. I know that the TT-CG is committed to continue developing WebVTT to its full potential for all kinds of media-time aligned content with new kinds already discussed at FOMS. Let’s enable both groups to achieve their goals. As a consequence, we will allow the two formats to excel where they do: TTML as an interchange format and WebVTT as a browser rendering format.

Summary Video Accessibility Talk

I’ve just got off a call to the UK Digital TV Group, for which I gave a talk on HTML5 video accessibility (slides best viewed in Google Chrome).

The slide provide a high-level summary of the accessibility features that we’ve developed in the W3C for HTML5, including:

  • Subtitles & Captions with WebVTT and the track element
  • Video Descriptions with WebVTT, the track element and speech synthesis
  • Chapters with WebVTT for semantic navigation
  • Audio Descriptions through synchronising an audio track with a video
  • Sign Language video synchronized with a main video

I received some excellent questions.

The obvious one was about why WebVTT and not TTML. While for anyone who has tried to implement TTML support, the advantages of WebVTT should be clear, for some the decision of the browsers to go with WebVTT still seems to be bothersome. The advantages of CSS over XSL-FO in a browser-context are obvious, but not as much outside browsers. So, the simplicity of WebVTT and the clear integration with HTML have to speak for themselves. Conversion between TTML and WebVTT was a feature that was being asked for.

I received a question about how to support ducking (reduce the volume of the main audio track) when using video descriptions. My reply was to either use video descriptions with WebVTT and do ducking during the times that a cue is active, or when using audio descriptions (i.e. actual audio tracks) to add an additional WebVTT file of kind=metadata to mark the intervals in which to do ducking. In both cases some JavaScript will be necessary.

I received another question about how to do clean audio, which I had almost forgotten was a requirement from our earlier media accessibility document. “Clean audio” consists of isolating the audio channel containing the spoken dialog and important non-speech information that can then be amplified or otherwise modified, while other channels containing music or ambient sounds are attenuated. I suggested using the mediagroup attribute to provide a main video element (without an audio track) and then the other channels as parallel audio tracks that can be turned on and off and attenuated individually. There is some JavaScript coding involved on top of the APIs that we have defined in HTML, but it can be implemented in browsers that support the mediagroup attribute.

Another question was about the possibilities to extend the list of @kind attribute values. I explained that right now we have a proposal for a new text track kind=”forced” so as to provide forced subtitles for sections of video with foreign language. These would be on when no other subtitle or caption tracks are activated. I also explained that if there is a need for application-specific text tracks, the kind=”metadata” would be the correct choice.

I received some further questions, in particular about how to apply styling to captions (e.g. color changes to text) and about how closely the browser are able to keep synchronization across multiple media elements. The earlier was easily answered with the ::cue pseudo-element, but the latter is a quality of implementation feature, so I had to defer to individual browsers.

Overall it was a good exercise to summarize the current state of HTML5 video accessibility and I was excited to show off support in Chrome for all the features that we designed into the standard.