I am a very happy camper today! Not because of the New Year – well, yes, there are new opportunities and challenges for the New Year. But I’ve just received an email from Philip J
All posts by silvia
View counts on YouTube contradictory
UPDATE (6th February 2010): YouTube have just reacted to my bug and it seems there are some gData links that are more up-to-date than others. You need to go with the “uploads” gData APIs rather than the search or user ones to get accurate data. Glad YouTube told me and it’s documented now!
I am an avid user of YouTube Insight, the metrics tool that YouTube provides freely to everyone who publishes videos through them. YouTube Insight provides graphs on video views, the countries they originate in, demographics of the viewership, how the videos are discovered, engagement metrics, and hotspot analysis. It is a great tool to analyse the success of your videos, determine when to upload the next one, find out what works and what doesn’t.
However, you cannot rely on the accuracy of the numbers that YouTube Insight displays. In fact, YouTube provides three different means to find out what the current views (and other statistics, but let’s focus on the views) are for your videos:
- the view count displayed on the video’s watch page
- the view count displayed in YouTube Insight
- the view count given in the gData API feed
The shocking reality is: for all videos I have looked at that are less than about a month old and keep getting views, all three numbers are different.
Sometimes they are just off by one or two, which is tolerable and understandable, since the data must be served from a number of load balanced servers or even server clusters and it would be difficult to keep all of these clusters at identical numbers all of the time.
However, for more than 50% of the videos I have looked at, the numbers are off by a substantial amount.
I have undertaken an analysis with random videos, where I have collected the gData views and the watch page views. The Insight data tends to be between these two numbers, but I cannot generally reach that data, so I have left it out of this analysis.
Here are the stats for 36 randomly picked videos in the 9 view-count classes defined by TubeMogul and by how much they are off at the time that I looked at them:
Class | Video | watch page | gData API | age | diff | percentage |
---|---|---|---|---|---|---|
>1M | 1 | 7,187,174 | 6,082,419 | 2 weeks | 1,104,755 | 15.37% |
>1M | 2 | 3,196,690 | 3,080,415 | 3 weeks | 116,275 | 3.63% |
>1M | 3 | 2,247,064 | 1,992,844 | 1 week | 254,220 | 11.31% |
>1M | 4 | 1,054,278 | 1,040,591 | 1 month | 13,687 | 1.30% |
100K-500K | 5 | 476,838 | 148,681 | 11 days | 328,157 | 68.82% |
100K-500K | 6 | 356,561 | 294,309 | 2 weeks | 62,252 | 17.46% |
100K-500K | 7 | 225,951 | 195,159 | 2 weeks | 30,792 | 13.63% |
100K-500K | 8 | 113,521 | 62,241 | 1 week | 51,280 | 45.17% |
10K-100K | 9 | 86,964 | 46 | 4 days | 86,918 | 99.95% |
10K-100K | 10 | 52,922 | 43,548 | 3 weeks | 9,374 | 17.71% |
10K-100K | 11 | 34,001 | 33,045 | 1 month | 956 | 2.81% |
10K-100K | 12 | 15,704 | 13,653 | 2 weeks | 2,051 | 13.06% |
5K-10K | 13 | 9,144 | 8,967 | 1 month | 117 | 1.94% |
5K-10K | 14 | 7,265 | 5,409 | 1 month | 1,856 | 25.55% |
5K-10K | 15 | 6,640 | 5,896 | 2 weeks | 744 | 11.20% |
5K-10K | 16 | 5,092 | 3,518 | 6 days | 1,574 | 30.91% |
2.5K-5K | 17 | 4,955 | 4,928 | 3 weeks | 27 | 0.91% |
2.5K-5K | 18 | 4,341 | 4,044 | 4 days | 297 | 6.84% |
2.5K-5K | 19 | 3,377 | 3,306 | 3 weeks | 71 | 2.10% |
2.5K-5K | 20 | 2,734 | 2,714 | 1 month | 20 | 0.73% |
1K-2.5K | 21 | 2,208 | 2,169 | 3 weeks | 39 | 1.77% |
1K-2.5K | 22 | 1,851 | 1,747 | 2 weeks | 104 | 5.62% |
1K-2.5K | 23 | 1,281 | 1,244 | 1 week | 37 | 2.89% |
1K-2.5K | 24 | 1,034 | 984 | 2 weeks | 50 | 4.84% |
500-1K | 25 | 999 | 844 | 6 days | 155 | 15.52% |
500-1K | 26 | 891 | 790 | 6 days | 101 | 11.34% |
500-1K | 27 | 861 | 600 | 3 days | 17 | 30.31% |
500-1K | 28 | 645 | 482 | 4 days | 163 | 25.27% |
100-500 | 29 | 460 | 436 | 10 days | 24 | 5.22% |
100-500 | 30 | 291 | 285 | 4 days | 6 | 2.06% |
100-500 | 31 | 256 | 198 | 3 days | 58 | 22.66% |
100-500 | 32 | 196 | 175 | 11 days | 21 | 10.71% |
0-100 | 33 | 88 | 74 | 10 days | 14 | 15.90% |
0-100 | 34 | 64 | 49 | 12 days | 15 | 23.44% |
0-100 | 35 | 46 | 21 | 5 days | 25 | 54.35% |
0-100 | 36 | 31 | 25 | 3 days | 4 | 19.35% |
The videos were chosen such that they were no more than a month old, but older than a couple of days. For older videos than about a month, the increase had generally stopped and the metrics had caught up, unless where the views were still increasing rapidly, which is an unusual case.
Generally, it seems that the host page has the right views. In contrast, it seems the gData interface is updated only once every week. It further seems from looking at YouTube channels where I have access to Insight that Insight is updated about every 4 days and it receives corrected data for the days in which it hadn’t caught up.
Further, it seems that YouTube make no differentiation between channels of partners and general users’ channels – both can have a massive difference between the watch page and gData. Most videos differ by less than 20%, but some have exceptionally high differences above 50% and even up to 99.95%.
The difference is particularly pronounced for videos that show a steep increase in views – the first few days tend to have massive differences. Since these are the days that are particularly interesting to monitor for publishers, having the gData interface lag behind this much is shocking.
Further, videos with a low number of views, in particular less than 100, also show a particularly high percentage in difference – sometimes an increase in view count isn’t reported at all in the gData API for weeks. It seems that YouTube treats the long tail worse than the rest of YouTube. For every video in this class, the absolute difference will be small – obviously less than 100 views. With almost 30% of videos being such videos, it is somewhat understandable that YouTube are not making the effort to update their views regularly. OTOH, these views may be particularly important to their publishers.
It seems to me that YouTube need to change their approach to updating statistics across the watch pages, Insight and gData.
Firstly, it is important to have the watch page, Insight and gData in sync – otherwise what number would you use in a report? If the gData API for YouTube statistics lags behind the watch page and Insight by even 24 hours, it is useless in indicating trends and for using in reports and people have to go back to screenscraping to gain information on the actual views of their videos.
Secondly, it would be good to update the statistics daily during the first 3-4 weeks, or as long as the videos are gaining views heavily. This is the important time to track the success of videos and if neither Insight nor gData are up to date in this time, and can even be almost 100% off, the statistics are actually useless.
Lastly, one has to wonder how accurate the success calculations are for YouTube partners, who rely on YouTube reporting to gain payment for advertising. Since the analysis showed that the inaccuracies extend also into partner channels, one has to hope that the data that is eventually reported through Insight is actually accurate, even if intermittently there are large differences.
Finally, I must say that I was rather disappointed with the way in which this issue has so far been dealt with in the YouTube Forums. The issues about wrongly reported view counts has been reported first more than a year ago and since regularly by diverse people. Some of the reports were really unfriendly with their demands. Still, I would have expected a serious reply by a YouTube employee about why there are issues and how they are going to be fixed or whether they will be fixed at all. Instead, all I found was a more than 9 month old mention that YouTube seems to be aware of the issue and working on it – no news since.
Also, I found no other blog posts analysing this issue, so here we are. Please, YouTube, let us know what is going on with Insight, why are the numbers off by this much, and what are you doing to fix it?
NB: I just posted a bug on gData, since we were unable to find any concrete bugs relating to this issue there. I’m actually surprised about this, since so many people reported it in the YouTube Forums!
Manifests for exposing the structure of a Composite Media Resource
In the previous post I explained that there is a need to expose the tracks of a time-linear media resource to the user agent (UA). Here, I want to look in more detail at different possibilities of how to do so, their advantages and disadvantages.
Note: A lot of this has come out of discussions I had at the recent W3C TPAC and is still in flux, so I am writing this to start discussions and brainstorm.
Declarative Syntax vs JavaScript API
We can expose a media resource’s tracks either through a JavaScript function that can loop through the tracks and provide access to the tracks and their features, or we can do this through declarative syntax.
Using declarative syntax has the advantage of being available even if JavaScript is disabled in a UA. The markup can be parsed easily and default displays can be prepared without having to actually decode the media file(s).
OTOH, it has the disadvantage that it may not necessarily represent what is actually in the binary resource, but instead what the Web developer assumed was in the resource (or what he forgot to update). This may lead to a situation where a “404” may need to be given on a media track.
A further disadvantage is that when somebody copies the media element onto another Web page, together with all the track descriptions, and then the original media resource is changed (e.g. a subtitle track is added), this has not the desired effect, since the change does not propagate to the other Web page.
For these reasons, I thought that a JavaScript interface was preferable over declarative syntax.
However, recent discussions, in particular with some accessibility experts, have convinced me that declarative syntax is preferable, because it allows the creation of a menu for turning tracks on/off without having to even load the media file. Further, declarative syntax allows to treat multiple files and “native tracks” of a virtual media resource in an identical manner.
Extending Existing Declarative Syntax
The HTML5 media elements already have declarative syntax to specify multiple source media files for media elements. The <source> element is typically used to list video in mpeg4 and ogg format for support in different browsers, but has also been envisaged for different screensize and bandwidth encodings.
The <source> elements are generally meant to list different resources that contribute towards the media element. In that respect, let’s try using it for declaring a manifest of tracks of the virtual media resource on an example:
<video> <source id='av1' src='video.3gp' type='video/mp4' media='mobile' lang='en' role='media' > <source id='av2' src='video.mp4' type='video/mp4' media='desktop' lang='en' role='media' > <source id='av3' src='video.ogv' type='video/ogg' media='desktop' lang='en' role='media' > <source id='dub1' src='video.ogv?track=audio[de]' type='audio/ogg' lang='de' role='dub' > <source id='dub2' src='audio_ja.oga' type='audio/ogg' lang='ja' role='dub' > <source id='ad1' src='video.ogv?track=auddesc[en]' type='audio/ogg' lang='en' role='auddesc' > <source id='ad2' src='audiodesc_de.oga' type='audio/ogg' lang='de' role='auddesc' > <source id='cc1' src='video.mp4?track=caption[en]' type='application/ttaf+xml' lang='en' role='caption' > <source id='cc2' src='video.ogv?track=caption[de]' type='text/srt; charset="ISO-8859-1"' lang='de' role='caption' > <source id='cc3' src='caption_ja.ttaf' type='application/ttaf+xml' lang='ja' role='caption' > <source id='sign1' src='signvid_ase.ogv' type='video/ogg; codecs="theora"' media='desktop' lang='ase' role='sign' > <source id='sign2' src='signvid_gsg.ogv' type='video/ogg; codecs="theora"' media='desktop' lang='gsg' role='sign' > <source id='sign3' src='signvid_sfs.ogv' type='video/ogg; codecs="theora"' media='desktop' lang='sfs' role='sign' > <source id='tad1' src='tad_en.srt' type='text/srt; charset="ISO-8859-1"' lang='en' role='tad' > <source id='tad2' src='video.ogv?track=tad[de]' type='text/srt; charset="ISO-8859-1"' lang='de' role='tad' > <source id='tad3' src='tad_ja.srt' type='text/srt; charset="EUC-JP"' lang='ja' role='tad' > </video>
Note that this somewhat ignores my previously proposed special itext tag for handling text tracks. I am doing this here to experiment with a more integrative approach with the virtual media resource idea from the previous post. This may well be a better solution than a specific new text-related element. Most of the attributes of the itext element are, incidentally, covered.
You will also notice that some of the tracks are references to tracks inside binary media files using the Media Fragment URI specification while others link to full files. An example is video.ogv?track=auddesc[en]. So, this is a uniform means of exposing all the tracks that are part of a (virtual) media resource to the UA, no matter whether in-band or in external files. It actually relies on the UA or server being able to resolve these URLs.
“type” attribute
“media” and “type” are existing attributes of the <source> element in HTML5 and meant to help the UA determine what to do with the referenced resource. The current spec states:
The “type” attribute gives the type of the media resource, to help the user agent determine if it can play this media resource before fetching it.
The word “play” might need to be replaced with “decode” to cover several different MIME types.
The “type” attribute was also extended with the possibility to add the “charset” MIME parameter of a linked text resource – this is particularly important for SRT files, which don’t handle charsets very well. It avoids having to add an additional attribute and is analogous to the “codecs” MIME parameter used by audio and video resources.
“media” attribute
Further, the spec states:
The “media” attribute gives the intended media type of the media resource, to help the user agent determine if this media resource is useful to the user before fetching it. Its value must be a valid media query.
The “mobile” and “desktop” values are hints that I’ve used for simplicity reasons. They could be improved by giving appropriate bandwidth limits and width/height values, etc. Other values could be different camera angles such as topview, frontview, backview. The media query aspect has to be looked into in more depth.
“lang” attribute
The above example further uses “lang” and “role” attributes:
The “lang” attribute is an existing global attribute of HTML5, which typically indicates the language of the data inside the element. Here, it is used to indicate the language of the referenced resource. This is possibly not quite the best name choice and should maybe be called “hreflang”, which is already used in multiple other elements to signify the language of the referenced resource.
“role” attribute
The “role” attribute is also an existing attribute in HTML5, included from ARIA. It currently doesn’t cover media resources, but could be extended. The suggestion here is to specify the roles of the different media tracks – the ones I have used here are:
- “media”: a main media resource – typically contains audio and video and possibly more
- “dub”: a audio track that provides an alternative dubbed language track
- “auddesc”: a audio track that provides an additional audio description track
- “caption”: a text track that provides captions
- “sign”: a video-only track that provides an additional sign language video track
- “tad”: a text track that provides textual audio descriptions to be read by a screen reader or a braille device
Further roles could be “music”, “speech”, “sfx” for audio tracks, “subtitle”, “lyrics”, “annotation”, “chapters”, “overlay” for text tracks, and “alternate” for a alternate main media resource, e.g. a different camera angle.
Track activation
The given attributes help the UA decide what to display.
It will firstly find out from the “type” attribute if it is capable of decoding the track.
Then, the UA will find out from the “media” query, “role”, and “lang” attributes whether a track is relevant to its user. This will require checking the capabilities of the device, network, and the user preferences.
Further, it could be possible for Web authors to influence whether a track is displayed or not through CSS parameters on the <source> element: “display: none” or “visibility: hidden/visible”.
Examples for track activation that a UA would undertake using the example above:
Given a desktop computer with Firefox, German language preferences, captions and sign language activated, the UA will fetch the original video at video.ogv (for Firefox), the German caption track at video.ogv?track=caption[de], and the German sign language track at signvid_gsg.ogv (maybe also the German dubbed audio track at video.ogv?track=audio[de], which would then replace the original one).
Given a desktop computer with Safari, English language preferences and audio descriptions activated, the UA will fetch the original video at video.mp4 (for Safari) and the textual audio description at tad_en.srt to be displayed through the screen reader, since it cannot decode the Ogg audio description track at video.ogv?track=auddesc[en].
Also, all decodeable tracks could be exposed in a right-click menu and added on-demand.
Display styling
Default styling of these tracks could be:
- video or alternate video in the video display area,
- sign language probably as picture-in-picture (making it useless on a mobile and only of limited use on the desktop),
- captions/subtitles/lyrics as overlays on the bottom of the video display area (or whatever the caption format prescribes),
- textual audio descriptions as ARIA live regions hidden behind the video or off-screen.
Multiple audio tracks can always be played at the same time.
The Web author could also define the display area for a track through CSS styling and the UA would then render the data into that area at the rate that is required by the track.
How good is this approach?
The advantage of this new proposal is that it builds basically on existing HTML5 components with minimal additions to satisfy requirements for content selection and accessibility of media elements. It is a declarative approach to the multi-track media resource challenge.
However, it leaves most of the decision on what tracks are alternatives of/additions to each other and which tracks should be displayed to the UA. The UA makes an informed decision because it gets a lot of information through the attributes, but it still has to make decisions that may become rather complex. Maybe there needs to be a grouping level for alternative tracks and additional tracks – similar to what I did with the second itext proposal, or similar to the <switch> and <par> elements of SMIL.
A further issue is one that is currently being discussed within the Media Fragments WG: how can you discover the track composition and the track naming/uses of a particular media resource? How, e.g., can a Web author on another Web site know how to address the tracks inside your binary media resource? A HTML specification like the above can help. But what if that doesn’t exist? And what if the file is being used offline?
Alternative Manifest descriptions
The need to manifest the track composition of a media resource is not a new one. Many other formats and applications had to deal with these challenges before – some have defined and published their format.
I am going to list a few of these formats here with examples. They could inspire a next version of the above proposal with grouping elements.
Microsoft ISM files (SMIL subpart)
With the release of IIS7, Microsoft introduced “Smooth Streaming”, which uses chunking on files on the server to deliver adaptive streaming to Silverlight clients over HTTP. To inform a smooth streaming client of the tracks available for a media resource, Microsoft defined ism files: IIS Smooth Streaming Server Manifest files.
This is a short example – a longer one can be found here:
<?xml version=
2009, Nov 25th, Panel: The Future of ICT Education
In this podcast, Mark Jones interviews Pia Waugh, ICT Policy Advisor for Senator Lundy; Senator Kate Lundy; Matt Barrie, CEO and founder, Freelance.com; and Silvia Pfeiffer, CEO and co-founder, Vquence about the ICT skills shortage and ways of addressing it. Education and tax incentives are two topics under discussion.
Talks and Interviews
Today I am starting a new collection – recordings of interviews, talks I have given, and slides of the talks. There are many that I’ve missed, sorry. You may find some slides also on Slideshare.
- 2009, Nov 25th, Panel: The Future of ICT Education
- 2009, Oct 9th, Web Directions South: Taking HTML5 a step further
- 2010, Oct 14th, Web Directions South, Sydney: “HTML5 Audio and Video”
- 2011, 23rd March, Google Tech Talk: “HTML5 video accessibility and the WebVTT file format”
- 2011, April 18th, ZDNet Interview: “Geek Culture”
- 2011, Dec 1st, OZeWAI Conference: “HTML5 Video Accessibility”
- 2011, Feb 25th, SLUG: “The latest on HTML5 media”
- 2011, Jan 24th, LCA Multimedia Mini-conf “Audio and Video processing in HTML5”
- 2011, July 21st, Web Standards Group WSG: “The latest on HTML5 media”
- 2011, Nov 11th, Google Developer Day Sydney: “Making Your Web Apps Accessible Using HTML5 and ChromeVox”
- 2011, Sep 19th, W3C Web and TV Workshop: “WebVTT in HTML5”
- 2012, Feb 11th, Test the Web Forward: “How to Read a Spec”
- 2012, Jan 15th: Drupal Down Under “HTML5 Video Specifications”
- 2012, Jan 16th, LCA Browser Miniconf: “Web Standardisation – how browser vendors collaborate, or not”
- 2012, Jan 16th, LCA Multimedia Miniconf: “HTML5 Video Accessibility Update”
- 2012, Jan 16th, Linux.conf.au: “Developing accessible web applications – how hard can it be?”
- 2012, Jul 31st, Geek Girls Dinner: “Implementing Video Conferencing in HTML5 Web browsers”
- 2012, Jul 4th, Web & TV Workshop Berlin: Keynote: “New HTML5 video technologies for the future of TV”
- 2012, May 23rd, Web Directions Code: “Implementing Video Conferencing in HTML5”
- 2012, Oct 19th, Web Directions South: “WebVTT and video accessibility”
- 2012, Sep 1st, VideoLAN Developer Days: “WebVTT – The Web Video Text Track Format”
- 2013, April 22nd, Digital TV Group UK: Technical Webinar: “Accessibility in HTML5 media”
- 2013, Jan 28th, LCA Multimedia Miniconf: “Browsers and HTML5 video accessibility”
- 2013, Jan 29th, LCA Browser Miniconf: Panel
- 2013, Jan 30th, Linux.conf.au: “Code up your own video conference in HTML5”
- 2013, May 2nd, Web Directions Code: “HTML5 Multi-party video conferencing”
- 2014, 30th July, WDC NZ, “Secure peer-to-peer video and data in browsers”
- 2014, Jan 7th, LCA Multimedia Miniconf: rtc.io: A node.js toolbox for WebRTC
- 2014, Rural Medicine Conference: Remote speech pathology assessments: making language assessments more accessible to children living in rural NSW
- 2015, Nov 4th, WebRTC Summit: WebRTC beyond Audio and Video
- 2016, April 6th, CRN Pipeline Event: Bespoke Video Conferencing for Integrators
- 2017, November 11th, AVCAL: “What’s the next big thing in deep tech innovation?”
- 2018, Apr 11th, Australian Telehealth Conference, Sydney: “Co-Designing Speech Pathology Telepractice Applications “
- 2019, May, International Women’s Forum: Reinventing Telehealth with Artificial Intelligence
- 2020, Oct, UTS: Commercialising Telehealth
- 2021, November, Successes and Failures in Telehealth Conference: Expanding footprint and improving access to care with video telehealth
- 2022, August, Rural Online Conference for GP: Unified Phone and Video Telehealth
- 2022, October, APA Focus Conference: Technology, AI, and the future of Physiotherapy Practice
- 2023, April, National Telehealth Conference: In the aftermath of COVID, have we hit peak Telehealth and what is the future of virtual care?
The model of a time-linear media resource for HTML5
HTML5 has been criticised for not having a timing model of the media resource in its new media elements. This article spells it out and builds a framework of how we should think about HTML5 media resources. Note: these are my thoughts and nothing offical from HTML5 – just conclusions I have drawn from the specs and from discussions I had.
What is a time-linear media resource?
In HTML5 and also in the Media Fragment URI specification we deal only with audio and video resources that represent a single timeline exclusively. Let’s call such Web resources a time-linear media resource.
The Media Fragment requirements document actually has a very nice picture to describe such resources – replicated here for your convenience:
The resource can potentially consist of any number of audio, video, text, image or other time-aligned data tracks. All these tracks adhere to a single timeline, which tends to be defined by the main audio or video track, while other tracks have been created to synchronise with these main tracks.
This model matches with the world view of video on YouTube and any other video hosting service. It also matches with video used on any video streaming service.
Background on the choice of “time-linear”
I’ve deliberately chosen the word “time-linear” because we are talking about a single, gap-free, linear timeline here and not multiple timelines that represent the single resource.
The word “linear” is, however, somewhat over-used, since the introduction of digital systems into the world of analog film introduced what is now known as “non-linear video editing”. This term originates from the fact that non-linear video editing systems don’t have to linearly spool through film material to get to a edit point, but can directly access any frame in the footage as easily as any other.
When talking about a time-linear media resource, we are referring to a digital resource and therefore direct access to any frame in the footage is possible. So, a time-linear media resource will still be usable within a non-linear editing process.
As a Web resource, a time-linear media resource is not addressed as a sequence of frames or samples, since these are encoding specific. Rather, the resource is handled abstractly as an object that has track and time dimensions – and possibly spatial dimensions where image or video tracks are concerned. The framerate encoding of the resource itself does not matter and could, in fact, be changed without changing the resource’s time, track and spatial dimensions and thus without changing the resource’s address.
Interactive Multimedia
The term “time-linear” is used to specify the difference between a media resource that follows a single timeline, in contrast to one that deals with multiple timelines, linked together based on conditions, events, user interactions, or other disruptions to make a fully interactive multi-media experience. Thus, media resources in HTML5 and Media Fragments do not qualify as interactive multimedia themselves because they are not regarded as a graph of interlinked media resources, but simply as a single time-linear resource.
In this respect, time-linear media resources are also different from the kind of interactive mult-media experiences that an Adobe Shockwave Flash, Silverlight, or a SMIL file can create. These can go far beyond what current typical video publishing and communication applications on the Web require and go far beyond what the HTML5 media elements were created for. If your application has a need for multiple timelines, it may be necessary to use SMIL, Silverlight, or Adobe Flash to create it.
Note that the fact that the HTML5 media elements are part of the Web, and therefore expose states and integrate with JavaScript, provides Web developers with a certain control over the playback order of a time-linear media resource. The simple functions pause(), play(), and the currentTime attribute allow JavaScript developers to control the current playback offset and whether to stop or start playback. Thus, it is possible to interrupt a playback and present, e.g. a overlay text with a hyperlink, or an additional media resource, or anything else a Web developer can imagine right in the middle of playing back a media resource.
In this way, time-linear media resources can contribute towards an interactive multi-media experience, created by a Web developer through a combination of multiple media resources, image resources, text resources and Web pages. The limitations of this approach are not yet clear at this stage – how far will such a constructed multi-media experience be able to take us and where does it become more complicated than an Adobe Flash, Silverlight, or SMIL experience. The answer to this question will, I believe, become clearer through the next few years of HTML5 usage and further extensions to HTML5 media may well be necessary then.
Proper handling of time-linear media resources in HTML5
At this stage, however, we have already determined several limitations of the existing HTML5 media elements that require resolution without changing the time-linear nature of the resource.
1. Expose structure
Above all, there is a need to expose the above painted structure of a time-linear media resource to the Web page. Right now, when the <video> element links to a video file, it only accesses the main audio and video tracks, decodes them and displays them. The media framework that sits underneath the user agent (UA) and does the actual decoding for the UA might know about other tracks and might even decode, e.g. a caption track and display it by default, but the UA has no means of knowing this happens and controlling this.
We need a means to expose the available tracks inside a time-linear media resource and allow the UA some control over it – e.g. to choose whether to turn on/off a caption track, to choose which video track to display, or to choose which dubbed audio track to display.
I’ll discuss in another article different approaches on how to expose the structure. Suffice for now that we recognise the need to expose the tracks.
2. Separate the media resource concept from actual files
A HTML page is a sequence of HTML tags delivered over HTTP to a UA. A HTML page is a Web resource. It can be created dynamically and contain links to other Web resources such as images which complete its presentation.
We have to move to a similar “virtual” view of a media resource. Typically, a video is a single file with a video and an audio track. But also typically, caption and subtitle tracks for such a video file are stored in other files, possibly even on other servers. The caption or subtitle tracks are still in sync with the video file and therefore are actual tracks of that time-linear media resource. There is no reason to treat this differently to when the caption or subtitle track is inside the media file.
When we separate the media resource concept from actual files, we will find it easier to deal with time-linear media resources in HTML5.
3. Track activation and Display styling
A time-linear media resource, when regarded completely abstractly, can contain all sorts of alternative and additional tracks.
For example, the existing <source> elements inside a video or audio element are currently mostly being used to link to alternative encodings of the main media resource – e.g. either in mpeg4 or ogg format. We can regard these as alternative tracks within the same (virtual) time-linear media resource.
Similarly, the <source> elements have also been suggested to be used for alternate encodings, such as for mobile and Web. Again, these can be regarded as alternative tracks of the same time-linear media resource.
Another example are subtitle tracks for a main media resource, which are currently discussed to be referenced using the <itext> element. These are in principle alternative tracks amongst themselves, but additional to the main media resource. Also, some people are actually interested in displaying two subtitle tracks at the same time to learn translations.
Another example are sign language tracks, which are video tracks that can be regarded as an alternative to the audio tracks for hard-of-hearing users. They are then additional video tracks to the original video track and it is not clear how to display more than one video track. Typically, sign language tracks are displayed as picture-in-picture, but on the Web, where video is usually displayed in a small area, this may not be optimal.
As you can see, when deciding which tracks need to be displayed one needs to analyse the relationships between the tracks. Further, user preferences need to come into play when activating tracks. Finally, the user should be able to interactively activate tracks as well.
Once it is clear, what tracks need displaying, there is still the challenge of how to display them. It should be possible to provide default displays for typical track types, and allow Web authors to override these default display styles since they know what actual tracks their resource is dealing with.
While the default display seems to be typically an issue left to the UA to solve, the display overrides are typically dealt with on the Web through CSS approaches. How we solve this is for another time – right now we can just state the need for algorithms for track activiation and for default and override styling.
Hypermedia
To make media resources a prime citizens on the Web, we have to go beyond simply replicating digital media files. The Web is based on hyperlinks between Web resources, and that includes hyperlinking out of resources (e.g. from any word within a Web page) as well as hyperlinking into resources (e.g. fragment URIs into Web pages).
To turn video and audio into hypervideo and hyperaudio, we need to enable hyperlinking into and out of them.
Hyperlinking into media resources is fortunately already being addressed by the W3C Media Fragments working group, which also regards media resources in the same way as HTML5. The addressing schemes under consideration are the following:
- temporal fragment URI addressing: address a time offset/region of a media resource
- spatial fragment URI addressing: address a rectangular region of a media resource (where available)
- track fragment URI addressing: address one or more tracks of a media resource
- named fragment URI addressing: address a named region of a media resource
- a combination of the above addressing schemes
With such addressing schemes available, there is still a need to hook up the addressing with the resource. For the temporal and the spatial dimension, resolving the addressing into actual byte ranges is relatively obvious across any media type. However, track addressing and named addressing need to be resolved. Track addressing will become easier when we solve the above stated requirement of exposing the track structure of a media resource. The name definition requires association of an id or name with temporal offsets, spatial areas, or tracks. The addressing scheme will be available soon – whether our media resources can support them is another challenge to solve.
Finally, hyperlinking out of media resources is something that is not generally supported at this stage. Certainly, some types of media resources – QuickTime, Flash, MPEG4, Ogg – support the definition of tracks that can contain HTML marked-up text and thus can also contain hyperlinks. But standardisation in this space has not really happened yet. It seems to be clear that hyperlinks out of media files will come from some type of textual track. But a standard format for such time-aligned text tracks doesn’t yet exist. This is a challenge to be addressed in the near future.
Summary
The Web has always tried to deal with new extensions in the simplest possible manner, providing support for the majority of current use cases and allowing for the few extraordinary use cases to be satisfied by use of JavaScript or embedding of external, more complex objects.
With the new media elements in HTML5, this is no different. So far, the most basic need has been satisfied: that of including simple video and audio files into Web pages. However, many basic requirements are not being satisfied yet: accessibility needs, codec choice, device-independence needs are just some of the core requirements that make it important to extend our view of <audio> and <video> to a broader view of a Web media resource without changing the basic understanding of an audio and video resource.
This post has created the concept of a “media resource”, where we keep the simplicity of a single timeline. At the same time, it has tried to classify the list of shortcomings of the current media elements in a way that will help us address these shortcomings in a Web-conformant means.
If we accept the need to expose the structure of a media resource, the need to separate the media resource concept from actual files, the need for an approach to track activation, and the need to deal with styling of displayed tracks, we can take the next steps and propose solutions for these.
Further, understanding the structure of a media resources allows us to start addressing the harder questions of how to associate events with a media resource, how to associate a navigable structure with a media resource, or how to turn media resources into hypermedia.
HTML5 Video element discussions at TPAC meetings
Last week’s TPAC (2009 W3C Technical Plenary / Advisory Committee) meetings were my second time at a TPAC and I found myself becoming highly involved with the progress on accessibility on the HTML5 video element. There were in particular two meetings of high relevanct: the Video Accessibility workshop and Friday’s HTML5 breakout group on the video element.
HTML5 Video Accessibility Workshop
The week started on Sunday with the “HTML5 Video Accessibility workshop” at Stanford University, organised by John Foliot and Dave Singer. They brought together a substantial number of people all representing a variety of interest groups. Everyone got their chance to present their viewpoint – check out the minutes of the meeting for a complete transcript.
The list of people and their discussion topics were as follows:
Accessibility Experts
- Janina Sajka, chair of WAI Protocols and Formats: represented the vision-impaired community and expressed requirements for a deeply controllable access interface to audio-visual content, preferably in a structured manner similar to DAISY.
- Sally Cain, RNIB, Member of W3C PF group: expressed a deep need for audio descriptions, which are often overlooked besides captions.
- Ken Harrenstien, Google: has worked on captioning support for video.google and YouTube and shared his experiences, e.g. http://www.youtube.com/watch?v=QRS8MkLhQmM, and automated translation.
- Victor Tsaran, Yahoo! Accessibility Manager: joined for a short time out of interest.
Practicioners
- John Foliot, professor at Stanford Uni: showed a captioning service that he set up at Stanford University to enable lecturers to publish more accessible video – it uses humans for transcription, but automated tools to time-align, and provides a Web interface to the staff.
- Matt May, Adobe: shared what Adobe learnt about accessibility in Flash – in particular that an instream-only approach to captions was a naive approach and that external captions are much more flexible, extensible, and can fit into current workflows.
- Frank Olivier, Microsoft: attended to listen and learn.
Technologists
- Pierre-Antoine Champin from Liris (France), who was not able to attend, sent a video about their research work on media accessibility using automatic and manual annotation.
- Hironobu Takagi, IBM Labs Tokyo, general chair for W4A: demonstrated a text-based audio description system combined with a high-quality, almost human-sounding speech synthesizer.
- Dick Bulterman, Researcher at CWI in Amsterdam, co-chair of SYMM (group at W3C doing SMIL): reported on 14 years of experience with multimedia presentations and SMIL (slides) and the need to make temporal and spatial synchronisation explicit to be able to do the complex things.
- Joakim S
FOMS and LCA Multimedia Miniconf
If you haven’t proposed a presentation yet, got ahead and register yourself for:
FOMS (Foundations of Open Media Software workshop) at
http://www.foms-workshop.org/foms2010/pmwiki.php/Main/CFP
LCA Multimedia Miniconf at
http://www.annodex.org/events/lca2010_mmm/pmwiki.php/Main/CallForP
It’s already November and there’s only Christmas between now and the conferences!
I’m personally hoping for many discussions about HTML5 <video> and <audio>, including what to do with multitrack files, with cue ranges, and captions. These should also be relevant to other open media frameworks – e.g. how should we all handle multitrack sign language tracks?
But there are heaps of other topics to discuss and anyone doing any work with open media software will find a fruitful discussions at FOMS.
2009: Taking HTML5 a step further
Silvia Pfeiffer, “Taking HTML5 <video> a step further”, Web Directions South Conference 2009, W3C Standards Track, Sydney Convention Centre, October 2009.
2006: Real or Virtual? — Design Reflections on a Remote Collaborative Information Space for Creative People.
C. Schremmer, S. Pfeiffer, A. Krumm-Heller, F. Mueller, “2006: Real or Virtual? — Design Reflections on a Remote Collaborative Information Space for Creative People.”, CSIRO Technical Report 06/136.