In the media fragment working group at the W3C, we are introducing a standard means to address fragments of media resources through URIs. The idea is to define URIs such as http://example.com/video.ogv#t=24m16s-30m12s, which would only retrieve the subpart of video.ogv that is of interest to the user and thus save bandwidth. This is particularly important for mobile devices, but also for pointing out highlights in videos on the Web, bookmarking, and other use cases.
I’d like to give a brief look into the state of discussion from a technical viewpoint here.
Let’s start by considering the protocols for which such a scheme could be defined. We are currently focusing on HTTP and RTSP, since they are open protocols for media delivery. P2P protocols are also under consideration, however, most of them are proprietary. Also, p2p protocols are mostly used to transfer complete large files, so fragment addressing may not be desired. RSTP already has a mechanism to address temporal fragments of media resources through a range parameter of the play request as part of the protocol parameters. Yet, there is no URI addressing scheme for this. Our key focus however is HTTP, since most video content nowadays is transferred over HTTP, e.g. YouTube.
Another topic that needs discussion are the types of fragmentation for which we will specify addressing schemes. At the moment, we are considering temporal fragmentation, spatial fragmentation, and fragmentation by tracks. In temporal fragmentation, a request asks for a time interval that is a subpart of the media resource (e.g. audio or video). In spatial fragmentation, the request is for an image region (e.g. in an image or a video). Track fragmentation addresses the issue where, e.g. a blind person would not require to receive the actual video data for a video and thus a user agent could request only the data tracks from the resource that are really required for the user.
Another concern is the syntax of URI addressing. URI fragments (“#”) have been invented to created URIs that point at so-called “secondary” resources. Per definition, a secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations. It is therefore the perfect syntax for media fragment URIs.
The only issue is that URI fragments (“#”) are not expected to be transferred from the client to the server (e.g. Apache strips it off the URI if it receives it). Therefore, in the temporal URI specification of Annodex we decided to use the query (“?”) parameter instead. This is however not necessary. The W3C working group is proposing to have the user agent strip off the URI fragment specification and transform it into a protocol parameter. For HTTP, the idea is to introduce new range units for the types of fragmentation that we will define. Then, the Range and Content-Range headers can be used to request and deliver the information about the fragmentation.
The most complicated issue that we are dealing with is the issue of caching in Web proxies. Existing Web proxies will not be able to understand new range units and will therefore not cache such requests. This is unfortunate and we are trying to devise two schemes – one for existing Web proxies and one for future more intelligent Web proxies – to enable proxy caching. This discussion has many dimensions – such as e.g. the ability to uniquely map time to bytes for any codec format, the ability to recompose new fragment requests from existing combined fragment requests, or the need and abilities for partial re-encoding. Mostly we are dealing with the complexities and restrictions of different codecs and encapsulation formats. Possibly, the idea of recomposition of ranges in Web proxies is too complex to realise and caching is best done by regarding each fragment as its own cacheable resource, but this hasn’t been decided yet.
We now have experts from the squid community, from YouTube/Google, HTTP experts, Web accessibility experts, SMIL experts, me from Annodex/Xiph, and a more people with diverse media backgrounds in the team. It’s a great group and we are covering the issues from all aspects. The brief update above is given from my perspective, and only lists the key issues superficially, while the discussions that we’re having on the mailing list and in meetings are much more in-depth.
I am not quite expecting us to meet the deadline of having a first working draft before the end of this month, but certainly before Christmas.
9 thoughts on “Media fragment URI addressing”
I am wondering how a web browser will know if
means jumping to anchor “t=24m16s-30m12s” on a page “foooooo” or downloading a segment of a file “foooooo”. Also, how can users predict this? I might be missing bits, can you shed some light or point me to somewhere with answers?
The discussion at the W3C media fragments working group centers around two approaches to this issue:
1) The browser will simply add the time range request to *every* URI that it is requesting over HTTP. The server will either ignore that parameter or reply to it appropriately if it is indeed a media resource and the server is able to provide the time range. Then it adds some other HTTP header to the reply and the user agent will know from this whether it is a media resource and it has received the subpart of a media file and has to start display of a audio or video file – or whether alternatively it is a totally different page and it needs to e.g. jump to an anchor of that name.
2) Alternatively we discussed whether we should include into the URI scheme a means to give a hint to the user agent that this is expected to be a video and therefore the time range request has to be added. This can be achieved, e.g. by adding an identifying character directly behind the “#” character. In particular if we choose something that is not allowed as a character in an element id, we will never clash with anchor offsets of HTML pages. Example characters are: “;”, “:”, “@”, “!”, “$” and many more (you can find out by comparing the allowed characters in a URI fragment at http://www.ietf.org/rfc/rfc3986.txt with the ID and NAME token restriction at http://www.w3.org/TR/html4/types.html.
Either way should really work because the HTTP protocol prescribes that parameters that cannot be understood should be ignored.
It’s great to hear about some of the discussion around this!
I really like the idea of the client turning the fragment request into something more efficient. This leaves user agents the freedom to just download the whole file and then present a portion, the way fragment references to html anchors work. More sophisticated agents can use whatever tricks they know for the protocol in question to negotiate transfer of just the region of interest.
What surprised me is that you’re considering alternate units for HTTP Range requests as a mechanism for this. I looked at this a few years ago, trying to solve the seek-latency problem but abandoned it because the semantics just aren’t clear.
There are a couple of issues here. The simplest is that byte range requests commute and time range requests do not. What I mean is, if you request bytes 0-999 of a file, then bytes 1000-2000, and concatenate the two, you have exactly what you’d get if you requested bytes 0-2000. Multimedia files can’t be done this way. Digital media is discrete, and of course the playback engine can round the requested point on the timeline to the nearest sample and start playback there, but it is not stored in sample units. Generally samples are grouped into blocks and the blocks are compressed with prediction from previous blocks. So to begin playback at 24m16s, the server might need to start returning data from 23m24s onward, and the user agent would have to feed all that data to the decoder, discarding the output until it got to 24m16s.
But that means requesting 5 seconds at 24m16s and five more at 24m21s gives you two files, one of which covers 23m24s-24m21 and one that covers 23m24s-24m26s. You can’t meaningfully concatenate them, you can only merge them, which requires detailed knowledge of the media format.
A cache can’t do as well here unless it understands enough of the file format to re-merge the temporal ranges and serve new temporal Range requests out of that.
If the server is just returning subsets of a static file though, what about answering a Range request in time units with a Content-Range in byte units? Then caches could still do something useful. Unfortunately I don’t see how this can work either. The problem is again that the server has to “round” the Range request to includes a safe restart point for the decoder. If the client gets back a byte range set instead then it knows how to reassemble the file, but it no longer knows what temporal range it’s received. In some cases it might be able to work out where the requested range lies based on internal timestamps, but that isn’t possible in general. There may be no absolute timestamps, or the requested interval may need to be encoded in the stream itself, in which case the returned result isn’t a simple byte Range request result anyway.
I’d love to hear if there’s a way around these issues. As far as I got it seemed like this is a format-specific extension, which didn’t justify a new Range unit. I’m having trouble seeing how this is better than just using a query uri like in the annodex proposal.
Oh, one other point.
The examples above all just reference fragments to the nearest seconds. That’s insufficient accuracy. If I want to pick out a particular line in a song, or the start of a scene in a movie, one second resolution means there would be a few extra words, or a few frames of the previous scene at the beginning, quite distracting.
For video, something like smpte timecodes, which can refer to frames or milliseconds would be good enough. I don’t know if milliseconds would be good enough for audio people. Certainly for other time-based data I can imagine wanting microseconds for something like network event logging, which means nanoseconds wouldn’t be a bad option.
I share your concerns about a time range request and the possibilities of getting it accurate. We’ve had a massive amount of discussion in the working group on exactly this topic. There are currently two approaches under discussion.
The first one is similar to the temporal URI specification that we in the Annodex community proposed a long time ago: http://annodex.net/TR/draft-pfeiffer-temporal-fragments-03.html#anchor8. It goes along the lines of having two roundtrips, where the first one tells the server which time range the client is looking for and it returns the right byte range for that, so in the second round trip, the client just does a byte range request for the correct data.
In the working group we are however trying to avoid more than one roundtrip to get the first data. Thus, an alternative proposal is to create time range requests, where the client asks the server for a time range, and the server replies with the data and the actual time range it is providing. Then the client can decode the data and play back only the actual time range that was requested.
The concatenation problem that you are referring to in your comment is then solved through the adapted time ranges that the server replies with. It will only have a finite set of actual time points for which it can return data (start/end). This is because compressed codecs devide time up into packets. The expectation is that because of this “discretisation” of time, concatenation will work.
Using your example: the first request for 5 seconds at 24m16s would return 23m24s-24m21, the second request for five more at 24m21s would give you 24m21s-24m26s, which you can safely concatenate.
Do you think that can work?
Also, re accuracy of time resolution: thanks for the input – point taken.
So the idea is to chop the file up into byte ranges, but report the corresponding time range with the returned piece? In this scheme the boundaries would start with a video keyframe and, presumedly, go to the next?
A number of points remain unclear to me. How do I get the headers so I can decode the stream? Do I first do a magic Range: time -0m0s request to get those? What about Apple .caf where the metadata can be at the end? What if it’s a chained Ogg stream, how do I get the headers for the current segment if they’re not at the start of the timeline?
Also, this cannot be sample-accurate: the preroll required by the vorbis codec, or the overlap of open-GOP structures in Dirac and mpeg video mean I won’t be able to start or finish decoding exactly at the time-segment boundaries, although complete playback is possible after adjacent segments are concatenated. Is that acceptable?
What about files with a large mux granularity, like MXF or (badly constructed) avi? What do I do if the smallest byte range I can return which contains the requested temporal range is the whole file?
For these reasons, I thought it made more sense to accept the overhead of always sending a decodable stream. This is just what ogg-chop currently does, by pre/post-pending the required extra headers and data. By marking the target temporal segment in the skeleton header it’s possible to concatenate these things and have playback be identical to the original timeline. But the resource so constructed is not the same stream as the original resource.
For formats without Ogg’s legal concatenation, I’d have to know how to merge adjacent requests. More likely, I’d just make a playlist with all the pieces. That’s fairly simple to implement at least; I’d like to compare that to any proposals that would let me reconstruct a standalone file by byte range concatenation.
Re: file chopping
The file chopping is of course only virtual – there aren’t different pieces of content on the server. But otherwise, yes, that’s how it works.
That is a very good question and hasn’t been resolved yet. There may be a special additional HTTP request for just the headers that relate to the requested time range for file formats that require such. As for chained Ogg files – these are really multiple files and I am not sure how to handle them properly, because in theory time also resets at the segment boundaries. What do you suggest?
The idea is to return as much data as is necessary to decode the requested time range properly. If that requires more packets from before and after the current time range, then that’s the way it goes. It should definitely be sample-accurate.
Re: mux granularity
If the files were not created for streaming in the first place, then you cannot return byte ranges but have to return the full file. That’s just how it goes.
Re: decodable stream
That is also a very good question that we have not fully explored yet. The answer depends on the use cases that we want to support. If it’s just about buffering strategies inside media players, then it doesn’t need to be a fully decodable stream. If it’s about sharing and reusing the media fragment, then it should be a fully decodable stream.
In Annodex, we assumed the need for fully decodable stream, which is why oggzchop and mod_annodex do it in this way. But the current Firefox plugin deals with byte range requests without bothering to get a corrected header – the data is for internal buffering purposes only. If as mentioned above the header data is requested in a separate roundtrip, then it could be left up to the user agent to request it if needed, or not to. The reconstructed stream should in any case be data-equivalent, but not necessarily be header-equivalent to the original resource.
Re: byte range reconstruction
Byte range reconstruction was only discussed as part of the requirements for Web proxies. It is indeed planned not to allow specification of more than one time range in a media fragment URI (any subsequent ones will be ignored). If the user agent requires several temporal fragments, it should indeed use playlists.
Re: file chopping, virtual
Ok good, that’s what I was thinking of too.
I don’t think of chain segments as separate files. Rather there are two clocks, one for the complete stream, and one within each segment. I would suggest the units in the range request measure time relative to the start of the resource and contain everything available in that resource, regardless of how it is constructed. And that the server was doing shifting regardless, since at least for ogg files, the internal timestamps don’t necessarily start anywhere near zero? Conceptually the same addressing should apply equally well to a playlist, e.g. http://example.com/latest.xspf#t=1h23m16.12s Of course, I don’t know how you’d implement that today with a cross-faded mp3 playback.
Re: sample-accurate returns, this is my point. If you return byte ranges which can be concatenated to duplicate the origin’s resource, the time ranges are not contiguous. If the time ranges are contiguous, the byte ranges overlap. I don’t see a way to resolve this, though I would be happy if it is possible.
Given the potential of this proposed effort, I’m surprised to see so few comments.
Anyway, I’m not a media specialist, but wanted to point out to your readers (it’s probably obvious to you already) that this notion of using URI fragments and/or HTTP Range headers to address portions of a resource will have implications outside the media realm.
If ‘smart’ browsers start converting well-defined URI fragments into alternate-units Range headers (and ‘smart’ proxies start caching accordingly) then the gate is opened for this same technique to be used for addressing sub-portions of JSON or XML resources in web services.
Here’s one proposal that uses URI fragments and Range/Content-Range/etc. (albeit separately) in JSON resources:
Best regards to you and the WG.
Comments are closed.