adaptive HTTP streaming for open codecs

October 9, 2010Digital Media, FOMS, open codecs, standardsadative HTTP streaming, bitrate switching, Firefox, FOMS, HTML5 video, media fragments URI, Ogg Theora/Vorbis, open codecs, video element, webmsilvia

At this week’s FOMS in New York we had one over-arching topic that seemed to be of interest to every single participant: how to do adaptive bitrate streaming over HTTP for open codecs. On the first day, there was a general discussion about the advantages and disadvantages of adaptive HTTP streaming, while on the second day, we moved towards designing a solution for Ogg and WebM. While I didn’t attend all the discussions, I want to summarize the insights that I took out of the days in this blog post and the alternative implementation strategies that were came up with.

Use Cases for Adaptive HTTP Streaming

Streaming using RTP/RTSP has in the past been the main protocol to provide live video streams, either for broadcast or for real-time communication. It has been purpose-built for chunked video delivery and has features that many customers want, such as the ability to encrypt the stream, to tell players not to store the data, and to monitor the performance of the stream such that its bandwidth can be adapted. It has, however, also many disadvantages, not least that it goes over ports that normal firewalls block and thus is rather difficult to deploy, but also that it requires special server software, a client that speaks the protocol, and has a signalling overhead on the transport layer for adapting the stream.

RTP/RTSP has been invented to allow for high quality of service video consumption. In the last 10 years, however, it has become the norm to consume “canned” video (i.e. non-live video) over HTTP, making use of the byte-range request functionality of HTTP for seeking. While methods have been created to estimate the size of a pre-buffer before starting to play back in order to achieve continuous playback based on the bandwidth of your pipe at the beginning of downloading, not much can be done when one runs out of pre-buffer in the middle of playback or when the CPU on the machine doesn’t manage to catch up with decoding of the sheer amount of video data: your playback stops to go into re-buffering in the first case and starts to become choppy in the latter case.

An obvious approach to improving this situation is the scale the bandwidth of the video stream down, potentially even switch to a lower resolution video, right in the middle of playback. Apple’s HTTP live streaming, Microsoft’s Smooth Streaming, and Adobe’s Dynamic Streaming are all solutions in this space. Also, ISO/MPEG is working on DASH (Dynamic Adaptive Streaming over HTTP) is an effort to standardize the approach for MPEG media. No solution yets exist for the open formats within Ogg or WebM containers.

Some features of HTTP adaptive streaming are:

Enables adaptation of downloading to avoid continuing buffering when network or machine cannot cope.
Gapless switching between streams of different bitrate.
No special server software is required – any existing Web Server can be used to provide the streams.
The adaptation comes from the media player that actually knows what quality the user experiences rather than the network layer that knows nothing about the performance of the computer, and can only tell about the performance of the network.
Adaptation means that several versions of different bandwidth are made available on the server and the client switches between them based on knowledge it has about the video quality that the user experiences.
Bandwidth is not wasted by downloading video data that is not being consumed by the user, but rather content is pulled moments just before it is required, which works both for the live and canned content case and is particularly useful for long-form content.

Viability

In discussions at FOMS it was determined that mid-stream switching between different bitrate encoded audio files is possible. Just looking at the PCM domain, it requires stitching the waveform together at the switch-over point, but that is not a complex function. To be able to do that stitching with Vorbis-encoded files, there is no need for a overlap of data, because the encoded samples of the previous window in a different bitrate page can be used as input into the decoding of the current bitrate page, as long as the resulting PCM samples are stitched.

For video, mid-stream switching to a different bitrate encoded stream is also acceptable, as long as the switch-over point adheres to a keyframe, which can be independently decoded.

Thus, the preparation of the alternative bitstream videos requires temporal synchronisation of keyframes on video – the audio can deal with the switch-over at any point. A bit of intelligent encoding is thus necessary – requiring the encoding pipeline to provide regular keyframes at a certain rate would be sufficient. Then, the switch-over points are the keyframes.

Technical Realisation

With the solutions from Adobe, Microsoft and Apple, the technology has been created such there are special tools on the server that prepare the content for adaptive HTTP streaming and provide a manifest of the prepared content. Typically, the content is encoded in versions of different bitrates and the bandwidth versions are broken into chunks that can be decoded independently. These chunks are synchronised between the different bitrate versions such that there are defined switch-over points. The switch-over points as well as the file names of the different chunks are documented inside a manifest file. It is this manifest file that the player downloads instead of the resource at the beginning of streaming. This manifest file informs the player of the available resources and enables it to orchestrate the correct URL requests to the server as it progresses through the resource.

At FOMS, we took a step back from this approach and analysed what the general possibilities are for solving adaptive HTTP streaming. For example, it would be possible to not chunk the original media data, but instead perform range requests on the different bitrate versions of the resource. The following options were identified.

Chunking

With Chunking, the original bitrate versions are chunked into smaller full resources with defined switch-over points. This implies creation of a header on each one of the chunks and thus introduces overhead. Assuming we use 10sec chunks and 6kBytes per chunk, that results in 5kBit/sec extra overhead. After chunking the files this way, we provide a manifest file (similar to Apple’s m3u8 file, or the SMIL-based manifest file of Microsoft, or Adobe’s Flash Media Manifest file). The manifest file informs the client about the chunks and the switch-over points and the client requests those different resources at the switch-over points.

Disadvantages:

Header overhead on the pipe.
Switch-over delay for decoding the header.
Possible problem with TCP slowstart on new files.
A piece of software is necessary on server to prepare the chunked files.
A large amount of files to manage on the server.
The client has to hide the switching between full resources.

Advantages:

Works for live streams, where increasing amounts of chunks are written.
Works well with CDNs, because mid-stream switching to another server is easy.
Chunks can be encoded such that there is no overlap in the data necessary on switch-over.
May work well with Web sockets.
Follows the way in which proprietary solutions are doing it, so may be easy to adopt.
If the chunks are concatenated on the client, you get chained Ogg files (similar concept in WebM?), which are planned to be supported by Web browsers and are thus legal files.

Chained Chunks

Alternatively to creating the large number of files, one could also just create the chained files. Then, the switch-over is not between different files, but between different byte ranges. The headers still have to be read and parsed. And a manifest file still has to exist, but it now points to byte ranges rather than different resources.

Advantages over Chunking:

No TCP-slowstart problem.
No large number of files on the server.

Disadvantages over Chunking:

Mid-stream switching to other servers is not easily possible – CDNs won’t like it.
Doesn’t work with Web sockets as easily.
New approach that vendors will have to grapple with.

Virtual Chunks

Since in Chained Chunks we are already doing byte-range requests, it is a short step towards simply dropping the repeating headers and just downloading them once at the beginning for all possible bitrate files. Then, as we seek to different positions in “the” file, the byte range of the bitrate version that makes sense to retrieve at that stage would be requested. This could even be done with media fragment URIs, through addressing with time ranges is less accurate than explicit byte ranges.

In contrast to the previous two options, this basically requires keeping n different encoding pipelines alive – one for every bitrate version. Then, the byte ranges of the chunks will be interpreted by the appropriate pipeline. The manifest now points to keyframes as switch-over points.

Advantage over Chained Chunking:

No header overhead.
No continuous re-initialisation of decoding pipelines.

Disadvantages over Chained Chunking:

Multiple decoding pipelines need to be maintained and byte ranges managed for each.

Unchunked Byte Ranges

We can even consider going all the way and not preparing the alternative bitrate resources for switching, i.e. not making sure that the keyframes align. This will then require the player to do the switching itself, determine when the next keyframe comes up in its current stream then seek to that position in the next stream, always making sure to go back to the last keyframe before that position and discard all data until it arrives at the same offset.

Disadvantages:

There will be an overlap in the timeline for download, which has to be managed from the buffering and alignment POV.
Overlap poses a challenge of downloading more data than necessary at exactly the time where one doesn’t have bandwidth to spare.
Requires seeking.
Messy.

Advantages:

No special authoring of resources on the server is needed.
Requires a very simple manifest file only with a list of alternative bitrate files.

Final concerns

At FOMS we weren’t able to make a final decision on how to achieve adaptive HTTP streaming for open codecs. Most agreed that moving forward with the first case would be the right thing to do, but the sheer number of files that can create is daunting and it would be nice to avoid that for users.

Other goals are to make it work in stand-alone players, which means they will need to support loading the manifest file. And finally we want to enable experimentation in the browser through JavaScript implementation, which means there needs to be an interface to provide the quality of decoding to JavaScript. Fortunately, a proposal for such a statistics API already exists. The number of received frames, the number of dropped frames, and the size of the video are the most important statistics required.

23 thoughts on “adaptive HTTP streaming for open codecs”

Pierre-Yves Kerembellec says:

October 10, 2010 at 12:08 am

Very interesting indeed Sylvia.

We (at Dailymotion) are taking the 3rd approach (with a new scalable chunk delivery server, which I hope we will open-source in the future). This saves us from remuxing millions of videos (in F4F, or any other chunked MP4 container), because the server will generate virtual chunks from the full original files.

Plus, we keep the bytes-ranges requests capability on all content.

This chunked delivery mechanism has definitely a huge potential in terms of cache-ability, firewalls live streaming acceptance, and potentially P2P distribution (where any client may become a chunk server itself, all clients being coordinated through a global low-overhead tracker).

Keep work, keep it up!

Pierre-Yves
Marc-Andre Lureau says:

October 10, 2010 at 3:30 am

You should also mention the MPEG DASH effort which has already reach the prototype level in VLC: http://www.youtube.com/watch?v=Yt1F0ULsA1w. Work is ongoing, and we can follow it in the public DASH mailing list: http://lists.uni-klu.ac.at/mailman/listinfo/dash
Michael Dale says:

October 10, 2010 at 5:32 am

I recommend the 3rd option as well with fallback to the 4th option. Ie adaptive streaming should be as simple as listing out multiple bit rate / codec versions in the video tag child source elements.

It should support a javascript api as to give the host control over the stream selection.

If the user does not encode with aligned keyframes then it should fallback to non-aligned ranged requests. The possibility of duplicate data in cases were people don’t encode with aliened keyframes is a small price to pay for simplicity of usage.
silvia says:

October 10, 2010 at 7:19 am

@Marc-Andre is the spec for DASH somewhere? Which approach is it following? Is it patent-encumbered? Would it apply to non-MPEG content?
Jeremy says:

October 10, 2010 at 10:10 am

The main hindrance to RTP/RTSP is not the presence of firewalls — it is the presence of NAT. Personally I believe RTP/RTSP does not deserve to be replaced by a technology that is essentially a hack in comparison.

The real fix is to adopt IPv6 and retain the push perspective. As I write this, I am watching a NASA TV feed as an RTP stream being pushed (not pulled) directly to my box over IPv6.
silvia says:

October 10, 2010 at 2:11 pm

@Jeremy to be fair, RTP/RTSP only has advantages for live streaming – everywhere else, HTTP has replaced it for delivering video content. The bitrate adaptation is thus not really a hack to replace RTP/RTSP, but to improve the performance of the most widely deployed video delivery technology.
Jeremy says:

October 10, 2010 at 4:31 pm

RTP/RTSP only has advantages for live streaming

Hang on, I thought we were talking about live streaming. Hence why the standard is called HTTP Live Streaming.

And even if we weren’t talking about live streaming, then HTTP Live Streaming is a little overkill for video-on-demand.
silvia says:

October 10, 2010 at 8:35 pm

@Jeremy “live” streaming is just a term that Apple picked for their adaptive HTTP streaming approach. We are talking about bitrate adaptation much more than we are talking about live-only video transmission. And: no, adaptive HTTP streaming is not actually overkill for video-on-demand – it is in fact in many ways much more leightweight, as discussed in the article.

Adaptive HTTP streaming is just not useful for real-time communication, such as video conferencing, where RTP/RTSP is still the going standard.
Jeremy says:

October 10, 2010 at 10:39 pm

Now with that clarification I guess I can agree on those fronts. I’m glad that we concur that RTP is still the best for videoconferencing. 🙂

As for adaptive bitrates on video-on-demand, I guess there are two camps of people: those that want to watch it now (i.e. on demand on demand) with whatever quality their crappy connection can sustain, and those that wouldn’t mind waiting one or two minutes to buffer a little longer in order to get a high quality picture all the way through (such as me).

Are those two camps of people better served the KDE style by giving them the option of switching between HTTP Live Streaming and plain jane radio-button HTTP, or should it be GNOME style where one will eventually win over the other?
JeroenW says:

October 10, 2010 at 11:43 pm

Some additional disadvantages of option 4:

*) There’s a “slow start” issue, in that the full keyframe index of all bitrate levels need to be loaded in advance in order to determine possible switching points. For example, if a UA on level 1 wants to switch to level 3, it already needs to know the byterange to use for a chunk with the right timing. For long videos, the index of each level may be 1MB+.

*) It seems like this solution won’t work for live (dvr) streaming, since there’s simply no communication mechanism available. In all other cases, the UA is kept up-to-date by the server through the manifest file.

In JW Player, we do support bitrate switching using the 4th option. After some testing we made the choice to not attempt “seamless” switching (hard to get done with Flash or HTML5 scripting). Instead, we switch (based upon bandwidth / droppedframes / screen width) at moments where smooth playback is interrupted anyway: on startup, on seek and on fullscreen switching. I believe this would be the proper fallback method for HTML5 video as well. It’s basically the range-request seeking functionality that is already available in browsers, but then with the awareness of multiple quality levels.
Ralph Giles says:

October 11, 2010 at 3:14 am

I think your introduction is misleading in that we were talking about adaptive streaming, not http streaming in general, and I can see why Jeremy was wondering about RTP.

The Ogg container has supported HTTP streaming with free codecs, even of live sources, since the beginning; this was a major design goal of the format. Matroska also supports streaming live sources, and so WebM can too if the browsers choose to support it. The ‘chunking’ approach is used in part for other containers which can’t be streamed continuously because they rely on an index table, and to deal with server infrastructure which can’t stream a file while it is still being appended to.

For adaptive streaming, where the client switches between different encodings to match available network and playback resources, we just need a way to signal the alternatives. Of course all this can be made seamless with a smart server, however the chunking mechanism does have a certain simplicity, especially in interacting with “dumb” servers and mirror systems. I expect both the chunking and the byte-range methods will get implemented; they both have advantages and they both dovetail with other features (playlists and seeking) which are valuable in and of themselves.
silvia says:

October 11, 2010 at 7:10 am

@Ralph you are so right.

I have changed the article’s introduction and also incorporated DASH.

Incidentally: thanks to everyone for all the really useful comments!
Ralph Giles says:

October 11, 2010 at 2:20 pm

BTW, for those (like me) who’d never heard of the “chunked” approach before, the current draft is here for Apple’s Live HTTP streaming, which we used as a model during the discussion.

The draft uses an .m3u playlist to describe the segments (chunks) to the media player, a sequence of urls just like a normal playlist. In the case of live streams, special comments flag how often the player should re-request the m3u to get information about new segments. The server can also remove older urls from the playlist as it adds new ones to limit the number of segments available at any given time. The window can be as small as three chunks, but if many are listed, the player can just pick a point and start playback there, allow seeking within the current range and so on.

The alternative encodes are signaled using a meta-m3u whose entries are themselves chunked stream m3u files. In this case, each entry is prefixed with a special comment describing the encoding parameters: bitrate, codecs, resolution, and programme ids so multiple channels can be described simultaneously. The player then selects a range of m3u playlists it thinks will be relevant and maintains up-to-date copies of each of them. In this way it can choose to download the next segment it needs from a lower or higher quality playlist than the one currently active.

The draft specifies that the media files must be mpeg-ts streams, but otherwise, everything there would work just fine for other containers, so this is a reasonable approach for the signalling format. For non-chunked streams (adaptation with byte-range requests) the meta-m3u could just list the media uri’s directly.
JeroenW says:

October 14, 2010 at 12:30 am

Note there is a 2b (or 3b) approach as well:

* The server stores the original files.
* These files have synced byteranges.
* The server serves virtual chunks to the client.
* However, neither the whole index is pushed in advance, nor the files are stored chunked-chained on the server. Instead, the server injects header data (some EBML in the case of WebM) on the fly to turn a chunk into a valid file.
* From the user-agent point of view, there’s no difference between this approach and approach 1. Hence, no multiple-scenario logic needs building.

This approach is basically the “smart server” variation of option 1, resolving its serverside file management and required transcoding/chunking issues. It introduces a drawback: the reliance on a serverside module.

For Apple’s HTTP streaming formats, various smart server modules that do exactly this (using MP4 input files) are already available (e.g. Code-shop.com or Wowzamedia.com).

This approach (injecting header in range from original on the fly) is also the de facto standard for “pseudostreaming” to Flash – used in thousands of sites (e.g. Youtube) and offered by all CDNs. Example:

http://content.bitsontherun.com/videos/nPripu9l.mp4
http://content.bitsontherun.com/videos/nPripu9l.mp4?starttime=15&endtime=20
Pierre-Yves Kerembellec says:

October 14, 2010 at 5:04 pm

@JeroenW This is the approach we had for our video delivery system : on-the-fly re-synthesis of FLV, MP4 and OGG headers, so each time/bytes-range fragment looks like an independent and valid file (with a few additional information to help the player fit the fragment in context of the whole video clip, i.e. proper seekbar display).

As for the initial “hit” due to huge indexes on long videos (especially for containers structured this way, like MP4), we decided to re-mux the A/V samples into pure streamable containers, depending on the client capabilities. For instance, H264 and AAC samples from an MP4 get remuxed on the fly to FLV for Flash clients, and to MPEG2-TS for Apple “iDevice” clients (HTTP Live Streaming). This remove the initial index transfer and parsing time entirely.
Philip Jägenstedt says:

October 14, 2010 at 7:59 pm

Typo: VP8 in “Ogg or VP8 containers” should have been WebM, right?
kl says:

October 14, 2010 at 8:13 pm

As an author, I really really want “Unchunked Byte Ranges” option. This could simply be done with multiple source elements for the video tag!
1. silvia says:
  
  October 14, 2010 at 9:10 pm
  
  @kl Unfortunately no, you cannot do this through multiple source elements – they have a completely different meaning. The way in which they work is that the first appropriate resource is chosen from top to bottom and then the resource selection stops. This algorithm is complex enough as it is and really cannot cope with several bitrate-alternative resources, when <source> is meant to distinguish between codec-alternatives.
Steve Mortman says:

October 14, 2010 at 8:16 pm

Those of you who live in countries with software patents should be aware of the usual difficulties
http://www.movenetworks.com/press/2010-09-15.html
silvia says:

October 14, 2010 at 9:11 pm

@philip correct – fixed. 🙂
David R says:

October 27, 2010 at 7:13 am

I have been involved in the development of adaptive video streaming (AS) on a wide range of platforms including PC/MAC, CE devices, and even BD-Live. I have a few thoughts:

1) Virtual chunking is the best implementation model. Chunked files (also known as the “billion file” model) has scalability and asset management issues. There are CDN services that use a single file and serve chunks to clients, but this presents a cost/scalability issue. An HTTP range request on a single file can be supported by any CDN, and is the cheapest and most scalable way to deliver video bytes. In addition, Virtual chunking allows smaller chunks. Chunk size correlates to start-up time, and re-buffer rates.

2) Alignment of video across chunks is important to provide simple seamless bitrate switches, and should be a requirement. My rule is to never impose complexity costs on the client (or servers, CDN’s, etc.), that can be easily handled once at the encode step. For a given video source, providing n bitrate encodes that are properly aligned is a fairly simple and highly scalable task. For a popular streaming application that we delivered recently, we encoded > 30,000 video sources (average view time of 1 hour) at multiple bitrates in < 60 days.

In the original blog, a disadvantage for virtual chunking is “Multiple decoding pipelines need to be maintained and byte ranges managed for each”. If each video stream has aligned chunks, and has a sequence start per chunk, then only one decoding pipeline is needed.

3) It is best not to switch audio except at rebuffers or seeks.

4) Given #3, demuxed A/V streams are preferred. This also allows for alternate audio tracks.

5) IMO 'HTTP Live Streaming' (which uses chunked muxed AV in M2TS container), while useful for live content, is very troublesome for streaming of content libraries (see 1-4 above).

Live streaming and streaming of library content have very different requirements, and if the objective is to have a great platform for streaming movies, then the design decisions should be different than if the objective is an adaptive video conferencing, or live broadcast system.

Cheers, David
Vittorio Palmisano says:

October 30, 2010 at 6:09 pm

Hi all,
we are experimenting the virtual chunks approach (http://www.quavlive.com/streamingdemo) but the stream switching is performed by the webserver, so the client sees a continuous stream and it doesn’t need special additional features.
We have a demo website that supports both H.264 muxed into FLV format and VP8 muxed into WEBM format here: http://193.204.59.68:8024/

WEBM demo works well with Chrome Browser v. 6.0. With the latest version, there are some issues in video scaling, we have reported this to chromium developers.
Roberto says:

December 15, 2010 at 8:25 am

Question on Option 4 “Unchunked Byte Ranges”

Recently I have been tasked at work to test with a file format/adaptive technology that seems to follow this format in VoD terms. My findings are the following.

Once a byte-range is done and the index within the file wrapper picks the most appropriate byte-range(quality level), it then stays on that provided the network does not fluctuate and starts a massive progressive download.

Only when the network starts fluctuating badly, will it have a chance to react or in this case adapt. But guess when this happens, when the client cache has run out of data from the progressive download. This could be never as the user may have downloaded the full file. Therefore, this is not adaptive at all, the adaptation only happens if the user has not got alot of downloaded data. With ever increasing speeds, this will be a rare occurence unless they are also downloading torrents or sharing a router with 3-4 other video users.

Secondly, will it be the case that option 4 will be progressive download after the initial byte-range connection? Again, with ever increasing bit rates, this is starting to become a really annoying option to CDN’s. We keep sending data to clients “ahead” of time, which we suffer the risk they may not watch the data. Therefore I vote for the chucked approach and just send the data to the client as they need it, which ultimately is the goal of STREAMING.

Any thoughts are very welcome.

Cheers

Roberto