Making your video discoverable

Videos will be everywhere on the web! Yes, cope with it: soon the majority of videos won’t be with some hosting site like youtube, but it will reside on our private servers, on company servers, actually on any and all web servers. And there will be interesting stuff, but it will be hard to find.

Yes, history will repeat itself again and finding those videos on the Web that satisfy our need – be it for information or entertainment – will be a nightmare. Why? Because google’s pagerank (and many other ranking algorithms) rely on Web pages pointing to the videos to give them a higher rank. However, the way in which videos are currently published is through embedding them into Web pages (let’s call such a page the “embedding page”). Thus, the link analysis will actually return the pagerank for the embedding page – but not for the video itself!

Now, if the embedding page can actually be seen as representative for the video because the only reason that the webpage exists is to publish the video and its annotations, then the pagerank for the embedding page is actually the same as the pagerank for the video. This is the case for google video and for youtube and for many other hosting sites.

However, you and I mostly publish our videos in blogs or on Web pages that describe more than just the video – some will even have several videos embedded. This is where the chaos for a Web search engine for videos begins. And this is where the discoverability of your videos through video search engines ends.

Here is the solution.

Just as we do with normal Web pages, we have to introduce SEO (search engine optimisation) for videos. That means, we have to make it easier for the search engines to find out information about our videos, i.e. to index and rank them.

Because videos are binary data, a common Web search engine cannot extract information about this Web resource directly from it (let’s ignore signal analysis and automatic content analysis approaches for the moment). We have to help the search engine.

The solution is to have a text file sitting “next” to the actual video file which contains indexable text about the video. It will have all the annotations, meta data, tags, copyright information and other textual meta information that search engines require to index and rank it better. This text file is an indexable textual representation of the video.

So, whenever a video search engine reaches a video in a crawl, it will check out this text file for its indexing work. If this text file is HTML, then people may link directly to it and it will be included in the pagerank calculations again. If it is a XML file, there should be a simple way to transcode it to HTML, e.g. via a xslt script, so links can go there directly again.

So much for the theory: here comes the practice.

For every video file (and incidentally it would work for audio, too), you should start writing a CMML file and publish it on your Web server together with the original. Here is a xslt script that you can use to transcode CMML to HTML. If you actually use Ogg Theora as your Video publishing format, you can even publish Annodex videos and make direct access to the clips that you defined in CMML and to time offsets possible by using the Apache Annodex module. Try using it in your blog with the external embedding of the Annodex Firefox extension.

When we’ve done this, all that remains is to encourage the video search engines to exploit the CMML data in their crawls. 🙂