All posts by silvia

1998: Automatic Trailer Production

Lienhart, R.; Pfeiffer, S.; Effelsberg, W. “Automatic Trailer Production”, in: B. Furht (Ed.), “Handbook of Multimedia Computing”, CRC Press, pp. 361-378, 1998.

Abstract

Presented is an algorithm for automatic production of a video abstract of a feature film, similar to movie trailers. It selects clips from the original video based on detection of special events like dialogs, shots, explosions and text occurrences, and on general action indicators applied to scenes. An editing model is used to assemble these clips into a movie trailer. Such movie abstracts are useful for multimedia archives, in movie marketing or for home entertainment.

1998: The MoCA Project – Movie Content Analysis Research at the University of Mannheim

Pfeiffer, S.; Lienhart, R.; Kuehne, G.; Effelsberg, W. “The MoCA Project – Movie Content Analysis Research at the University of Mannheim” in: J. Dassow, R. Kruse (Hg.) “Informatik ’98: Informatik zwischen Bild und Sprache”, 28. Jahrestagung der Gesellschaft fuer Informatik, Magdeburg, 21.-25. September 1998, Springer Verlag, Berlin, Heidelberg, pp. 329-338, 1998.

Abstract

In 1994 an ambitious project in the multimedia domain was started at the University of Mannheim under the guidance of Prof. Dr. W. Effelsberg. We realized that multimedia applications using continuous media like video and audio data absolutely require access to semantic contents of these media types similar to that for textual and numerical data. Imagine the existence of large digital collection of textual data such as books, articles etc. without anyone being able to search for pertinent keywords. Content analysis of continuous data, especially of video data, currently relies mainly on manual annotations. This implies reduction of the searchable content to the annotated content, which usually does not contain the required information. The MoCA project therefore aims to extract structural and semantic content of videos automatically.

2004: Automated Annodexing of Meeting Recordings

Pfeiffer, S.; Schremmer, C. “Automated Annodexing of Meeting Recordings”, in: R. Raieli and P. Innocenti (Eds.) “MultiMedia Information Retrieval – Methodologie ed esperienze internazionali di content-based retrieval per l’informatzione e la documentazione”, AIDA Assoziazione Italiana per la Documentazione Avanzata, Rome, pp. 347 – 368, 2004.

1999: Information Retrieval aus digitalisierten Audiospuren von Filmen

Pfeiffer, S. “Information Retrieval aus digitalisierten Audiospuren von Filmen (Information Retrieval from Digitized Audio Tracks of Films)”, Dissertation Mannheim University, published through Shaker Verlag, Aachen, March 1999.

Abstract
(Ausschnitte aus dem ersten Kapitel “Motivation” in Deutsch)

Im Zeitalter der Digitalisierung entsteht immer mehr multimediales Datenmaterial, insbesondere im WWW oder in multimedialen Datenbanken. Da die Informationsverarbeitungskraefte des Menschen begrenzt sind, ist er mit der Erfassung und Verarbeitung der Masse der vorhandenen Daten sowie der Extraktion interessanter Informationen daraus vollkommen ueberfordert. Fuer textuelle Daten existieren bereits relativ gute Mechanismen zur Strukturierung, Indexierung und Suche. Aber die Datenflut beinhaltet neben Texten in zunehmendem Masse digitales Bild-, Ton- und Filmmaterial. Mechanismen zur Strukturierung, Indexierung, Suche und zum “Sichten” solcher Daten werden immer wichtiger.

Akut sind solche Problemstellungen bereits bei Fernsehsendern, wo riesige Mengen an Bild- und Tonmaterial in den Archiven liegen und nur schwer wiederverwendbar sind. Diese Daten werden manuell nach bestimmten Schemata annotiert, um zumindest rudimentaeren inhaltlichen Zugang zu gewaehrleisten. Die Erstellung solcher Annotationen ist allerdings aeusserst kostspielig im Vergleich zum daraus entstehenden Nutzen. Bei der Erstellung der Annotationen koennen nicht alle in Bild und Ton enthaltenen Informationen beruecksichtigt werden, da die unter einem Aspekt wichtigen Information unter einem anderen Aspekt vollkommen unwichtig sein kann.

Als Beispiel seien die Aufnahmen eines Empfangs einer deutschen Delegation in der ehemaligen Sowjetunion genannt, an dem auch ein damals noch unbekannter Beamter namens Gerhard Schoeder teilnahm. Es ist unwahrscheinlich, dass ein professioneller Annotator die Namen aller Teilnehmer kannte und noch unwahrscheinlicher, dass er sie alle ins System eingab. Damit ist aber eine automatische Extraktion allen Datenmaterials ueber Gerhard Schroeder aus dem Archiv nicht moeglich. Man erkennt, dass es schlichtweg unmoeglich ist, zum Zeitpunkt der Erstellung der Annotation zu wissen, welche Bildaspekt und Tonausschnitte Informationen beinhalten, die in Zukunft benoetigt werden.

Aehnliche Probleme treten in der medienwissenschaftlichen Forschung auf. Wenn beispielsweise gewisse Eigenschaften von Filmen untersucht werden sollen, so ist es notwendig, alle zu untersuchenden Filme von Menschen auf relevante Merkmale hin sichten zu lassen, obwohl es sich womoeglich um relativ einfache Fragestellungen wie die Schnittrate oder den Musikanteil handelt.

Das Projekt MoCA (Movie Content Analysis) der Universitaet Mannheim hat zum Ziel, die Moeglichkeiten automatischer Analysen von Bildsequenzen und zugehoerigen Tonspuren zu untersuchen. Es ist international eines der wenigen Projekte, welches Bild- und Toninformationen kombiniert analysiert. Die Bestimmung von Strukturen und Inhalten wird von Algorithmen realisiert, welche zu definierende Merkmale von Einzelbildern, Bildfolgen und Tonsequenzen extrahieren. Diese koennen letztlich fuer die Anwendungen der oben geschilderten Art genutzt werden.

Diese Arbeit hat sich die Entwicklung von Methoden zur Informationsextraktion aus digitalisierten Audiospuren von Filmen zum Ziel gesetzt. Dies umfasst drei allgemeine Aufgaben: Indexierung, Anfrageformulierung und Retrieval.

Die Indexierung hat die Erstellung von Informationsspuren zum Ziel, welche einen inhaltlichen Zugriff ermoeglichen. Dazu wird das Audio auf bestimmte Merkmale hin analysiert, welche direkt oder in Kombination Antworten auf Benutzeranfragen beinhalten.

Neben der Indexierung ist die Formulierung von Anfragen eine wichtige Aufgabe des Information Retrieval. Anfragen an Inhalte von Audio koennen entweder ueber die Tastatur oder auditiv (d.h. ueber das Mikrofon) gestellt werden. Anfragen ueber die Tastatur beziehen sich dabei auf wahrnehmbare und durch den Menschen beschreibbare Eigenschaften von Audio, z.B. “Gib mir alle Bereiche mit dunkler Musik” oder “… mit hektischem und lautem Ton” oder “… mit Stille”. Auditive Anfragen sind dagegen Anfragen, die nicht mit einem Wechsel des Mediums verbunden sind: eine solche Anfrage ist selbst ein Audiostueck. Beispiel hierfuer sind gesungene Anfragen, aufgenommene Geraeusche oder auch laengere Audiostuecke wie z.B. ein Werbespot, der wiedergefunden werden soll.

Abschliessende Teilaufgabe des Information Retrieval ist die Bestimmung von Ausschnitten, die dem Benutzer auf eine Anfrage zurueckgegeben werden sollen. Im einfachsten Fall orientiert sie sich strikt an den Ergebnissen der Anfrage und umfassen exakt den Bereich, der die Anfrage beantwortet. Bei Filmen handelt es sich dabei aber oft nur um Inhaltsfetzen, denen der Kontext fehlt. Deshalb ist die Bestimmung eines abgeschlossenen Kontextes – einer Einstellung oder eine Szene – zu einer Anfrage beim Retrieval notwendig.

Ein andersgearteter Ansatz, dem Benutzer einer Film-Datenbank den Inhalt eines bestimmten Films nahezubringen, besteht darin, ihm eine Zusammenfassung des Films zu praesentieren, mit der er sich einen schnellen Ueberblick verschaffen kann. Traditionelle Beispiele hierfuer sind Kino-Trailer, die einen neuen Film ankuendigen oder eine Serienvorschau im Fernsehen oder Kurzueberblicke ueber die Themen einer Nachrichtensendung.

Diese Dissertation beschaeftigt sich sowohl mit den drei aufgezaehlten Aufgaben des Information Retrieval von Filmen als auch mit der automatischen Erstellung von Film-Zusammenfassungen. Sie beginnt allerdings mit einer ausfuehrlichen Beschreibung interdisziplinaeren wissenschaftlichen Grundwissens fuer die Audioanalyse. Ziel dieses Teils ist es, die Grundlagen der computergestuetzten Verarbeitung von Audiodaten, sowie Fachwissen ueber die Wahrnehmung von Audio aus psychologischer, medizinischer und filmtechnischer Sicht zusammenzufassen.

Im Anschluss daran wird die im Rahmen dieser Arbeit eingesetzte und selbst implementierte Entwicklungsumgebung vorgestellt. Es handelt sich dabei um die sogenannte MoCA-Workbench und eine objektorientierte Klassen-Bibliothek namens aulib++, in deren Rahmen die Audioanalysealgorithmen zur Verfuegung gestellt werden. Beide dienen dem Ziel, die Entwicklung von Inhaltsanalysealgorithmen sowie deren Integration in Anwendungen zu erleichtern.

Der Kern der Arbeit beginnt einem Kapitel, in dem die Formulierung von auditiven Anfragen sowie Eignung von aus diesen Anfragen berechneten Indizes fuer ein Retrieval untersucht werden. Anschliessend wird die Indexierung von Audio durch wahrnehmbare Eigenschaften beschrieben, welche besser als allgemeine Transformations-Indizes zum Einfangen inhaltlicher Aspekte von Audio geeignet sind. Schliesslich wird ein Verfahren zur automatischen Bestimmung von Szenen als kontextuellen Einheiten untersucht. Dieses Verfahren verbindet Video- und Audioindikatoren ebenso wie die Systeme zur automatischen Produktion von Filmzusammenfassungen, die in einem separaten Kapitel vorgestellt werden.

Zum Abschluss dieser Dissertation werden die im Rahmen der Forschungsarbeit gemachten Erfahrungen zusammengefasst.

1993: Communication protocols for a parallel computer

Silvia Pfeiffer “Codegenerierung aus Estelle-Spezifikationen fuer einen Parallelrechner (Code generation from Estelle specifications for a parallel computer)”, Master’s Thesis Department of Mathematics and Computer Science, Mannheim University , 1993 Mannheim, Germany, 86 pages.

Publications

Having completed a PhD in computer science and worked for the Australian research organisation CSIRO for 7 years, I naturally have a lengthy list of publications. Here’s a start of a list.

Australian Startup Carnival

Vquence was today presented on the “Australian Startup Carnival” site – go, check it out.

There are 28 participants to the startup carnival and each one of them is being introduced through an interview that was taken electronically. Questions for this interview were rather varied and detailed. They included technical and system backgrounds as well as asking for your use of open source software.

All the questions you have always wanted to ask about Vquence, and a few more. 😉

UPDATE: The Startup Carnival has announced the prizes and they are amazing – first prize being an exhibition package at CeBIT. Good luck to us all!!

Vquence: Measuring Internet Video

I have been so busy with my work as CEO of Vquence since the end of last year that I’ve neglected blogging about Vquence. It’s on my list of things to improve on this year.

I get asked frequently what it is that we actually do at Vquence. So here’s an update.

Let me start by providing a bit of history. At the beginning of 2007 Vquence was totally focused on building a social video aggregation site. The site now lives at http://www.vqslices.com/ and is useful, but lacks some of the key features that we had envisaged to have a breakthrough.

As the year grew older and we tried to create a corporate business and an income with our video aggregation, search and publication technology, we discovered that we had something that is of much higher value than the video handling technology: we had quantitative usage information about videos on social video sites in our aggregated metadata. In addition, our “crawling” algorithms, are able to supply up-to-date quantitative data instantly.

In fact, I should not simply call our data acquisition technology a “crawler” because in the strict sense of the word, it’s not. Bill Burnham describes in his blog post about SkyGrid the difference between crawlers of traditional search engines and the newer “flow-based” approach that is based on RSS/ping servers. At Vquence we are embracing the new “flow-based” approach and are extending it by using REST APIs where available. A limitation of the flow-based approach is that just a very small part of the Web is accessible through RSS and REST APIs. We therefore complement flow-based search with our own new types of data-discovery algorithms (or “crawlers”) as we see fit. In particular: locating the long tail of videos stored on YouTube is a challenge that we have mastered.

But I digress…

So we have all this quantitative data about social videos, which we update frequently. With it, we can create graphs of the development of view counts, comment counts, video replies and such. See for example the below image for a graph that compares the aggregate view count of the videos that were published by the main political parties in Australia during last year’s federal election. The graph shows the development of the view count over the last 2.5 months before the election in 2007.

Aggregate Viewcount Graph Federal Election Australia

At first you will notice that Labor started far above everyone else. Unfortunately we didn’t start recording view counts that early, but we assume it is due to the Kevin07 website that was launched on 7th August. In the graph, you will notice a first increase on the coalition’s view count on the 2nd September – that’s when Howard published the video for the APEC meeting 2-9 Sept 2007. Then there’s another bend on the 14th September, when Google launched it’s federal election site and we saw first videos of the Nationals going up on YouTube. The dip in the curve of the Nationals a little after that is due to a software bug. Then on the 14th October the Federal Election was actually announced and you can see the massive increase in view count from there on for all parties, ending with a huge advantage of Labor over everybody else. Interestingly enough, this also mirrors the actual outcome of the election.

So, this is the kind of information that we are now collecting at Vquence and focusing our business around.

On that background, check out a recent blog post by Judah Phillips on “Thinking about Measuring Internet Video?”. It is actually a wonderful description of the kind of things we are either offering or working on.

Using his vocabulary: we can currently provide a mix of Instream and Outstream KPI to the video advertising market. Our larger aim is to provide outstream audience metrics that are exceptional and we know how to get them regardless of where the video goes on the Internet. Our technology plan centers around a mix of a panel-based approach (through a browser plugin) and a census-based approach (through a social network plugin for facebook et al, also using OpenID), and video duplicate identification.

This information isn’t yet published at our corporate website, which still mostly focuses on our capabilities in video aggregation, search, and publication. But we have a replacement in the making. Watch this space… 🙂

Activities for a possible Web Video Working Group

The report of the recent W3C Video on the Web workshop has come out and has some recommendations to form a Video Metadata Working Group, or even more generally a Web Video Working Group.

I had some discussions with people that have a keen interest in the space and we have come up with a list of topics that a W3C Video Working Group should look into. I want to share this list here. It goes into somewhat more detailed than the topics that the W3C Video on the Web workshop has raised. Feel free to add any further concerns or suggestions that you have in the comments – I’d be curious to get feedback.

First, there are the fundamental issues:

  • Choice of royalty-free baseline codecs for audio and video
  • Choice of encapsulation format for multi-track media delivery

Both of these really require the generation of a list of requirements and use cases, then analysis of existing format with respect to these requirements and finally a decision on which ones to use.

Requirements for codecs would encompass, amongst others, the need to cover different delivery and receiving devices – from mobile phones with 3G bandwidth, over Web video, to full-screen TV video over ADSL.

Here are some requirements for an encapsulation format:

  • usable for live streaming and for canned delivery,
  • the ability to easily decode from any offset in a media file,
  • the use for temporal and spatial hyperlinking and the required partial delivery that comes with these,
  • the ability to dynamically create multi-track media streams on a server and to deliver requested tracks only,
  • the ability to compose valid streams by composing segments from different servers based on a (play)list of temporal hyperlinks,
  • the ability to cache segments in the network,
  • and the ability to easily add a different “codec” track into the encapsulation (as a means of preparing for future improved codecs or other codec plugins).

The decisions for an encapsulation format and for a/v codecs may potentially require a further specification of how to map specific codecs into the chosen encapsulation format.

Then we have the “Web” requirements:

The technologies that have created what is known as the World Wide Web are fundamentally a hypertext markup language (HTML), a hypertext transfer protocol (HTTP) and a resource addressing scheme (URIs). Together they define the distributed nature of the Web. We need to build an infrastructure for hypermedia that builds on the existing Web technologies so we can make video a first-class citizen on the Web.

  • Create a URI-compatible means of temporal hyperlinking directly into time offsets of media files.
  • Create a URI-compatible means of spatial hyperlinking directly into picture areas of video files.
  • Create a HTTP-compatible protocol for negotiating and transferring video content between a Web server and a Web client. This also includes a definition of how video can be cached in HTTP network proxies and the like.
  • Create a markup language for video that also enables hyperlinks from any time and region in a video to any other Web resource. Time-aligned annotations and metadata need to be part of this, just like HTML annotates text.

All of these measures together will turn ordinary media into hypermedia, ready for a distributed usage on the Web.

In addition to these fundamental Web technologies, to integrate into modern Web environments, there would need to be:

  • a standard definition of a javascript API to interact with the media data,
  • an event model,
  • a DOM integration of the textual markup,
  • and possibly the use of CSS or SVG to define layout, effects, transitions and other presentation issues.

Then there are the Metadata requirements:

We all know that videos have a massive amount of metadata – i.e. data about the video. There are different types of metadata and they need to be handled differently.

  • Time-aligned text, such as captions, subtitles, transcripts, karaoke and similar text.
  • Header-type metadata, such as the ID3 tags for mp3 files, or the vorbiscomments for Ogg files.
  • Manifest-type description of the relationships between different media file tracks, similar to what SMIL enables, like the recent ROE format in development with Xiph.

The time-aligned text should actually be regarded as a codec, because it is time-aligned just like audio or video data. If we want to be able to do live streaming of annotated media content and receive all the data as a multiplexed stream through one connection, we need to be able to multiplex the text codec into the binary stream just like we do with audio and video. Thus, the definition of the time-aligned text codecs have to ascertain the ability to multiplex.

Header-type metadata should be machine accessible and available for human consumption as required. They can be used to manage copyright and other rights-related information.

The manifest is important for dynamically creating multi-track media files as required through a client-server interaction, such as the request for a specific language audio track with the video rather than the default.

Other topics of interest:

There are two more topics that I would like to point out that require activities.

  • “DRM”: It needs to be analysed what the real need is here. Is it a need to encrypt the media file such that it can only be read by specific recipients? Maybe an encryption scheme with public and private keys could provide this functionality? Or is it a need to retain copyright and licensing information with the media data? Then the encapsulation of metadata inside the media files may be a good solution already, since this information stays with the media file after a delivery or copy act.
  • Accessibility: It needs to be ascertained that the association of captions, sign language, video descriptions and the like in a time-aligned fashion to the video is possible with the chosen encapsulation format. A standard time-aligned format for specifying sign language would be needed.

This list of required technologies has been built through years of experience experimenting with the seamless integration of video into the World Wide Web in the Annodex project and through further recent discussions from the W3C Video on the Web workshop and elsewhere.

This list is just providing a structure towards what is necessary to address in making video a first-class citizen on the Web. There are many difficult detail problems to solve in each one of these areas. It is a challenge to understand the complexity of the problem, but I hope this structure can contribute to break down some of the complexity and help us to start attacking the issues.