2002: Scene Determination Using Auditive Segmentation Models of Edited Video

Pfeiffer, S.; Srinivasan, U. “Scene Determination Using Auditive Segmentation Models of Edited Video” in: C. Dorai and S. Venkatesh (Eds.) “Computational Media Aesthetics”, Kluwer Academic Publishers, pp. 105-130, 2002.

Abstract This chapter describes different approaches that use audio features for determination of scenes in edited video. It focuses on analysing the sound track of videos for extraction of high-level video structure. We define a scene in a video as a temporal interval which is semantically coherent. The semantic coherence of a scene is often constructed during cinematic editing of a video. An example is the use of music for concatenation of several shots into a scene which describes a lengthy passage of time such as the journey of a character. Some semantic coherence is also inherent to the unedited video material such as the sound ambience at a specific setting, or the change pattern of speakers in a dialogue. Another kind of semantic coherence is constructed from the textual content of the sound track revealing for example the different stories contained in a news broadcast or documentary. This chapter explains the types of scenes that can be constructed via audio cues from a film art perspective. In continues on a discussion of the feasibility of automatic extraction of these scene types and finally presents existing approaches.