High Level Video Summaries, Cross-Referencing and Editing
A system and method is presented to construct a highly compact hierarchical representation of video. It is based on a tree-like representation where the bottom level is composed of frames, and the highest level represents a non-temporal segmentation of the video, and is suitable for well-structured video genres. The video is first temporally segmented into shots, and these shots are then temporally segmented into scenes. We next demonstrate the benefits of using mosaics for representing shots. A novel method for mosaic alignment and comparison is proposed, which is shown to be both efficient and effective. A scene distance measure based on mosaic comparison is then defined and used to cluster the scenes into a higher abstraction of video content, the physical settings. The hierarchical representation of video is demonstrated using situation comedies, in which this abstraction has a strong semantic meaning in summarizing videos. By comparing physical settings across different episodes of the same situation comedy, we determine the main plots of each episode. In another example, we apply our mosaic comparison method to detect significant events in basketball games, enabling fast forwarding from one event to the next. A more recent application involves producing edited video from raw unedited video footage. We focus on the application of generating wedding videos. We use the existing wedding photo album as an abstract, and produce an edited wedding video from it. The photo album serves us in determining importance of raw shots, as well as their style and order.