ACOUSTICALLY DERIVED DISCOURSE STRUCTURE FOR
INTERACTIVE SPEECH SKIMMING

Chris Schmandt

Media Laboratory
Massachusetts Institute of Technology

CONTACT INFORMATION

Chris Schmandt
MIT Media Lab
E15-327
20 Ames St.
Cambridge, MA 02139

Email: geek@media.mit.edu
Phone: +1-617-253-5156
Fax: +1-617-258-6264

WWW PAGE

http://geek.mit.edu/people/geek/home.html

PROGRAM AREA

Speech and natural language understanding.

But strongly related to Adaptive User Interfaces.

KEYWORDS

Discourse structure, intonation, digitized speech, speech skimming

PROJECT SUMMARY

Our goal is to make possible enhanced listening to recorded audio lectures. The proposed approach has two components. First, we will develop an approach for deriving the structure, based on a discourse oriented analysis and using acoustical techniques, inherent in a recording. Second, we will provide a user interface and interaction techniques to allow a listener to exploit that structure in real time. We will be evaluating not only the performance of the acoustic analysis of discourse structure, but also the usability (and iterative design) of the audio skimming user interface. Both components will build upon existing research.

Previous work has used the acoustic cues of pauses and pitch, usually in isolation, to attempt to identify emphasized or salient portions of a monologue. Well developed theories of discourse suggest a much richer underlying structure, however. Discourse structure will be exploited to provide a hierarchical "view" of the recording, allowing listening at different levels of detail. Automatic detection of discourse structure will necessitate correlation of the multiple acoustic cues of both pitch and duration. The results will then be evaluated against a marked corpus. We do not expect perfect results; we hope to obtain results which are good enough to form cues to be exploited by the speech skimming user interface.

The user interface will be built around a physical interaction device, not a traditional graphical user interface. Our goal is to prototype a device which could be built into appropriate portable hardware and allow one to listen in a variety of situations where conventional interfaces are inappropriate, such as while commuting or exercising. Determining the physical form of the interface is in part the subject of this research, but we propose for it to support a model of "audio zooming" facilitating rapid scanning of the stored material at different levels of detail. The interface will include time scaling, skimming at various granularities, and acoustic feedback to convey the discourse structure and skipped portions. The interface will be designed iteratively with continuous user evaluation.

PROJECT REFERENCES

This project has only just begun. These references are to related work from the same related projects by the same research group.

Stifelman, L. J. A Discourse Analysis Approach to Structured Speech. In Proceedings of the AAAI 95 Spring Symposium Series. Empirical Methods in Discourse Interpretation and Generation, 1995.

Arons, B. Speech Skimmer: Interactively Skimming Recorded Speech. In Proceedings, UIST, ACM, 1993.

Schmandt, C. and Mullins, A. AudioStreamer: Exploiting Simultaneity for Listening. In Proceedings of CHI 95 Short Papers, ACM SIGCHI, 1995.

AREA BACKGROUND

This process is about automatically detecting discourse structure and using this knowledge to facilitate interactive retrieval of spoken documents. There is a firm literature on discourse structure and, more recently, acoustical cues for discourse structure. Note that this work is in some ways different from much of what has been done before in that it assumes no lexical knowledge -- we are not attempting to recognize words, or content, but rather structure.

The other related background area is the traditional human-computer interaction field, in which there have been a small number of projects (many from the same Lab at MIT) on various approaches for capturing and later interacting with digital recordings of speech.

AREA REFERENCES

Passonneau, R. J. and Litman, D. J. Intention-Based Segmentation: Human Reliability and Correlation with Linguistic Cues. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.

Grosz, B. and Hirschberg, J. Some Intonational Characteristics of Discourse Structure. In Proceedings of the International Conference on Spoken Language Processing, pages 429-432. 1992.

Grosz, B. and Sidner, C. Attention, Intentions, and the Structure of Discourse. Computational Linguistics, 12(3):175P204, 1986.

RELATED PROGRAM AREAS

Adaptive User Interfaces

Usability and User-Centered Design.