EVALUATING THE USE OF PROSODIC INFORMATION
IN SPEECH RECOGNITION AND UNDERSTANDING

Mari Ostendorf

with co-PI's Patti Price, SRI International, and Stefanie Shattuck-Hufnagel, MIT

Electrical, Computer and Systems Engineering
Boston University
Boston, MA 02215

CONTACT INFORMATION

Electrical, Computer and Systems Engineering, Boston University, 44 Cummington St, Boston, MA 02215

mo@raven.bu.edu
617-353-5430 (phone)
617-353-8437 (fax)

WWW PAGE

http://raven.bu.edu/

PROGRAM AREA

2. Speech and Natural Language Understanding.

KEYWORDS

Prosody, speech understanding, speech recognition

PROJECT SUMMARY

Prosodic patterns mark information structure in speech, which this research aims at detecting and utilizing for improved speech understanding. To accomplish these goals, the effort involves analysis of acoustic cues to prosodic patterns and disfluencies, development of algorithms for detecting them, and exploration of architectures for using the detected patterns to improve speech recognition and understanding. Important aspects of the approach include integration of linguistic theory and statistical modeling and evaluation of algorithms with existing speech systems. Prosody is used as a supplementary knowledge source, providing information not available from the words alone for evaluating language interpretation hypotheses.

PROJECT REFERENCES

Selected publications supported all or in part by this grant:

N. Veilleux and M. Ostendorf, 1993a. ``Probabilistic parse scoring with prosodic information,'' Proc. ICASSP, Vol. II, pp. 51-55.

N. Veilleux and M. Ostendorf, 1993b. ``Prosody/parse scoring and its application in ATIS,'' Proc. ARPA HLT Workshop, pp. 335-340.

C. W. Wightman and M. Ostendorf, 1994. ``Automatic labeling of prosodic patterns,'' IEEE Trans. Speech and Audio Processing, Vol. 2, No. 4, pp. 469-481.

S. Shattuck-Hufnagel, M. Ostendorf and K. Ross, 1995. ``Pitch accent placement within lexical items in American English,'' J. Phonetics, Vol. 22, pp. 357-388.

M. Swerts and M. Ostendorf, 1995. ``Discourse prosody in human-machine interactions,'' Proc. of the ESCA Tutorial and Research Workshop on Spoken Dialogue Systems - Theories and Applications.

K. Ross and M. Ostendorf, 1995. ``A dynamical system model for recognizing intonation patterns,'' Proc. Eurospeech, Vol. 2, pp. 993-996.

E. Shriberg, 1994. ``Preliminaries to a theory of speech disfluencies,'' UC Berkeley Ph.D. thesis.

M. Ostendorf, P. J. Price and S. Shattuck-Hufnagel, 1995. ``The Boston University radio news corpus,'' Boston University Technical Report No. ECS-95-001. (available by anonymous ftp from raven.bu.edu)

P. Price and M. Ostendorf, 1996. ``Combining linguistic with statistical methods in modeling rosody,'' in J. L. Morgan and K. Demuth (Eds.), Signal to syntax: Bootstrapping from speech to grammar in early acquisition, Hillsdale, NJ: Lawrence Erlbaum Associates.

AREA BACKGROUND

Spoken language is the primary mode of communication between humans for interactive problem solving, and therefore spoken language understanding is a vital technology for making computers accessible to a broad range of the population. As human listeners, we bring many sources of information to bear on the problem of interpreting an utterance, including syntax, semantics, our knowledge of the world and conversational context, as well as prosody. Prosodic phrase structure and prominence patterns often provide the link between acoustic realization and linguistic interpretation of a word, giving clues as to how to parse a word string, which element is in focus, whether a point is in question, and whether there has been a change in topic. Despite the fact that prosody provides such important information, it has been little used so far in spoken language understanding systems. One reason is simply that prosody modeling is a difficult problem, with acoustic cues depending on prosodic structures operating at many different time scales. A second reason, may be that current speech systems handle only constrained domains, where the information provided in prosody is often redundant with semantic cues in speech understanding systems or is less useful because the speech recognition task involves read speech. However, as systems move towards less constrained and more natural interaction, the additional information provided by prosody will become increasingly important.

The key problems that must be solved to make effective use of prosody in human-computer communication include: identification of the important abstract prosodic patterns and their acoustic correlates, automatic prosodic pattern recognition, analysis and computational modeling of the mapping between prosodic patterns and linguistic constructs, and integrating the detected prosodic information with existing speech recognition and understanding systems. Solving these problems, which span both technical and linguistic disciplines, benefits from a multi-disciplinary approach. Progress has been made in all of these areas. Through a multi-site collaborative effort, a standard prosodic labeling system has been proposed for American English (Pitrelli et al, 1994), which makes it possible for different researchers to share data and results. Much of the automatic prosody recognition algorithm development has built on statistical models from speech recognition, including (Wightman and Ostendorf, 1994; Ross and Ostendorf, 1995). There have been numerous studies of the mappings between acoustic correlates, abstract prosodic structure and linguistic structure. For prosody and discourse structure in particular, a good summary can be found in (Hirschberg, 1993). A variety of architectures have been explored for integrating prosody into speech understanding systems (Bear and Price, 1990; Veilleux and Ostendorf, 1993a; Kompe et al, 1993). In speech recognition studies, there have been several efforts at improving on HMM duration models, and a more extensive use of prosody is explored in (Kompe et al 1995). Although more work is still needed on all fronts, much of the groundwork has been laid in the field, opening up numerous possibilities for application of prosody modeling in actual speech understanding systems.

AREA REFERENCES

(Excluding those listed above)

J. Pitrelli, M. Beckman, and J. Hirschberg, 1994. ``Evaluation of prosodic transcription labeling reliability in the ToBI framework,'' Proc. ICSLP, pp. 123-126.

J. Hirschberg, 1993. ``Studies of intonation and discourse,'' Proc. ECSA Workshop on Prosody, pp. 90-95.

J. Bear and P. J. Price, 1990. ``Prosody, syntax and parsing,'' Proc. ACL Conference.

R. Kompe et al, 1993. ``Prosody takes over: a prosodically guided dialog system,'' Proc. Eurospeech, pp. 2003-2006.

R. Kompe et al, 1995. ``Prosodic scoring of word hypotheses graphs,'' Proc. Eurospeech, pp. 1333-1336.

RELATED PROGRAM AREAS

Speech recognition and understanding is relevant to all the other program areas, and by extension so is the use of prosody recognition to improve accuracy of speech processing. However, areas that I could envision more specifically taking advantage of prosody include: (3) Other Communication Modalities; (4) Adaptive Human Interfaces; and (6) Intelligent Interactive Systems for Persons with Disabilities.

POTENTIAL RELATED PROJECTS

3. Other Communication Modalities:

a) If we are to integrate speech input with other modalities, then the problem of ambiguous references will be even more difficult than in spoken language human/computer dialogs. Prosodic cues may help define discourse units and indicate interpretation errors, keeping the human/computer interaction on track.

b) By improving models of prosody for recognition, we can both directly improve synthesis models which are needed for computer voice response, and vice versa. Indirectly, improved prosodic label recognition can be used to automatically annotate large corpora, which can in turn be used to further our understanding of the mapping from prosody to linguistic structure.

4. Adaptive Human Interfaces.

Several researchers have proposed using dynamic models of language in speech understanding systems, adapting the probability of words as a human user interacts with a system. In addition to using semantic knowledge of the domain, adaptation could be dependent on prosody cues that indicate focused information or a change of topic.

6. Intelligent Interactive Systems for Persons with Disabilities.

As mentioned in (3) above, improving models of prosody recognition can lead to improved speech synthesis, and synthesis has long been established as an important aid to persons with disabilities. For recognition, it may be that prosody plays an especially role in the speech of persons who use their voice both to control their environment and to communicate. If speech must contain more information, then the information in prosody is likely to be less redundant with the syntactic/semantic content. Since the particular strategies used by individuals may differ, prosodic pattern recognition algorithms are needed to automatically train speaker-specific communication/control interfaces.

EVALUATING THE USE OF PROSODIC INFORMATION IN SPEECH RECOGNITION AND UNDERSTANDING