with co-PI's Patti Price, SRI International,
and Stefanie Shattuck-Hufnagel, MIT
Electrical, Computer and Systems Engineering
Boston University
Boston, MA 02215
mo@raven.bu.edu
617-353-5430 (phone)
617-353-8437 (fax)
N. Veilleux and M. Ostendorf, 1993a. ``Probabilistic parse scoring with prosodic information,'' Proc. ICASSP, Vol. II, pp. 51-55.
N. Veilleux and M. Ostendorf, 1993b. ``Prosody/parse scoring and its application in ATIS,'' Proc. ARPA HLT Workshop, pp. 335-340.
C. W. Wightman and M. Ostendorf, 1994. ``Automatic labeling of prosodic patterns,'' IEEE Trans. Speech and Audio Processing, Vol. 2, No. 4, pp. 469-481.
S. Shattuck-Hufnagel, M. Ostendorf and K. Ross, 1995. ``Pitch accent placement within lexical items in American English,'' J. Phonetics, Vol. 22, pp. 357-388.
M. Swerts and M. Ostendorf, 1995. ``Discourse prosody in human-machine interactions,'' Proc. of the ESCA Tutorial and Research Workshop on Spoken Dialogue Systems - Theories and Applications.
K. Ross and M. Ostendorf, 1995. ``A dynamical system model for recognizing intonation patterns,'' Proc. Eurospeech, Vol. 2, pp. 993-996.
E. Shriberg, 1994. ``Preliminaries to a theory of speech disfluencies,'' UC Berkeley Ph.D. thesis.
M. Ostendorf, P. J. Price and S. Shattuck-Hufnagel, 1995. ``The Boston University radio news corpus,'' Boston University Technical Report No. ECS-95-001. (available by anonymous ftp from raven.bu.edu)
P. Price and M. Ostendorf, 1996. ``Combining linguistic with statistical methods in modeling rosody,'' in J. L. Morgan and K. Demuth (Eds.), Signal to syntax: Bootstrapping from speech to grammar in early acquisition, Hillsdale, NJ: Lawrence Erlbaum Associates.
The key problems that must be solved to make effective use of prosody in human-computer communication include: identification of the important abstract prosodic patterns and their acoustic correlates, automatic prosodic pattern recognition, analysis and computational modeling of the mapping between prosodic patterns and linguistic constructs, and integrating the detected prosodic information with existing speech recognition and understanding systems. Solving these problems, which span both technical and linguistic disciplines, benefits from a multi-disciplinary approach. Progress has been made in all of these areas. Through a multi-site collaborative effort, a standard prosodic labeling system has been proposed for American English (Pitrelli et al, 1994), which makes it possible for different researchers to share data and results. Much of the automatic prosody recognition algorithm development has built on statistical models from speech recognition, including (Wightman and Ostendorf, 1994; Ross and Ostendorf, 1995). There have been numerous studies of the mappings between acoustic correlates, abstract prosodic structure and linguistic structure. For prosody and discourse structure in particular, a good summary can be found in (Hirschberg, 1993). A variety of architectures have been explored for integrating prosody into speech understanding systems (Bear and Price, 1990; Veilleux and Ostendorf, 1993a; Kompe et al, 1993). In speech recognition studies, there have been several efforts at improving on HMM duration models, and a more extensive use of prosody is explored in (Kompe et al 1995). Although more work is still needed on all fronts, much of the groundwork has been laid in the field, opening up numerous possibilities for application of prosody modeling in actual speech understanding systems.
J. Pitrelli, M. Beckman, and J. Hirschberg, 1994. ``Evaluation of prosodic transcription labeling reliability in the ToBI framework,'' Proc. ICSLP, pp. 123-126.
J. Hirschberg, 1993. ``Studies of intonation and discourse,'' Proc. ECSA Workshop on Prosody, pp. 90-95.
J. Bear and P. J. Price, 1990. ``Prosody, syntax and parsing,'' Proc. ACL Conference.
R. Kompe et al, 1993. ``Prosody takes over: a prosodically guided dialog system,'' Proc. Eurospeech, pp. 2003-2006.
R. Kompe et al, 1995. ``Prosodic scoring of word hypotheses graphs,'' Proc. Eurospeech, pp. 1333-1336.
a) If we are to integrate speech input with other modalities, then the problem of ambiguous references will be even more difficult than in spoken language human/computer dialogs. Prosodic cues may help define discourse units and indicate interpretation errors, keeping the human/computer interaction on track.
b) By improving models of prosody for recognition, we can both directly improve synthesis models which are needed for computer voice response, and vice versa. Indirectly, improved prosodic label recognition can be used to automatically annotate large corpora, which can in turn be used to further our understanding of the mapping from prosody to linguistic structure.
4. Adaptive Human Interfaces.
Several researchers have proposed using dynamic models of language in speech understanding systems, adapting the probability of words as a human user interacts with a system. In addition to using semantic knowledge of the domain, adaptation could be dependent on prosody cues that indicate focused information or a change of topic.
6. Intelligent Interactive Systems for Persons with Disabilities.
As mentioned in (3) above, improving models of prosody recognition can lead to improved speech synthesis, and synthesis has long been established as an important aid to persons with disabilities. For recognition, it may be that prosody plays an especially role in the speech of persons who use their voice both to control their environment and to communicate. If speech must contain more information, then the information in prosody is likely to be less redundant with the syntactic/semantic content. Since the particular strategies used by individuals may differ, prosodic pattern recognition algorithms are needed to automatically train speaker-specific communication/control interfaces.