COMPUTATIONAL MODELS FOR SPEECH GENERATION

James Flanagan

Center for Computer Aids for Industrial Productivity (CAIP)
Rutgers University

CONTACT INFORMATION

CAIP Center, Rutgers University, CoRE Building, Busch Campus, Piscataway, NJ 08855-1390

email: jlf@caip.rutgers.edu

tel: (908) 445 3443

fax: (908) 445 0547

WWW PAGE

http://www.caip.rutgers.edu/

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Speech generation and production, Speech synthesis, Computational fluid dynamics (CFD), Articulatory models, Acoustics, Navier-Stokes equations

PROJECT SUMMARY

Natural spoken language is a preferred means for communication among humans. Because of advances in automatic speech recognition, in speech synthesis, and in the computational economy of microelectronics, natural spoken language is emerging as an effective means for human/machine communication. But, as yet, the fundamental understanding supports application only in narrow, specific tasks. This limitation is owing, in part, to lack of detailed knowledge about how speech signals are generated, and how they can be described quantitatively. Too, it is owing to inadequate computational models of languages. In both instances, research progress has heretofore been hampered by unavailability of high-performance computing necessary to support analysis of the basic physics of speech generation. As a result speech recognizers are fragile in performance and unduly susceptible to interference, and speech synthesizers produce a signal quality that is far short of human naturalness. New understanding -- that can diminish these limitations -- can now be gained through computing capabilities that are emerging and through accumulated knowledge of speech production.

The objective of this research is to seek a compact and more accurate description of information in the speech signal. The approach uses articulatory parameters to create a new, realistic computer model for the generation of human speech from first principles of fluid flow. The research characterizes the generation of speech signals in terms of (a) an articulatory description of the vocal system, and (b) a fluid-dynamic solution to the generation, propagation, and radiation of audible sound produced by the acoustic system. Included is a computation of the speech signal from first principles, using the Navier-Stokes description of fluid flow. Preliminary research has already demonstrated the feasibility of the approach. Computational methods are within reach for realistically characterizing the non-linearities involved in voiced-sound generation by the vocal cords, voiceless-fricative generation from turbulent flow at constrictions, and resonance and radiation effects conditioned by sound travel in a non-uniform, lossy, yielding-wall conduit (i.e., the human vocal tract). These factors have been particularly difficult to quantify with conventional linear models of the speech production system.

First indications are that replication of the basic physics of sound generation leads to synthetic speech of improved naturalness, and hence to the possibility of ``machine voices'' of higher quality. Also, preliminary indications also are that articulatory descriptions of speech are more robust to variability, and may lead to speech recognizers that are more immune to interference.

This research initially formulates software for computing sound pressures and velocities in a two-dimensional vocal tract, via Navier-Stokes solutions on a dense time-space grid. The articulatory shape is prescribed, and in initial studies, is non-time-varying. The results permit characterizing acoustic interaction between sound sources and resonator system, and permit identifying the dominant parameters that condition turbulent flow and chaotic pressure generation. The research subsequently moves to three-dimensional time-varying articulatory shapes. Throughout, interactive auditory assessment of synthesized signals are made in quantitative listening tests. Parsimonious modeling of articulatory shape and dynamics will be sought as a totally new parameterization of speech signals.

This research is expected to result in a totally new parameterization of speech information - one that will directly support new methods for speech synthesis, low bit-rate speech coding, and less-fragile speech and speaker recognition. In addition, an ancillary result is speech synthesis of quality and naturalness surpassing that yet achieved.

PROJECT REFERENCES

Daily, J. W. and Harleman, D. F. (1966): Fluid Dynamic, Addison-Welsey Pub. Company, Inc. Reading, Mass.

Fant, G. (1970): Acoustic Theory of Speech Production, Mouton, The Hague.

Flanagan, J. L. (1972): Speech Analysis Synthesis and Perception, Springer-Verlag, New York.

Flanagan, J. L., Ishizaka, K., and Shipley, K. L. (1975): ''Synthesis of Speech from a Dynamic Model of the Vocal Cords and Vocal Tract,'' Bell Sys. Tech. J. 544, 485-506.

Hegerl, G. C., and Hoge, H. (1991): ``Numerical Simulation of the Glottal Flow by a Model Based on the Compressible Navier-Stokes Equations,'' Proc. ICASSP, 477-480.

Richard, G., Liu, M., Sinder, D. Duncan, H., Lin, Q., Flanagan, J., Levinson, S., Davis, D., and Slimon, S. (1995): ``Numerical simulations of fluid flow in the vocal tract,'' Proc. of 1995 Eurospeech, Madrid, Spain, September 18-21, 1995.

Richard, G. Liu, M., Sinder, D. Duncan, H., Lin, Q., Flanagan, J., Levinson, S., Davis, D., and Slimon, S. (1995): ``Vocal tract simulations based on fluid dynamic analysis,'' J. Acous. Soc. Amer. Vol. 97 (5S) Part 2, p. 3245. 1995.

Wagner, C. A. (1989): ``Computations of a Horseshoe Vortex Around an Endwall Mounted Cylinder,'' Ph. D. Thesis, University of Connecticut.

Baer, T., Gore, J. C., Gracco, L. C., and Nye, P. W. (1991): ``Analysis of Vocal Tract Shape and Dimension Using Magnetic Resonance Imaging: Vowels,'' J. Acoust. Soc. Am., 90(2), 799-828.

Sondhi, M. M. and Schroeter, J. (1987): ``A hybrid time-frequency domain articulatory speech synthesizer,'' IEEE. Transac. ASSP -35, pp. 955-967.

Stevens, K. N. (1971): ``Airflow and turbulence noise for fricative and stop consonants: Static considerations,'' J. Soc. Acous. Am. 50, pp. 1180-1192.

AREA BACKGROUND

The area of Speech and Natural Language Understanding includes two major components of human/machine communication. One is automatic speech recognition through which humans communicate with computers via natural spoken language; The other is text-to-speech synthesis through which retrieved information and/or processed results are spoken. The combination of the two technologies enables information retrieval via telephone networks (or a displayless communication mode).

Dominating methods for speech recognition are based on statistical modeling of speech, where spectral or cepstral coefficients are utilized as measured speech features. At the same time, speech synthesis mainly relies on linear acoustic theory of speech production. The theory assumes plane wave propagation inside the vocal tract and neglects nonlinear terms. The present project aims to study speech sound generation and propagation from first principles of physics. It then aims to use a new articulatroy based parameterization to develop natural-sounding text-to-speech synthesizers and less-fragile speech recognizers.

AREA REFERENCES

Flanagan, J. L. (1972): Speech Analysis Synthesis and Perception, Springer-Verlag, New York.

Flanagan, J. (1994): ``Technologies for multimedia communications,'' Proceedings IEEE, Vol. 84(4), pp. 590-603.

Furui, S., and Sondhi, M. M. (1991): Advances in Speech Signal Processing, Marcel Dekker, New York.

Rabiner, L., and Schafer, R. (1978): Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, New Jersey.

Rabiner, L. and Juang, B.-H. (1993): Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, New Jersey.

Roe D. and Wilpon J. (Eds.) (1994): Voice Communication between Humans And Machines, National Academy Press, Washington, D.C.

RELATED PROGRAM AREAS

Other Communication Modalities

Adaptive Human Interfaces

Intelligent Interactive Systems for Persons with Disabilities

POTENTIAL RELATED PROJECTS

Speech recognition, Speech synthesizers for persons with verbal disabilities, Speech coding.