Center for Computer Aids for Industrial Productivity (CAIP)
Rutgers University
email: jlf@caip.rutgers.edu
tel: (908) 445 3443
fax: (908) 445 0547
The objective of this research is to seek a compact and more accurate description of information in the speech signal. The approach uses articulatory parameters to create a new, realistic computer model for the generation of human speech from first principles of fluid flow. The research characterizes the generation of speech signals in terms of (a) an articulatory description of the vocal system, and (b) a fluid-dynamic solution to the generation, propagation, and radiation of audible sound produced by the acoustic system. Included is a computation of the speech signal from first principles, using the Navier-Stokes description of fluid flow. Preliminary research has already demonstrated the feasibility of the approach. Computational methods are within reach for realistically characterizing the non-linearities involved in voiced-sound generation by the vocal cords, voiceless-fricative generation from turbulent flow at constrictions, and resonance and radiation effects conditioned by sound travel in a non-uniform, lossy, yielding-wall conduit (i.e., the human vocal tract). These factors have been particularly difficult to quantify with conventional linear models of the speech production system.
First indications are that replication of the basic physics of sound generation leads to synthetic speech of improved naturalness, and hence to the possibility of ``machine voices'' of higher quality. Also, preliminary indications also are that articulatory descriptions of speech are more robust to variability, and may lead to speech recognizers that are more immune to interference.
This research initially formulates software for computing sound pressures and velocities in a two-dimensional vocal tract, via Navier-Stokes solutions on a dense time-space grid. The articulatory shape is prescribed, and in initial studies, is non-time-varying. The results permit characterizing acoustic interaction between sound sources and resonator system, and permit identifying the dominant parameters that condition turbulent flow and chaotic pressure generation. The research subsequently moves to three-dimensional time-varying articulatory shapes. Throughout, interactive auditory assessment of synthesized signals are made in quantitative listening tests. Parsimonious modeling of articulatory shape and dynamics will be sought as a totally new parameterization of speech signals.
This research is expected to result in a totally new parameterization of speech information - one that will directly support new methods for speech synthesis, low bit-rate speech coding, and less-fragile speech and speaker recognition. In addition, an ancillary result is speech synthesis of quality and naturalness surpassing that yet achieved.
Fant, G. (1970): Acoustic Theory of Speech Production, Mouton, The Hague.
Flanagan, J. L. (1972): Speech Analysis Synthesis and Perception, Springer-Verlag, New York.
Flanagan, J. L., Ishizaka, K., and Shipley, K. L. (1975): ''Synthesis of Speech from a Dynamic Model of the Vocal Cords and Vocal Tract,'' Bell Sys. Tech. J. 544, 485-506.
Hegerl, G. C., and Hoge, H. (1991): ``Numerical Simulation of the Glottal Flow by a Model Based on the Compressible Navier-Stokes Equations,'' Proc. ICASSP, 477-480.
Richard, G., Liu, M., Sinder, D. Duncan, H., Lin, Q., Flanagan, J., Levinson, S., Davis, D., and Slimon, S. (1995): ``Numerical simulations of fluid flow in the vocal tract,'' Proc. of 1995 Eurospeech, Madrid, Spain, September 18-21, 1995.
Richard, G. Liu, M., Sinder, D. Duncan, H., Lin, Q., Flanagan, J., Levinson, S., Davis, D., and Slimon, S. (1995): ``Vocal tract simulations based on fluid dynamic analysis,'' J. Acous. Soc. Amer. Vol. 97 (5S) Part 2, p. 3245. 1995.
Wagner, C. A. (1989): ``Computations of a Horseshoe Vortex Around an Endwall Mounted Cylinder,'' Ph. D. Thesis, University of Connecticut.
Baer, T., Gore, J. C., Gracco, L. C., and Nye, P. W. (1991): ``Analysis of Vocal Tract Shape and Dimension Using Magnetic Resonance Imaging: Vowels,'' J. Acoust. Soc. Am., 90(2), 799-828.
Sondhi, M. M. and Schroeter, J. (1987): ``A hybrid time-frequency domain articulatory speech synthesizer,'' IEEE. Transac. ASSP -35, pp. 955-967.
Stevens, K. N. (1971): ``Airflow and turbulence noise for fricative and stop consonants: Static considerations,'' J. Soc. Acous. Am. 50, pp. 1180-1192.
Dominating methods for speech recognition are based on statistical modeling of speech, where spectral or cepstral coefficients are utilized as measured speech features. At the same time, speech synthesis mainly relies on linear acoustic theory of speech production. The theory assumes plane wave propagation inside the vocal tract and neglects nonlinear terms. The present project aims to study speech sound generation and propagation from first principles of physics. It then aims to use a new articulatroy based parameterization to develop natural-sounding text-to-speech synthesizers and less-fragile speech recognizers.
Flanagan, J. (1994): ``Technologies for multimedia communications,'' Proceedings IEEE, Vol. 84(4), pp. 590-603.
Furui, S., and Sondhi, M. M. (1991): Advances in Speech Signal Processing, Marcel Dekker, New York.
Rabiner, L., and Schafer, R. (1978): Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, New Jersey.
Rabiner, L. and Juang, B.-H. (1993): Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, New Jersey.
Roe D. and Wilpon J. (Eds.) (1994): Voice Communication between Humans And Machines, National Academy Press, Washington, D.C.
Adaptive Human Interfaces
Intelligent Interactive Systems for Persons with Disabilities