From MRI and Acoustic Data to Articulatory Synthesis
IRI-9503089 (Career Development Award)

Abeer Alwan

Department of Electrical Engineering, UCLA, 405 Hilgard Ave., Los Angeles, CA 90095.

CONTACT INFORMATION

email: alwan@icsl.ucla.edu, phone: (310) 206-2231, fax: (310) 206-4685.

WWW PAGE

http://www.ee.ucla.edu/faculty/Alwan.html

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

MRI, EPG, Sensors, Articulatory Synthesis, and Speech Production Models.

PROJECT SUMMARY

Quantitative models of the human speech production system are needed for a better understanding of our cognitive abilities and for the development of high-quality speech synthesizers and automatic speech recognition systems. In the proposed research, speech production models for fricative consonants are developed. Modeling fricative consonants has always been a challenging problem because of the complex production mechanisms and the lack of sufficient articulatory and aerodynamic data for these sounds.

In this study, articulatory data are obtained from Magnetic Resonance Images (MRI) and Dynamic Electropalatography (EPG), and aerodynamic data, from flow and pressure measurements. MRI reveals the 3D geometry of the vocal tract while EPG is important for studying articulatory dynamics. Aerodynamic data are crucial for studying turbulence generation during fricative production. The modeling approach is based on estimation theory, acoustics, and signal-processing techniques and uses the data obtained from the unified set of measurements described above.

Although the study focuses on fricatives, the proposed novel approach can be used to study all speech sounds. In addition, the study fosters cross-disciplinary activities in Electrical Engineering, Radiology, and Linguistics. By developing parametric speech production models and by inferring articulatory dynamics from the acoustic waveform, better performance of speech synthesizers and speech recognition systems can be achieved.

PROJECT REFERENCES

A. Alwan, S. Narayanan, B. Strope, and A. Shen, ``Speech Production and Perception Models and their Applications to Synthesis, Recognition, and Coding,'' To appear in the Intl. Symp. Sig., Sys., and Elec. (ISSSE) Proc., Oct. 1995.

S. Narayanan, A. Alwan, and K. Haker, ``An Articulatory Study of Liquid Consonants in American English,'' Int. Cong. Phon. Sci. (ICPhS) Proc., Stockholm, Sweden, August 1995, Vol. 3, 576-579.

S. Narayanan, A. Alwan, and K. Haker, ``An Articulatory Study of Fricative Consonants using MRI,'' Jour. Acoust. Soc. Amer. (JASA), September 1995, 1325-1364.

AREA BACKGROUND

Articulatory models of speech production mechanisms aim at modeling the physical and physiological functioning of the vocal apparatus. The advantages of such models are many and since they attempt to directly incorporate the dynamics of the articulators, they provide a more natural setting for the analysis and synthesis of speech. One of the main challenges that lie ahead in the process of achieving a physically-based mimic of speech production is obtaining sufficient articulatory and aerodynamic data for different speech sounds.

In previous studies, information regarding the vocal tract geometry during speech production speech has been mainly derived from lateral x-ray data. The main limitations of x-rays include radiation risks and difficulty in accurately deducing the cross-sectional morphology from mid-sagittal profiles. Other techniques such as ultrasound, static palatography and dynamic electropalatography, can also provide important articulatory information. None of these methods, however, provides a description of the entire vocal tract during the production of speech sounds.

Magnetic resonance imaging (MRI) is a powerful tool in obtaining the vocal-tract geometry and does not involve any known radiation risks. The images have good signal to noise ratio, are amenable to computerized 3-D modeling, and provide excellent structural differentiation. In addition, the tract (airway) area and volume can be directly calculated. The low image sampling rate, however, has restricted MRI use to the study of sustained speech sounds, corresponding to `static' tract shapes. In addition, the high expense associated with using MRI equipment, has restricted its use in speech research. Previous MRI studies have been mostly limited to vowels and nasal consonants.

We have gained access to the Medical Imaging Facilities at Cedars Sinai Hospital. In our study, MR images in the sagittal, coronal, and axial planes were collected using a GE 1.5 Tesla SIGNA machine (about 3.2 sec/image) with an image slice thickness of 3 mm and no interscan spacing. Four phonetically-trained, native American English speakers [2 males (MI, SC) and 2 females (AK, PK)] served as subjects. During scanning, the speakers were asked to sustain each of the eight fricatives of English in a VC context with a neutral vowel. The data were then processed on an ISG-Allegro workstation. Both automatic and manual (with the aid of anatomy atlases and dental casts) thresholding procedures were used in image segmentation. Following segmentation, three dimensional reconstructions of the entire vocal tract, or specific regions were made. Coronal and axial scans were used to obtain area functions of the front and back regions, respectively, while sagittal scans were used for length measurements. 3D models were used to measure the volumes of the sublingual cavities, piriform sinuses, and the entire vocal tract.

In addition to obtaining valuable estimates of the area functions and volumes, the MRI data illustrated inter-speaker differences in tongue shapes; differences in tongue shapes have important acoustic consequences. For example, the tongue shapes for /s/ and /z/ were characterized by post-constriction tongue concavity, the degree of which is speaker dependent. The concavity for one subject (MI) is the most striking and is enhanced by a significant medial grooving of the front tongue body; the maximum depth reaches 11.5 mm. In contrast, the front tongue body for another subject (PK) is concave but with no grooving, and the maximum medial depth reaches only 3.9 mm. Postalvelolars, on the other hand, do not exhibit any concavity. As a result, a more abrupt post-constriction area function is evidenced in the alveolars. The pressure drop due to losses at the contraction and expansion in the constriction region, on which the SPL of the turbulence source depends is predicted to be smaller for smooth transitions when compared to more abrupt ones. We also found that asymmetries in tongue shapes and linguopalatal contacts, are predominantly in the posterior tongue region, and are subject dependent; subject PK demonstrated the most striking asymmetry.

Inter-speaker differences in area functions are found to be greater in the pharyngeal cavity than in the buccal cavity with the nonstrident fricatives exhibiting greater differences than the strident ones. It was also found that the front region tract shapes across subjects were, in general, similar for the voiceless and voiced fricatives that share the same place of articulation. Voiced fricatives, however, tend to show larger pharyngeal volumes than the unvoiced fricatives due to tongue-root advancement.

Among the fricatives, the labiodentals exhibited the most variability across speakers. The variations in the labiodentals are not surprising: the tongue, which is the principal articulator for the other fricatives, is relatively unrestricted for the labiodentals. In fact, the acoustical characteristics of labiodentals are greatly influenced by the vocalic environment Such coarticulatory effects are expected to play a significant role in the overall tongue shapes assumed in a labiodental articulation.

Our measurements also show that cross-sectional tongue shapes can not be deduced from midsaggital profiles; deducing the shapes from midsaggital x-rays has been attempted by speech researchers in the past but our study points to the pitfalls of that approach.

The significance of an accurate description of the 3D geometry in understanding and modeling fricative source mechanisms has been detailed by Shadle (1991). The availability of fairly accurate dimensions and cross-sectional shapes may be exploited in specifying more realistic production models. For example, acoustic coupling between the front and back cavities can be modeled accurately. The acoustic significance of the sublingual cavities and piriform sinuses could be studied in greater detail. With the availability of data from four subjects, one can begin to investigate inter-subject variabilities in the articulatory domain.

Our work on using MRI to study fricatives is the first published work in this area. Recently, we have extended our MRI study to liquids (such as /l/ and /r/). We are also using the articulatory data to build mathematical models of consonant production.

The observed tongue shapes correspond to sustained consonants produced in a neutral vowel context. It would be of great importance to examine how these shapes, and, hence, area functions, change in different vocalic environments.

AREA REFERENCES

Baer, T., Gore, J.C., Gracco, L.C., and Nye, P.W., ``Analysis of vocal tract shape and dimensions using magnetic resonance imaging: Vowels,'' JASA, 90(2), 799-828, 1991.

Dang, J., Honda, K., and Suzuki, H., ``MRI measurements and acoustic investigation of the nasal and paranasal cavities'' JASA, vol. 94, No. 3, Pt. 2, Sep. 1993.

Greenwood, A.R., Goodyear, C.C., and Martin, P.A., ``Measurement of vocal tract shapes using magnetic resonance imaging,'' IEE Proc. I (Commun., Speech Vision)(UK), 139 (6), 553-560, 1992.

Shadle, C.H., ``The effect of geometry on source mechanisms of fricative consonants,'' J. Phonetics, 19, 409-424, 1991.

RELATED PROGRAM AREAS

Programs 1 and 3 (Virtual Environments and Other Communication Modalities): Visualization studies and computer facial animation studies can both benefit and add to our MRI study of speech production.

POTENTIAL RELATED PROJECTS

Building mathematical models, both finite-element and finite time-difference models using the 3D geometry obtained. Creating animations from the obtained data. Using faster MRI machines to explore articulatory dynamics. Mounting micro-sensor arrays on pseudopalates and using them in vivo to obtain pressure information at the contact location. Pressure sensors should be small enough to be fitted on a palate and cause no discomfort to the subjects. Microsensors and microtransducers can be used to measure lingual contact force as well as mean velocity and pressure fluctuations in the vocal tract.