A STUDY OF VARIABILITY AND A FEATURE-BASED APPROACH
TO SPEECH RECOGNITION

Carol Y. Espy-Wilson

Boston University
Electrical, Computer and Systems Engineering Department
44 Cummington St.
Boston, MA 02215

CONTACT INFORMATION

email: espy@formant.bu.edu
Phone: 617 353-6521
FAX: 617 353-8437
mail: 44 Cummington St., Boston, MA 02215

WWW PAGE

http://engc.bu.edu:80/ECS/faculty/homepages/espy-wilson.html

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Variability, phonetic features, acoustic properties, lexical access, speech recognition

PROJECT SUMMARY

Through the development of a feature-based recognition system, the aim of this research is to (1) study variability in the speech signal, (2) develop acoustic parameters that capture the acoustic properties for phonetic features, (3) investigate a recognition strategy based on acoustic events and (4) explore lexical access methods that can match directly lexical items represented in terms of phonetic features with the extracted acoustic properties while accounting for variability.

This work is divided into several components. First, a study of how to automatically extract the acoustic properties for features in a speaker-independent way via relative measures is conducted. Techniques such as the Fisher criterion are being explored to help find an optimal set of parameters. The automatically extracted acoustic properties are developed and evaluated using the TIMIT database where the phonetic transcriptions are converted to canonical feature specifications. Analysis of the acoustic parameters for phonetic features often point to variability in the way speakers produce speech and, is therefore, a study of speech variability. Our findings thus far suggest that much of the variability is systematic and agrees with phonological changes discussed in the literature. A subset of the phonetically-based acoustic parameters (those related to just the manner features) have been developed and compared to a signal representation based on Mel-cepstra coefficients (used in most state-of-the-art recognition systems) in an HMM-based recognition system for the broad classes ``vowel'', ``sonorant consonant'' (includes nasals and semivowels), ``fricative'' and ``noncontinuant'' (includes stops and affricates). Recognition results indicate that the phonetically-based acoustic parameters are better able to target the linguistic information and are more speaker independent.

The second component of this research explores an event-oriented recognition strategy. Acoustic properties for features are extracted in a hierarchical fashion. The manner features (such as sonorant and consonantal) which relate to the degree of opening/closing in the vocal tract are extracted in an event-oriented manner to first define regions and/or specific landmarks within the speech signal. These landmarks are then used in the extraction of information related to other features (such as the place-of-articulation features). In this type of control strategy, recognition is not frame based or segment based. There are no assumptions that speech is nonoverlapping or juxtaposed. As a result, recognition of completely or partially coarticulated sounds should be possible.

Finally, lexical access experiments are being conducted to determine how to directly match the extracted acoustic properties with lexical items based on phonetic features. Three main issues are addressed here. First, should variability be directly encoded in the lexicon, or should it be accounted for by the lexical access process. Second, if variability is encoded in the lexicon, should those features that can change be (1) marked as modifiable, (2) marked as modifiable and marked in terms of how they can change or (3) left unspecified so that only those features that are invariant are included in the lexical representation. Finally, what is the optimum method for partitioning the search space of the lexicon via a preliminary cohort selection based on a subset of the extracted acoustic properties.

PROJECT REFERENCES

C.Y. Espy-Wilson and S. Boyce (in revision). ``Coarticulatory Stability in American English /r/s'', Journal of the Acoustical Society of America.

C.Y. Espy-Wilson, (1994), ``A feature-based semivowel recognition system'' JASA, vol. 96, pp. 65-72.

N. Bitar and C.Y. Espy-Wilson (1995) , ``A Signal Representation of Speech Based on Phonetic Features,'' Proc. of the 1995 IEEE Dual-Use Technologies and Applications Conference, May 22-25, SUNY Inst. of Tech., Utica/Rome.

N. Bitar and C.Y. Espy-Wilson (1995), ``Speech Parameterization Based on Phonetic Features: application to speech recognition,'' Eurospeech-95, Madrid, Spain, September.

D. Paneras, N. Bitar and C.Y. Espy-Wilson (1995), ``Speech Variability in the TIMIT Database'', 130th Meeting of the Acoustical Society of America, November.

N. Bitar and C.Y. Espy-Wilson (1995), ``Knowledge-Based vs. Cepstral-Based Parameters for Broad-Class HMM Speech Recognition,'' IEEE Workshop on Speech Recognition, Snowbird, Utah, December.

C.Y. Espy-Wilson and N. Bitar (1996), ``A Knowledge-Based Signal Representation for Speech Recognition,'' ICASSP '96, to be held May 7-10, Atlanta, GA.

AREA BACKGROUND

Linguists have proposed a set of 20 or so binary distinctive features which comprise a phonological description of the speech sounds in all languages. Use of phonetic features for recognition is motivated by (1) spectrogram reading experiments which show that phonetic information is represented in the speech signal, (2) cognition studies which assert that the human lexicon is organized in terms of phonetic features, (3) psychoacoustic studies which show that phonetic features play an important role in human perception and (4) variability studies which show that large acoustic changes are often the result of a change in only one or two phonetic features.

Research into recognition systems based on the explicit extraction of linguistic information is presently suffering in comparison to probabilistic approaches such as HMM. However, a variety of efforts have been made over the past several years to combine speech knowledge and probabilistic frameworks. Furthermore, several advances have been made in recent years that suggest that another look into feature-based recognition is warranted. These advances include an improved understanding of the phonetic features and the relations between them, a better idea of the acoustic correlates of the features and the development of theories of hierarchical structures for the representation of lexical items in terms of phonetic features.

In addition to these recent gains, the use of phonetic features for recognition is desirable because it provides a framework for handling and understanding variability, the major stumbling block in the development of recognition systems that achieve human performance. Variability is usually due to either (1) coarticulation where the articulatory gestures for adjacent speech sounds overlap so that the acoustic properties associated with a particular feature are either masked or manifest differently depending upon the acoustic properties associated with other phonetic features that are simultaneously implemented, or (2) lenition whereby speakers don't achieve their articulatory targets so that the phonetic features for a particular sound can be manifest with varying degrees of strength.

AREA REFERENCES

Boyce, S. E., R. A. Krakow, F. Bell-Berti, and C. Gelfer. (1990). ``Converging sources of evidence for dissecting articulation into core gestures,'' Journal of Phonetics, vol.18, pp. 173-188.

Chomsky, N., and Halle, M. (1968). The Sound Pattern of English, New York: Harper and Row.

Clements, G.N. (1985). ``The geometry of phonological features,'' Phonological Yearbook, vol 2., pp. 225-252.

Fant, G. (1960). Acoustic Theory of Speech Production, The Hague: Mouton.

Jacobson, R., Fant, G., and Halle, M. (1952). ``Preliminaries to speech analysis,'' MIT Acoustics Lab. Tech. Rep. No. 13.

Lahiri, A. and Marslen-Wilson, W. (1991), ``The mental representation of lexical form: A phonological approach to the recognition lexicon,'' Cognition, 38(3):245--294.

Stevens, K. (1995). ``Applying phonetic knowledge to lexical access,'' Proc. of Eurospeech'95, vol. 1, pp. 3-10.

Stevens, K.N. and Keyser, J.K. (1989). ``Primary features and their enhancement in consonants,'' Language, vol. 65, pp. 81-106.

Zue, V. (1985). ``The use of speech knowledge in automatic speech recognition,'' Proc. IEEE vol. 73, pp. 1602-1615.

RELATED PROGRAM AREAS

Adaptive Human Interfaces

Intelligent Interactive Systems for Persons with Disabilities

POTENTIAL RELATED PROJECTS

Given the mappings between phonetic features, acoustics and articulation, the algorithms for automatic extraction of the phonetic features can be used to help identify articulatory problems in the speech of people with hearing impairments. They can also be used to help nonnative English speakers learn to produce English sounds. Finally, the acoustic properties for features can be used to design stimuli to help assess a person's hearing loss.