BINARY-PAIR PARTITIONED NEURAL NETWORKS
FOR SPEAKER AND DIALECT RECOGNITION

Stephen A. Zahorian

Department of Electrical and Computer Engineering
Old Dominion University
Norfolk, VA 23529

CONTACT INFORMATION

Email: saz100f@eefs01.ee.odu.edu

Phone: (804) 683-4570

Fax: (804) 683-3220

WWW PAGE

www.ee.odu.edu/zahorian.html

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Speaker recognition, speaker identification, neural networks, pattern classification, acoustic features, dialect identification

PROJECT SUMMARY

A new method for speaker and dialect identification is under investigation. This method, which is referred to as binary-pair partitioning (BPP), partitions an N-way classification task with N * (N-1)/2 "elemental" classifiers, each of which is highly tuned to discriminate a particular pair of categories. Each elemental classifier consists of a multi-layer feddforward neural network. Since the binary-pair partitioned approach scales up in complexity on the order of N2 rather than exponentially, the method is well-suited to talker identification for very large numbers of speakers (> 100). The approach has been used to develop talker identification systems for over 200 speakers. The system performs with 100% accuracy in a text-independent mode when trained with about 9 to 14 seconds of speech and tested with 6 seconds of speech (using the TIMIT data base for testing). The BPP has also been extended for speaker verification and found to perform with over 99.7% accuracy when tested with 630 speakers. The BPP approach is also well suited to other classification problems for which multiple elemental decisions can be combined to form an overall decision, such as dialect or language identification, and phonetic recognition. A major advantage of this partitioned classification approach is the ability to easily refine the features and classifier for each two-way decision. Several refinements and extensions to the basic method are under investigation in the research project. These include: (1), enhancements to the input features; (2), the effects of noise and channel distortions on performance; (3) the use of BPP for phonetic classification; (4), the use of BPP for sex and dialect identification; (5) the use of speaker identification to improve automatic speech recognition; (6) training individual BPP networks to minimize the probability of error rather than to minimize mean square error.

PROJECT REFERENCES

C. A. Norton, III, S. A. Zahorian, " Speaker Verification based on Speaker Position in a Multidimensional Speaker Identification Space," Intelligent Engineering Systems through Artificial Neural Networks, Vol. 5, St. Louis, MO, November 1995.

S. A. Zahorian, A. Zhou, and N. Correal, "Comparison of Minimum Misclassification (MME) Networks with Least Mean Square Error (LMS) Networks," Intelligent Engineering Systems through Artificial Neural Networks, Vol. 4, pp. 191-198, St. Louis, MO, November 1994.

C. A. Norton, III, S. A. Zahorian, and Z. B. Nossair, "The Application of Binary-Pair Partitioned Neural Networks to the Speaker Verification Task," Intelligent Engineering Systems through Artificial Neural Networks, Vol. 4, pp. 441-446, St. Louis, MO, November 1994.

S. A. Zahorian and Z. B. Nossair, "A Neural Network Clustering Technique for Text-Independent Speaker Identification," Intelligent Engineering Systems through Artificial Neural Networks, Vol. 4, pp. 453-460, St. Louis, MO, November 1994.

P. L. Silsbee, S. A. Zahorian, and Z. B. Nossair, "A Warped Time-Frequency Expansion for Speech Signal Representation," Proc. IEEE-SP Symposium on Time-Frequency and Time-Scale Analysis, pp. 636-639, Philadelphia, PA, October 1994.

L. Rudasi and S. A. Zahorian, "Text Independent Talker Identification Using Binary-Pair Partitioned Neural Networks," IJCNN-92, Vol. 4, pp. 679-684, Baltimore, MD, June 1992.

L. Rudasi and S. A. Zahorian, S. A., "Text-independent talker identification with neural Networks, ICASSP 91, 389-392.

AREA BACKGROUND

Automatic speaker recognition is a pattern recognition problem. There are two basic aspects--feature selection and pattern classification. The pattern classification algorithm also has two components--training speaker-specific models from features, and an algorithm for choosing the model that best "matches" a test utterance. Much previous work in speaker recognition has been devoted to both feature selection and algorithm development. Features that represent the average magnitude spectrum, and thus model vocal tract characteristics, have generally been found to yield best discrimination ability (Furui, 1981, O'Shaughnessy, 1986, Soong, et. al., 1988).

There are generally considered to be two categories of speaker recognition systems-- speaker identification and speaker verification. For the first type, N speakers from a closed set are to be correctly identified from the closed set. For speaker verification, each "customer" presents a claimed identity along with a sample of speech and the system must either verify the claim or reject the speaker as an "imposter." For the case of speaker identification, accuracy is equal to the percent correct identification. For speaker verification, error rates are typically given for equal error rates of "false acceptances" and "false rejections." Another basic issue is whether the system operates in text-dependent or text-independent mode. For the text-dependent case, decisions are based on a fixed text, rather than a random selection. Because of the removal of (much) phonetic variability, text-dependent systems usually achieve higher accuracy than text- independent systems. Another general consideration that has a major effect on performance is the amount of speech available for training speaker models and the amount available for testing. Performance improves as the amount of data increases--at least up to a certain point.

AREA REFERENCES

Doddington, G. (1985). "Speaker recognition--identifying people by their voices," Proc. IEEE 73, 1651-1664.

Furui, S. (1981). "Cepstral analysis techniques for automatic speaker verification," IEEE Trans. ASSP, ASSP-29, 254-272.

O'Shaughnessy, D. (1986). "Speaker recognition," IEEE ASSP Magazine, 3-4, 4-17.

Soong, F. K. and Rosenberg, A. E. (1988). "On the use of instantaneous and transitional spectral information in speaker recognition," IEEE Trans. ASSP 36-6, 871-879.

Tishby, N. Z. (1991). "On the application of mixture AR Hidden Markov Models to text independent speaker recognition," IEEE Trans. ASSP , 39-3, 563-570.

RELATED PROGRAM AREAS

4. Adaptive Human Interfaces. 6. Intelligent Interactive Systems for Persons with Disabilities

POTENTIAL RELATED PROJECTS

The work in this project can be integrated with any project which requires automatic pattern classification. More specifically, within Speech and Natural Language Understanding (item 2 of ISP), the techniques used in this project for speaker identification can be used to enhance acoustic-phonetic performance in speech recognition systems. Speaker identification can also be combined with speaker-dependent speech recognition and to form an overall speaker independent speech recognition system. Thus a computer could automatically adapt to a particular speaker (item 4), to form a more robust speech interface. The accurate phonetic recognition possible with the partitioned neural network classifiers can also be used to develop computer-based phonetic training exercises for the hearing impaired, thus improving the speech capabilities of this population (related to item 6). The BPP classifier could also be used to improve isolated word recognition for speech control of computers for those unable to use a keyboard.

BINARY-PAIR PARTITIONED NEURAL NETWORKS FOR SPEAKER AND DIALECT RECOGNITION