BINARY-PAIR PARTITIONED NEURAL NETWORKS
FOR SPEAKER AND DIALECT RECOGNITION
Stephen A. Zahorian
Department of Electrical and Computer Engineering
Old Dominion University
Norfolk, VA 23529
CONTACT INFORMATION
Email: saz100f@eefs01.ee.odu.edu
Phone: (804) 683-4570
Fax: (804) 683-3220
WWW PAGE
www.ee.odu.edu/zahorian.html
PROGRAM AREA
Speech and Natural Language Understanding.
KEYWORDS
Speaker recognition, speaker identification, neural networks, pattern classification, acoustic
features, dialect identification
PROJECT SUMMARY
A new method for speaker and dialect identification is under investigation. This method,
which is referred to as binary-pair partitioning (BPP), partitions an N-way classification task with
N * (N-1)/2 "elemental" classifiers, each of which is highly tuned to discriminate a particular pair
of categories. Each elemental classifier consists of a multi-layer feddforward neural network.
Since the binary-pair partitioned approach scales up in complexity on the order of N2 rather than
exponentially, the method is well-suited to talker identification for very large numbers of speakers
(> 100). The approach has been used to develop talker identification systems for over 200
speakers. The system performs with 100% accuracy in a text-independent mode when trained
with about 9 to 14 seconds of speech and tested with 6 seconds of speech (using the TIMIT data
base for testing). The BPP has also been extended for speaker verification and found to perform
with over 99.7% accuracy when tested with 630 speakers. The BPP approach is also well suited
to other classification problems for which multiple elemental decisions can be combined to form
an overall decision, such as dialect or language identification, and phonetic recognition. A major
advantage of this partitioned classification approach is the ability to easily refine the features and
classifier for each two-way decision. Several refinements and extensions to the basic method are
under investigation in the research project. These include: (1), enhancements to the input
features; (2), the effects of noise and channel distortions on performance; (3) the use of BPP
for phonetic classification; (4), the use of BPP for sex and dialect identification; (5) the use of
speaker identification to improve automatic speech recognition; (6) training individual BPP
networks to minimize the probability of error rather than to minimize mean square error.
PROJECT REFERENCES
C. A. Norton, III, S. A. Zahorian, " Speaker Verification based on Speaker Position in a
Multidimensional Speaker Identification Space," Intelligent Engineering Systems through
Artificial Neural Networks, Vol. 5, St. Louis, MO, November 1995.
S. A. Zahorian, A. Zhou, and N. Correal, "Comparison of Minimum Misclassification (MME)
Networks with Least Mean Square Error (LMS) Networks," Intelligent Engineering Systems
through Artificial Neural Networks, Vol. 4, pp. 191-198, St. Louis, MO, November 1994.
C. A. Norton, III, S. A. Zahorian, and Z. B. Nossair, "The Application of Binary-Pair Partitioned
Neural Networks to the Speaker Verification Task," Intelligent Engineering Systems through
Artificial Neural Networks, Vol. 4, pp. 441-446, St. Louis, MO, November 1994.
S. A. Zahorian and Z. B. Nossair, "A Neural Network Clustering Technique for Text-Independent
Speaker Identification," Intelligent Engineering Systems through Artificial Neural Networks, Vol.
4, pp. 453-460, St. Louis, MO, November 1994.
P. L. Silsbee, S. A. Zahorian, and Z. B. Nossair, "A Warped Time-Frequency Expansion for
Speech Signal Representation," Proc. IEEE-SP Symposium on Time-Frequency and Time-Scale
Analysis, pp. 636-639, Philadelphia, PA, October 1994.
L. Rudasi and S. A. Zahorian, "Text Independent Talker Identification Using Binary-Pair
Partitioned Neural Networks," IJCNN-92, Vol. 4, pp.
679-684, Baltimore, MD, June 1992.
L. Rudasi and S. A. Zahorian, S. A., "Text-independent talker identification with neural
Networks, ICASSP 91, 389-392.
AREA BACKGROUND
Automatic speaker recognition is a pattern recognition problem. There are two basic
aspects--feature selection and pattern classification. The pattern classification algorithm also has
two components--training speaker-specific models from features, and an algorithm for choosing
the model that best "matches" a test utterance. Much previous work in speaker recognition has
been devoted to both feature selection and algorithm development. Features that represent the average
magnitude spectrum, and thus model vocal tract characteristics, have generally been found to yield best
discrimination ability (Furui, 1981, O'Shaughnessy, 1986, Soong, et. al., 1988).
There are generally considered to be two categories of speaker recognition systems--
speaker identification and speaker verification. For the first type, N speakers from a closed set
are to be correctly identified from the closed set. For speaker verification, each "customer"
presents a claimed identity along with a sample of speech and the system must either verify the
claim or reject the speaker as an "imposter." For the case of speaker identification, accuracy is
equal to the percent correct identification. For speaker verification, error rates are typically given
for equal error rates of "false acceptances" and "false rejections." Another basic issue is whether
the system operates in text-dependent or text-independent mode. For the text-dependent case,
decisions are based on a fixed text, rather than a random selection. Because of the removal of
(much) phonetic variability, text-dependent systems usually achieve higher accuracy than text-
independent systems. Another general consideration that has a major effect on performance is
the amount of speech available for training speaker models and the amount available for testing.
Performance improves as the amount of data increases--at least up to a certain point.
AREA REFERENCES
Doddington, G. (1985). "Speaker recognition--identifying people by their voices," Proc. IEEE
73, 1651-1664.
Furui, S. (1981). "Cepstral analysis techniques for automatic speaker verification," IEEE Trans.
ASSP, ASSP-29, 254-272.
O'Shaughnessy, D. (1986). "Speaker recognition," IEEE ASSP Magazine, 3-4, 4-17.
Soong, F. K. and Rosenberg, A. E. (1988). "On the use of instantaneous and transitional spectral
information in speaker recognition," IEEE Trans. ASSP 36-6, 871-879.
Tishby, N. Z. (1991). "On the application of mixture AR Hidden Markov Models to text
independent speaker recognition," IEEE Trans. ASSP , 39-3, 563-570.
RELATED PROGRAM AREAS
4. Adaptive Human Interfaces.
6. Intelligent Interactive Systems for Persons with Disabilities
POTENTIAL RELATED PROJECTS
The work in this project can be integrated with any project which requires automatic
pattern classification. More specifically, within Speech and Natural Language Understanding
(item 2 of ISP), the techniques used in this project for speaker identification can be used to
enhance acoustic-phonetic performance in speech recognition systems. Speaker identification can
also be combined with speaker-dependent speech recognition and to form an overall speaker
independent speech recognition system. Thus a computer could automatically adapt to a
particular speaker (item 4), to form a more robust speech interface. The accurate phonetic
recognition possible with the partitioned neural network classifiers can also be used to develop
computer-based phonetic training exercises for the hearing impaired, thus improving the speech
capabilities of this population (related to item 6). The BPP classifier could also be used to
improve isolated word recognition for speech control of computers for those unable to use a
keyboard.