ADAPTIVE AND ROBUST AUTOMATIC SPEECH RECOGNITION IN HUMAN-COMPUTER INTERACTION

Yunxin Zhao

Beckman Institute and Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
405 N. Mathews Ave., Urbana, IL 61801

CONTACT INFORMATION

Mail: 405 N. Mathews Ave., Urbana, IL 61801

Email: yxz@ifp.uiuc.edu

Phone: (217) 333--2012

Fax: (217) 244--8371

WWW PAGE

http://www.beckman.uiuc.edu/groups/IFP/people/Zhao.html

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

automatic speech recognition (ASR), speaker-independent continuous speech recognition (SICSR), speaker adaptation, self-learning, environment adaptation, two-channel speech processing.

PROJECT SUMMARY

The goal of the proposed research is to develop highly effective adaptation mechanisms for speaker-independent continuous speech recognition (SICSR) in order to enhance its robustness under a wide range of speaker and environment conditions. The proposed technique for speaker adaptation offers important prospects of supporting SICSR systems with self-directed effective learning capabilities, thereby completely eliminating the need for user enrollment while continuously improving recognition accuracy. The proposed technique of environment adaptation presents a solution of two-channel speech processing to the problem of automatic speech recognition under adverse conditions, where ambient noise, mismatched channels, and active interference signals (including "cocktail party" speech) coexist with the desired speech signal.

In implementing the plan for speaker adaptation, an initial effort will be concentrated on the study of the convergence property of the proposed feedback self-learning mechanism under various recognition conditions. The study will be conducted from the experimental aspect of recognition accuracy improvement and the theoretical aspect of hidden Markov model convergence. The next effort is focused on a detailed study on the identification and modeling of individual attributes of speaker characteristics that cause inter-speaker speech variations, so as to effectively overcome these variations through speaker adaptation. In the last phase of the research, the adaptation algorithm will be fully integrated into one or more SICSR systems and computation efficiency will be improved towards real-time adaptive SICSR systems.

In carrying out the plan for environment adaptation, a beginning effort will be made on the study of two-channel configurations and the resulting signal characteristics; parameter estimation for speech, interference, and time-varying channel characteristics will also be implemented and studied. A subsequent effort will be focused on combining recognition classification and signal parameter estimation, in order to effectively enhance speech spectral features or adapt speech models. In the final phase of the research, the adaptation algorithm will be integrated with an SICSR system and the system will be evaluated in certain human-computer interaction environments at the Beckman Institute of UIUC, such as the 3D problem solving environment for structural biology, or the virtual reality environment CAVE, where the ASR techniques will facilitate applications that require hands-free, eyes-free natural interactions between people and the environments.

During the award period, the PI also plans to initiate education activities on speech processing and recognition, which include offering new courses, setting up a laboratory for research and teaching, supervising graduate students' thesis work, participating in multidisciplinary research, and establishing collaborative relations with industry. Both the research and education plans are currently under full implementation; some of the results are reflected in the project references as listed below. It is expected that a successful execution of the project plans will be instrumental to the related research and education programs in the prestigious academic institution of UIUC.

PROJECT REFERENCES

Y.Zhao, "Self-learning Speaker/Channel Adaptation Based on Spectral Variation Source Decomposition," to appear in Speech Communication.

K.Yen and Y.Zhao, "Co-Channel Speech Separation and Recognition based on Blockwise Decorrelation and Filtering," in preparation.

Y.Zhao, "Robust Speaker Characterization," to appear in Proc.of IEEE 1995 Workshop on Automatic Speech Recognition, Snowbird, Utah, Dec. 1995 (invited).

Y.Zhao, "Hierarchical Mixture Densities and Phonological Rules in Open Vocabulary Speech Recognition," Proc. of the 4th EuroSpeech, pp. 1587--1590, Madrid, Spain, Sept., 1995.

Y.Zhao, T.H.Applebaum, and B.A.Hanson, "Acoustic Normalization of Microphone-Channel Characteristics," Proc. of ICA, Trondheim, Norway, June, 1995 (invited).

Y.Zhao, "Iterative Self-Learning Speaker Adaptation Under Various Initial Conditions," Proc.ICASSP, pp. 712--715, Detroit, MI, May, 1995.

Y.Zhao, "An Acoustic-Phonetic Based Speaker-Adaptation Technique for Improving Speaker-Independent Continuous Speech Recognition," IEEE Trans. on Speech and Audio Processing, Vol. 2, No. 3, pp. 380--394, July, 1994.

Y.Zhao, "Self-learning Speaker Adaptation Based on Spectral Variation Source Decomposition," Proc.of the 3rd EuroSpeech, pp. 359--362, Berlin, Germany, Sept., 1993.

Y.Zhao, "A Speaker-Independent Continuous Speech Recognition System Using Continuous Mixture Gaussian Density HMM of Phoneme-sized Units," IEEE Trans.on Speech and Audio Processing, Vol. 1, No. 3, pp. 345--361, July, 1993.

AREA BACKGROUND

Automatic speech recognition (ASR) is one of the important component in intelligent human-computer interaction. The state-of-the-art ASR technology is based on the diverse disciplines in engineering and science, including signal processing (feature analysis, filtering), statistical modeling (feature distributions, hidden Markov models), communication theory (optimal decoding of speech into symbolic strings of words or phonemes), speech and linguistic sciences (auditory models, acoustic-phonetics, syntactic grammars), computer science (software implementation of ASR systems), etc.[1]. Although significant progresses have been made in this field, the technology remains rather fragile under a wide range of speaker and environmental conditions [2]. Therefore, many research efforts are currently directed towards the development of speaker-robust and environment-robust ASR techniques.

Speaker adaptation is a technique for speaker-robustness that makes use of a new speaker's speech data to reduce mismatches between the trained models (acoustic and phonologic) and the new speaker's characteristics, which potentially would improve the recognition accuracy of an ASR system to that of a system specifically trained for the speaker. Many approaches of speaker adaptation have been proposed, which can be categorized as supervised vs.unsupervised adaptations, off-line vs.on-line adaptations, data or model transformations vs.model adaptation, data specific vs.relational adaptations, etc. A brief overview of the adaptation techniques can be found in [3]. The efforts on environment-robustness have been focused on stationary noises with single-channel speech processing; the approaches include noise filtering, noise-robust distance measure, model adaptation, model augmentation, etc.[4,5,6].

AREA REFERENCE

[1].L.R.Rabiner and B.-H.Juang, Fundamentals of Speech Processing and Recognition, Prentice Hall, 1993.

[2].R.Cole et al., "The Challenge of Spoken Language Systems: Research Directions for the Nineties," IEEE Trans.on Speech and Audio Processing, Vol.3, No. 1, pp.1--21, Jan.1995.

[3].Y.Zhao, "Robust Speaker Characterization," to appear in Proc.of IEEE 1995 Workshop on Automatic Speech Recognition, Snowbird, Utah, Dec. 1995.

[4].A.Nadas, D.Nahamoo, and M.A.Picheny, "Speech Recognition using Noise-Adaptive Prototypes," IEEE Trans. on ASSP, Vol. 37, pp. 1495--1503, Oct. 1989.

[5].B.-H.Juang, "Speech Recognition in Adverse Environments," Computer Speech and Language, pp. 275--294, 5, 1991.

[6].R.C.Rose, E.M.Hofstetter, and D.A.Reynolds, "Integrated Models of Signal and Background with Application to Speaker Identification in Noise," IEEE Trans.on Speech and Audio Processing, Vol.2, No.2, pp.245--258, Apr.1994.

RELATED PROGRAM AREAS

1. Virtual Environments

3. Other Communication Modalities.

4. Adaptive Human Interfaces.

5. Usability and User-Centered Design.

6. Intelligent Interactive Systems for Persons with Disabilities.

POTENTIAL RELATED PROJECTS

1.ASR is ideal for human-computer interaction. Research projects on virtual environment applications such as manufacturer virtual fast prototyping could be benefited by incorporating ASR as a voice-interactive means with virtual environments.

3.Human-to-human communications commonly involve simultaneous usages of speech, hand-gesture, facial expression, etc. Research projects on various modalities of human-computer interaction should therefore address the between-modality interactions, in particular the interactions between spoken language with other modalities, which could in turn lead to new techniques in individual modalities.

4.Human spoken language adaptation to a problem solving environment is complementary to other human interactive behaviors. Research results in this area can be used for the integration of dynamic models of speech and language into ASR systems.

5.Successful ASR applications have been largely dependent upon judicious considerations of human factors. Research projects should include joint studies of human factors with interface design and ASR-technique improvement.

6.ASR technology has wide potentials for persons with disabilities. The number one chronic disability in America is hearing impairment, where ASR can provide improved means of speech processing to enhance both the intelligibility and quality of speech for the design of hearing aids.