Dept. of Electrical and Computer Engineering
Old Dominion University
Norfolk, VA 23529
A multispeaker database is being collected. The database will eventually have at least 100 speakers, and it will be extensible so that more speakers can be added to the database after the conclusion of the project.
Several forms of multiresolution, motion constrained deformable templates have been investigated for extraction of lip shape information. The deformable model parameters are a coarse physical description of the talker's mouth.
Another issue is the integration, or fusion, of the two information streams. A major focus of this portion of the investigation will be strategies which adapt to the speech, the acoustic environment, and the quality of the visual information. Currently, HMM integration strategies are being explored. These have the desirable quality that the auditory and visual information are combined in such a way that intermodal timing cues can be exploited.
Although many studies have shown that lipreading can improve speech recognition under some conditions, the degree of improvement and the range of conditions over which it occurs are still unknown.
P. L. Silsbee. "Sensory Integration in Audiovisual Automatic Speech Recognition," in 28th Annual Asilomar Conference on Signals, Systems, and Computers, vol. 1, pp. 561-565. 1994.
P. L. Silsbee. "Motion in Deformable Templates," in First IEEE International Conference on Image Processing, vol. 1, pp. 323-327. 1994.
P. L. Silsbee, Computer Lipreading for Improved Accuracy in Automatic Speech Recognition, Ph.D. Thesis, University of Texas at Austin. 1993.
The most difficult problem currently seems to be the extraction of reliable oral cavity features while allowing the user comfort and freedom of movement. Also, the field currently needs to have a large standard database so that systems can be compared in a meaningful fashion.
I have listed below six references out of many which are available. Eric Petajan was the first person to report a working lipreading system; an improved version of that system is reported in [1]. Chris Bregler, in [2], discusses the only audiovisual ASR system to date which is based on a top notch audio ASR system. Jordi Robert-Ribes has performed some interesting analysis on the integration problem [3]. Dom Massaro has reported extensively on the "Fuzzy Logical Model of Perception" [4], which has been successful in describing integration of many different knowledge sources in human perception. Finally, David Stork has organized a NATO workshop [5] on the subject of speechreading in humans and machines; when the proceedings are published next year they will provide a snapshot of most of the current research in the field. In the meantime, Michael Cohen maintains a web page [6] with pointers to many articles on the subject.
[2] C. Bregler, S. M. Omohundro, and Y. Konig. "A Hybrid Approach to Bimodal Speech Recognition," in 28th Annual Asilomar Conference on Signals, Systems, and Computers, vol. 1, pp. 556-560. 1994.
[3] J. Robert-Ribes, J.-L. Schwartz, and P. Escudier. "A Comparison of Models for Fusion of the Auditory and Visual Sensors in Speech Perception," AI Review, vol. 9. 1995.
[4] D. W. Massaro. Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Lawrence Erlbaum Associates, Hillsdale, NJ. 1987.
[5] D. Stork, ed. Speechreading in Humans and Machines: Models, Methods, and Applications, NATO-ASI held Aug.-Sep. 1995, proceedings to be published by Springer-Verlag, 1996.
[6] http://mambo.ucsc.edu/psl/lipr.html
Given that such an interface can be constructed, it would also be desirable for that interface to seamlessly adapt its strategy to different users. A talker who barely moves his or her mouth will not provide useful visual speech information; one whose arm is in a cast will not provide useful gesture information.