Peter L. Silsbee

Dept. of Electrical and Computer Engineering
Old Dominion University
Norfolk, VA 23529


email: silsbee@ee.odu.edu, phone: (804) 683-4671, fax: (804) 683-3220




Speech and Natural Language Understanding.


Lipreading, Speechreading, Robust Speech Recognition


An automatic speech recognition (ASR) system is being developed which uses lipreading to supplement standard acoustic signal based techniques. This should result in improved immunity to poor acoustic environments. The project has four main goals: to advance the art of automatic lipreading, to investigate methods of integrating audio and visual information, to develop a natural-speech database for lipreading research, and to demonstrate that audiovisual ASR is more robust than audio ASR over a wide range of conditions.

A multispeaker database is being collected. The database will eventually have at least 100 speakers, and it will be extensible so that more speakers can be added to the database after the conclusion of the project.

Several forms of multiresolution, motion constrained deformable templates have been investigated for extraction of lip shape information. The deformable model parameters are a coarse physical description of the talker's mouth.

Another issue is the integration, or fusion, of the two information streams. A major focus of this portion of the investigation will be strategies which adapt to the speech, the acoustic environment, and the quality of the visual information. Currently, HMM integration strategies are being explored. These have the desirable quality that the auditory and visual information are combined in such a way that intermodal timing cues can be exploited.

Although many studies have shown that lipreading can improve speech recognition under some conditions, the degree of improvement and the range of conditions over which it occurs are still unknown.


P. L. Silsbee and Q Su. "Audiovisual Sensory Integration Using Hidden Markov Models," in D. Stork, ed, Speechreading by Humans and Machines: Models, Systems, and Applications. NATO-ASI. To appear, 1996.

P. L. Silsbee. "Sensory Integration in Audiovisual Automatic Speech Recognition," in 28th Annual Asilomar Conference on Signals, Systems, and Computers, vol. 1, pp. 561-565. 1994.

P. L. Silsbee. "Motion in Deformable Templates," in First IEEE International Conference on Image Processing, vol. 1, pp. 323-327. 1994.

P. L. Silsbee, Computer Lipreading for Improved Accuracy in Automatic Speech Recognition, Ph.D. Thesis, University of Texas at Austin. 1993.


Speechreading has been proposed as a method for substantially improving the robustness of ASR systems. The motion of certain articulators, especially in the front of the mouth (lips, teeth, tongue tip) have a significant effect on the higher frequencies of the speech spectrum [1]. It is these frequencies which are most easily masked by noise. Visual processing provides a separate channel to recover this information in noisy acoustic environments. The information gained visually is complimentary to that obtained auditorily in the sense that many phoneme classes which are highly confusable in one modality are easily distinguished in the other.

The most difficult problem currently seems to be the extraction of reliable oral cavity features while allowing the user comfort and freedom of movement. Also, the field currently needs to have a large standard database so that systems can be compared in a meaningful fashion.

I have listed below six references out of many which are available. Eric Petajan was the first person to report a working lipreading system; an improved version of that system is reported in [1]. Chris Bregler, in [2], discusses the only audiovisual ASR system to date which is based on a top notch audio ASR system. Jordi Robert-Ribes has performed some interesting analysis on the integration problem [3]. Dom Massaro has reported extensively on the "Fuzzy Logical Model of Perception" [4], which has been successful in describing integration of many different knowledge sources in human perception. Finally, David Stork has organized a NATO workshop [5] on the subject of speechreading in humans and machines; when the proceedings are published next year they will provide a snapshot of most of the current research in the field. In the meantime, Michael Cohen maintains a web page [6] with pointers to many articles on the subject.


[1] E. D. Petajan, B. J. Bischoff, D. A. Bodoff, and N. M. Brooke. An Improved Automatic Lipreading System to Enhance Speech Recognition. Bell Labs Tech. Report TM 11251-871012-11. 1987.

[2] C. Bregler, S. M. Omohundro, and Y. Konig. "A Hybrid Approach to Bimodal Speech Recognition," in 28th Annual Asilomar Conference on Signals, Systems, and Computers, vol. 1, pp. 556-560. 1994.

[3] J. Robert-Ribes, J.-L. Schwartz, and P. Escudier. "A Comparison of Models for Fusion of the Auditory and Visual Sensors in Speech Perception," AI Review, vol. 9. 1995.

[4] D. W. Massaro. Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Lawrence Erlbaum Associates, Hillsdale, NJ. 1987.

[5] D. Stork, ed. Speechreading in Humans and Machines: Models, Methods, and Applications, NATO-ASI held Aug.-Sep. 1995, proceedings to be published by Springer-Verlag, 1996.

[6] http://mambo.ucsc.edu/psl/lipr.html


Other Communication Modalities, Adaptive Human Interfaces.


A "general" input interface would allow many different input modalities, including but not limited to auditory speech, visual speech and gesture. There could be significant interplay between the different modalities. Examples would be the use of lip motion information to determine which of several people in the field of view is talking; the use of gesture to indicate to the computer who is talking (or to "pass off" control of the system); or the use of gesture to issue commands to a computer during dictation, without interrupting the flow of dictation. Without suggesting a specific project, I believe that there are many avenues which could be explored in this general area.

Given that such an interface can be constructed, it would also be desirable for that interface to seamlessly adapt its strategy to different users. A talker who barely moves his or her mouth will not provide useful visual speech information; one whose arm is in a cast will not provide useful gesture information.