Modeling Speech Perception in Noise
IRI 9309418 (Research Initiation Award)

Abeer Alwan

Department of Electrical Engineering, UCLA, 405 Hilgard Ave., Los Angeles, CA 90095.


email:, phone: (310) 206-2231, fax: (310) 206-4685.



Speech and Natural Language Understanding.


Masking, Speech-in-Noise, Auditory Models.


We almost always listen to speech which is degraded by the addition of competing speech and non-speech signals. Fortunately, we are remarkably adept at isolating a specific speech signal from the background noise and understanding what is said. The purpose of this study is to contribute to a broad research program whose aim is to understand and model human perception of speech in noise.

Developing quantitative models of speech perception in noise is important for providing insights into our cognitive abilities and into the perceptual mechanisms of the hearing impaired. People suffering from hearing loss often have the greatest difficulty understanding speech in noisy environments. Quantitative models describing why healthy hearing manages so well in noisy environments are imperative to the design of hearing-aids that would begin to recover the noise robustness lost with hearing impairment.

The study will also be useful in the development of robust automatic speech recognition and coding algorithms. Currently, the performance of such systems deteriorates significantly at signal-to-noise ratios high enough for humans to hear and understand perfectly.

Work progresses on two fronts: [1] developing fully-parameterized models which predict human perception in noisy environments, and [2] incorporating these models into automatic speech recognition systems and speech coders, to improve the systems' performance in noise.


A. Alwan, S. Narayanan, B. Strope, and A. Shen, ``Speech Production and Perception Models and their Applications to Synthesis, Recognition, and Coding,'' to appear in the Proc. of the Int. Symp. Sig. Sys. and Elec. (ISSSE), Oct. 1995.

B. Strope and A. Alwan, ``A First-Order Model of Dynamic Auditory Perception,'' Proc. NIH Hearing Aid Research and Development Workshop, September 1995.

B. Strope,``A Model of Dynamic Auditory Perception and its Application to Robust Speech Recognition,'' M.S. thesis, Department of Electrical Engineering, UCLA, June 1995.

B. Strope and A. Alwan, `` A Novel Structure to Compensate for Frequency-Dependent Loudness Recruitment of Sensorineural Hearing Loss,'' Proc. Int. Con. Acous. Speech Sig. (ICASSP), Vol. V, 3539-3542, May 1995.

A. Shen, B. Tang, A. Alwan, and G. Pottie, ``A Robust and Variable-Rate Speech Coder,'' Proc. ICASSP, Vol. I, 249-252, May 1995.

A. Shen,``Perceptually-Based Subband Coding,'' M.S. thesis, EE Dept., UCLA, June 1994.

A. Alwan, ``A Perceptual Metric for Masking,'' Proc. IEEE ICASSP, Vol. 2, 712-715, April 1993.


Speech sounds in all languages are thought to be realizations of a small number of constituents or features. Theories about these discrete, rather than continuous, representations of speech are based on articulatory, acoustic, and perceptual considerations or are based on production mechanisms with less emphasis on the perceptual and acoustical dimensions.

The mapping of these features to the acoustic domain is not necessarily one-to-one, but rather, one-to-many. The assessment of the perceptual importance of each cue is verified through perceptual experiments where typically one acoustic property is manipulated.

This project examines the perceptual importance of acoustic correlates of certain phonological features under conditions where the speech signals are corrupted by noise. Although noise is very frequently the limiting factor in normal communication, most previous studies which examined the perceptual importance of acoustic cues have been based on experiments conducted in quiet. The goal here is to develop parameterized models which predict human perception in noisy environments. The research attempts to bridge the gap between psychoacoustics and speech science by integrating knowledge of psychoacoustic capabilities with that of the acoustic properties of the speech signal when developing models of speech perception in noise.

Modeling auditory processes has several practical applications. For example, simplified auditory models have been used successfully to optimize the performance of digital speech and audio coders. These simplified models view speech as a sequence of unrelated static segments and exploit, predominantly, static masking effects. A more complete auditory model, especially one which takes into account dynamic spectral distortions, will undoubtedly have an important impact in speech coding research.

Another application would be in developing robust automatic speech recognizers (ASR) in the presence of noise. Using MFCC and their temporal derivatives to represent the acoustic observation, current recognizers incorporate a suitable model of frequency selectivity and a rough approximation of short-term auditory adaptation. We should also expect, however, increased performance by incorporating acoustic observation sequences that are more consistent with human perception.

The project addresses three issues:

1) Developing a noise-in-noise masking model: Most masking studies have examined the masking of tones-in-noise. A model to predict the level of a noise masker needed to mask a noise burst, aspiration noise or frication-like noise needs to be developed. This kind of masking is particularly important when studying the perception of plosives and fricatives in noise. We recently completed a set of perceptual experiments to quantify the relationship between masked thresholds of signals within noise as a function of signal center frequency, duration, bandwidth, and signal type. The experimental data show that the four parameters affect auditory-filter shapes, implying that perceptually-based analysis of speech in noise must take these parameters into account. The data were then used to develop a noise-in-noise masking model which was successful in predicting the masked thresholds of plosive bursts in noise.

2) Modeling dynamic auditory perception: Most dynamic auditory models are physiologically-based; unfortunately, the extreme computational burden of physiologically-based models currently makes them unattractive for most applications.

We parameterize a time-varying active auditory model from perceptual experiments. Our approach differs from detailed physiological models in that we are able to 'close the loop' with observations of top-level functionality. The first-order non-linear model of dynamic auditory perception consists of a linear filter bank with carefully-parameterized logarithmic additive adaptation after each filter output. An extensive series of perceptual forward masking experiments, determine the model's dynamic parameters. Pure tone forward masking experiments with varying masker levels and probe delays provide measurements of upward adaptation (post-adaptation recovery), while forward masking experiments with varying masker durations provide measurements of downward adaptation (attack).

The model, which is consistent to a first order with underlying physiological processes, predicts the saliency of different parts of changing sounds (such as onsets and spectral transitions), providing useful approximations of the perception of non-stationary speech.

3) Incorporating auditory models into speech recognition and speech coding schemes: An initial evaluation of our dynamic model as a front end for a simple word recognition system shows an improvement in robustness to background noise when compared to MFCC and LPCC front ends. The evaluation uses a dynamic-programming approach, and we are currently evaluating this representation with an HMM system.

We have used a simplified auditory model to develop a perceptually-based variable-rate speech and audio coder. The perceptual metric ensures that encoding is optimized to the human listener and is based on calculating the signal-to-mask ratio in short-time frames of the input signal. An adaptive bit allocation scheme is employed and the subband energies are then quantized. Subjective listening tests indicate the noise-robustness of the coder.

In the near future, we will integrate aspects of the static masking model into the dynamic model and continue to evaluate the effect of incorporating our auditory models into speech recognition and coding algorithms.

It is clear that our understanding of speech perception mechanisms has improved tremendously in the past few decades. Further multidisciplinary research in signal processing, psychoacoustics, linguistics, imaging, and auditory physiology are needed to better model these mechanisms.


Schroeder, M. R., Atal, B. S., and Hall, J. L., ``Optimizing digital speech coders by exploiting masking properties of the human ear,'' JASA, 66, 1647-1652, 1979.

Moore, B. An Introduction to the Psychology of Hearing. Academic Press, London, 1982.

Ghitze, O. ``Auditory nerve representation as a front-end for speech recognition in a noisy environment,'' Computer Speech and Language, 1(2):109-130, 1986.

Seneff, S., ``A joint synchrony/mean-rate model of auditory speech processing,'' J. Phonetics, vol. 16, no. 1m pp 55-76, 1988.

Johnston, James D. ``Transform coding of audio signals using perceptual noise criteria.'' IEEE JSAC, Vol. 6, No. 2, 1988.

Hermansky, H., and N. Morgan ``RASTA processing of speech,'' IEEE Trans. Speech and Audio Proc., vol. 2, no. 4, pp. 578-589, 1994.


Programs 3 (Other Communication Modalities) and 6 (Intelligent Interactive Systems for Persons with Disabilities).


Incorporating our auditory models into large-vocabulary speech recognition systems and using these models to design hearing-aid devices.