The Importance of Speech Intelligibility

Mary C. Gruszka

NEW YORK -- Speech intelligibility isn’t just a concern for sound reinforcement or emergency evacuation systems. It’s important for audio for video as well. No matter what type of program, if the spoken word or dialog is unintelligible, much of the content is lost. The viewer’s frustration level can rise, even to the point of changing the channel.

A BBC study a couple of years ago found that 60 percent of the viewers surveyed were not able to “hear” what was being said. I would suspect that most if not all of the viewers actually heard that someone was speaking, but could not make out the words.

Human speech is interesting in that the vowels or voiced sounds from the vocal chords produce most of the sound power compared with the consonants. But consonants, while lower in energy, carry the necessary information needed to distinguish similar sounding words from each other. (This is the case for many languages, including English and Spanish.)

Consonants are non-voiced sounds formed in the mouth with the tongue, teeth, and cheek all contributing. Notice how you form certain consonants, for example, P, T, and B, C and Z, S and F, or M and N. Also notice how loudly you can speak and hold a vowel sound compared with a consonant.

In addition to the difference in energy, vowels and consonants differ in frequency. The vowel sounds are mostly lower in frequency. For sound system design we generally think of them in the 250 Hz to 500 Hz range, but one source I read said that vowels are grouped in formants, a grouping of frequencies around 400, 1,200, and 2,000 Hz.

On the other hand the more fricative consonants produce higher frequencies in the 2,000 Hz range on up to 7,000 Hz to 8,000 Hz and with some even higher.

For speech to be intelligible, a listener needs to be able to clearly distinguish the different consonants sounds. The key is good signal to noise ratio. The signal in this case is the spoken words. They must have sufficient level and be delivered through a channel with a wide enough frequency response. Problems with the signal end can include quiet, muffled speakers and poorly articulated words. Rapid speech and accents can also contribute to poor intelligibility.

Noise can take on many forms. There’s noise like you’d expect from building HVAC systems, traffic on the street, sirens or percussive sounds like jack hammers and natural sounds like wind, rain or ocean waves. But noise that interferes with speech intelligibility can also take on the form of many people talking at the same time, reverberations in a space, production elements like laugh tracks and music, and equipment problems like hums and buzzes and distortion. No matter what form the noise takes, it is especially troublesome to intelligibility when it falls in the same frequency range as the consonants of speech. At a sufficient enough level, noise will tend to mask those consonants, making it difficult for a listener to distinguish words.

One measure of speech intelligibility is the percentage articulation loss of consonants or %ALcons. For a sound system this can be calculated from various room and loudspeaker parameters, measured in situ with certain test devices, or with word lists given to a group of listeners.

In a sound system, the worst-case %ALcons one would want is 15 percent with 25 dB signal to noise ratio. But as one who has participated in intelligibility listening tests, I would say 10 percent %ALcons would be a better target, realizing that in some situations that’s not attainable. And audio for TV? I haven’t found any studies that give target numbers, but perhaps 5 percent ALcons typically used in designing educational classrooms would be a good start.

With this brief background on speech intelligibility, the results of the BBC study are not surprising. In most cases the problems listed by viewers related to a poor signal to noise ratio.

First of all viewers complained that the speech itself wasn’t clearly uttered, was muffled, spoken too fast, or trailed off at the end of sentences. Problems came up when the speaker would turn away from the camera. This can reduce not only the sound level when off axis from the mic, but reduce high frequency response as well.

And there’s another factor at play. When a viewer can’t see the person who is speaking, visual cues, like lip movements and facial expressions that aid intelligibility, are lost.

When more than one person on screen talked at the same time and talked over other people, viewers reported difficulty understanding what was being said. When listening to a TV program, viewers don’t have the same binaural cues to differentiate speakers or conversations than they do if they were in the same room with a group of people.

Dialog is generally mixed to the center channel of a 5.1 mix or in equal amounts of a stereo mix to form a phantom center channel. While good multichannel mixing can position talkers to some degree to help differentiate who is speaking, not every viewer is listening to a multichannel mix.

Other complaints involved background noise and music. Percussive music and music with lyrics playing over the dialog were especially troublesome. The BBC suggests reducing the overall music level by 4 dB after editing to improve speech intelligibility. For the most part, they feel this won’t adversely affect the creative intent.

An awareness of the factors that can degrade speech intelligibility can lead to production practices to avoid them.

Mary C. Gruszka is a systems design engineer, project manager, consultant and writer based in the New York metro area. She can be reached via TV Technology.