Mary C. Gruszka / Audio By Design
04.03.2013 11:00 AM
The Importance of Speech Intelligibility
Speech intelligibility isn’t just a concern for sound reinforcement or emergency evacuation systems.

Mary C. Gruszka

NEW YORK -- Speech intelligibility isn’t just a concern for sound reinforcement or emergency evacuation systems. It’s important for audio for video as well. No matter what type of program, if the spoken word or dialog is unintelligible, much of the content is lost. The viewer’s frustration level can rise, even to the point of changing the channel.

A BBC study a couple of years ago found that 60 percent of the viewers surveyed were not able to “hear” what was being said. I would suspect that most if not all of the viewers actually heard that someone was speaking, but could not make out the words.

Human speech is interesting in that the vowels or voiced sounds from the vocal chords produce most of the sound power compared with the consonants. But consonants, while lower in energy, carry the necessary information needed to distinguish similar sounding words from each other. (This is the case for many languages, including English and Spanish.)

Consonants are non-voiced sounds formed in the mouth with the tongue, teeth, and cheek all contributing. Notice how you form certain consonants, for example, P, T, and B, C and Z, S and F, or M and N. Also notice how loudly you can speak and hold a vowel sound compared with a consonant.

In addition to the difference in energy, vowels and consonants differ in frequency. The vowel sounds are mostly lower in frequency. For sound system design we generally think of them in the 250 Hz to 500 Hz range, but one source I read said that vowels are grouped in formants, a grouping of frequencies around 400, 1,200, and 2,000 Hz.

On the other hand the more fricative consonants produce higher frequencies in the 2,000 Hz range on up to 7,000 Hz to 8,000 Hz and with some even higher.

For speech to be intelligible, a listener needs to be able to clearly distinguish the different consonants sounds. The key is good signal to noise ratio. The signal in this case is the spoken words. They must have sufficient level and be delivered through a channel with a wide enough frequency response. Problems with the signal end can include quiet, muffled speakers and poorly articulated words. Rapid speech and accents can also contribute to poor intelligibility.

Noise can take on many forms. There’s noise like you’d expect from building HVAC systems, traffic on the street, sirens or percussive sounds like jack hammers and natural sounds like wind, rain or ocean waves. But noise that interferes with speech intelligibility can also take on the form of many people talking at the same time, reverberations in a space, production elements like laugh tracks and music, and equipment problems like hums and buzzes and distortion. No matter what form the noise takes, it is especially troublesome to intelligibility when it falls in the same frequency range as the consonants of speech. At a sufficient enough level, noise will tend to mask those consonants, making it difficult for a listener to distinguish words.

One measure of speech intelligibility is the percentage articulation loss of consonants or %ALcons. For a sound system this can be calculated from various room and loudspeaker parameters, measured in situ with certain test devices, or with word lists given to a group of listeners.

In a sound system, the worst-case %ALcons one would want is 15 percent with 25 dB signal to noise ratio. But as one who has participated in intelligibility listening tests, I would say 10 percent %ALcons would be a better target, realizing that in some situations that’s not attainable. And audio for TV? I haven’t found any studies that give target numbers, but perhaps 5 percent ALcons typically used in designing educational classrooms would be a good start.

With this brief background on speech intelligibility, the results of the BBC study are not surprising. In most cases the problems listed by viewers related to a poor signal to noise ratio.

First of all viewers complained that the speech itself wasn’t clearly uttered, was muffled, spoken too fast, or trailed off at the end of sentences. Problems came up when the speaker would turn away from the camera. This can reduce not only the sound level when off axis from the mic, but reduce high frequency response as well.

And there’s another factor at play. When a viewer can’t see the person who is speaking, visual cues, like lip movements and facial expressions that aid intelligibility, are lost.

When more than one person on screen talked at the same time and talked over other people, viewers reported difficulty understanding what was being said. When listening to a TV program, viewers don’t have the same binaural cues to differentiate speakers or conversations than they do if they were in the same room with a group of people.

Dialog is generally mixed to the center channel of a 5.1 mix or in equal amounts of a stereo mix to form a phantom center channel. While good multichannel mixing can position talkers to some degree to help differentiate who is speaking, not every viewer is listening to a multichannel mix.

Other complaints involved background noise and music. Percussive music and music with lyrics playing over the dialog were especially troublesome. The BBC suggests reducing the overall music level by 4 dB after editing to improve speech intelligibility. For the most part, they feel this won’t adversely affect the creative intent.

An awareness of the factors that can degrade speech intelligibility can lead to production practices to avoid them.

Mary C. Gruszka is a systems design engineer, project manager, consultant and writer based in the New York metro area. She can be reached via TV Technology.

Post New Comment
If you are already a member, or would like to receive email alerts as new comments are
made, please login or register.

Enter the code shown above:

(Note: If you cannot read the numbers in the above
image, reload the page to generate a new one.)

Posted by: Anonymous
Thu, 12-26-2013 - 9:45AM Report Comment
The CALM Act has nothing to do with fidelity.
Posted by: Anonymous
Wed, 05-08-2013 - 4:32PM Report Comment
This issue is of big concern to me! I used to be a TV mixer but had to quit when I realized I could not hear the higher frequencies any more. I have had my hearing tested and it drops of pretty rapid at 2K. I should be OK with speech but not always. We leave the captions on all the time. Not always because of me but my wife and sometimes my daughter can't understand words that are spoken. Sometimes we have to go back several times with the captions on to figure out what they said. They talk too fast and drop of at the end of a sentence. I especially have problems with British actors. I often feel that the dialogue is buried in the mix. This article really caught my attention be I really struggle as a viewer with this issue. Thanks for writing this article.
Posted by: Anonymous
Tue, 04-23-2013 - 3:23AM Report Comment
Begging to differ, but intelligibility problems here in the USA are all up to the poor fidelity coming from those crummy lavalier microphones ... where they're clipped on the suit ... and/or a badly adjusted compressor. When you place the mic near the throat, that's exactly what you pick up: Guttural sounds, and something so unintelligible, it's pathetic. And it seems that just about all of the "Talking Heads" in various news shows have a 5-10db peak around 500 hz, with steep rolloff in both directions soon after that. In short, it sounds more like an old telephone. And I, for one, am sick of having to boost my treble control just to make the damn program intelligible, only to have my ears cleaned out from the perfect fidelity and high frequency content of the commercial. And I'm having to deal with this even after the CALM act. ---Duke & Banner

Monday 6:39AM
What Price Reliability?
Digitally delivered TV has seen a pile o’ fail lately.

Featured Articles
Discover TV Technology