-- Speech intelligibility isn’t just a
concern for sound reinforcement
or emergency evacuation
systems. It’s important for audio for
video as well. No matter what type of
program, if the spoken word or dialog
is unintelligible, much of the content is
lost. The viewer’s frustration level can
rise, even to the point of changing the
A BBC study a couple of years ago found that 60 percent
of the viewers surveyed were not able to “hear”
what was being said. I would suspect that most if not all
of the viewers actually heard that someone was speaking,
but could not make out the words.
Human speech is interesting in that the vowels or
voiced sounds from the vocal chords produce most of the
sound power compared with the consonants. But consonants,
while lower in energy, carry the necessary information
needed to distinguish similar sounding words from
each other. (This is the case for many languages, including
English and Spanish.)
Consonants are non-voiced sounds formed in the
mouth with the tongue, teeth, and cheek all contributing.
Notice how you form certain consonants, for example, P, T, and B, C and Z, S and F, or M and N. Also notice
how loudly you can speak and hold a vowel
sound compared with a consonant.
In addition to the difference in energy, vowels
and consonants differ in frequency. The vowel
sounds are mostly lower in frequency. For sound
system design we generally think of them in
the 250 Hz to 500 Hz range, but one source I
read said that vowels are grouped in formants, a
grouping of frequencies around 400, 1,200, and
On the other hand the more fricative consonants
produce higher frequencies in the 2,000 Hz range on up to 7,000 Hz to 8,000 Hz and
with some even higher.
For speech to be intelligible, a listener
needs to be able to clearly distinguish the
different consonants sounds. The key is
good signal to noise ratio. The signal in this
case is the spoken words. They must have
sufficient level and be delivered through
a channel with a wide enough frequency response. Problems with the signal end
can include quiet, muffled speakers and
poorly articulated words. Rapid speech
and accents can also contribute to poor
Noise can take on many forms. There’s
noise like you’d expect from building
HVAC systems, traffic on the street, sirens
or percussive sounds like jack hammers
and natural sounds like wind, rain or
ocean waves. But noise that interferes with
speech intelligibility can also take on the
form of many people talking at the same
time, reverberations in a space, production
elements like laugh tracks and music, and
equipment problems like hums and buzzes
and distortion. No matter what form the
noise takes, it is especially troublesome to
intelligibility when it falls in the same frequency
range as the consonants of speech.
At a sufficient enough level, noise will tend
to mask those consonants, making it difficult
for a listener to distinguish words.
One measure of speech intelligibility
is the percentage articulation loss of consonants
or %ALcons. For a sound system
this can be calculated from various room
and loudspeaker parameters, measured in
situ with certain test devices, or with word
lists given to a group of listeners.
In a sound system, the worst-case %ALcons
one would want is 15 percent with
25 dB signal to noise ratio. But as one who
has participated in intelligibility listening
tests, I would say 10 percent %ALcons
would be a better target, realizing that in
some situations that’s not attainable. And
audio for TV? I haven’t found any studies
that give target numbers, but perhaps 5
percent ALcons typically used in designing educational classrooms would be a
CAN YOU HEAR ME NOW?
With this brief background on speech
intelligibility, the results of the BBC study
are not surprising. In most cases the problems
listed by viewers related to a poor
signal to noise ratio.
First of all viewers complained that the
speech itself wasn’t clearly uttered, was
muffled, spoken too fast, or trailed off at the end of sentences. Problems came up when
the speaker would turn away from the camera.
This can reduce not only the sound level
when off axis from the mic, but reduce
high frequency response as well.
And there’s another factor at play.
When a viewer can’t see the person who
is speaking, visual cues, like lip movements
and facial expressions that aid intelligibility,
When more than one person on screen
talked at the same time and talked over
other people, viewers reported difficulty
understanding what was being said. When
listening to a TV program, viewers don’t
have the same binaural cues to differentiate
speakers or conversations than they
do if they were in the same room with a
group of people.
Dialog is generally mixed to the center
channel of a 5.1 mix or in equal amounts
of a stereo mix to form a phantom center
channel. While good multichannel mixing
can position talkers to some degree to help
differentiate who is speaking, not every
viewer is listening to a multichannel mix.
Other complaints involved background
noise and music. Percussive music and music
with lyrics playing over the dialog were
especially troublesome. The BBC suggests
reducing the overall music level by 4 dB
after editing to improve speech intelligibility.
For the most part, they feel this won’t
adversely affect the creative intent.
An awareness of the factors that can
degrade speech intelligibility can lead to
production practices to avoid them.
Mary C. Gruszka is a systems design
engineer, project manager, consultant and
writer based in the New York metro area.
She can be reached via TV Technology.