The elements controlling loudness

Controlling program loudness begins with an accurate estimation of the loudness either on a continuous (short-term) or an overall (long-term) basis. This is best expressed by a single value that represents the loudness of the entire program. In a Dolby Digital, or AC-3, bit stream, this value is known as the dialogue normalization, or dialnorm value.

The term “loudness” can be defined as the attribute of auditory sensation in which sounds can be placed on a scale extending from quiet to loud. It is a highly subjective quantity that involves psychoacoustic, physiological and other factors. This often results in substantial differences in loudness perception between listeners, making a single measurement solution that considers all of the factors, for all individuals, incredibly complex. This is borne out by real-world experience, as there is often no single loudness level that satisfies all listeners (or even a single listener) all of the time.

At best, we can only approximate the loudness of sounds by artificial means. One study Dolby performed showed that even when a group of people normalize a program by ear (still the best loudness estimation device we all have), the normalized program will satisfy a different group of listeners only about 86 percent of the time. Given this level of uncertainty, how can we estimate the loudness of programming for level-control purposes and satisfy the largest portion of the listening audience? And how can we do it without having to develop an overly complex system?

Loudness perception

Before answering this question, let's briefly recap what we know about the science behind loudness perception. First, the human auditory system is nonlinear with respect to frequency. Perceived loudness is dependent on the frequency content of a sound.

For example, a person with normal hearing would perceive a low-frequency sound, such as a 20Hz tone at 40dB SPL, to be quieter than a 1kHz tone at 40dB SPL. If this process is repeated for various frequencies (with the 1kHz tone still fixed at 40dB SPL), a 40-phon equal-loudness contour is created (where phon is defined as a unit of loudness level). For example, if a given sound is perceived to be as loud as a 40dB SPL sound at 1000Hz, then it is said to have a loudness of 40 phons.

You may be familiar with the equal-loudness contours that were first developed by Fletcher and Munson in 1933. Approximations of these contours have been used in sound level meters for several years and are commonly referred to as frequency weighting networks (e.g., an Leq(A) meter used for setting the dialnorm value in AC-3).

In such a network, the intensity of each frequency is weighted according to the shape of the equal-loudness contour — and for a particular loudness level in phons (for example, A-weighting approximates the sensitivity of human hearing similar to the 40-phon loudness contour) — before summing the energy across the entire frequency range. Devices of this type are good at estimating the relative loudness of signals with similar spectra, such as dialogue.

Calculating the loudness of more complex groupings of sounds, such as those with heterogeneous spectra, however, requires further thought, as something called the critical bandwidth comes into the picture. Critical bandwidth is a measure of the frequency resolution of the ear.

For example, if two sounds of equal loudness are close together in pitch (narrowband) when sounded separately, then their combined loudness when sounded together will be perceived as only slightly louder than one of them alone. Hence, they are probably in the same critical band where they are competing for the same nerve endings on the basilar membrane of the inner ear.

However, if the two sounds are widely separated in pitch (wideband), the perceived loudness of the combined tones will be considerably greater because they don't compete for the same nerve endings. Third-octave frequency bands can, and have been, used as an approximation to the critical bands in some standardized methods of calculating loudness (namely ISO 532-1975 Method B).

As a side note, the critical band is about 90Hz wide below 200Hz and increases to about 900Hz for frequencies around 5kHz. Because loudness perception is dependent on whether the signal is wideband or narrowband, it becomes challenging to design a measurement system that detects and applies a specific loudness measurement function for each of these signal types on a continuous basis.

The human ear is also not particularly sensitive to instantaneous peaks in signal level. While peaks of short duration may be present in a signal, the perceived loudness of the overall signal is typically not significantly affected. This is why a peak program meter (PPM) is less effective in indicating loudness. Psychoacoustic experiments show that for short intervals of time, loudness is less for shorter sounds but that at some time interval, somewhere between 100ms and 200ms, increasing the duration of a sound doesn't make it any louder.

What about volume-unit (VU) meters? The VU meter has considerably slower ballistics than the PPM and will indicate somewhere between the average and peak values of a complex waveform. Moreover, the VU meter only approximates momentary loudness changes in program material and can indicate moment-to-moment level differences that are greater than what our ears perceive. The VU meter also incorporates a relatively flat frequency response over the entire audio spectrum and therefore does not address the nonlinear nature of the human auditory system. This can result in very large meter deflections that do not highly correlate with a change in perceived loudness. Perhaps most important, these types of devices can lead to subjective interpretation errors among operators.

Given all this, you can see how the development of a measurement system that factors in even these few characteristics of human hearing (as well as numerous others not described here) would be quite complex, and yet it still wouldn't provide a measurement method perfect for every individual! Considering this, is there a way of simplifying the measurement of loudness of broadcast programming without impacting our accuracy or the satisfaction of your listeners?

Taking a new look

If we take a step back and look at the problem from a different angle, we know that we already have a proven and standardized method of estimating the relative loudness of signals that performs quite well, particularly with signals that have similar spectra, such as an A-weighted (Leq) measurement. Furthermore, because a significant portion of broadcast content contains dialogue, where any one sample of dialogue is spectrally similar to another, why not leverage the use of a classification system that intelligently chooses which portions of the signal (to be measured) that have strong similarities among most broadcast programming? This approach would better address the limitations of a basic frequency-weighted measure by only measuring the dialogue portions of the signal.

Consider the following: Evidence suggests that television listeners make adjustments to their volume controls in an effort to create consistent (perhaps conversational) speech levels from program to program (or channel to channel). Simply stated, as viewers, we use the television volume control to normalize the dialogue level to our own individual taste for each program, from scene to scene, between commercials and so on. For most television programs, speech can be considered the most important portion of the audio signal, because it carries the information describing the pictures we are viewing.

Television viewers in a living room environment prefer the dialogue level at a mean sound pressure level of 60.5dBA.1 Speech levels during ordinary conversation range from 55dBA SPL to 66dBA SPL. Television viewers choose to set the listening levels such that the program is, in a sense, speaking to them at a normal conversational level.

Considering the research results above, estimating the level of dialogue seems beneficial — and perhaps even a shortcut — to developing a more accurate loudness estimation for television programs. In a study that supports this claim, 21 listeners evaluated two samples of programming compared with a reference.2 During this test, each of the listeners leveled (by ear) each sample to match the reference. One of the samples contained dialogue, and the other contained a portion of a program where someone was walking down a hallway and heard only the sound of footsteps.

Figure 1. The level of agreement among 21 listeners, leveling speech and footstep signals by ear. Click here to see an enlarged diagram.

Figure 1 shows the correlation histogram of listener results. There was general agreement among the listeners when they leveled the dialogue item. Nineteen out of the 21 listeners agree with each other within 1dB.

By contrast, there was pronounced disagreement within the group when they attempted to level the footsteps piece to the reference. One person indicated a need to adjust the footsteps up by 3dB, while another indicated a decrease of 9dB to make it agree to the reference.

Users more closely agree on relative loudness when comparing two dialogue segments than when comparing arbitrary audio signals. Thus, a device that successfully predicts the average perceived level of dialogue (for example, the Dolby LM100 broadcast loudness meter) will be in close agreement (within approximately 1dB) with a significant portion of your listeners.

Jeff Riedmiller is product manager for Dolby Laboratories.

1 117th Audio Engineering Society Convention Paper 6233, Eric Benjamin

2 115th Convention Audio Engineering Society Convention paper 5900, Riedmiller et al.