Working toward consistency in program loudness

Within the past few years, we have entered an age where-by many television viewers are presented with programming from both the analog and digital domains (in some cases unbeknownst to them). This situation has, in many cases, generated audio level discrepancies that far exceed what most of the television viewing population considers acceptable. Research indicates that there are significant level differences within and across both types of services (that is, digital and analog). And when combined, the average A-weighted dialogue levels varied over a 16dB range. This range is well outside the range of acceptance for listeners tested during research.

Furthermore, many broadcasters (including cable programmers and television personnel) do not realize that the digital set-top box (STB) has been designed around several assumptions regarding analog NTSC broadcast levels (specifically, the speech levels), digital decoder operating mode, the validity of the digital audio metadata and the level relationships within the STB itself. Let’s look at three of the conditions that must be met in order for digital programming to match analog programming at the STB’s channel 3/4 RF output:

Digital STBs assume that while tuned to any analog service (whether over the air or cable), the average dialogue level is -17dB Leq(A) below 100 percent modulation.
When tuned to a digital service, the transmitted dialogue normalization (dialnorm) value is assumed to be correct for that program.
When using the RF output of the STB, the Dolby Digital decoder defaults to the RF operating mode.

Each of these assumptions are referenced in a bulletin issued by the Electronic Industries Association and the Consumer Electronics Association entitled EIA/CEA-CEB-11 NTSC/ATSC Loudness Matching. This document provides guidance to digital STB manufacturers on how to maintain uniform audio loudness between existing NTSC services and digital television services while simultaneously preserving the dynamic range capability of the digital services. The bulletin also addresses optimal output specification, gain structure and the capabilities of consumer broadcast products to match loudness from the viewer perspective.

Therefore, if a digital service is provisioned correctly (dialnorm set correctly, STB decoder operating in the correct mode, and so on) and the digital speech level emerges from the RF output of the STB at -17dB below 100 percent, there is still no guarantee that the speech levels on analog channels would be consistently at -17dB below 100 percent modulation (which is the level the STB expects the analog service dialogue levels to be at in order for a properly provisioned digital service to match it).

Many broadcasters achieve self-consistent levels by using dynamic range processing. This practice does not guarantee success in all cases, as inconsistencies are just as likely when that channel is compared to another. Hence, broadcasters must begin to employ measurement methods that allow all programming to meet the level requirements of downstream equipment (that is, the STB) if the overall listening experience is to be improved.

Figure 1. Agreement among listeners when evaluating speech items versus other signal types, such as music or effects. Click here to see an enlarged diagram.

The importance of speech to the listener

During research into subjective versus objective loudness estimations, it became apparent that listeners agreed with each other more consistently when evaluating content comprising primarily speech. On the other hand, when the listeners evaluated other types of program content, results became far less consistent. An example of this is shown in Figure 1, which compares the results of 21 listeners evaluating the level of one audio program containing speech and another containing only the sound of footsteps (a sound effect from a drama), compared to a reference.

The results show that 19 out of 21 listeners agreed with each other to within 1dB when evaluating a speech item. However, when the same 21 listeners evaluated the footsteps item, they disagreed with each other by up to 12dB. One listener indicated that the footsteps item was 3dB too quiet, while another indicated that it was 9dB too loud. Based on this evidence, we concluded that a loudness estimation based on only the speech portions of programming (the most subjectively consistent portion of the signal) will lead to greater listener satisfaction.

Assuming the non-speech elements of any given program are appropriately “balanced” around the speech elements, listeners will not be annoyed by the natural changes in loudness that occur during programs if the speech elements fall within a “comfort zone.” To the best of our knowledge, the exact magnitude of this comfort zone has never been determined. Therefore, we performed a series of experiments in an effort to determine the range of loudness levels we could use to define the zone.

Our method was to present listeners with a reference program segment and let each listener adjust the level to suit his or her taste. We then asked each listener to adjust the level of a test segment to each of the six points shown in Figure 2, relative to the reference segment.

Figure 2. Relative loudness (in dB) of the listening levels investigated, with 95 percent confidence intervals. Click here to see an enlarged diagram.

The results make it quite clear why television broadcasters (and others) have been plagued by complaints of “loud ads” for so long. An increase of 2dB to 3dB in subjective loudness proved enough to move a program out of the typical listener’s comfort zone and toward the point at which he or she would like to turn the volume down. There is much more latitude available on the softer side of the optimal volume (shown here as “0”). In summary, this comfort zone gives us a good idea of how accurate any loudness measure must be.

Measurement time scales

During our investigation, it also became apparent that in order to satisfy the requirements of estimating loudness in several broadcast applications, multiple time scales are necessary. The long-term, or “infinite,” time scale is necessary for determining the loudness of an entire program expressed as a single numeric loudness value. A single loudness value can be used for normalization purposes at the program ingest point. This value can be calculated in real time while the content is being transferred, or in an offline process on existing files. It can then be used to normalize the program to a desired level, or stored and carried along with the file as a part of metadata, allowing downstream equipment to take advantage of it upon playback. If the single loudness value only considered the loudness of speech over an entire program, this value could also be used to provision the dialnorm value in an AC-3 emission encoder, as well. On the other hand, there are several applications in which a short-term loudness indication is useful. Level shifts at program boundaries can be more quickly identified with a short-term method. This feature is useful in identifying and addressing the loud commercial problem.

In addition, measurements performed in short-term mode allow the operator to see short-term variations in loudness within a program. This can be particularly useful in live event situations. In any case, skilled audio operators may prefer to use the short-term measurement in some cases, as they find the information on near-term dynamics to be useful when mixing or producing a program. Short-term mode is also useful for measuring and logging the “dynamic” loudness history of a given program during the QC, post-production process or particular television service/channel in a cable head-end facility.

Satisfying the viewer

The need for better loudness management in content creation and broadcast is well recognized. The broadcast industry must continue to develop methods and the tools necessary to accurately assess the speech levels within broadcast programming. In this article, we have discussed what we feel to be the accuracy needed to maximize listener satisfaction. The best results are achieved by measuring and leveling the dialogue portions of programming using an objective loudness measurement (Leq(A)) based on a combination of long- and short-term loudness averaging.

The end result of all this is something we can all look forward to — a unified loudness level for all broadcast programming. Once we accomplish this, we will be able to put down our remotes and simply enjoy the show.

Note: A more detailed explanation of the issues and research discussed here can be found in the AES 115th convention paper “Intelligent Program Loudness Measurement and Control: What Satisfies Listeners?”

Jeff Riedmiller is broadcast product manager at Dolby Labs.