ATSC tackles audio loudness

The ATSC has published a new Recommended Practice (RP) that addresses the large variation in loudness among programs, commercials and other interstitial elements. The new document is A/85, “Techniques for establishing and maintaining audio loudness for digital television.” A/85 covers all facets of the audio delivery system, from implementation of the key ATSC standards to mix room monitoring and the consumer experience. It also includes “Quick Reference Guides” to get operators and content creators up to speed on critical information, as well as links to audio test signals that can be used for monitoring environment setup.

Loudness variations

Despite the conclusion of the DTV transition, many broadcasters and the production community have been slow to effectively adapt to the changes required to transition from analog NTSC audio techniques to contemporary digital audio practices. With digital television's expanded aural dynamic range (over 100dB) comes the opportunity for excessive variation in content when DTV loudness is not managed properly.

Consumers do not expect large changes in audio loudness from program to interstitials and from channel to channel. Inappropriate use of the available wide dynamic range has led to consumer complaints, which eventually reached Congress.

The NTSC analog TV system uses conventional audio dynamic range processing at various stages of the signal path to manage audio loudness for broadcasts. This practice compensates for limitations in the dynamic range of analog equipment and controls the various loudness levels of audio received from suppliers. It also helps smooth the loudness of program-to-interstitial transitions. Though simple and effective, this practice permanently reduces dynamic range and changes the audio before it reaches the audience. It modifies the characteristics of the original sound, altering it from what the program provider intended to fit within the limitations of the analog system.

The AC-3 audio system defined in the ATSC digital television standard uses metadata, or data about the data, to control loudness and other audio parameters more effectively without permanently altering the dynamic range of the content. The content provider or DTV operator encodes metadata along with the audio. From the audience's perspective, the dialog normalization (dialnorm) metadata parameter sets different content to a uniform loudness transparently. It achieves results similar to a viewer using a remote control to set a comfortable volume between disparate TV programs, commercials and channel-changing transitions. The dialnorm and other metadata parameters are integral to the AC-3 audio bit stream.

It is important for the digital television system to provide uniform subjective loudness for all audio content. Consumers find it annoying when audio levels vary between channels and on a single channel. Dialog, the spoken word, has been identified as the element that audiences typically adjust their volume to. Achieving an approximate match for average dialog level from all content is a desirable goal. While the AC-3 audio specifications in ATSC Standard A/52, “Digital Audio Compression (AC-3, E-AC3) Standard,” provide syntax that makes this goal achievable, system implementation in the real world has proven more difficult than expected.

Addressing the loudness issue encompasses several elements, which include mixing; monitoring; and proper encoding of local and network programs, commercials, promos and other content. The S6-3 study group explored all facets of DTV loudness, with a goal to identify problem areas and recommend practical solutions.

The industry has recognized that a new proficiency in loudness measurement, production monitoring, metadata usage and contemporary dynamic range practices is critical for meeting the expectations of the content supplier, the broadcaster, the audience and governing bodies.

The AC-3 audio system

The ATSC AC-3 audio system intends to deliver a reproduction of the original (unprocessed) content at the output of the AC-3 decoder in a receiver, normalized to a uniform loudness. It provides the ability for broadcasters to allow each listener the freedom to exert some control over the degree of dynamic range reduction, if any, that best suits his or her listening conditions.

The metadata parameter dialnorm is transmitted to the AC-3 decoder along with the encoded audio. The value of the dialnorm parameter indicates the loudness of the anchor element of the content. The dialnorm value of a very loud program might be 15, and of a soft one, 27. There is an attenuator at the output of the AC-3 decoder that applies appropriate attenuation to normalize the content loudness so all content is normalized to the same level without compromising dynamic range.

If the dialnorm metadata parameter accurately reflects the overall loudness of the content, then listeners will be able to set their volume controls to their preferred listening (loudness) level and will not have to change the volume when the audio changes from program to advertisement and back again. If all broadcasters use the system properly, the loudness will also be consistent across channels.

There are three methods of using audio metadata: fixed, preset and agile. Any one of these approaches will deliver consistent loudness to the listeners. A broadcaster should use the method that best suits its operational practices. Whichever approach is selected, the system depends on transmitting a value of dialnorm that correctly represents the loudness of the content, which depends in turn on accurate measurements.

Loudness measurement

Because loudness is a subjective phenomenon, human hearing is the best judge of loudness. When combined with a known mixing environment, experienced audio mixers using their sense of hearing can produce a program with remarkably consistent loudness. If all programs and commercials are produced with consistent loudness — and if the loudness of the mix is preserved through the production, distribution and delivery chain — listeners will not be subjected to annoying changes in loudness within and between programs.

When measuring audio signals, there are two key parameters of interest: the true peak level of the signal and its loudness. The true peak measurement enables the mixer to protect the program from clipping, and the loudness measurement allows the mixer to protect the listener from annoying variations in loudness. Although the mixer balances a mix using his or her hearing, an objective loudness measurement helps to maintain consistent loudness within and between programs.

The familiar VU and PPM meters measure neither the loudness nor the true peak levels of the signal. The characteristics of many of the common electronic meters available are unknown, and contribute to the inconsistent and confusing situation found in practice today.

The A/85 RP provides guidance that, if followed, will result in consistency in loudness and avoidance of signal clipping. The specified measurement techniques are based on the loudness and true peak measurements defined by ITU-R Recommendation BS.1770, “Algorithms to measure audio program loudness and true peak audio level.”

Loudness is measured by integrating the weighted power of the audio signals in all channels over the duration of the content. The general structure of the algorithm is shown in Figure 1.

The BS.1770 method was validated in listening tests by comparing its results to the relative subjective loudness of mono, stereo and multichannel program material. Measured loudness is reported as Loudness K-weighted Full Scale (LKFS). A unit of LKFS is the same measure as a decibel. A -15 LKFS program can be made to match the loudness of a quieter -22 LKFS program by attenuating it by 7dB.

The loudness of the anchor element (often dialog, for long-form programs, a global measure for short-form content, e.g. commercials) of the mix is used as the proxy for the overall loudness of the content. Accurate measurement of the loudness of the anchor element is necessary to allow operators to deliver content to listeners at consistent loudness levels.

Target loudness

With input from members representing various disciplines and following considerable discussion by the S6-3 committee, it was decided that for delivery or exchange of content without metadata (and where there is no prior arrangement by the parties regarding loudness), the target loudness value should be -24 LKFS. Minor measurement variations of up to approximately ±2 dB of this value are anticipated due to measurement uncertainty and are acceptable. Content loudness should not be targeted to the high or low side of this range.

Metadata management considerations

An AC-3 encoder allows users to set up to 28 metadata parameters concerning the characteristics of the accompanying audio in the bit stream. The parameters can be classified in three groups:

  • Informational metadataThis includes seven optional parameters that can be used to describe the encoded audio. These parameters do not affect encoding or the decoded listening experience in the home.
  • Basic control metadataThis includes 19 parameters that determine the dynamic range compression, downmixing, matrix decoding and filtering used in certain operating modes of the professional encoder and consumer decoder. Optimizing the setting of these parameters for each program may enhance the listening experience under varying listening conditions and with certain content types. However, default values may be used without detriment to the listening experience.
  • Critical control metadataThis includes two parameters that are critical for proper encoding and decoding. The first is channel mode (acmod), which should be chosen correctly to engage proper channel formatting in the decoder to match the content. Improper use of this parameter may alter a transmission and cause the loss of dialog when encoding a 5.1 program, for example, encoding a 5.1 channel soundtrack with 2/0 metadata. The second is dialog level (dialnorm), which the DTV Standard (A/53) requires to be set correctly to prevent potentially severe loudness variation during content transitions on a channel and when channel changing across the DTV dial. Incorrect dialnorm values can lead to a variation in loudness as large as 30dB.

The requirement for accurate dialnorm, channel mode (acmod) and other metadata can be met in three different ways, at the discretion of the operator:

  • Fixed metadataThe AC-3 encoder dialog level is fixed to a single value, and the content dialog levels are conformed to that setting.
  • Preset metadataAC-3 encoder presets are programmed, each with different dialnorm values and engaged via a general purpose interface (GPI) or other control interface.
  • Agile metadataThe AC-3 encoder is configured to receive external metadata. An upstream agile dialnorm metadata system may be used to deliver dynamically changing dialnorm values to the encoder, corresponding to the changing loudness at the content boundaries.

When managed properly, all three methods provide a compliant and acceptable end result for the consumer. It is also possible for the operator to apply a hybrid approach, choosing one of the methods for loudness management and a different method for the remainder of the metadata. For example, the user might maintain a fixed dialnorm value but switch the channel mode as required.

Controlling program-to-interstitial loudness

The AC-3 audio system incorporates the necessary technology to mitigate variations in loudness during program-to-interstitial transitions. Effective solutions are listed in Table 1.

There are notable conditions that may adversely impact program-to-interstitial transitions at content boundaries, such as when:

  • Content suppliers often increase dramatic impact by using program dynamics and manipulating loudness to achieve a desired audience effect. This is sometimes done at the end of program segments going into a commercial break.
  • An extreme variation outside of the comfort zone may cause a listener to adjust the volume to compensate for the large, temporary change in loudness. When a scheduled commercial or promo plays going into or out of breaks, the listener may need to re-adjust the volume yet again to achieve an acceptable setting for the short-form content. This has proven to be an annoyance to the audience.

Dynamic range management

The DTV audio system is capable of delivering wide dynamic range (the range between the softest and loudest sounds.) Content producers often take advantage of dynamic range as one of the methods to convey artistic intent.

However, there could be a conflict between the desire of the content producer to deliver content with wide dynamic range and the audience who cannot, or chooses not to, enjoy the wider dynamic range. This could be caused by the inability of the viewer's equipment to reproduce the desired range of sounds, or the lack of an environment suitable to the enjoyment of the wide dynamic range. Thus, the goals of preserving the original dynamic range of the content and satisfying viewers can often be at odds.

A goal of the AC-3 system is to provide content producers with the greatest freedom and flexibility in the choice of dynamic range control (DRC) when producing content. The AC-3 system conveys these DRC options to the viewer, where the DRC system will interact with the viewer's input in a known and repeatable fashion.

There are several methods for controlling dynamic range. Some methods apply prior to audio encoding, some apply after decoding, and some span both domains. One approach is traditional compression and/or limiting, where gain control is applied to the audio prior to encoding. Another approach is to use the AC-3 coding system, which generates gain control words during encoding but does not apply the gain control to the audio until after decoding. This allows a user to optionally choose how much dynamic range they desire.

The primary difference between the two approaches is that the AC-3 approach is reversible, and the other approach is not. A hybrid of the two methods is also possible, allowing for some permanent and some reversible processing to be combined in a balance determined by the broadcaster.

Awareness and education

The ATSC recognized that developing the Recommended Practice on audio loudness is an important step, but that ongoing education on the recommendations contained in the document will be critical to effectively addressing the loudness problem. Accordingly, the ATSC has embarked on an aggressive educational effort. This article is one element of this work; educational seminars are another. The ATSC has held two events so far, one on Nov. 4 in Washington, D.C., and another on Feb. 16 in Rancho Mirage, CA. The February seminar was timed to be adjacent to the 2010 Hollywood Post Alliance Technology Retreat. At the NAB convention next month, the ATSC will set up a demonstration area where loudness control techniques can be observed.

Jim Starzynski is principal engineer and audio architect at NBC Universal, and J. Patrick Waddell is the technical marketing manager, Standards & Regulatory, Harmonic.

Table 1. Large loudness variation during transitions can be effectively managed by ensuring dialnorm properly reflects the dialog level of all content. For operators using a fixed dialnorm system

  1. Ensure that all content meets the target loudness and that long-term loudness matches the dialnorm value; and/or
  2. Employ a file-based scaling device to match long-term loudness of nonconformant file-based content to the target value; and/or
  3. Employ a real-time loudness-processing device to match the loudness of nonconformant real-time content to the target value.

For operators using an agile dialnorm system

  1. Ensure that during program production, post production or ingest content is measured and labeled with the correct dialnorm value matching the actual loudness of the specific content; and/or
  2. Employ a file-based measurement and authoring device to set dialnorm to the average loudness of the specific content; and/or
  3. Employ a real-time processing device to match content to a specific loudness. Apply a dialnorm value, matching the loudness of all content processed by this device.