Here’s the conclusion to the tests of the ITU BS.1770-2 and CBS algorithms.
This article is the second and concluding part of an article in our Broadcast Forum focusing on an examination of loudness and measurement. Part 1 may be found in the August issue.
We continue our discussion with an examination of the results after automatic loudness control. Figures 1 and 2 summarize that data. (To present the data with optimum graphic resolution, the loudness scales are narrower than in last month’s graphs.)
Both the loudness vs. time graphs and the histograms show the Orban 8685 controls loudness well, although the details of the meters’ indications are different. Both the BS.1770 and CBS measurements indicate that most of the data points are in a ±1dB/LK window.
The peak CBS readings fit within a ±2dB window. The BS.1770 readings also fit within a ±2 LK window — except for four short intervals, which appear as low-probability outliers in the left side of the histogram. These intervals correspond to dialog without background music and in the author’s opinion illustrate a weakness in BS.1770-2: Based on our extensive listening tests, we have concluded that the meter does not effectively lock onto the A/85 “anchor element” (almost entirely dialog in the test material used to prepare this paper) and instead indicates that loudness increases when dialog level is held constant while underscoring or effects are added to the mix.10
Problems with low peak-to-RMS ratio material
In the subjective testing to validate the BS.1770 meter, there were outliers as large as 6dB (i.e., the meter disagreed with human subjective perception by as much as 6dB11). The subjective testing to validate the CBS meter found outliers up to 3dB, although fewer items were used in this testing. We hypothesize that the fact that the worst-case error of the BS.1770 meter was substantially larger than that of the CBS meter is caused by the BS.1770 meter’s not modeling loudness summation or the loudness integration time constants of human hearing. BS.1770-2 states:
It should be noted that while this algorithm has been shown to be effective for use on audio programs that are typical of broadcast content, the algorithm is not, in general, suitable for use to estimate the subjective loudness of pure tones.
We have noted that the meter tends to over-indicate the loudness of program material that had been subject to large amounts of “artistic” dynamic compression, as is often done for commercials and promotional material. In other words, the meter over-indicates the loudness of program material having an unusually low peak-to-average ratio, which, at the limit, approaches the peak-to-average ratio of a pure tone.
We have encountered heated complaints by mixers12 and producers who stated that such material, when “matched” to the loudness of the surrounding program material via the BS.1770 meter, is considerably quieter in subjective terms. In turn, this has constrained the ability of producers to specify the type of audio processing they had previously used to give this material excitement and punch. We hypothesize that this problem is related to the fact that BS.1770 does not accurately indicate the loudness of pure tones.
Some studies have indicated that when people are asked to assess the loudness of a given piece of material, they state that it sounds louder when underscoring or effects are added to constant-level dialog. The EBU has used these studies to justify the position taken in R 128 that a listener’s impression of total loudness is more important than dialog level13. In our opinion, this misses the point. A more relevant question is whether viewers would want to turn down their volume controls to make dialog quieter when underscoring and effects appear. (In other words, whether effective TV commercial loudness control requires nothing more than applying gain control to commercials such that the BS.1770-2 “short-term” loudness14 is always limited to 0 LK.)
Orban and Dolby Labs hold similar views. We believe that dialog is the most important element in most television audio and that listeners do not want to turn down their volume controls every time that underscoring or effects appear under the dialog. The popular Dolby LM100 loudness meter15 in its current revision uses the same Leq(RLB) algorithm as BS.1770 but adds gating to eliminate non-speech material, including silence. The author has used the Dolby LM100 to measure the output of the Orban 8685 with a wide variety of speech material, and has observed that this material is almost always controlled within a ±1dB window as measured on the LM100.
This demonstrates the benefits of a dialog-centric measurement. Moreover, the author believes it is unwise to rely on a BS.1770 measurement to set the on-air loudness of unadorned dialog because this can cause the dialog to be too loud with respect to other material. The author has experimented with “inverse short-term BS.1770 loudness control” and believes that it sounds unnatural, pumping dialog loudness up and down in a subtly inartistic way as underscoring and effects come and go.16
Studies indicating that BS.1770 is inaccurate at very low frequencies
Another weakness of BS.1770 is that, unlike the CBS loudness controller and meter as implemented in Orban products, the BS.1770 algorithm does not take into account the loudness contributed by the LFE channel, for good reason. Nacross and Lavoie17 tried to extend the BS.1770 algorithm to include the LFE channel by summing the K-weighted LFE channel’s power into the current BS.1770 algorithm, where the gain is weighted for the fact that LFE channel receives a 10dB gain boost on playback, per Dolby’s standards.
This modified BS.1770 algorithm failed to agree with the judgments of a subjective listening panel unless a 10dB attenuation “fudge factor” was applied to the LFE channel prior to its power summation with the other channels. Nacross and Lavoie concluded:
A problem exists, however, should ITU-R BS.1770 be modified to simply include an attenuated version of the LFE channel. Because the LFE channel receives a 10dB boost on playback, the low-frequencies on this channel would contribute differently to a loudness measure if they were moved to one of the other main channels, even though the perceived loudness would not appreciably change. This suggests that while LFE content does contribute to the perceived loudness, Equation (2)18 does not sufficiently predict how that content should be included.
An Australian study may shed light on the failure of BS.1770 when program material contains considerable energy at very low frequencies.19 The authors used octave-band noise in subjective listening tests with the goal of verifying the K-weighting curve used in BS.1770. The authors state:
Comparison of the test results with an image of the filter curve currently specified in ITU-R Recommendation BS.1770 shows good agreement at 250Hz and above 500Hz, reasonable agreement at 500Hz, but marked difference in the bottom two octaves.
The relatively good performance of the BS.1770 algorithm in ITU trials suggests that, in partial loudness terms, there was probably not much test content in the 125Hz band or below. While the existing BS.1770 filter curve is probably a good choice in applications where the program is dominated by speech, and it is certainly an improvement on the A and B curves in that application, it is likely to give significant errors in measuring the loudness of other programs with more partial loudness in the lower frequencies, such as movie soundtracks and popular music. It is, therefore, desirable to improve on this filter for more general measurement of program loudness.
Discussion and conclusions
Several studies have shown that the loudness “comfort range” for typical television listening is +2, -5dB20. Beyond this range, a viewer is likely to become annoyed, eventually reaching for the remote control to change volume (or worse, from the broadcaster’s point of view, to mute a commercial). Whether measured via the CBS or BS.1770 algorithms, the CBS loudness controller algorithm in Orban’s current products effectively controls subjective loudness to much better than this +2, -5dB window.
In the original version of this paper, we had assumed that results using BS.1770 metering would be more consistent if that algorithm employed gating to prevent unadorned dialog from reading low compared to music and dialog with substantial background music or effects. However, this did not prove to be true with the program material we used for testing; the results from the BS.1770-1 (ungated) and BS.1770-2 (gated) measurements were similar when measuring material that had been processed by the CBS loudness controller. It is likely that the loudness-controlled material seldom caused the gate to act. (The CBS algorithm does not need silence gating because it is a “short-term” loudness measurement that incorporates cascaded models of the “instantaneous” and “short-term” loudness time constants of human hearing21, which the BS.1770 algorithm does not.)
Controlling loudness to a standard such as BS.1770 says nothing about the subjective acceptability of the loudness controller’s action. We have found that a simple loudness controller that uses the inverse of the BS.1770 short-term meter’s output to control loudness by gain reduction can cause unnatural-sounding gain pumping of dialog when underscoring and effects appear under the dialog.
More complex automatic loudness controllers can produce all of the well-known artifacts of dynamics processing. Improperly designed multiband compressors can reduce dialog intelligibility22. This is why it is important to carefully assess the audio quality and side effects that an automatic loudness controller produces so that one can choose a device that controls loudness effectively without producing objectionable and unnatural artifacts that can fatigue audiences. Different loudness controllers do not provide equally good subjective results even if they produce identical measurements on a loudness meter.
Based on extensive experimentation with typical broadcast material, we believe that the CBS loudness meter locks onto dialog more effectively than does BS.1770, particularly when the dialog is accompanied by underscoring and/or effects. Unlike the BS.1770 meter, the CBS technology does not unnaturally penalize material having a low peak-to-RMS ratio, so it allows mixers and producers to freely use “artistic compression”23 and other well-established production techniques with the knowledge that such material will be neither too loud nor too quiet when compared to the surrounding program.
10 In the first published version of the paper, we observed the similar dips in the BS.1770-1 (ungated) loudness and hypothesized that they were caused by lack of gating on silence and low-level material. For this reason, we were surprised that BS.1770-2 gating made little difference in the measurements of this material.
11 Refer to the scatter plots in Figs. 11, 12 and 13 of the ITU-R BS.1770-2 standard.
12 For example: “I did a -24 [LKFS] piece for Fox that was wall to wall singing and music for two minutes. Because of the overall loudness and continued full audio signal, I had to bring it down and when it aired, it was 3db too quiet even though it matched the magic LKFS number. I have no problem using these meters or meeting specs, but they are faulty.”—“wheresmyfroggy,” AVID board, 3-28-2011
13 Dash, Ian; Bassett, Mark; Cabrera, Densil, “Relative Importance of Speech and Non-Speech Components in Program Loudness Assessment,” AES Convention Paper 8043, 128th AES Convention (May 2010).
14 EBU R 128 specifies short-term loudness as a BS-1770-1 (ungated) measurement with a three-second integration time.
16 See Begnert, Fabian; Ekman, Håkan; Berg, Jan, “Difference between the EBU R-128 Meter Recommendation and Human Subjective Loudness Perception,” AES Convention Paper 8489, 131st AES Convention, (October 2011). This paper states, “These loudness-equalized signals gave rise to a perceived maximum loudness difference of 2.8dB.” This is very close to the 3dB number that has come up in other discussions. While the authors of this paper consider 3dB to be insignificant, others do not necessarily share this view, particularly advertisers who hear their expensive commercials aired 3dB quieter than surrounding program material!
17 Norcross, Scott G; Lavoie, Michel C., “Investigations on the Inclusion of the LFE Channel in the ITU-R BS.1770-1 Loudness Algorithm,” AES Convention Paper 7829, 127th AES Convention (October 2009)
19 Cabrera, Densil; Dash, Ian; Miranda, Luis, “Multichannel Loudness Listening Test,” AES Convention Paper 7451, 124th AES Convention (May 2008).
20 ATSC A/85:2009 Annex E, “Loudness Ranges”
21 For example, see Glasberg, B.R. & Moore, B.C.J. (2002) “A Model of Loudness Applicable to Time-Varying Sounds,” J.AES, vol.50:5, pp.331-342, May 2002.
22 Stone, Michael A.; Moore, Brian C. J.; Füllgrabe, Christian; Hinton, Andrew C., ”Multichannel Fast-Acting Dynamic Range Compression Hinders Performance by Young, Normal-Hearing Listeners in a Two-Talker Separation Task,” J. AES Volume 57 Issue 7/8 pp. 532-546; July 2009.
23 It appears that the group that created R 128 may be biased against this style of production: “Again, this does NOT mean that within a program the loudness level has to be constant, on the contrary! It also does NOT mean that individual components of a program (for example, pre-mixes or stem-mixes, a Music & Effects version or an isolated voice-over track) have all to be at the same loudness level! Loudness variation is an artistic tool, and the concept of loudness normalization according to R 128 actually encourages more dynamic mixing!” EBU TECH 3343, op. cit., p. 17.
—Robert Orban is chief engineer, Orban.