SAN FRANCISCO—Today, many broadcasters and cable programmers are under increasing pressure both to prepare and to distribute their content to viewers over more diverse pathways. These next-generation distribution paths often have limited bandwidth. This requires broadcasters, service providers and operators to use audio and video coding systems that deliver broadcast streams more efficiently while simultaneously maintaining a predictable level of quality. The overall goal of using these new coding systems is to maintain the pre-established benchmark of broadcast quality.
Today, some networks simultaneously prepare and deliver content in the format that their existing and primary viewers require, as well as in the new formats that several of the next-generation service providers (e.g. IPTV operators or download services) require. This simulcast of content is ideal because the next-generation service providers can simply pass through the precompressed low bit rate content without having to perform a quality altering process (on the way through their systems).
However, there are many cases today in which these next-generation service providers simply take a network's primary signal and transcode it (i.e. converting it in real time from one format to another) into a format that will yield the lower bit rates required for their network. Unfortunately, making this process work typically requires a full decode of both the audio and video before re-encoding into one of the future audio and video formats — a process that often results in a degradation in quality.
This article series will focus on the most important and often overlooked factors when considering a next-generation audio codec in applications where transcoding (from one format to another) cannot be avoided. In particular, it will examine what to expect in terms of quality when two different audio coding systems are used in tandem and include a brief explanation of the standardized methods used to test the quality of audio coding systems. In addition, it will explore how to interpret test results and most importantly, define broadcast quality.
The term quality is a key concept in audio coding, yet it is also quite challenging to describe or measure in objective terms. Using traditional methods, such as SNR, to assist in quantifying the perceived audio quality of any audio coding system will often lead to little success because measures like this do not consider psychoacoustic principles.
In the late 1980s, researchers Brandenburg and Johnston from Bell Labs presented an interesting case supporting the need for something other than a simple objective measure. Referred to as the 13dB miracle, the researchers presented two processed audio signals, each having a measured 13dB SNR. In one of the signals, they introduced white noise, while the other was injected with perceptually shaped noise.
Even though the SNR was identical for both signals, the perceived quality was quite different to the listener. The signal with the injected white noise had an annoying background hiss. The signal with the shaped noise was characterized as having good quality because the noise distortion was being partially or completely masked by the signal itself.
Based on the case just described, it is clear that human listeners need to be involved at some level to assess the quality of an audio coding system. Fortunately, there are two standardized methods for determining the perceived quality of audio coding systems that involve human test subjects. They are both standardized by the International Telecommunication Union (ITU) and referred to as:
ITU-R BS.1116-1 (Methods for the Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound Systems); and
ITU-R BS.1534-1 (Methods for the Subjective Assessment of Intermediate Quality Levels of Coding Systems), also known as MUSHRA, which stands for Multiple Stimuli with Hidden Reference and Anchor.
It can safely be stated that the perceived difference between a coded audio signal and the source (that is, the reference) has a direct correlation with quality, and the term impairment (often used in this discipline) can be thought of as the difference between the two. Another important term used when assessing audio coding systems is transparency.
Transparency is used to describe coded audio signals where the coding system under consideration is operating at a data rate such that listeners cannot reliably distinguish between the source (the reference) and the coded signal itself (where the source audio is encoded and then decoded). Therefore, the goal of any subjective listening test is to use a variety of test signals and then identify and grade how annoying audio impairments are when the codec is operating within its region of transparency and below its region of transparency.
The ITU-R BS.1116-1 method is the more critical subjective listening test methodology of the two listed previously. It is typically used to assess audio coding systems that introduce impairments small enough to be undetectable without strict control of experimental conditions and proper statistical analysis. The grading scale is based on a (continuous) five-grade impairment scale, as shown in the right side of Table 1. A grade of 5.0 is considered to be transparent, while a grade of 1.0 is reserved for very annoying impairments. The ITU-R BS.1116-1 grading scale also has a relationship to a standardized quality scale, which is defined in ITU-R BS.1284-1 and shown in Table 1.
The BS.1116-1 method itself is a double-blind, triple-stimulus with hidden reference type of test. A subset of a test session is a trial that begins with the presentation of a set of stimuli (the reference and two test items) and finishes with the test subject grading each of the test items.
Click image to enlarge.">
For each trial, the listener is presented with three signals or stimuli. One signal is the uncompressed reference signal (which is always known to the test subject), and the remaining two are the test signals, one of which is identical to the reference and the other of which is the same signal coded at a particular bit rate of interest.
Listeners are asked to assess and then grade the impairments between each of the test signals compared with the known reference. (They can freely switch between any of them.) Since one of the test signals is actually the (hidden) reference signal, the listeners should be grading it as a 5.0, and the remaining test signal should receive a grade based on the listener's subjective assessment of the degradation. If listeners are unable to reliably perceive any differences between the test signals, the audio coding system and the tested bit rate are said to be in the particular codec's region of transparency.
Obviously, tests like this often include signals coded at multiple bit rates to quantify the coding margin of the coding systems in question. Coding margin is the difference between the quantization noise from the codec/bit rate combination and the masking threshold. Generally speaking, operating a codec at a higher data rate will typically yield an increase in coding margin.
What about test material? BS.1116-1 requires that only critical material be used to expose differences among all of the audio coding schemes being tested. Critical material stresses the audio coding system in question and must be investigated and sought out for each system that's tested.
It is not uncommon to find several of the same audio test sequences among different subjective evaluation tests. However, there is not a universal set of audio test material that can be used to assess all audio coding systems for all conditions. This key aspect of the testing process is absolutely crucial because failing to find truly critical test sequences for each audio coding system will result in inconclusive test results.
Briefly, the MUSHRA method was designed for (and is more appropriate for) evaluating audio coding systems that are known and/or expected (a priori) to provide intermediate audio quality. Hence, this approach is tailored to evaluate systems with medium and large impairments (which is quite different from the goal of evaluating systems that introduce small impairments as in BS.1116-1).
In a MUSHRA test, it is a given that the listening panel will have no difficulty in detecting impairments. This approach also uses a high-quality uncompressed reference signal (also used as the hidden reference) and one additional signal called the anchor, which is a low-pass-filtered version of the high-quality reference signal. The bandwidth of the anchor signal should be 3.5kHz and is meant to aid us in weighting the relative annoyance of coder artifacts.
One of the main differences is that MUSHRA provides the listener a means of directly comparing the impairments of all the coding systems (and data rates) being evaluated in each trial. Hence, the listener can switch at will between any of the systems under test, the hidden reference, the anchor and the known reference. The listener will be grading a particular coding system by comparing that system directly with the reference signal and relative to the other systems being tested for each trial throughout the test.
As is obvious from this overview, both the ITU-R BS.1116-1 and ITUR BS.1534-1 test methodologies can and often do take a significant amount of time to perform correctly — often several months of effort. Given this, there are only a handful of facilities and personnel throughout the world qualified to properly administer this type of test. For more details on these ITU test methodologies, please request a copy of each document from the ITU.
INTERPRETING TEST RESULTS
Figures 1 and 2 depict the results from a formal ITU-R BS.1116-1 listening test performed by Dolby Laboratories in 2001. Figure 1 shows the individual and mean results for two-channel AC3 tested at 192Kb/s with and without a cascade of Dolby E preceding it. (The Dolby E included eight decode/encode cycles.) The x-axis indicates the critical items used in the test as well as the mean value of all items. The y-axis indicates something called the “diffgrade.” Diffgrade is equal to the subjective rating given to the coded test item minus the rating given to the hidden reference. Hence, a diffgrade of -4.0 indicates poor quality, and a diffgrade close to 0.0 can be considered very high quality.
The reason for using diffgrade as our unit of measurement has to do with the fact that listeners do make mistakes with identifying the hidden reference, and it would be improper to analyze and include only the grade for the actual coded item since the information contained in the grade given to the hidden reference would be lost. Diffgrade scores that are all negative tell us that listeners correctly identified the coded item during the test.
In Figure 1, the overall mean score across all items in the test was -0.58 for AC-3 at 192Kb/s and -0.60 for AC-3 at 192Kb/s with the Dolby E cascade described previously. Therefore, both of these scores fall in the middle of the “perceptible, but not annoying” range of the BS.1116-1 scale, which is between 4.0 and 5.0.
It is also worth noting the importance of the error bars (representing the 95 percent confidence intervals) in test results like these. Remember, any two data points are statistically different only if their error bars do not overlap. In Figure 1, the mean scores show that the error bars overlap for the codec with and without Dolby E in cascade, indicating that they are statistically identical. If these error bars are not shown, be sure to request a copy of them with the error bar information as the information contained within them can be quite revealing.
Figure 2 shows the results of an MPEG-1 LII audio codec operating at a data rate of 192Kb/s with and without a cascade of Dolby E preceding it. (Dolby E included eight decode/encode cycles.) However, in this case, the mean score was lower with a mean diffgrade of -1.35 and -1.45, respectively. The results place the MPEG-1 LII at 192Kb/s with and without Dolby E in cascade in the middle of the “slightly annoying” range of the BS.1116-1 scale.
Contrasting the results in Figure 1 and Figure 2, it is important to look for consistent behavior among all the test items. In Figure 1, the behavior among all the critical test items was consistent (from item to item), whereas the codec behavior in Figure 2 exhibits quite a large range of variability across all of the critical listening items. This type of behavior is important to keep an eye on and can indicate how well a particular codec/bit rate combination will perform over a variety of test material.
MORE TO COME
In a future Production Clips article, we'll tackle part two of this audio coding series, defining broadcast quality, as well as tandem coding losses and their effect on perceived quality.
Jeffrey C. Riedmiller is senior broadcast product manager for Dolby Laboratories.
Table 1. ITU-R BS.1116-1's grading scale compared with ITU-R BS.1284-1's scale ITU-R BS.1284-1 (quality) ITU-R BS.1116-1 (impairment) 5 Excellent 5 Imperceptible 4 Good 4 Perceptible but not annoying 3 Fair 3 Slightly annoying 2 Poor 2 Annoying 1 Bad 1 Very annoying