Examining Audio Bitstreams

Welcome to the next installment of our investigation into how Dolby Digital (AC-3), AAC, MP3, PAC, WMA and other audio data compression schemes get all that audio into such a small package and keep it sounding good. This time we will see precisely how the data rate is reduced and describe what certain artifacts sound like and what might cause them.

DISCARD VS. IGNORE

Last time we saw that audio signals can be viewed in both the time and frequency domains. To be able to view audio signals in the frequency domain, some sort of transformation must occur. The most common is the Fast Fourier Transform, or FFT, and to cut right to the chase, its output is the frequency domain representation of its input. Through the use of the FFT, the audio can be separated into groups of frequencies and data reduced on a band-by-band basis.

When looking at sounds in the frequency domain, we see that there are many sounds present but inaudible due to the fact that they may fall below the threshold of human hearing. One misconception is that audio coders throw away audio information. This is not exactly true. What in fact happens is that the audio information that falls below the hearing threshold is ignored by decreasing the resolution in those areas. In this way, "bits" are not wasted on the non-essential and non-audible audio information and are instead available to improve the accuracy of the audible part.

(click thumbnail)Fig. 1 Spectrum of a primary and secondary signal, the resulting masking threshold (in red), and the quantization noise underneath.We know that in a linear PCM system each bit of resolution gives us approximately an additional 6 dB of dynamic range by lowering the noise floor by that same amount. As an example, the theoretical noise floor of a 16-bit system is -96 dB (16 bits X 6 dB per bit). Lowering the resolution, i.e. fewer bits, causes an increase in quantization noise, and therefore the noise floor to rise. If we selectively decrease the resolution on a band-by-band basis, the area under the auditory masking threshold curve can be ignored, the net result of which is that this area is turned into inaudible noise. Fig. 1 shows what this might look like.

You can see that the quantization noise has been increased accurately so that the area under the masking threshold is filled in but not exceeded. You can also see the small gap between the quantization noise and the hearing threshold, known as coding margin.

STEREO VS. MONO

There is some efficiency to be gained from coding a stereo signal as such and not as two separate monaural signals. This is usually accomplished by generating sum (L+R) and difference (L-R) signals representing the M(id) and S(ide) information, respectively. In this manner, the encoder can switch between coding the L and R signals or the M and S signals, depending on which yields better coding gain. This switching can occur very fast and is signaled to the decoder so that it can turn the L/R and M/S signals back into a seamless Left and Right output. There are some psychoacoustic caveats to doing this, however. As Brian Moore points out in his classic text "An Introduction to the Psychology of Hearing," binaural masking-level differences can cause sounds that might be masked if presented to only one ear to become audible when presented to both ears. Luckily, this can be detected and will be taken into account by a smart coder when it calculates the masking threshold.

A further and not so obvious benefit of L/R and M/S switching during coding is that signals such as matrix surround encoded audio (LtRt) can be passed and protected from revealing coding artifacts upon matrix decoding. As the surround channels of such systems rely on phase differences between the Lt and Rt signals, separately coded channels could cause artifacts that differ in phase and would therefore be very effectively decoded into the surround channels.

Stereo (and multichannel) signals can also exploit the human ear's relative inability to localize high frequencies and, through a technique called "coupling," can send only a representation of the power of the high-frequency content of both channels. As the data rate is lowered, the frequency at which this coupling occurs can be lowered to save additional bits; however, there are certainly audible limits to this technique as the ear does eventually become more discriminating.

ARTIFACTS

At a given bitrate, there will either always be enough bits to accurately convey the audio, or there will not be enough bits and the codec will have to make a best guess as to where to use the available bits. In a high-quality codec operating at a conservative bitrate (i.e., on the high data rate side), these decisions will usually be correct and the audio will sound fine. Unfortunately, the world not being a perfect place, there is the possibility that a codec can make the wrong decision and reveal itself. This can be due to many factors, including a data rate that is simply too low to support the desired audio quality, or even audio that is difficult to encode due to noise or other bit-wasting content.

Arguably the most common artifact is described as a "swishy" or "watery" sound in the upper midrange and high-frequency areas. According to Brett Crockett of Dolby Laboratories, it is likely due to the codec running out of bits and having to make tough choices in the frequency domain. Sometimes certain frequency bands are simply undercoded and the result when transformed back to the time domain is the swishy sound.

Another complaint is a gritty or grainy quality to the audio. This can sometimes be attributed to too much resolution being taken away in an effort to save bits, causing the quantization noise to rise and become audible. This can also be the result of passing an audio signal through multiple generations of audio coding.

Looking back at Fig. 1, the small gap between the masking threshold and the quantization noise that we previously called coding margin will start to become filled in after another pass through an audio coder. If the coding margin was tight to begin with, the next pass could rapidly cause audible side effects as the coder will have a difficult time discriminating between the quantization noise added by the previous coder and the useful content. If you know that the audio will be passed through a subsequent coding generation, make sure to keep the data rates as high as possible at all stages.

This has a very practical application in digital television. As we have explored most of the network distribution strategies and have seen either high rate Dolby Digital (AC-3) or Dolby E chosen, it should be readily apparent why those choices were made. Dolby E was designed to have a wide coding margin and therefore withstand many generations with no audible degradation, and Dolby Digital (AC-3) at 640 kbps also has a relatively wide coding margin and will easily survive a subsequent pass for emission.

It may not be widely known or remembered that the ATSC allows Dolby Digital (AC-3) to be run at 448 kbps for a complete main program. This small increase buys a bit more coding margin, and can really help those networks that are not able to use Dolby E or Dolby Digital (AC-3) at a high data rate (and you know who you are!).

The next Audio Notes will wrap up our discussion of audio coding, and take a look at an amazing technology introduced at the New York AES convention that allows the pitch and time of an audio signal to be shifted, and actually sound good afterwards-not an easy task and not a simple solution, but an interesting one.

Special thanks again to Dr. Deepen Sinha, and also to Brett Crockett of Dolby Laboratories for describing what might be behind some of the weird sounds we sometimes hear from coding systems. If you are interested in finding out more, an excellent group of papers are available at: http://www.dolby.com/tech/. I specifically recommend "The AC-3 Multichannel Coder" and "AC-3: Perceptual Coding for Audio Storage and Transmission."