Audio compression and noise

In a previous article, we discussed analog audio SNR concepts. As mentioned, the dynamic range is restricted at the top by clipping (THD ≤1 percent) and at the bottom by the thermal noise combined with the ambient acoustical noise picked up by the microphone. This article will discuss some aspects of digital audio noise and digital audio compression-related noise.

Digital audio noise concepts

Digital audio noise is the result of the A/D limitations resulting in quantizing errors. Essentially, the analog-to-digital conversion quality is limited on the one hand by the sampling frequency (Fs) and the maximum audio frequency (Fmax), and on the other hand by the number of bits per sample (n). The digital audio SNR is given by:

SNR(dB) = 6.02n + 1.76 + 10log10(Fs/2Fmax)

where

n = Number of bits per sample

Fs = Sampling frequency

Fmax = Maximum audio frequency

Assuming n=20, Fs=48kHz and Fmax=20kHz, the resulting SNR=122.95dB. Quite impressive!

The perception of quantizing error is signal-dependent. At high audio signal levels, approaching 0dBFS, the Human Auditory System (HAS) perceives the quantizing errors as random noise. At lower audio signal levels, the HAS perceives the quantizing errors as harmonic and intermodulation distortions.

Consider a digital audio data stream with an 18-bit resolution and a sampling frequency of 48kHz. The essential bit rate is equal to:

18 bits per sample × 48kHz = 864kb/s

Six 5.1 audio channels and 18 bits per sample, as in ATSC, would result in a bit rate of 5.184Mb/s. It is obvious that compression is necessary for transmission purposes.

Audio compression methods

The audio compression methods rely on the human psychoacoustic characteristics and their limitations to remove redundant digital audio data. The audio coding is thus best described as a perceptual coder as opposed to a waveform coder. In a perceptual compression process, the codec (coder/decoder pair) does not attempt to recreate the input signal waveform. Its goal is to ensure that the recreated signal sounds natural to a human listener. The HAS has certain characteristics that are exploited by audio compression systems:

The spectral response: The HAS behaves like a spectrum analyzer. It separates the audible sound spectrum into 25 frequency bands called critical bands. The bandwidth of the critical bands is proportional to the center frequency and varies from 100Hz, below 500Hz, to 3500Hz at 13,500Hz.
The frequency response: The sensitivity of the ear decreases at low and high frequencies and is dependent on the sound pressure level (SPL) being relatively flat (±10 dB) at 120dB SPL.
The masking effects: The HAS suppresses some sounds in the presence of other sounds, a process called auditory masking. A weaker sound is masked if it is made inaudible by the presence of a louder sound. There are two types of masking: temporal masking and frequency dependent masking.

As shown in Figure 1, temporal masking results in a delay in the perception of a sound (premasking) and a slow decay in its perception (post masking). While the sound is maintained, other sounds of lower amplitude are masked.

Figure 2 shows that the threshold of hearing is also frequency-dependent. A sound close in frequency to another sound is more easily masked than if it is far apart in frequency. The continuous curve represents the HAS threshold of hearing. Sounds at various frequencies with SPL levels below this curve are inaudible. The dash-outlined curve shows how a 1kHz sound raises the threshold of hearing and effectively masks lower amplitude sounds of neighboring frequencies. Simultaneous frequency domain masking results in raising the perception threshold of sounds whose frequencies are in the vicinity of a higher amplitude sound.

In the presence of a complex audio spectrum, such as music, the threshold is raised at all frequencies. The beneficial effect is the masking of background noise during the reproduction of music. The parts of the signal that are masked are referred to as irrelevant. The parts of the signal that are removed by a source encoder are referred to as redundant. In order to remove the irrelevancies in the audio signal, the encoder contains a psychoacoustic model. It analyzes the input signal in consecutive time blocks and determines for each block the spectral components of the input signal by applying a frequency transform.

The psychoacoustic model provides for high-quality loss signal compression by describing which part of a given audio signal can be removed, or aggressively compressed, without a significant loss in the quality of the sound. Essentially, low-amplitude signals are either suppressed, if they are below the threshold of audibility, or given few bits, because the resulting quantizing noise is below the perception level. Such types of compression are a feature of all modern audio compression formats.

The MPEG approach

Figure 3 shows a simplified block diagram of an MPEG encoder:

Filter bank: A filter bank splits the input signal into 32 sub bands in an essentially lossless and reversible manner similar to the HAS process. Bands in which there is little energy result in small signal amplitudes that can be transmitted with short word lengths (few bits per sample). Thus, each band results in variable length samples, but the sum of all the sample word lengths is less than that of the initial PCM. Therefore, a coding gain can be obtained.
Fast Fourier transform (FFT): An FFT of the audio is used as the input to a masking threshold algorithm to determine what scale factor and quantizing level to use.
Scaler: A scaler boosts low-amplitude signals as far above the noise level as is possible.
Quantizer: A quantizer allocates the available number of bits in a way that meets both the bit rate and the masking requirements. The information on how the bits are distributed over the spectrum is contained in the bit stream as side information.

The decoder is less complex because it does not require a psychoacoustic model and bit allocation procedure.

ATSC applications

Figure 4 shows a typical ATSC application. In this example, a 5.1 channel audio program is converted from a PCM representation requiring more than 5Mb/s (six channels × 48kHz × 18 bits = 5.184Mb/s) into a 384kb/s serial bit stream by an AC-3 encoder. All this is achieved by transforming irrelevant audio information into inaudible noise.

Michael Robin, a fellow of the SMPTE and former engineer with the Canadian Broadcasting Corp.'s engineering headquarters, is an independent broadcast consultant located in Montreal, Canada. He is co-author of “Digital Television Fundamentals,” published by McGraw-Hill and translated into Chinese and Japanese.