Digital Audio Sample Rates: The 48 kHz Question

A regular reader of this column would correctly surmise that the author considers audio to be a very important part of television. We are well into the digital age in television audio and video now, but let's take an interesting trip back a couple of decades, to the early years of digital audio, as we answer the question, "How did 48 kHz emerge as the professional digital audio sample rate?" As we will see, television had a great influence on this number.

It is not really difficult to understand the origin of the composite NTSC digital video sample rate of about 14.318 MHz, as used in D-2 and D-3 recorders: It is derived by multiplying the NTSC subcarrier frequency by 4. The origin of the component digital video sample rate of 13.5 MHz, as found in ITU-R Recommendation 601, is a little less accessible, but also clear: 4.5 MHz is a common frequency from which both 525/60 and 625/50 line and field rates may be derived by frequency division (see Table 1 for further details), and 13.5 MHz is derived by multiplying 4.5 MHz by 3.

TOO MANY SAMPLE RATES

In the late 1970s, digital audio was being increasingly used in both television and audio-only applications, and, as there was not yet a standard, the industry found itself dealing with a large number of proprietary digital audio sample rates; these ranged from a low of 32 kHz used by the BBC on aural studio-to-transmitter links and standardized by the European Broadcasting Union for use on communications circuits, to 50 kHz, as used on the two preeminent U.S. digital audio recorders then available, Soundstream and 3M Mincom.

As the use of digital audio increased, it became apparent that standardization on a single sample rate was needed, and particular progress toward this goal was achieved in 1981. In the selection of a sample rate, a number of factors had to be considered. An underlying assumption was that an audio bandwidth of 20 kHz had to be accommodated for professional use. Nyquist dictates that in order to perfectly reconstruct a sampled waveform without aliasing, at least two complete samples must be taken per second. Another way to say this is that the sample rate used must be at least double the highest frequency sampled. As the filters that are required to ensure that the bandwidth of the sampled audio signals remains within the proper bounds have finite slopes, the sample rate had to be well above 40 kHz. It was also generally agreed at that time that that quantization should be 16-bit linear, with potential future expansion to 18 or more bits. At that time, state-of-the -- art 16 -- bit digital-to -- analog and analog -- to-digital converters could operate at sample rates up to 60 kHz, and the sample rates under consideration therefore ranged from about 45 to 60 kHz. For applications concerned solely with audio, any frequency in this range could have been selected, but in order to synchronize digital audio with television and film, all the frequencies listed in Table 1, Television and Film Frequencies in use in 1981, had to be taken into consideration. Note that at this time, the only HDTV format that was in commercial use was the (analog) 1125/60 Hi-Vision system.
LEAP FRAMES

When audio-video editing and processing is considered, the number of digital audio samples per video frame is an important number. If the sample rate is a multiple of 600 Hz, an integral number of digital audio samples per video frame (single-frame periodicity) is obtained in systems with 24, 25 and 30 frames per second, as 600 is evenly divisible by those frame rates. However, if single-frame periodicity is to be achieved for those frame rates and also the frame rate of the NTSC color system (30/1.001 or about 29.97 frames per second), the sample rate must be a multiple of 30 kHz. Thus, in the sample frequency range under consideration, 60 kHz was the only frequency in which each frame would contain the same integer number of audio samples-a common element in all television and motion picture systems. All other possible sample frequencies had either the "leap-frame problem" or the "split-frequency problem."

When the ratio of digital audio sample rate to video frame rate is not an integer, it is necessary to accommodate a different number of samples in some video frames than in others. These are called "leap frames," in analogy to "leap years," as opposed to "regular frames." A pattern of leap frames and regular frames repeats with a periodicity of n frames. Leap frames must of course be identified and distinguished from regular frames by some means such as a digital flag.

In the split-frequency approach, when working in applications related to NTSC the nominal sample rate is reduced by the same factor as the NTSC color frame rate. There were proposals for two pairs of split-frequency sample rates: 44.1(44.1/1.001) kHz and 50.4(50.4/1.001) kHz. The use of split frequencies would have led to ambiguity and to uncertainty about which sample rate was to be used in a given situation. Further, use of the incorrect rate would produce a pitch change of one part in 1,000, considered unacceptable for professional use, and would quickly generate a lip-sync problem in television.

It is apparent that 60 kHz would have been the ideal sample rate for film and video use owing to the complete absence of leap frames; but from the perspective of professional audio-only recording, it was considered wastefully high(50 percent higher than the rate required for a 20 kHz audio bandwidth. This left the leap-frame frequencies of 45, 48, 50, 52.5 and 54 kHz, all of which have periodicities of five video frames or less. In 1981, a large amount of extant hardware and software used sample rates of 50 kHz and 48 kHz. In the U.S., Soundstream and 3M were leaders in digital audio recording, and both had hardware using the 50 kHz sample rate, which of course had generated a considerable quantity of software using that sample rate. 3M had proposed the distribution of 5005, 50 kHz digital audio samples over six video fields for NTSC television. 50 kHz may be derived by frequency division from 4.5 MHz, and it consequently synchronized with all television timing signals. It had the disadvantage that it required leap frames in 24 Hz and 525/60 monochrome television as well as NTSC television.

48 kHz was a favorite frequency in Europe at that time because it related by a simple 3:2 ratio to the 32 kHz sample rate, and because it caused leap frames in only one system, NTSC television (conveniently, not the system used in Europe), where 8008 digital audio samples must be divided over five video frames. Decca had by that time produced a large amount of software using the 48 kHz sample rate. 48 kHz had the disadvantage of not being a submultiple of 4.5 MHz or of 13.5 MHz.

PSEUDO-VIDEO

In 1981 a published standard in Japan for recording digital audio as "pseudo-video" on consumer VCRs was the split frequency pair 44/1(44.1/1.001=44.0559) kHz, and Philips and Sony had proposed 44.1 kHz exactly for compact discs. Those frequencies were slightly low to accommodate a professional 20 kHz audio bandwidth, and thus the split frequencies 50.4(50.4/1.001) kHz were proposed for professional use; they are related to 44.1(44.1/1.001) by a factor of 8/7. Split frequencies, as mentioned earlier, create ambiguity, confusion and potential pitch errors, and they did not relate simply to the 32 kHz BBC/EBU sample rate. For these reasons, they fell out of consideration.

Many in the U.S. television industry liked 60 kHz as a standard sample rate because it was free of leap frames and split frequencies, and it synchronized readily with all timing signals used in 60 Hz and 50 Hz television systems, 24 Hz film and the 13.5 MHz component digital video sample rate. The professional audio industry, however, considered it wastefully high, and there was a quantity of 48 kHz software extant in Europe. Leap-frame frequencies did not appear to present any constraints on editing, mixing or switching, but some additional hardware was required to keep track of leap frames. The choice boiled down to two leap-frame sample rates, 50 kHz and 48 kHz. 50 kHz caused a three-frame periodicity with NTSC video, as opposed to the five-frame periodicity of 48 kHz, but it also caused a three-frame periodicity in 24 Hz and 30 Hz systems. 48 kHz caused leap frames only in NTSC. 48 kHz was readily derived by frequency division from standard input frequencies that are used to derive television frequencies, and it readily synchronized with all video signals. It further bore a simple relationship with the 32 kHz BBC/EBU sample rate, and it enjoyed widespread use in Europe. Its selection as the professional digital audio sample rate involved some compromises, such as the requirements for some buffers and digital housekeeping when used with NTSC video. However, after two decades it has caused no serious problems in the NTSC world.