Audio synchronization Analog audio is difficult to handle. There are frequency response problems, distortion problems (harmonic and intermodulation), noise problems, and wow and flutter. Then there is the signal level monitoring with two opposing concepts: the VU meter and the PPM. Not surprisingly, analog video is also difficult to handle. There are linear distortion problems (poor frequency response, chrominance to luminance delay and gain inequality to name a few), nonlinear distortion problems (including luminance nonlinearity, differential gain and differential phase) and noise problems.
In the NTSC world, picture information is transmitted in a synchronized manner. Each picture requires precisely the same amount of time to be transmitted. The analog audio accompanying the NTSC video signal is continuous, and, like the analog video, is transmitted in real time.
Early television production had no lip-sync problems. They started appearing with the use of video frame synchronizers. These devices introduce a 33.3 msec, or more, video frame delay with respect to the accompanying audio. This is barely noticed by the viewer and was ignored by the broadcasters. Apart from lip-sync there were no audio synchronizing concerns.
Digital signal processing, recording and distribution eliminates many of the cumulative distortions affecting the analog audio signals. 20-bit digital audio equipment provides a 120dB dynamic range, guaranteeing excellent SNR and 20dB of headroom, allowing the use of any level indicator, including the infamous VU meter, while avoiding clipping. 48kHz sampling guarantees a 20kHz bandwidth without aliasing.
Digital audio equipment can consist of an assembly of digital black boxes connected using analog I/O ports. In this case, there is no need for synchronization of the digital equipment. This analog approach, however, results in multiple conversion artifacts and should be avoided.
The AES/EBU digital audio interconnect standard eliminates the multiple conversion problems and ensures faultless, secure and reliable equipment interconnections, especially when 75V coaxial cable is used. Digital interconnection of digital audio equipment in an audio studio, including the digital audio mixer, requires all audio equipment be synchronized to a common reference. With signal sources using sampling frequencies other than the standard 48kHz, audio standards converters are required. External digital audio signal sources need to be passed through an audio frame synchronizer locked to the local reference signal before further processing. These requirements are relatively easy to satisfy and problems are not normally encountered, other than learning the basics of synchronization, a topic unheard of in analog audio environments.
SDTV digital audio/video studios The SDTV (525/59.94) 10-bit CCIR 601 4:2:2 component digital video format is a mature and cost-effective technology. A wide choice of production equipment is available on the market. The SMPTE 259M bit-serial interconnect standard ensures secure and reliable digital video equipment interconnections.
Using digital audio production facilities in a digital video studio requires that the 48kHz audio sampling frequency be coherent with the 27MHz 4:2:2 time division multiplexing frequency, i.e. derived from a common reference. This is required to allow the embedding of the digital audio into the digital video datastream. In addition to synchronizing the audio and video sampling frequencies, an additional problem occurs in North America. This has to do with the fact that an integer number of audio samples (8008) occurs only once every five video frames. Ideally, all digital audio sources have to be synchronous and timed according to the five-frame sequence. (See Figure 1.) This poses some problems when embedded audio signal sources have to be switched "live" using an embedded routing switcher. Because the five-frame timing sequence cannot be easily and inexpensively controlled, the live switching of non-timed embedded audio/video signal sources is often accompanied by audio clicks. The problem can be solved either by using a V-fade type switch or by routing video and audio digital signals separately.
Along with synchronizing and audio/video timing considerations, video equipment latency also has to be considered. Digital video production switchers, especially in combination with a DVE, introduce video signal delays known as video latency. A concatenation of several digital-processing elements can introduce considerable delays, which manifest themselves as lip-sync loss. For instance, a frame-synchronizer-processed external video source passing through a DVE can acquire a video latency on the order of 66.6 msec. Currently, the lip-sync problem is treated in one of three ways:
- The signals pass uncorrected. This approach is used quite frequently and leads to significant video latencies.
- A fixed correction is applied, such as delaying the audio signal to match the video delay. This method can be used when the video signal path is unchanging and where the delay can be measured and is generally known. Alternately, when frequently changing operational configurations occur resulting in a variety of video latencies, a fixed audio delay may offer a compromise solution.
- The audio delay is caused to track the video delay. In this case, the audio delay tracks the difference in the timing of the input and the output video signals across an item, such as a frame synchronizer or a standards converter. Several manufacturers offer video frame synchronizers and standards converters with slaved audio delay units.
Clearly, the first approach is inadequate. The other two require the installation of many audio delay units. Unfortunately, these methods only compensate for the locally introduced video latencies and cannot correct for video latencies existing in the incoming signal. When the incoming signals exhibit time-varying lip-sync problems operators may be assigned to manually adjust the audio delay. This is time consuming and costly.
MPEG-2 and lip-sync problems In the compressed digital world the amount of data transmitted to represent I, P and B pictures is variable depending on a large number of factors. The compressed digital television world lacks the concept of synchronism between display and transmission. To address this problem, MPEG-2 provides for the transmission of decoder timing reference information in the adaptation headers of selected packets. The system clock is 27MHz, samples of which are transmitted in the Program Clock Reference (PCR) field. The decoder audio and video sample clocks are derived from the system clock derived from the PCR.
MPEG-2 features a timing model that guarantees the accumulated delay from the MPEG-2 encoder to the MPEG-2 decoder is kept constant. The decoder can thus be designed to compensate for this delay. The contributing factors are:
- the encoding process (DCT, VLC, RLC);
- encoder buffering;
- decoder buffering;
- the decoding process; and
DTV contributes to the lip-sync problem because early DTV implementations require numerous video format conversions. The various types of format conversions may not always be predictable. Among the various scenarios will be HDTV to SDTV and SDTV to HDTV conversions including a variety of aspect ratio conversions. As each of these conversions will generally require a frame memory, the resulting accumulated video latencies, if not eliminated or at least reduced, will prove to be unacceptable to the viewing public.
Additionally, the type of format conversion and the equipment used will vary from location to location. It is expected that network origination centers will use a limited and predictable number of format conversions and will thus be able to predict and control lip sync. The operational configurations and equipment used by network affiliates vary, so each location will have to apply specific means of lip-sync control. When you realize that a great deal of signal sources and destinations will still be analog for the foreseeable future, requiring a great number of ADCs and DACs, it is evident that DTV will increase the occurrence of lip-sync problems.
The DTV standards provide for the transmission of six audio channels (5.1). Current HDTV VTRs can handle only four discrete audio channels (two AES/EBU bitstreams). Handling six audio channels is quite a challenge. One possibility is to use compression to increase the carrying capability of one AES/EBU datastream. Another possibility is using a multichannel external digital audio tape recorder. This audio tape recorder will have to be slaved to the VTR using timecode. Consider the problems associated with locking a timecode generator based on NTSC 59.94 interlaced fields per second to a 1920x1080 HDTV VTR operating at 60 interlaced fields per second. It should be clear that it is quite impractical to operate a single teleproduction center simultaneously with 59.94 and 60 fields per second. Don't forget that we also have the choice of using the 1280x720 format featuring 60 progressive frames per second. Undoubtedly solutions will be found and this scenario will fade into oblivion. But, in the meantime, we will have to train our ear/brain mechanism to accept ever increasing lip-sync problems.