A/V Synchronization: How Bad Is Bad?

There have recently been a number of complaints registered in this publication and others about bad audio/video synchronization, also known as bad lip-sync. (It is the writer's belief, by the way, that the first such complaints to show up in this publication appeared in this very column.)

These artifacts of the digital age have been with us for some time now, but the potential for them to become more severe is growing apace, as we subject television audio and video to increasingly long chains of digital processing. Each digital process adds its own delay, and the amounts of delay added to video and audio signals respectively are usually different. If the difference in audio and video delays is not compensated, A/V synchronization will be noticeably compromised. In this day of end-to-end digital acquisition, storage, processing and transmission, each step in the long, meandering trail from camera and microphone to living room harbors the potential to make its own contribution to the composite synchronization error.

THINGS CHANGE

Let us begin with what we know for certain about this problem. We know that in the early days of live television, lip-sync errors were not a problem, as long as the microphone was sufficiently close to the performer. All the systems used to carry the audio and video signals were analog, and none had the potential to impose a significant differential A/V delay. But things have changed since Uncle Miltie's day. The introduction of digital devices in television, beginning with VTR time base correctors and digital frame synchronizers, made it more likely that significant differential A/V timing errors would occur, and the situation has become progressively more perilous. We know that because the speed of sound in air is much slower than the speed of light, it is natural to observe audio lagging video. The farther away from the observer a sound is made, the longer it takes the sound to reach the observer, while the accompanying visual stimulus will always appear to be instantaneous, as long as we are considering events that happen here on Earth. We learned about this in high school science, illustrated with the example of a distant hammerer: The farther the observer from the hammerer, the longer the delay between seeing the blow and hearing it. Conversely, it is completely unnatural to hear the sound of an event before we see it happen. We thus have a greater tolerance for audio-lagging video than for video-lagging audio. This is unfortunate, as we also know that the delays imposed by digital systems on video are typically greater than those imposed on audio, and this results in a tendency for audio signals to precede associated video signals in digital television systems.

The foregoing are generally known and agreed upon. There is less general agreement, however, on how much A/V sync error is too much. Those who served on the audiovisual squad that projected films back in school (in the days when they showed films in school) worked by the rule of thumb that to avoid complaints, the film sound could not lead the picture by more than one frame or lag it by more than two frames. These were 24 Hz frames, so the tolerance in that rule of thumb was about +42/-84 milliseconds. An article in a 1951 SMPTE Journal describes some A/V synchronization testing done in the 1940s by Bell Laboratories. This study concluded that the threshold of detectability for A/V "simultaneity" errors was reached when audio led video by more than 35 milliseconds, or lagged video by more than 100 milliseconds. This is pretty close to +1/-3 NTSC frames, and in good agreement with the audiovisual squad's rule of thumb, particularly when considering that the audio-visual squad had to work in increments of an entire film frame.

In early 2001, the Implementation Subcommittee of the ATSC made a finding that described the potential for A/V sync errors to accumulate in the end-to-end DTV production, post production, distribution, broadcast and reception chain. It stated that the audio signal at the input terminals to the DTV emission encoder should not lead the video signal by more than 15 milliseconds, or lag it by more than 45 milliseconds. The rationale behind this was that the composite, end-to-end system error should be no worse than +30/-90 milliseconds, and that half of that error should be budgeted to the system preceding the encoder, where presentation time stamps were added to the audio and video frames, as well as half to the system following that point, all the way to the DTV viewer. This figure is in reasonable agreement with the threshold figure established by the Bell Laboratories study and with the audiovisual squad's rule of thumb.

(click thumbnail)Fig. 1: Simplified reference chain for television sound/vision timing. (Courtesy ITU-R)
Now, let's look at an international standard. The International Telecommunications Union Radiocommunications Sector, known as ITU-R, has a recommendation in force: ITU-R BT.1359-1 (1998), Relative Timing of Sound and Vision for Broadcasting. BT.1359-1 contains a coarse block diagram (see Fig.1) of the camera/microphone-to-viewer chain, and recommends tolerances at some points along the chain. It sets as the timing zero-or reference point-the output of the "final programme source selection element," which we might recognize as the master control switcher. It recommends that for the end-to-end path from source origination point to viewer, audio should neither lead video by more than 90 milliseconds nor lag video by more than 185 milliseconds. It further recommends that the A/V timing error between the image source (camera/microphone) and the zero reference point should be no worse than +25/-100 milliseconds, and that the A/V timing difference in the path from the output of the final program source selection element to the input to the transmitter for emission should be no worse than +22.5/-30 milliseconds.

It is noted that the recommended tolerance figure between the source and the zero reference point, +25/-100 milliseconds, is close to the overall composite detection threshold figure indicated by Bell Laboratories in the 1940s, the audiovisual squad's rule of thumb and the Implementation Subcommittee's Finding; 1359's overall system composite tolerance figure, at +90/-185 milliseconds, is appreciably wider.

SLOW GOING

This 1998 recommendation is the result of ITU's research that began in the early 1990s (when it was still known as the CCIR), lending credence to one person's observation that in the ITU, work begins slowly and decelerates from that point. ITU recommendations must take into consideration contributions, opinions and outright pressures from groups the world over, some of which are carrying what might be characterized as their own agendas.

Appendix 1 to Rec. 1359-1 is an "Explanation for the selection of the recommended value for sound/vision timing difference." This appendix begins by stating that it is known from many years of experience that the relative timing between picture and sound in film projection is very important, and that there is an identifiable point at which the timing error becomes objectionable to the viewer. It then cites ITU-R BR.265 (Standards for the International Exchange of Programmes on Film for Television Use), stating that Rec. 265 "...indicates that the precision of accuracy of location of sound and picture information should be within +/- half a frame. For 24 fps film, this is an acceptable variation of about +/-22 ms." But, "Subjective evaluations undertaken in Japan, Switzerland and Australia show a high degree of similarity in the sensitivity of viewers to errors in sound/vision timing in television material for NTSC and PAL systems. Tests conducted have shown that the thresholds of detectability are about +45 ms to -125 ms and thresholds of acceptability are about +90 ms to -185 ms on the average." So Appendix 1 indicates that in film for use in television, A/V sync should be very tight at +/-22 milliseconds; it cites several studies that place the threshold of error detectability at +45/-125 milliseconds, a figure in reasonable agreement with the audiovisual squad, the 1940's Bell Laboratories study and the overall tolerance figure in the Implementation Subcommittee Finding of 2001. However, the Recommendation itself is that the composite system tolerance number be just within the range of "acceptability...on average." It may safely be said that some feel that this places the bar too low, and that the Rec. 1359 number should be tightened. But, at the pace at which the ITU operates, do not expect this to happen any time soon.

The ITU documents cited above may be obtained from the International Telecommunications Union: ITU-R Rec. BT1359-1 (1998), Relative Timing of Sound and Vision for Broadcasting. The quotations are from Appendix 1 to this document. ITU-R Rec. BR265-8 (1997), Standards for the International Exchange of Programmes on Film for Television Use.