Added Video Effects Aggravate Lip-Sync

Eleven years ago, television networking and broadcasting was very different from the one we work in today. DTV broadcasting was under development but not yet implemented. Network and local station technical plants were largely analog.

Although many pieces of digital equipment were used, and digital stages were found within equipment that was fundamentally analog in nature, our plant signal infrastructures were analog, as were the network-to-affiliate distribution systems.

Today, our technical plant infrastructures are well on their way to becoming completely digital. Going digital has solved a number of the technical problems we had in the analog television era, but it has generated a few of its own.

READ MY LIP-SYNCS

One technical problem we had in 1993 is arguably more severe today than it was then, and that is audio-video synchronization. Our plants have become largely digital--digital equipment interconnected using digital interfaces. Distribution from network to affiliate is increasingly becoming digital as well. It is instructive to see where we were 11 years ago, and to consider where we are today, on the audio-video synchronization issue.

The following is a paper on A/V synchronization written in 1993. Note that although the name of the International Radio Consultative Committee (CCIR) had already been changed to International Telecommunications Union, Radiocommunications Sector (ITU-R), its documents were at that time still known as CCIR documents.

CCIR REPORT 412-4

The introduction to CCIR Report 412-4, "Transmission Time Differences Between the Sound and Vision Components of a Television Signal," states "In the early days of long-distance transmission, when the video and sound were transmitted via different facilities, there occasionally existed a perceptible lack of simultaneity between sound and vision.... Today, even though the sound and vision of a television program may be transmitted via different facilities, the transmission velocities used are such that little or no lack of simultaneity between sound and vision is experienced."

It is true that the diplexed systems typically used to transmit television video and audio over analog telco circuits, satellites and the like do not generate significant relative timing differences between audio and video. But if you've watched any television recently, you probably realize that there are ways to subvert relative A/V timing.

Consider what we glibly refer to as "analog video." Virtually every videotape recorder newer than a two-inch quad machine employs a digital time-base corrector. Digital frame synchronizers are used routinely to time externally produced analog and digital signals to the television plant, and a given video signal is frequently subjected to a cascade of two or more such frame synchronizers. Digital video effects units are used to put pictures within pictures. All such devices have one thing in common--some finite amount of time delay caused by analog/digital conversions and digital processing.

PROCESSING = DELAY

The result of the extensive digital processing of virtually every video signal, analog or digital, is delay relative to its associated audio signal. Digital processing may also be applied to the audio signal, but the delays associated with digital audio processing are typically significantly shorter than those that are imposed on the video signal.

The most straightforward solution to this problem is to convert the audio signal to the digital domain, and then delay it until it is time-coincident with the video signal; however, this is not always done.

Human perception of A/V timing disparities unfortunately works against the television engineer. Our sensory systems find it much easier to accept sound delay with respect to picture than they do picture delay with respect to sound. Up to a point, it is natural for sound to lag behind vision, the lag increasing with distance between stimulus and observer, because sound travels much more slowly than does light. We all remember the example we learned in high school of someone hammering at a distance, in which we see the hammer strike well before we hear the impact, or the other classic example, lightning followed by thunder.

There is no natural situation in which the sound of a physical event precedes its visual perception, so the human sensory system has a much lower tolerance for audio preceding than for audio following associated video. Unfortunately, the audio-video world in which we currently work is far more likely to produce video delay greater than audio delay.

As things in our television world become more digital in nature, both audio and video are increasingly subjected to digital processing and other actions that generate delays. In these cases, audio and video are both delayed but usually by different amounts. Fortunately, having both audio and video in the digital domain makes it relatively easy to put them back into synchronism.

There is, of course, general recognition of this problem. The ITU-R has work in progress to specify tolerances for delay between video and its accompanying audio. Those familiar with the work of this international standards-setting body know that, being both international and bureaucratic, its wheels turn rather slowly. Let's review some of the organization's work on audio-video timing.

RECOMMENDED LIMITS

CCIR Report 10812 refers to studies that have been carried out by the European Broadcasting Union and in Australia suggesting that delay of sound with respect to picture on the order of 40 ms, corresponding to a path distance of about 40 feet from the listener, is often detectable.

The Australian CCIR input document cited a 20 ms sound advance with respect to picture, and a 40 ms sound delay with respect to picture as "detectable" in material with frequent aural and visual cues, such as speech, and a sound advance of 40 ms or a sound delay of 160 ms as "subjectively annoying," even with relatively uncritical material.

The EBU recommends tolerances for A/V outputs that are intended to feed broadcasting transmitters of 40 ms sound advance and 60 ms sound delay, with respect to picture.

The digital processing delays referred to are cumulative, and although any single delay may be small, when cascaded, their combined effect can be devastating. While no ITU-R recommendation yet exists for end-to-end A/V synchronism, there is one for so-called "connections", or point-to-point links.

CCIR Recommendation 7173 states "Éthat for any connection used for the international exchange of television signals, the time difference between the sound and vision components should not exceed 20 ms if the sound is advanced with respect to the picture or 40 ms if the sound is delayed with respect to the picture."

Adherence to this recommendation would, according to the Australian study, keep the relative A/V timing within the detectable, but not subjectively annoying, region.

A draft recommendation has been produced by ITU-R Working Party 11C [now absorbed into Study Group 6] for end-to-end relative picture-sound timing. It specifies some curious numbers, in that the tolerances are twice as tight for 50 Hz television systems as for 60 Hz systems. The history of the disparity between tolerances in that draft is unknown to the author, but it seems reasonable to expect that the human sensory system's sensitivity to relative picture and sound timing discrepancies is pretty much the same in areas that use 60 Hz television systems as it is in those that use 50 Hz systems.

The draft recommends tolerances for 50 Hz television systems be set at 20 ms for a sound-leading picture and 40 ms for a sound-lagging picture. This conforms to Recommendation 717 and corresponds to +1, -2, 625/50 video fields. The same draft recommends tolerances of 33 ms when sound leads the picture and 66 ms when sound lags the picture for 525/60 systems. This corresponds to +2, -4, 525/60 video fields, or a tolerance half as tight as that for 625/50 systems. It also pushes the tolerance for sound-leading picture rather close to the "subjectively annoying" point cited by the Australian study.

Seemingly, a single set of tolerance numbers, or at least two sets that correspond to the same number of fields within each video system, would make more sense than the numbers cited above. If the 625/50 tolerances are translated into 525/60 numbers, they become 16 ms for a sound-leading picture and 33 ms for a picture-leading sound, corresponding to +1, -2, 525/60 fields.

There seems to be no technical reason to expose NTSC viewers to subjectively annoying sound delays, and even less reason to institutionalize it in a worldwide standard.

This situation has gotten worse, in that the A/V sync recommendation by the ITU became even looser than it was when the above was written. Further, we often work in fully digital milieus today, and the potential for A/V sync errors is much greater than it was in 1993. The ATSC Implementation Subcommittee has generated a recommendation to fix this rather substantial problem in DTV and has brought it to the attention of the ITU. Let us hope that the problem is seriously addressed by the industry.