Maintaining lip sync

The problem of lip sync, or A/V synchronization, continues to daunt the most sophisticated video operators. While various technical working groups continue to grapple with the issue, new tools and processors continue to emerge that can help the situation.

Causes and targets known

With the ubiquitous use of digital processing, video will be delayed when passing through different equipment, often by as much as several frames. While audio delay is usually compensated in this equipment, differences in routing audio and video signals are sometimes unaccounted for and will create a problem. Also, some encode/decode processes may not always produce a deterministic delay. This means that a correction will have to be applied after-the-fact to maintain lip sync.

The limits of proper A/V sync are well established, if not uniquely specified. For film, the accuracy of the location of the sound record and corresponding picture should be within ±0.5 film frames, or about ±22ms. For video, ITU and others have suggested that the thresholds of timing detectability are about +45ms to -125ms, and the thresholds of acceptability are about +90ms to -185ms. The ATSC recommends that the sound program never lead the video program by more than 15ms and not lag the video program by more than 45ms (±15ms).

Detection is half the battle

In order to maintain tight A/V synchronization, two actions must be taken: measurement and correction. Measurement can be done visually with a test pattern or automatically by various means. A correcting delay is then introduced in either the audio or video to compensate. Dynamically varying delays present a more difficult problem that can only be corrected automatically. This is usually done in the audio, where it can be changed with less perceptibility.

Visual measurement typically involves an electronic version of the old clapperboard used in cinematography, with an easily identifiable strobed video frame, together with an audio tone burst, click or similar impulsive sound. The better of these will provide a continuous visual presentation of the in-sync point, together with leading and trailing video elements, such as rotating or oscillating visual objects. The test patterns can be hardware generated or can be contained within transport streams. When using such a scheme, the operator must manually change a compensating delay somewhere in the system.

With visible and audible indicators, all of this must be done offline. One promising nonintrusive technology detects a face in the video and then compares selected sounds in the audio with the mouth shapes that create them in the video. The relative timing of these sounds and corresponding mouth movements are analyzed to produce a measurement of the lip sync error. Currently, this would only work with talking heads, so its use as a continuous monitoring device is limited. Also, the reliability of this scheme may not be at 100 percent, precluding its faithful use for automatic correction.

While a visual technique can give an empirical indication of the absolute delay in video frames, the accuracy depends on the skill of the observer. Automatic measurement is the next step toward a precise and self-correcting system. By tagging the A/V program at an upstream point, a downstream processor can detect any differential changes. One offline method is to use a visible time stamp consisting of a short duration black cross on a stationary or quasi-stationary background test pattern. An automatic detecting unit can then sense the identifying signal and measure the timing of video signals in a plant or even at separate locations.

Delay compensation makes it work

By using the output of a nonintrusive delay measurement system, a downstream adjustable delay device can automatically provide an opposite correcting delay and ensure accurate lip sync. One such system used watermarking technology to embed timing data within the video itself. In that unit, a nonvisible watermark carries an audio timing reference, embedded at an early point in the video distribution chain. A processor at another point then detects the watermark and associates it with the audio signal. A variable audio delay then corrects the added delay. Unfortunately, the device is no longer made.

Another nonintrusive technology uses a digital signature that represents each frame of video together with the associated audio. A pair of upstream and downstream devices generates these signatures, and then the two signatures are correlated using an IP connection. While radical changes in the video (such as bug insertion) may prevent an accurate comparison from being made, quality metrics may be used to sound an alarm.

The user can mask off portions of the image to get a valid detection. Such a system is capable of operating over long distances, provided the audio-video signals and the IP connection can operate relatively synchronously. (See Figure 1)

Unfortunately, very few products are able to work automatically in this fashion. One manufacturer claims to have a proprietary method — most likely a watermarking technology — that nonintrusively corrects delays. The system consists of a preprocessor unit at the transmission origination point and a postprocessor at the reception point.

A variant of this technology uses a control input from a compatible video frame synchronizer. This automatically corrects independent variable delay sources by sampling the video at two points in a system and then providing a control signal to an audio delay unit. Such a unit does not actually correct A/V delay, but rather relative video delay. As such, the unit cannot account for an arbitrary transmission path that changes the video and audio delays differentially.

Work remains to be done

Several standardization groups continue to work on the A/V sync problem. SMPTE has created a special committee to address the matter (S22-Lip Sync Evaluation Committee). It issued a call for submissions back in 2005 and intends to produce guidelines documents. The ATSC Specialist Group on Video and Audio Coding (TSG/S6) has also established two working groups to gather implementation data and report back with recommendations. Unfortunately, neither group has publicly released any details of its work, and it's not clear when useful results will be announced.

This comes as one indicator of the enormity of the problem, which also extends into the consumer electronics realm. Earlier, the ATSC's Implementation Subcommittee released a recommendation that the additional delay budget needed in a receiver should be ±15ms from that indicated by the transport stream timing parameters.

To ensure broadcasters' efforts won't be undone, TV receiver manufacturers must be held accountable to this (or a tighter) standard. What manufacturers currently don't have - and may need - is an incentive to fix the problem.

Aldo Cugnini is a consultant in the digital television industry.

Send questions and comments to: aldo.cugnini@penton.com