Correcting lip sync errors

Audio/video synchronization, or lip sync, is a frustrating problem for both broadcasters and viewers. The complexity of the broadcast plant has not made for any simple solutions, but potential remedies seem to be within our grasp.

The heart of the problem

First, a recap of the perceptual aspects of the problem. Subjective testing has shown that the detectability of an A/V sync mismatch covers a small spread in time, in the tens of milliseconds, and is asymmetric. This appears to be a consequence of human acclimation to the laws of physics, which set the speed of light and sound to be widely different. As a result, we are conditioned to expect light and sound from an object to reach us virtually instantaneously when it is nearby, and the sound to be delayed when the object emitting the sound is at a distance. For this reason, humans are accustomed to conditions where the A/V delay is zero, or somewhat skewed in the direction whereby the sound is delayed with respect to the image. (See Figure 1.)

Digital processing equipment requires a different processing time for video and audio content, primarily due to the disparate bandwidth of the digital data. With baseband SD video sampled typically at 13.5MHz and audio at 48kHz, any equipment processing the signals will produce a delay in the signals that is different for video and audio. Process HD video and throw in compression too, and the problem gets worse. While good equipment design can minimize the problem, what is also needed is a good way to measure and correct it — ideally back to zero — once it occurs.

A feasible solution

Numerous tools are available that can be used offline to measure and correct A/V sync error, but what is really needed is an unobtrusive way to manage it on live signal paths. One solution is to apply watermarking to the audio and/or video. An encoder analyzes the envelope of the audio signal, and from this generates a watermark that embeds timing information into the corresponding video. At a downstream point, the video and audio are analyzed, the watermark is decoded, and a corresponding measure of any intermediary A/V sync error can be produced. As such, the encoding must be done at a point with known (preferably ideal) A/V sync.

Some years back, Tektronix sold just such a system. Unfortunately, sale of the device was discontinued, in part due to an insufficient market. But there were technical challenges too. Two key requirements for successful operation of a watermarking system are the invisibility of the watermark and the robustness of the watermark after signal processing. Video processing can damage a watermark, and audio processing can alter the signal envelope. Anecdotal information regarding watermark systems includes a mix of positive and negative experience; noise reduction (denoising) is known to be particularly damaging because it can obliterate a watermark.

An alternative to watermarking is a fingerprinting technique. As with watermarking, the video and audio are processed at an upstream reference point, but in this case, a pair of signatures (or fingerprints) is produced from both the video and audio at the sampling point and time. (See Figure 2.) These signatures are then combined to produce an A/V signature that can be sent asynchronously on a different path (or even at a different time) than the program content; offline storage of the video, audio and signatures is also possible. A downstream device extracts a similar pair of video and audio signatures; comparing the upstream and downstream signatures produces a measure of the intermediary A/V sync error.

Some constraints on the use of fingerprinting

In order for such a system to work, the signature generation algorithm must reliably produce the correct signature despite video and audio processing, including bit rate compression, time compression, noise reduction and resolution conversion (resampling). Typically, the signatures are generated using what are called hash functions, which are similar to those used in cryptographic systems to process and manage encryption keys.

A key characteristic of a hash function is that it can convert a large amount of data into a small data set. By using an appropriate hash function, small changes to the video and audio signals, as expected by signal processing and compression, will not change the hashed result. The robustness of the system will ultimately be a function of the amount of intermediate signal processing, as well as the complexity of the A/V signature. In principle, an acceptable level of robustness should be achievable with the processing typically done after post production, at a data rate of less than 4kb/s for the signature data.

Systems based on fingerprinting are available or have been demonstrated by several different manufacturers. Although fingerprinting systems have been shown to work reliably, the makeup of the fingerprint has so far been proprietary, without a common definition. The SMPTE 22TV Lip Sync Ad Hoc Group, chaired by Graham Jones of NAB, is now investigating the possibility of producing a SMPTE standard for the fingerprint signal and/or the methods of metadata carriage. Readers are encouraged to contact Jones at gjones@nab.org for more information, or to participate.

Decoding equipment needs attention too

While a fingerprinting system can maintain accurate A/V sync, it can only do so at the point where fingerprint decoding equipment is deployed. In order to solve the A/V sync problem over the complete extent of the content distribution chain, attention must be paid to equipment, including consumer electronics. It is known that some implementations of MPEG decoders do not accurately maintain A/V synchronization, because mere compliance with the MPEG and ATSC standards does not by itself assure perfect A/V sync. Professional decoders, of course, should handle synchronization to the best state of the art. Consumer electronics, however, are often developed with performance limitations driven by cost. To address the issue, the CEA last year published CEA-CEB-20, “A/V Synchronization Processing Recommended Practice,” which outlines the steps that an MPEG decoder should take to ensure and maintain audio/video synchronization.

Among the recommendations made in CEA-CEB-20 are the continuous monitoring and processing of all Program Clock References (PCR), Presentation Time Stamps (PTS) and Decoding Time Stamps (DTS) available in the stream. Also described are the various timing mechanisms that should be present in a well-designed decoder, including startup, adjustment and steady-state conditions. Of particular interest is the discussion of packet timing, which can aid in the understanding of similar mechanisms in professional encoders and multiplexers — but that's another topic.

Aldo Cugnini is a consultant in the digital television industry.

Send questions and comments to:aldo.cugnini@penton.com