Managing lip sync

This is not the first time that the subject of A/V sync, or lip sync, has been covered in this column, nor will it be the last. While some industry organizations continue to study the issue, and a handful of products exist that either measure or control A/V sync, progress is slow in combating the problem. This month, we'll look at some of the lesser-understood technical factors contributing to the problem.

To recap the issue, correct A/V sync is necessary for program delivery so that the presentation retains a natural appearance. Studies have shown that a mismatch is detectable when the sound leads the video by more than 45ms or lags the video by more than 125ms. Various recommendations exist that put tighter bounds on acceptable performance. The ATSC, for example, recommends that the sound program should never lead the video program by more than 15ms and should never lag the video program by more than 45ms (±15). But state-of-the art systems and products are not yet at the point where this recommendation is always met.

Compression complicates the problem

Audio and video will be differentially delayed when passing through different equipment (or improperly designed equipment). These differences in routing audio and video signals can create an A/V sync problem, especially when the delays change over time.

In addition to the problem of independent signal paths and processing, compression adds another variable to A/V sync mismatch. Not only are video and audio signals compressed using different algorithms, but more importantly, the differential delay between the compression paths is not constant in parts of the system. This is illustrated in Figure 1, together with the program clock reference (PCR) synchronizing element.

MPEG video compression, like most compression systems, uses different types of frames, resulting in different amounts of data for each frame in the coded bit stream. While the overall bit rate for such a system is constant (when using constant bit rate encoding), the number of coded bits per second varies around a target rate and is smoothed by a buffer.

However, the compressed audio does have a constant number of bits per second in most transmission systems. This means that the video and audio frames never exactly line up, and therefore must rely on a time stamping mechanism in order to reproduce the correct A/V sync. MPEG provides a PCR to accomplish this, which is a sample of the master clock that is used in the compression system. By generating the video and audio clocks from this master clock, and then transmitting the PCR at frequent intervals, the decoder can correctly resynthesize the clocks necessary to maintain synchronization.

The video and audio streams each contain a recurring presentation time stamp (PTS) that indicates when each video and audio “presentation unit” should be presented to the decoder. With a fixed decoding time for each process, this then establishes the correct presentation time of video and audio to the viewer/listener.

However, there exists the possibility that receivers (decoders) do not process these time stamps correctly, depending on how the video decoder buffer is managed. As we saw previously, the bit stream data rate varies from frame to frame. This requires a buffer in order to properly decode the video, and an appropriate algorithm to manage the buffer. In MPEG, this is known as the video buffer verifier (VBV), a model that is used in the encoder to ensure that there is never an overflow or underflow condition.

This is shown in Figure 2 for a fictitious seven-frame stream, with the fullness of the decoding video buffer as a function of time. Bits enter the buffer and then are removed (decoded) starting at frame #0 in the graph. From that point forward, bits must be removed at the correct frame rate to ensure proper video display. (The model assumes that all bits from each frame are removed instantaneously. This is valid for the sake of buffer management, given actual hardware architectures and the fact that any practical delay is inconsequential to the action of the buffer.) If the buffer should overflow or underflow, the video would either freeze or jump ahead, causing a noticeable disruption.

The parameter VBV delay specifies the duration of time that the first byte of coded video data remains in the video buffer (to the left of zero in this example), to start the filling process. While this parameter can be specified in the bit stream, most decoders ignore it, and regenerate the buffer timing from the PCR and PTS data — and herein comes the potential for problems.

Decoders vary in how often they recheck the PCR and PTS elements for synchronization, which can cause a problem if data is corrupted or missing. For instance, a simple decoder could be constructed that fills the buffer to some arbitrary point, and then proceeds to decode pictures without referring back to the PTS on an ongoing basis. Assuming all other data is correctly received, and the decoding frame rate is correct, the decoder could run indefinitely and appear to produce correct pictures and sound. But if there was an error in the timing algorithm, or if some data is lost in transmission, the playback timing could be sufficiently in error so as to produce an A/V sync error that persists indefinitely.

The problem with any such product is that there is no formal requirement that the decoding should work properly 100 percent of the time, other than that of product quality control. (And receiver manufacturers are loath to accept imposed requirements, as well.) In reality, any viewer encountering lip sync issues will almost certainly blame it on the program provider and not on the product. Any activity aimed at improving the situation would have to be from a cross-industry collaboration between broadcasters and consumer electronics manufacturers.

Few solutions at this time

In an earlier column, we took a look at some of the technologies that measure or control A/V sync at the broadcast plant. Part of the problem with their effectiveness is that the simplest test equipment requires the interruption of normal programming. Automatic online measurement and compensation could alternately provide a precise and self-correcting system. Technical committees are continuing to work on the problem, but the work is difficult.

IEC is working on standards relating to assessment, measurements and methods for A/V synchronization, but the results may not provide specifics for the broadcaster. The HDMI v1.3 and IEEE-1394 standards have features that help consumer equipment, but not in systems already installed.

CEA is working on a recommended practice, to be known as CEB-20, for DTV receiver implementers and developers, that relates to DTV receiver/decoder processing affecting a/v sync. Expected completion is mid-2009, after which ATSC will continue its own efforts.

More visibility needed

Unfortunately, A/V sync is the kind of problem that everyone knows about, but not all broadcasters and program distributors are willing or able to spend sufficient time or money in its solution, perhaps in part due to the difficulty of determining the actual effect on revenue. Perhaps therein lies an opportunity for manufacturers to develop solutions that are inexpensive and straightforward to implement.

Aldo Cugnini is a consultant in the digital television industry.

Send questions and comments to: aldo.cugnini@penton.com