Lip-sync and MPEG delay

It's been more than 10 years that we've had digital broadcasting throughout the world, and, unfortunately, the problem of A/V sync (lip sync) still exists. There are many reasons for this — confusion about the causes, apathy about the effects, few or limited solutions — but the upshot is that an issue that had limited negative effects in the analog world is now pervasive in the digital one.

No simple solution

While we've known about many of the causes of A/V sync error, simple solutions have been evasive. These causes stem from any of several processes: servers, audio and video processors, encoders, multiplexers, decoders, and so forth. In all of these, the two major culprits are differential audio and video processing, and incorrect handling of system clocks. The first of these often involves a static delay in the processing of linear (i.e., uncompressed) signals; the second largely concerns the incorrect handling of clock regeneration for compressed signals.

Recently, methods have been developed to monitor and maintain incorrect A/V sync. If components within a signal chain modify the A/V sync, then these components can be viewed as an “input/output” system, and the resulting A/V sync can be “repaired” by comparing an upstream reference signal with a downstream sample signal and then applying an appropriate compensating audio or video delay. (See Figure 1.) But such a solution really only fixes a problem caused elsewhere. When considering static delays, close attention to the overall system design is imperative. In the case of dynamically varying delays, the components responsible — assuming they can be identified — should be considered for replacement. More on that later.


A new technique creates a data component that uniquely identifies each frame of video with its associated sample of audio. This “fingerprint” consists of a small data word, with the running data forming its own stream. This stream can be packaged as metadata and stored with content as files, streamed concurrently with the content, or even sent over an entirely separate path. Downstream, a similar fingerprint can be generated, and a delay processor can use the two fingerprints to regenerate an A/V signal with the original (correct) timing relationship.

Manufacturers are developing products that use this fingerprint technology. The technique has been shown to work even when the video has been subjected to intervening compression, black bars, logo insertion, etc. The technique also survives various forms of audio compression and processing. Only a few fingerprint data bytes per frame are needed for reliable identification of the A/V signals; increasing the number of fingerprint data bits per frame provides high accuracy and faster processing. While some of the technology is covered by patents, offers have been made to freely license the intellectual property, in the spirit of forming an open standard that supports rapid deployment.

At the same time, the SMPTE 22TV Lip Sync Ad Hoc Group (AHG) has been working to develop a standard for in-service A/V timing error measurement. The group is considering a number of requirements for such a system, including a specification for fingerprinting. Overall, the group wants to specify a system that enables automatic detection and measurement of A/V sync errors, works through complex distribution systems, and enables detection of errors at multiple points in the chain and over multiple distribution paths. The SMPTE work also suggests that such a system be interoperable across multiple vendors of fingerprint generators and detectors.

Although there is progress, not all broadcasters are directly active in SMPTE. The group therefore encourages more input from broadcasters on their requirements for such a system.

CE sheds light on critical issues

While the subject of consumer decoders might appear to be unrelated to broadcasting solutions, the reality is that a great many professional products that handle compressed signals — especially professional decoders — use consumer chips to handle clock generation, and some of those chips simply do not do a good job of it. In fact, many of these chips will synchronize only upon signal acquisition (or system initialization) and then “flywheel” thereafter, with the result that the audio and video clocks will drift over time. Some consumer chips don't even properly set up A/V sync at startup.

It should therefore be of great interest to broadcasters to better understand the internal workings of their hardware, including those components that use consumer chips. The Consumer Electronics Association (CEA) has published CEA-CEB-20, “A/V Synchronization Processing Recommended Practice,” which should be studied by broadcasters and manufacturers alike. It provides recommendations on the steps that an MPEG decoder should take to ensure and maintain A/V sync, and describes issues affecting the appropriate processing of various MPEG stream elements, including program clock references (PCR), presentation time stamps (PTS) and decoding time stamps (DTS). All of these elements are typically used to regenerate decoder clocks and guarantee appropriate timing signals, which directly affect A/V sync. (Several other documents also describe the recommended handling of timing-related components of streams. Among these are ATSC A/54A, A/78A, and ETSI TR 101 290.)

Also described in CEB-20 are various timing mechanisms that should be present in a well-designed decoder, including startup, adjustment and steady-state conditions. The discussion of packet timing is similarly relevant when considering timing mechanisms in professional encoders and decoders. One key issue here is that PCR packets must be remapped properly when multiple program transport streams are demultiplexed. Otherwise, buffer requirements could be violated, leading to decoder crashes and abrupt timing changes.

Many of these issues are directly tied to the algorithms buried in silicon. That means the chip manufacturers should similarly be aware of these issues, because they aren't always respectful of them. Because silicon is constantly being upgraded and re-engineered, providing access to the necessary timing functions should be a manageable aspect of the design process going forward. However, new products with the appropriate performance could still be 12 to 18 months away.

Know your equipment and your vendors

It is important that broadcasters make a detailed functional inventory of their plants and ask their vendors the right questions concerning new product purchases. Do the products use legacy consumer chips? How new are the designs? Do the products follow the recommendations of CEB-20 and other specs? Can compliance be verified over a long-term period, e.g., several weeks? Dependable answers to these questions could lead to better lip sync in broadcast plants.

Aldo Cugnini is a consultant in the digital television industry.

Send questions and comments to: