A/V timing errors

Are you the person who fields engineering office phone calls? If you are, then you’ve probably heard more than a few that started with these classic words: “There’s something wrong with your station.” This is usually followed by some type of reception complaint. As the number of viewers watching DTV on large flat screens grows, you may be hearing more complaints about lip-sync issues than you used to.

Lip-sync errors don’t always start at the station. At the end is the viewer and his TV system. Complaints about lip sync may not always reflect a problem with your station or signal.

The next time you’re in a big TV set retailer or box house, step back and observe as many flat-screen HDTVs simultaneously as possible. Typically, most of the screens are connected to the same video source. As scenes change, watch closely with your engineering eye and you may discover that all the TVs don’t always display scene changes at precisely the same time. All flat screens process video, and some process it with more latency than others. Some LCD and plasma displays can add up to three frames of delay by performing digital processes, which, if the audio isn’t delayed to match, can create a notable lip-sync problem.

Lip-sync problems become more noticeable on larger screens, and they may be exacerbated by the viewer’s audio wiring. For example, a flat-screen display may be connected with a DVI or HDMI cable from a cable or satellite set-top box (STB), while the analog audio output of the STB is connected directly to a stereo system. If there is any video-processing delay in either the STB or display, the audio will move ahead of the video. Other home theater gizmos such as scalers or signal converters can also add video delay.

Another variable is the program delivery method the viewer chooses to use. Although there is a trend toward more DTV off-air reception, there are still far more people watching local channels on cable and satellite than those using antennas. Does your station regularly monitor all the major delivery systems most likely used to receive your station? Not all cable and satellite providers pass every signal precisely as transmitted at its source.

The quality your station transmits probably means a great deal more to you and your station’s advertisers than it does to a program provider dealing with hundreds of channels. It’s not unusual, particularly with basic-tier channels such as local stations, for a program provider to employ data reduction on the basic tier to allocate more bandwidth for premium channels. This process can sometimes introduce more video delay. You should routinely confirm that all program providers retransmitting your signal are doing so with the highest quality and accuracy, including lip sync.

What A/V timing errors are acceptable? Written in 2003, the ATSC IS-191 committee report stated, “Under all operational situations, at the inputs to the DTV encoding devices, the sound program should be tightly synchronized to the video program. The sound program should never lead the video program by more than 15 milliseconds and should never lag the video program by more than 45 milliseconds.” More recently, the AES, SMPTE and IEEE have all formed study groups and committees to investigate various standards for lip-sync error detection and evaluation systems. Incidentally, the lip-sync standard in the film industry, sometimes called “lip flap,” is plus or minus 22ms, which is a lead or lag of approximately one-half a frame at 24fps.

What commonly is referred to as a 60i DTV frame lasts 33.366ms (at a 59.94 field rate). A field is half that, or 16.68ms. Thus, if the video lags the audio by a single field, it exceeds the recommended limit. The maximum acceptable audio lag is slightly less than three fields. As an aside for your reference, the speed of sound is approximately 330m/s, or about 33ft/frame. Based on the ATSC committee conclusion and film industry standards, people may begin to perceive a sound delay when a sound-producing physical event, such as a firecracker or the crack of a baseball bat, is observed at a distance of as little as 18ft. Perhaps that’s why people can tolerate more audio lag than lead.

Several reputable institutions have researched this same delay perception question, and all resulted in similar conclusions. As opposed to the physical world, where sound delay is commonplace, in DTV, the problem is usually the other way around.

There are many unintended ways a broadcaster, production or post house can push video behind audio before the signal ever enters a recorder, DTV encoder or a fiber feed to a headend, which brings us to where the signal originates.

Video from a typical live CCD camera starts with at least one field of built-in video delay as the image is captured, stored and digitally processed before it ever leaves the camera or is recorded. So like driving 70mph in a 65 zone, you can usually get away with it. Plug that camera into a switcher and run it through a DVE, or punch it up on a switcher with frame-synched inputs, and you’ll add at least one frame of delay to the field you already had. Now, you are driving 85mph in a 65 and folks are going to notice.

The relationship between video and audio can be compromised every time the signal is encoded or decoded, or when digital signal-processing occurs. The problem builds as signals cascade through a station or production facility and as more processing, conversion and distribution devices are switched in or out of the signal chain.

There are several ways to address this issue. As with nearly everything in the world of broadcast engineering, the method that works best for you and your facility will primarily depend on the urgency of the problem and size of your budget. Some facilities take the shotgun approach and simply insert an audio delay unit downstream from master control, dial in a couple of frames of delay and call it a day. More professional methods to verify and control lip-sync issues are divided into two groups: out-of-service and in-service.

In the out-of-service group, equipment must be offline for testing. It can be as basic as using an A/V source showing the clap and sound of a clapperboard, and using your eyes and ears to zero in A/V sync manually with an audio delay device. This simple method is as subjective as it is primitive, but it has worked for about 90 years, and sometimes it’s all a budget can afford. It may also be the only test instrument available at the time.

Many other out-of-service tools are available to the broadcast industry designed to match audio delays to video delays. Typically, these products are found in the backbone of a modern facility’s infrastructure, such as A/D conversion, audio processing, distribution, embedded audio, SD-HD conversion, synchronizers and video processors. Typically, out-of service methods are more frequently used in mobile production vehicles, where distances and delays can vary with nearly every new setup but remain stable once set.

The in-service group of devices is designed to occasionally or continuously detect end-to-end lip-sync errors without interruption. One early idea was to use watermarking. Watermarks are hidden visual and audio queues added to the original material intended to secure content and protect it from unauthorized copying. The problem with watermarking is that it can get lost in digital-processing equipment, size or aspect-ratio modifications, or digital encoding and decoding.

A much more clever and robust in-service lip-sync monitoring idea has recently emerged, commonly referred to as fingerprinting. The fingerprinting technique has several names, including digital signatures, A/V signatures, descriptors, data correlations, feature vectors, short digests and robust hash. Generally, but not always, fingerprints are level-based and taken over a period of time.

Fingerprinting also has many uses, which include content management, content verification, ad insertion verification, media monitoring, cataloging and security and piracy management. YouTube, for instance, uses fingerprinting technology to avoid lawsuits by protecting content owners from unauthorized uploads. In addition to all these powerful solutions, fingerprinting also has the ability provide powerful, nonintrusive, online, in-service solutions for lip-sync verification and management.

A local comparison server can be located at each remote location to compare the local source to the one sent to it, or a centralized comparison server can be used to handle larger amounts of data that need to be compared or to compare various outputs simultaneously. Each end will require a fingerprint-generation system. A comparison server can collect fingerprint data over IP networks to perform lip-sync error detection and accurately measure delays over long distances.

While several manufacturers also sell addressable processing equipment with the ability to remotely adjust delays to fix identified lip-sync errors, facilities can also resolve problems with stand-alone, locally controlled audio delay devices, added to the audio output of each device that delays the video. In the case of larger production switchers with multiple DVEs, with a little engineering, the tally system can be used to control an addressable audio delay unit to match the varying video delay.

One of the most impressive benefits of using nonintrusive fingerprint technology is that it can be used end to end in the real world. You can measure A/V delays in your facility from your studio to your digital encoder with a high degree of accuracy. It can even be used to monitor your channel and measure delay at the output of a satellite STB, cable STB or demodulator with similar accuracy. Then, when you are asked “What’s the matter with you people?” you can respond with a confident, “The signal is leaving here OK.”

Clearly, there is no question the industry needs to establish an interoperable fingerprint standard. It’s not the use of fingerprints that needs standardization, but, rather, the method used to generate a fingerprint. The SMPTE 22TV Lip Sync Ad Hoc Group is presently investigating the possibility of producing a SMPTE standard for the fingerprint signal and/or the methods of metadata carriage.

That being said, the sooner the industry can agree on a universal, open-source algorithm for A/V fingerprints, the better.

Martin Jolicoeur, product manager at Miranda Technologies and Norman Rouse, marketing development manager at Snell, contributed to this article.