We have come to expect much of television over the last decade, including increased channel capacity and video and audio quality improvements due in large part to new digital processing capabilities. However, considering the attention that crowd-pleasing special effects receive from today’s audiences, it is surprising to find the highest-ranking determinant of television program quality has actually proven to be proper and consistent lip-sync timing. A new method for testing lip sync based on digital video watermarking offers true “in-service” monitoring and correction capability.
Human vision and aural perception
Multi-sensory studies have shown that when audio is advanced or delayed with respect to video a considerable reduction in speech intelligibility is observed. Studies also show a bias of tolerance toward the delay of audio relative to video. Humans are conditioned to expect to see something happen before hearing it, possibly due to the natural difference in the speeds of sound and light. Some tests have found the threshold for detecting audio advanced relative to video to be 30 ms (~ 2 NTSC fields) and for audio delayed relative to video to be 85 ms (~ 5 NTSC fields). When the advance or delay exceeds these just-noticeable thresholds, the effect becomes irritating. Due to the increased number of broadcast audio-to-video timing issues occurring today, which quite often far exceed these thresholds, viewers are becoming more aware of these occurrences and are more likely to complain.
Because the A/V delay rules were informal and not consistently applied by all video distributors, in 1994 the International Telecommunication Union (ITU) established a group to investigate A/V timing errors. Subjective testing with skilled and non-skilled test subjects was performed and led to the 1998 recommendation ITU-R BT.1359-1. Table 1 lists viewer tolerances ITU-R BT.1359-1 assigned to different distribution and contribution chains within a television network and/or plant.
Watermarks for A/V delay correction
Watermarking, a method of hiding data or information within still images, video or audio, emerged as an area of research around 1990. It is defined more or less as a digital data communication channel, using the image or image sequence in the video context as a carrier. There are essentially two types of watermarks -- visible and invisible. A visible watermark is essentially a secondary image that is overlaid on a primary image. These secondary images can include a corporate seal “burned” into an advertisement or a logo or “bug” associated with the organization that holds the rights to the image. The secondary watermark image is visible to allow clear determination of the image owner, but translucent enough to allow the primary image to be viewed with minimal visual impairment. An invisible watermark is a secondary overlaid image that cannot be seen, but that can be detected algorithmically.
Most of the research done on digital watermarking has been directed toward ownership authentication for copyright protection. There, it is desirable for the watermark data to be difficult to remove, since there is presumably a motive for hostile attempts to jam or strip the watermark data.
Digital watermarking of video signals involves adding (or modulating) areas of the images with a small signal that has little or no visible impact on the image or video quality. As long as the signal can be detected in the active video images, the watermarked video can serve as a subliminal communications channel for transmitting additional digital data. Watermark detection is possible even when blanking and sync have been removed, as in MPEG ransmission. It is this characteristic that makes watermarking a viable method for A/V synchronization, since enough of the audio envelope information can be compressed to fit into a robust watermarking data payload to allow timing correlation with the actual received audio signal envelope.
For the purpose of A/V delay signaling, the watermark data need not be robust to hostile attacks as with copyright applications. However, it must be robust enough to be separated from the active video signal images even if the video signal has changed signal format or undergone analog or digital conversion, scaling or MPEG compression. By encoding the associated audio envelope variation into the video as an invisible watermark at the start of a signal path, the relative timing between the received audio at the end of the signal path and the extracted watermark audio data can be measured. The measured value can then be used to control a variable audio delay to correct for A/V delay errors.
Potential A/V-delay problem areas
Lip-sync errors are certainly common in ENG productions. Video frame synchronizers that are used to synchronize ENG feeds to the in-plant reference can cause a variable A/V delay up to about four fields (66 ms). The period of variation depends on the frequency difference between the ENG source and the in-plant or studio timing reference. Ideally, there should be a compensating audio delay that tracks the variable video delay created by the frame synchronizer. However, a fixed delay is often used, allowing some periodic lip-sync errors. For example, using a fixed audio delay of 50ms reduces the A/V delay error on average, but over time the A/V lip-sync delay could grow to more than one field.
Another source of A/V delay is the wireless cameras common in ENG productions. These are sometimes sub-switched with wired cameras. The wireless camera uses a compression coder/decoder that will add video delay relative to the audio and wired camera video. Since a separate microphone is not part of the wireless camera, there is often an additional A/V delay when the wireless camera is switched in place of the wired camera.
Clearly, problems that cause A/V delay need to be well understood by television engineers, but seldom are identifiable. Digital video effects (DVE) machines add a generally predictable fixed delay of an integer number of frames. If the DVE delay is two frames, for example, a fixed audio delay of two frames must be added to compensate. However, if the master control dynamically inserts and removes the DVE from the video path, the audio delay must also be switched or adjusted to maintain lip sync. Another problem is that A/V delay often accumulates through a network in minute increments that may be barely detectable, if at all. No one device might in itself be an overall cause. Most devices that process video can add from one field to several frames of latency. Color correctors, noise reducers, frame synchronizers, compression equipment and a variety of other editing and video processing equipment are commonly used throughout the television network. Even at the source, i.e. the video camera, CCD elements can add several fields of audio-to-video delay.
Delay errors are increasingly evident in broadcast material and represent a key element of program quality. Monitoring audio-to-video delay becomes feasible with watermarking, as does automatic correction, providing an innovative solution to an old problem.
Tom Tucker is a product marketing manager in the video business unit of Tektronix.