Skip to main content

Audio processing for HDTV, Part 2

Last month we began this discussion on audio for HDTV by looking at key areas that can affect your viewers' audio experience. That article can be found on page 33 of the April issue or at

With those points covered, let's continue with a look at metadata and the most common audio problem for today's broadcaster, which is really caused by a video issue called lip sync.


Metadata can be static or dynamic. Typically the metadata is set at the end of the chain just as the audio is being encoded into Dolby Digital AC-3 as either stereo (2.0) or surround sound (5.1). This is the static approach to metadata processing.

Using dynamic metadata (see Figure 1), engineers can pass data about the audio signals from ingest all the way to master control. In this way, the source data is decoded from the Dolby E signal or from the uncompressed material with either program-provided metadata or generated metadata.

In addition, metadata might be generated when upconverting sources for the stereo audio program. The audio — whether it is stereo or 5.1 — and the metadata are both embedded. The signals are then output, ingested, stored, edited and played back through master control. The program audio is de-embedded ahead of the Dolby Digital AC-3 encoder along with metadata. The audio (2.0 or 5.1) is compressed, and the appropriate metadata is then used to control the downstream decoder in the home.

However, there are issues to consider when using this approach. The metadata might not be easy to review or monitor. And the metadata could be wrong. If it's wrong, the engineer needs to decide whether to fix the audio or fix the metadata. Unfortunately, some of today's home theater receivers mute the audio when there is a transition between 2.0 and 5.1. So, audio engineers need to carefully consider the choices they make when attempting to repair a signal because the fix may be worse than the original problem.

Maintaining lip sync

Video processing is notorious for adding delay. While audio can be quickly decoded and then recoded, it takes time to perform video recoding, often many frames. When this happens, the audio is ready to go before the video gets out of the starting gate. If they are then output simultaneously, the audio will lead the video. The viewer hears what a person will say before his or her lips move. This is quite disturbing to viewers. Video delays also vary over time, so just because things are in time today doesn't mean someone can't move a patch, causing the signal to have lip sync problems. That is why it is important to be able to continually adjust the audio to match any resulting video delay.

As long as the video and audio are close enough, perhaps two frames, viewers may not perceive any lip sync errors. Unfortunately, once a viewer does notice a lip sync error, they become sensitized to it. Now, even smaller differences become noticeable.

Engineers must quickly correct any lip sync problems so viewers don't suddenly become experts and find fault with your signal. The goal should be to prevent viewers from ever detecting lip sync on your signal. Point-to-point techniques are used to measure and correct lip sync over links from ingest to master control, or from location to location over satellite. Typically, these techniques are used offline (when the program is not running). Other techniques look at and listen to the picture content and provide a measurement that an operator can use to adjust video delay to maintain the lip sync.

Again, remember that lip sync isn't something you can set and forget. Always monitor it.

Embedded audio

Today's TV facilities often rely on a technique called embedded audio as a convenient way to move audio with video signals. It is used in both SD and HD infrastructures. Embedding the audio onto a digital video stream simplifies lip sync issues because it precisely locks the audio to the video in time.

However, moving embedded audio requires AES interfacing on the support equipment. This may include audio processing, de-embedders and embedders, and console inputs/outputs. So, while it can work well, the engineer has to carefully consider all the points where the audio needs to be removed from the digital video signal, processed in some way, and then reinserted into the video path. Every time audio is de-embedded and re-embedded represents a point for potential lip sync errors.

Loudness control

Many techniques have been tried to maintain consistent audio levels. However, most do not take into account the actual perceived loudness of the audio signal.

Loudness control in television is extremely important because of the variety of mixes, programs and dynamic ranges. Existing approaches have made it difficult to manage loudness effectively, but some new methods of measuring audio loudness now make it possible to better control the output loudness, keeping it consistent, while leaving the dynamic range intact. An example system block diagram is shown in Figure 2.

Detecting types of audio

Because audio can arrive in a variety of formats, operator intervention is typically required to sort out what's provided at ingest. Equipment can detect the varying types of audio by using the C-bit in the AES stream. PCM and non-PCM (compressed audio) can be identified so that a Dolby decoder can be routed into the signal path for decoding as necessary.

For an upmixing process, detecting 2.0 and 5.1 is important so that 5.1 passes through unaltered and the 2.0 signal is immediately upmixed. Current techniques also include the use of the automation system to signal the processing equipment using GPI.

Implementing solutions

There are several possible approaches to implementing these new processes in a television facility. One approach is to build a hybrid stereo audio and 5.1 surround sound audio infrastructure. If metadata processing is included, then placing the Dolby Digital AC-3 encoder at the end of the chain is possible. Another approach is to upmix stereo audio sources to 5.1 just before the emission path. Then the facility needs to pass through any 5.1 surround and set the encoder to 5.1 metadata.

A third approach is to leave the stereo audio infrastructure intact and downmix all 5.1 signals at ingest to a 2.0 signal. At the end of the chain, the signal is then upmixed back to the original 5.1. Any 2.0 signals are then upmixed as well. The encoder is set for 5.1 metadata, and 5.1 is then sent into the home at all times. This approach offers the benefit of reduced design and implementation costs.

The most modern approach is to build a full 5.1 infrastructure and upmix all 2.0 sources at ingest. This method allows 5.1 monitoring across the infrastructure, as well as all audio editing and sweetening to be performed in 5.1. The ideal 5.1 infrastructure also includes metadata management. Loudness control is then added at the end of the signal chain. A loudness level is chosen, and the input audio is matched to the output loudness level.

Another option is to alter the metadata as the input audio changes. However, as we've discussed, this may not be ideal, as varying metadata at the home receiver might have unexpected and undesirable results.

How to interface

New audio devices typically have AES inputs and outputs, but most television infrastructures still rely on embedded audio transportation. When interfacing these AES devices, along with the de-embedding and embedding requirements, sufficient video delay must be added to maintain lip sync. All audio devices must maintain a fixed delay to simplify lip sync management throughout the facility. Metadata generators, monitoring and management must be easy to use. Otherwise, operators will simply become confused and ignore the solutions, which are actually at their fingertips.

Action plan

First, be sure you understand the basics of audio interfacing. Recognize the differences between balanced and unbalanced AES lines. Know how analog and embedded audio can be moved in SD-SDI and HD-SDI streams.

Second, create a facility timing diagram. You need to understand the propagation of signals through your facility's video and audio equipment so that lip sync can be more easily predicted and maintained.

Third, there are many approaches to mapping audio into a TV facility. It is important to understand that a stereo mix, a 5.1 mix, natural sound and descriptive video could all be associated with a program path. Also, because going beyond eight channels is easy for today's audio processing, it's necessary to design a system capable of handling 16 audio channels. Audio processors that provide easy routing or mapping of audio for different productions becomes important.

Finally, the difficulty of presenting consistent audio in the home is one of the biggest challenges facing broadcasters. Fortunately, there are techniques and equipment available to address the issues. Properly applied, good audio processing can improve the audio experience for your viewers. The key is to understand the issues, design the required infrastructure, budget for the purchase and then carefully commission and train your staff.

Randy Conrod, is product manager of digital products for Harris.