Audio distribution

When digital television formats were initially created, the audio systems were carefully designed to support mono, stereo and 5.1 (and beyond, in some cases). These formats, such as Dolby Digital (AC-3) and DTS Coherent Acoustics (CA), ably encompassed the path from the broadcaster to the consumer. The only problem was that little thought was given to the lengthy path leading up to transmission: How do you get 5.1 channels (or more) to the transmitter?

The path prior to transmission was loaded with VTRs and routing gear that supported only four channels of audio, making 5.1 channels impossible. Worse, what would happen when multiple languages or services need to be carried? Worse still, what if these additional programs need to be in surround? Luckily, there are some new solutions that look far beyond the initial goals of delivering a single 5.1 channel program and a mono alternate language channel.

History

In 1999, Dolby Laboratories released the Dolby E mezzanine audio compression system to answer the problem. Meant to wrap around equipment that supported only four digital audio channels, the system carries eight high-quality channels of audio plus audio metadata in a single 20-bit, 48kHz AES pair (i.e. 1.92Mb/s). The system can also carry six channels of audio and metadata in a 16-bit, 48kHz AES pair (i.e. 1.536Mb/s).

This means that a VTR with two AES pairs could devote one pair to normal PCM, while the remaining pair could carry eight additional channels of audio encoded in Dolby E format. The system then outputs compressed audio in packets equal in size to reference video frames, so a plant operating at 25fps will use Dolby E at 25fps. This allows video frame-based editing and switching to be performed on the compressed signal without affecting the final decoded audio.

Several issues have arisen since the release of the technology. One is that the advantages of being tied to a video frame rate also have a downside: Content that is mastered at one rate will usually require decoding and re-encoding to the new frame rate. This will not cause audio issues per se, but it may cause metadata and lip-sync mistakes and definitely will introduce additional delay. The downside of simply ignoring this need for transcoding is that the tape or bitstream cannot be edited or switched without causing audio problems.

The top half of Figure 1 shows a simplified example of a PAL video frame compared to a PAL Dolby E frame. Note that everything lines up. If a format conversion is done to change PAL to NTSC and Dolby E is not decoded and re-encoded, then the bottom half of Figure 1 shows how the video switch points will corrupt the Dolby E frames.

Another issue is that eight channels per AES pair may simply not be sufficient to carry all of the audio channels necessary for multiple languages and services. Finally, like all other compression systems, a Dolby E signal encounters delay in the encoder and the decoder. This requires a matching delay be applied to the video.

PCM rules

Since 1999, the audio channel count in professional VTRs has grown. The original Panasonic HD-D5 and Sony HDCAM formats supported only four audio channels. The latest HD-D5 can handle eight channels, and the new HDCAM-SR format has 12 audio channels. Most video servers can handle 16 or more channels of audio, and hard disk space is far less expensive than adding compression systems everywhere. A further benefit is that all of these systems manage audio/video timing internally with no need for expensive external delays.

Facility routing and switching has also grown with the adoption of SMPTE 299M, which carries up to 16 channels of audio in the horizontal ancillary (HANC) space of an HD-SDI signal. Audio metadata can also be carried in the vertical ancillary (VANC) space of the same HD-SDI signal, providing video, audio and metadata in a single tightly synchronized, routable, recordable package. Not all VTRs or servers can handle the metadata portion of the HD-SDI signal, but the numbers are growing, thanks to requests from broadcasters. Metering and monitoring devices that handle uncompressed PCM audio are also less expensive than models that must also contain compressed audio decoding. All known editing and production gear that handles digital audio also handles uncompressed PCM.

The drive towards file-based production is also speeding up the transition from compressed to uncompressed audio. When files are stored using standard PCM, anyone anywhere can accept that file on a hard drive, optical media or via ftp or e-mail and can play it back without external decoding gear.

It is now possible and practical to keep audio as baseband PCM throughout most parts of a facility. This allows compression systems to be limited to those areas with bandwidth restrictions, such as satellite backhauls and other RF or telco distribution paths.

Maximizing bits

The areas where mezzanine compression systems are still required demand that these systems have the flexibility to allow a larger number of channels to be carried in the same space as legacy systems.

With this in mind, a new format called e2 (e-squared) has been developed. Based on the Coherent Acoustics (CA) algorithm from DTS, it solves several longstanding issues.

The system accepts up to 16 audio channels plus audio metadata and encodes them into a single 20-bit, 48kHz AES pair, effectively doubling the capacity. Up to 12 audio channels plus audio metadata can be carried in a 16-bit, 48kHz AES pair. The system can be configured to carry only 5.1 channels of audio in this same 16-bit, 48kHz pair if desired. Compensating HD/SD-SDI video delays are built into both the encoder and decoder. With built-in embedding and de-embedding, these signals can also be used for PCM audio inputs and outputs.

The e2 format is fundamentally an advanced ADPCM system at heart thanks to the CA coding system, and this brings several advantages to handling transmission errors. ADPCM systems can spread errors across several decoded frames instead of producing a loud burst of near-full-scale noise. Further, the e2 system was designed from the start to be synchronized to 48kHz audio reference to allow it to be used with any video frame rate without the need for a decode/re-encode cycle. The bitstream can also be edited or switched on AES frame boundaries, allowing video frame and field edits in addition to being routable through normal AES routers.

Figure 2 shows how the e2 format lines up with video frames and AES frames. For clarity, the AES frames are magnified many times and in reality are much finer-spaced than shown. Switching is possible on any of these boundaries.

Remotely controlled audio metadata generation is built into an e2 encoder along with a metadata frame synchronizer to tame externally supplied metadata. A matching generator and frame synchronizer are present in the decoder to ensure that any external transmission errors do not negatively affect metadata output.

Another solution

Consumers expect 5.1-channel audio in their entertainment, and it is up to broadcasters to meet the demand. It is now possible to produce this content using today's familiar workflows by keeping the audio in the PCM domain whenever possible. If the audio reaches a bottleneck, new formats such as e2 provide an efficient way to maintain high-quality audio, protect audio metadata and manage latency to prevent lip-sync errors.

Tim Carroll is president of Linear Acoustic.

Recommended reading