Breaking Down Embedded Audio Part 1

Fortunately for the audio world, the SD and HD digital video formats contain unused space, called the horizontal and vertical ancillary data space, places between the end and start of active video data which are devoid of any bits representing the video signal. It's said that nature abhors a void, so it didn't take long for other types of data to fill this space, digital audio being the most common.


Digital audio is a good fit, at least conceptually. Having audio and video in the same bitstream makes sense for distribution, simplifying routing switcher systems and savings in cable runs, to name a few positives. However, digital audio isn't a good fit bit for bit, since a digital audio subframe is 32 bits long, while an ancillary data word is only 10 bits. Some finessing is required.

This time we'll look at how the AES3 signal is formatted for the horizontal ancillary data space (HANC) for standard-definition component digital video, per SMPTE 272M, "Formatting AES Audio and Auxiliary Data into Digital Video Ancillary Data Space." This applies to digital video standards SMPTE 259M "10-Bit 4:2:2 Component and 4fsc NTSC Composite Digital Signals—Serial Digital Interface" and SMPTE 344M "540 Mbps Serial Digital Interface."

Digital audio is embedded in the HANC which, as might be expected, has a packet format different from AES audio, and, as mentioned, has a word length of 10 bits.

Fig. 1: How 23-bits of an AES sample word (one subframe) is divided into three 10-bit HANC words, X, X+1, and X+2, prior to packetizing, per SMPTE 272M. Obviously a 32-bit digital audio word isn't going to fit directly into a 10-bit slot. The digital audio word needs to be broken down into smaller chunks, and then be properly HANC packetized. First, the four AES3 Aux bits are separated from the rest of the digital audio data, and inserted into a special packet. (Recall that one function of the Aux bits is to provide up to 24-bits of resolution per audio sample, so this process is required when 24-bit digital audio is present.)

Next, the 20-bits that make up an AES sample word along with its corresponding validity bit (V), user data bit (U), and audio channel status bit (C), are divided into three contiguous 10-bit ancillary data words, called X, X+1 and X+2.


Let's start with the "X word." Bit zero, when set to logic 1, represents the AES "Z" bit, which indicates the start of a new AES channel status block, and corresponds to AES frame zero. In the AES bitstream, the Z bits occur at AES frame zero, subframe A only. However, for embedded audio, it's good practice to set the Z-bit in the X word for frame zero of each of the two channels (subframes) in an AES pair.

Bits 1 and 2 of the X word indicate the embedded audio channel number within an audio group, as shown in Fig. 2. The HANC can contain up to 16 channels of embedded audio, in four groups of four channels (two AES pairs) each. Each group is identified by the data identification word in the header of the HANC packet, as discussed below.

Bits 3 through 8 of the X word, referred to as Aud zero–5, are the first six bits of the twos complement linear formatted AES3 digital audio sample word, with Aud zero, being the least significant bit.

The embedded audio "X+1 word" contains audio data bits 9 through 14, and the "X+2 word" contains the remaining audio bits 15 through 19, with "Aud 19" being the most significant audio data bit. The X+2 word also contains the AES3 validity, user, and audio channel status bits.

Fig. 2: SMPTE 272M embedded audio "X word" channel (Ch) code. Bit 8 of the X+2 word contains a parity bit, but this is not the same as the AES parity bit. The AES parity bit is not coded in embedded audio. The embedded audio P bit is set so that there is even parity for bits zero through 8 of the "X and X+1 words" plus bits zero through 7 for the X+2 word (26 bits in total).

Bit 9 of each of the embedded audio words, X, X+1 and X+2 is the complement of the previous bit 8. If bit 8 is one, then bit 9 is zero, and vice versa. This is done to make sure that the embedded audio words don't contain data reserved for specific purposes in the digital video standard.

After each AES word is mapped to the three 10-bit HANC words, they are then strung together in a proper HANC packet, per SMPTE 291M "Ancillary Data Packet and Space Formatting."

Fig. 3: Horizontal ancillary data packet format containing four channels of embedded digital audio (comprising one audio group) in a standard-definition component video bit stream per SMPTE 272M. Fig. 3 shows the format for an HANC data packet that contains, in this example, one group of four channels of digital audio (two AES pairs).

Each HANC packet contains a header which includes the ancillary data flag (ADF), which takes up three words for component standard definition digital video, plus a word each for data identification (DID), data block number (DBN), and data count (DC). Next follows the user data words (UDW), which in the case of embedded audio consist of a series of the X, X+1 and X+2 words for the different audio samples for the different audio channels. The packet ends with a checksum (CS) word. Each data packet can contain up to 255 user data words, and more than one data packet can be multiplexed into an individual data space.

The ancillary data flag (ADF) starts the packet, which for component video consists of three specific words with values of 000h, 3FFh, 3FFh. Next follows the data identification word (DID) which indicates which group the audio samples belong to. The data block number is a sequential count of data blocks with the same data identification. The data count word indicates the number of user data words that come next. The last word in the packet is the checksum that takes into account the DID, DBN, DC and UDW words.

As noted above, if 24-bit audio is present, the AES auxiliary bits need to be packetized separately in the HANC. The embedded audio standard also allows for an audio control packet. These will be discussed next time.

Mary C. Gruszka is a systems design engineer, project manager, consultant and writer based in the New York metro area. She can be reached via TV Technology.