Skip to main content

Embedded Audio for SD Component Video

As the AES3 digital audio standard allows for word lengths up to 24-bits as well as a variety of sampling rates, SMPTE 272M provides the means for embedding these signals. In addition, SMPTE 272M allows for the indication of audio frame numbers, the amount of audio processing delay, and active audio channels. Two specialized packets handle these chores—the extended data and audio control packets.


As discussed previously, embedded 20-bit digital audio (the bits located in AES3 timeslots 8 to 27) are carried in the main horizontal ancillary data packets. The first four bits (the least significant—or those in AES3 timeslots 4 to 7) of 24-bit digital audio are handled differently. To embed these four bits into a standard definition component digital video signal, they are first separated from the rest of the AES3 bitstream and placed in another packet, called the extended data packet.

Fig. 1: Extended data packet structure for embedded audio in a standard-definition component video bit stream per SMPTE 272M. (Reference for all figures: SMPTE 272M.) According to SMPTE 272M, this extended data packet contains "two 4-bit groups of auxiliary data per ancillary data word… One extended data word will be transmitted for each corresponding sample pair."

Fig. 1 shows the configuration of an extended data packet, and Fig. 2 shows the details of each aux word. An extended data packet, if present in the embedded audio signal, follows the main audio data packet it's associated with and is located on the same video line as the main packet.

According to SMPTE 272M, "all of the audio and auxiliary data from one audio group shall be transmitted together before data from another group is transmitted."

The audio control packet contains bits indicating audio frame numbers, audio sampling rate, active audio channels, and audio processing delay. According to SMPTE 272M, the audio control packet is transmitted once per field in an interlaced system or once per frame in a progressive system.

Fig. 2: Extended audio packet data structure The audio control packet must be present in cases where digital audio has a sampling rate other than the default case of 48 kHz locked to video. Each audio group has its own control packet. In the default case, the audio control packet is optional. However, if the audio control packet is not transmitted, other parameters contained in the packet are undefined. Fig. 3 shows the audio control packet structure.

Let's look at each of the parameters it contains.

According to SMPTE 272M, "audio frame numbers provide a sequential ordering of video frames to indicate where they fall in the progression of non-integer number of samples per video frame (audio frame sequence) inherent in 30/1.001 frame(s) video systems. The first number in the sequence is always one and the final number is equal to the length of the audio frame sequence. A value of all zeros indicates no frame numbering is available." The rate word in the audio control packet indicates the sampling rate, as shown in Fig. 4.

Fig. 3: Structure of audio control packetTHE DEAL WITH DELAY

The act word of the audio control packet indicates whether an audio channel in a four-channel group is active or not. The first four bits of the act word correspond to each of the four audio channels respectively. When any of these bits is set to one, it indicates that that channel is active.

The delay words indicate, according to SMPTE 272M, "the amount of accumulated audio processing delay relative to video, measured in audio sample intervals, for each of the channels… The delay words are referenced to the point where the AES data are input to the formatter. The delay words represent the average delay value, inherent in the formatting process, over a period no less than the length of the audio frame sequence… plus any preexisting audio delay. Positive values indicate that the video leads the audio."

Delay can be indicated for channel pairs or for each of the four channels individually, depending on whether the first bit of each DELAY-0 word is zero or 1, respectively. If delay is indicated for individual channels, DELAY-A is for channel 1, DELAY-B is for channel 3, DELAY-C is for channel 2, and DELAY-D is for channel 4. If delay is indicated for channel pairs, then DELAY-A is for channels 1 and 2, and DELAY-B is for channels 3 and 4. In the second case, DELAY-C and DELAY-D have no meaning.

Fig. 4: Audio Control Packet, Rate Word structure The amount of delay is indicated as a 26-bit twos complement number formed by each set of the three delay words. This completes the highlights of embedded digital audio for component standard definition video. Stay tuned for the high-definition version.

Mary C. Gruszka is a systems design engineer, project manager, consultant and writer based in the New York metro area. She can be reached via TV Technology.