Audio multiplexing


Figure 1. Ancillary data packet structure for the 4:2:2 525/59.94 format. Click here to see an enlarged diagram.

Audio and video data are captured and processed separately for delivery to the end user. Analog approaches require separate distribution media for audio and video. In a teleproduction studio, this is usually done by using two separate cables. In transmission to homes, separate carriers are used. The result is a frequency division multiplex (FDM). Digital video allows for a more efficient distribution on a single cable or carrier using the time division multiplex (TDM).

Figure 1 shows details of the 4:2:2 525/59.94 horizontal blanking structure. The component digital standards do not provide for the sampling of the analog sync pulses. Two timing reference signals (TRS) are multiplexed into the data stream on every line immediately preceding and following the active line data.

Of the 276 data words in the horizontal blanking interval, eight are reserved for the transmission of the TRS. Words 1440, 1441, 1442 and 1443 are used to transmit the end of active line (EAV) TRS message, and words 1712, 1713, 1714 and 1715 are used to transmit the start of active line (SAV) TRS message. Each TRS consists of a four-word sequence. Using a 10-bit hexadecimal notation, these words are represented as follows:

3FF 000 000 XYZ

Space for ancillary data

The first three words are a fixed preamble and unambiguously identify the SAV and EAV information. XYZ represents a variable word and defines the field identification, state of the vertical blanking and the state of horizontal blanking.

In the horizontal blanking interval, 268 words — 1444 through 1711 — can be used to transmit ancillary data. During the vertical blanking duration, large blocks of data, up to 1440 words, can be transmitted within the interval between the end of EAV and the start of SAV. Only eight-bit words can be used in the vertical blanking interval. Certain restrictions on the lines that can be used exist, allowing only the use of lines one through 19 and 265 through 282. To prevent switching clicks, lines 10 (vertical interval switching instant) and 11 are not used. Lines nine (fields I and III) and 272 (fields II and IV) are reserved for error detection and handling (EDH) signals.


Table 1. 4:2:2 525/59.94 ancillary data space. Click here to see an enlarged diagram.

Audio data multiplexing

Table 1 summarizes the ancillary data space available with the ITU-R601 4:2:2 format. The horizontal ancillary (HANC) capability is listed in the upper row of the table and indicates the bit rate available for insertion of ancillary data in the horizontal blanking interval. The vertical ancillary (VANC) capability is listed in row two of the table and indicates the bit rate available for insertion of ancillary data in the vertical blanking interval. The total ancillary data space, listed in row three of the table, represents the sum of the HANC and VANC capability of the system. This value may be reduced by 10 percent to 20 percent by the data formatting used. Row three of the table lists the nominal (total) bit rate. The essential video bit rate required by the standard is shown in row four. It results from the elimination of nonessential samples in the horizontal and vertical blanking intervals. Ancillary data may include digital audio, time code, EDH, or user and control data.

The most important use of the ancillary data space is for the insertion of audio signals accompanying the video signal. The 4:2:2 component digital standards have a considerable amount of overhead. They can easily accommodate eight AES/EBU signals (eight stereo pairs or 16 individual audio channels), still leaving a considerable amount of overhead for other uses.

The ANSI/SMPTE 272M document defines the manner in which AES/EBU digital audio data, AES/EBU auxiliary data and associated control information are embedded into the ancillary data space of the bit-serial digital video conforming to the ANSI/SMPTE 259M standard. As mentioned above, the 4:2:2 525/59.94 component digital signal can accommodate 268 ancillary data words in the unused data space between the end of active video (EAV) timing reference and start of active video (SAV) timing reference.

Figure 1 shows the ancillary data packet structure for the 4:2:2 component digital interface. Each packet can carry a maximum of 262 10-bit parallel words. A six-word header precedes the ancillary data and contains:

  • A three-word ancillary data flag (ADF) marking the beginning of the ancillary data packet. Word values are 000, 3FF and 3FF, respectively.
  • An optional data identification (DID) word identifying the user data.
  • An optional data block number (DBN) word.
  • A data count (DC) word.

A variable number of data words, not exceeding 255, follows. The packet is closed by a checksum (CS) word allowing the receiver to determine the validity of the packet. Multiple, contiguous ancillary data packets may be inserted in any ancillary data space. They must follow immediately after the EAV for the HANC, or the SAV for the VANC to indicate the presence of the auxiliary data and the start of a packet. If there is no ADF in the first three words of an ancillary data space, it is assumed that no ancillary data packets are present.

SMPTE 272M proposes two modes of operation for embedding digital audio into a digital video data stream. The minimum implementation is characterized by 20-bit resolution, 48kHz sampling, audio synchronous with video, only one group of four audio channels and a receiver buffer size of 48 audio samples. The full implementation is characterized by 24-bit resolution; sampling frequencies of 32kHz, 44.1kHz or 48kHz; audio synchronous or asynchronous with video; up to four groups of four audio channels; a receiver buffer size of 64 audio samples; and indication of relative time delay between any audio channel and the video data signal.


Figure 2. Audio data packet formatting from two AES/EBU data streams. Click here to see an enlarged diagram.

Figure 2 shows an example of the minimum implementation in which two data streams (AES/EBU data stream one and AES/EBU data stream two) are formatted for embedding into a 4:2:2 525/59.94 component digital signal.

Some afterthoughts

  • A six-word header starts the audio data packet.
  • To begin the embedding sequence, frame zero of AES/EBU data stream one provides data from its subframes one and two. Each of these subframes is stripped of the four sync bits, the four auxiliary bits and the P bit. The remaining 20 bits of audio and the V, U and C bits — a total of 23 bits of subframe one — are mapped into three consecutive 10-bit words identified as X, X+1 and X+2 of AES1/CH1.
  • The 23 bits of subframe 2 are similarly mapped into three consecutive 10-bit words identified as X, X+1 and X+2 of AES1/CH2.
  • AES1/CH1 and AES1/CH2 form a sample pair.
  • To continue the embedding sequence, frame zero of AES/EBU data stream two provides data from its subframes one and two. These data are similarly reduced to 23 bits and result in sample pairs AES2/CH1 and AES2/CH2.
  • The two consecutive sample pairs form an audio group.
  • The 19-word audio data packet closes with a CS word.
  • Subsequent horizontal blanking intervals will accommodate frame one of AES/EBU data streams one and two, frame two of AES/EBU data streams one and two, and so on until the 192 frames (each constituting one AES/EBU block) of each of the two AES/EBU data streams are embedded.
  • Then a new block of 192 frames coming from the two AES/EBU data streams will be embedded, and the process will continue.
  • At the receiving end, the packets are extracted and fill a 64-sample buffer from which the original data are extracted at a constant bit rate and then reformatted.


Table 2. Formatted audio data structure. Click here to see an enlarged diagram.

Table 2 shows the audio data structure represented by the three 10-bit data words. Two bits indicate the channel number, and a parity is calculated on the 26 bits, excluding all b9 address bits.

The distribution of digital audio and video signals using a single coaxial cable is advantageous if the multiplexed signal does not have to be processed separately — that is, if the product is ready for distribution or transmission. However if the video signal has to feed a production switcher for further processing, the audio has to be demultiplexed and processed separately, which may prove to be awkward and costly. To embed or not to embed is a decision that requires a clear understanding of predictable and unpredictable operational requirements.

Michael Robin, a fellow of the SMPTE and former engineer with the Canadian Broadcasting Corp.'s engineering headquarters, is an independent broadcast consultant located in Montreal, Canada. He is co-author of Digital Television Fundamentals, published by McGraw-Hill and translated into Chinese and Japanese.

Send questions and comments to:michael_robin@primediabusiness.com