MPEG editing

The original development of analog television was based on the concept of direct, real-time, on-air transmission of live programs and, occasionally, films. Today's television production is unthinkable without the editing of original picture sequences. With the exception of sports and news, there are few live programs going on-air.

Generally speaking, there are several types of operations that fall under the category of editing, including live switching, cross fading (mixing), various types of DVE operations, caption insertions and nonlinear editing. The transmission-oriented analog television standards are quite inadequate for editing and multigeneration signal processing.

Originally, video was edited by physically cutting tape, a process that today we would call linear editing. The later introduction of electronic linear editing depended on copying rather than cutting the material. Early VTR editing techniques often resulted in unacceptable chroma shifts at the splicing point. The appearance of digital time-base correctors removed some of the splicing irregularities but produced horizontal picture shifts at the splice point, resulting from the time base corrector trying to maintain chrominance subcarrier continuity in a four-field sequence (NTSC) or eight-field sequence (PAL).

In North America, the problem was identified and solved by updating the NTSC standard to SMPTE 170M and introducing the SCH concept. For a while, everything seemed to work just fine. Along the way, the old 2in QUAD and 1in Helical VTRs were replaced by component analog VTRs, such as Betacam and MII. This helped remove certain tape editing difficulties as no subcarrier was recorded, but the multigeneration accumulation of impairments remained.

Beginning the transition

In the 1980s, the component digital standard known as 4:2:2 made its inroads, and the first, and most expensive, digital VTR — the D1 — appeared on the market. Peripheral digital equipment proliferated with the standardization of the 270Mb/s bit-serial digital signal distribution known as SDI.

Competitively priced digital studio production equipment gradually replaced analog video production equipment. Studio-type digital equipment can operate at the full bit rate of 270Mb/s with few, if any, constraints. Full bit-rate (D5 format) or lossless compression (Digital Betacam) VTRs became entrenched and provided high-quality editing and transparent multigeneration recording.

In this environment, compression is a choice, not a necessity. There are, however, two cases where compression needs to be considered:

Distributing or transmitting the signals out of the studioHere spectrum availability and cost impose constraints that can only be addressed by using compression.
Using portable newsgathering equipmentPortability and miniaturization requirements dictate the use of compression.

The same technology that made possible the digital processing of full-quality video also made compression techniques practical and affordable. The MPEG-2 compression concept, with its toolkit approach, is the answer to these constraints.

It uses interframe encoding for the high compression ratio required by the reduced transmission bandwidth or low compression with excellent signal quality for post production. Like its early predecessors, the analog transmission standards, MPEG-2 is transmission-oriented, which means that it was designed with a single-pass process in mind and not for multigeneration processing and editing.

The nature of the problem

MPEG data streams are characterized by three types of pictures:

Intra-frame encoded (I)I frames are independent and need no information from other pictures. They contain all the information necessary to reconstruct the picture.
Predicted (P)P frames contain the difference between the current frame and a previous reference I or P frame. If the earlier reference frame is removed as a consequence of editing, the P frame cannot be decoded. P frames contain about half the information of an I frame.
Bidirectionally predicted (B)B frames use differences between the current frame and earlier and later I or P reference frames. B frames contain about one fourth the information of an I frame.

Bidirectional coding requires that frames be sent out of display sequence to allow the decoder to reconstruct the later B frames. For display, the IPB sequence has to be rearranged in the decoder. Figure 1 on page 22 shows the relative timing of the IPB frames making up a group of pictures (GOP) at the input of the encoder, the output of the encoder and the output of the decoder. B frames need to be reordered so that future frames are available for prediction. This causes a delay.

Different applications use different GOP structures to achieve the desired compression ratio. The longer the GOP, the higher the compression ratio; hence, the long GOPs are found in MPEG-2 applications for transmission and distribution. The maximum permitted length of a GOP is 15 frames. IPB GOPs end with a B frame, which has the previous P frame and the future I frame as references. IPB GOPs are, therefore, referred to as open GOPs.

Signal manipulations of MPEG transmission streams are generally limited to switching of signal sources in a master control room and are referred to as splicing. VTR or disk server handling of MPEG production streams is referred to as editing. Editing consists of replacing a recorded sequence on tape with another sequence (clip) coming from an alternate source. The new sequence is inserted, starting with its own reference I frame substituting the original I frame. This creates a problem with IPB open GOPs. Because the B frame is the result of a forward as well as a backward prediction, substituting the I frame with a new I frame unrelated with the B frames disrupts the sequence.

Two simple solutions

Seamless frame-accurate editing of compressed video is most easily accomplished with the use of short and closed GOP structures. A closed GOP does not contain frames that make reference to frames in the preceding GOP. Longer GOP structures can be edited by decoding and re-encoding or by transcoding to shorter GOP structures. There are two relatively simple solutions to the MPEG editing problem: naive cascading and restricted MPEG-2.

The naive cascading process consists of decoding the MPEG-2 compressed signal to the ITU.R 601 level, performing the required operation in the uncompressed domain and subsequently re-encoding back to MPEG-2. The intermediate processing might be a switch or some other effect. The frame has to be fully decoded in order to have access to the basic pixels.

Figure 2 on page 22 shows the conceptual block diagram of naive editing as used in some disk-based servers. Each output channel has two MPEG decoders, each with its own buffer. While Clip 1 is being played out of Decoder A, Clip 2 is being decoded by Decoder B and stored, ready to be played on demand. The switching of clips is made in the digital video domain, resulting in seamless cuts.

While switching problems are eliminated, the decoding and encoding process introduces a certain amount of picture degradation. In a typical operational configuration, there are likely to be several cascaded decoding and encoding processes resulting in a concatenation effect.

With restricted MPEG-2, the concatenation problem is avoided by restricting the compression process to a limited subset of MPEG-2 so frame-accurate editing can be performed. A typical case is the Sony SX system, which records an IBIBIB GOP structure. Each B frame is the result of a forward prediction (from the previous I frame) and a backward prediction (from the next I frame). It is, therefore, dependent on both surrounding I frames. If, as a result of editing, one of the reference I frames is substituted by a new I frame, the B frame cannot be completely reconstructed.

To avoid this effect, the B frame immediately preceding the newly inserted I frame, at the edit point, is reconstructed using only the information from the previous I frame. It effectively becomes a P frame, which Sony calls a BU (unidirectional) frame.

The original open GOP is effectively replaced by a closed GOP. This is achieved by using a pre-read playback head whose output is decoded and used to generate a BU frame that is switch selected and recorded on tape, replacing the originally recorded B frame.

After the BU frame is inserted, the switch returns to the input video source. The result is a seamless edit. Figure 3 on page 22 shows the conceptual block diagram of the insert editing process. Figure 4 shows the original IBIBIB sequence recorded on tape. Figure 5 shows the edited sequence. The newly created frame is referred to as the B'U3 and all new frames are identified with a prime sign.

Michael Robin, a fellow of the SMPTE and former engineer with the Canadian Broadcasting's engineering headquarters, is an independent broadcast consultant located in Montreal. He is co-author of “Digital Television Fundamentals,” published by McGraw-Hill and translated into Chinese and Japanese.

Send questions and comments to:michael.robin@penton.com