MPEG editing North American (FCC) and inter national (ITU) spectrum management concerns resulted in the analog television on-air channel allocations and transmitted bandwidth with regional flavors (NTSC/6MHz, SECAM/7MHz and PAL/8MHz). This created certain transmission-related constraints resulting in the frequency division multiplexing of luminance and chrominance aimed at best utilizing the available transmission channels.
The original development of analog television was based on the concept of direct, real-time, on-air transmission of live programs including occasional films. Most of today's television production is unthinkable without editing the original picture sequences. With the exception of sports, there are very few live programs on-air. Generally speaking there are several types of operations that fall under the category of editing such as live switching, cross fading (mixing), various types of DVE operations, caption insertions and . Digital timebase correctors removed some of the splicing irregularities but resulted in horizontal non-linear editing. The transmission-oriented analog television standards are quite inadequate for editing and multigeneration signal processing. Originally video was edited by physically cutting tape, a process which today we would call nonlinear editing. The later introduction of electronic linear-editing depended on copying rather than cutting the material. Early VTR editing techniques very often resulted in unacceptable chroma shifts at the splice-point picture shifts at the splice point. This was because the time base corrector tried to maintain chrominance subcarrier continuity in a four (NTSC) or eight (PAL) field sequence. In North America, the problem was identified and solved by updating of the NTSC standard to SMPTE 170M and the introduction of the SCH concept. For awhile, everything seemed to work just fine. Along the way the old 2" QUAD and 1" Helical VTRs were replaced by component analog VTRs, such as Betacam and MII. This helped remove certain tape editing difficulties as no subcarrier was recorded but the multigeneration accumulation of impairments remained.
In the 1980s the component digital standard known as 4:2:2 made inroads. At this time the first digital VTR, the D1, appeared on the market. The availability of peripheral digital equipment proliferated with the standardization of the 270Mb/s bit-serial digital signal distribution known as SDI. Today, competitively priced digital studio production equipment is replacing analog video production equipment. Studio-type digital equipment can operate at the full bit rate of 270Mb/s with few, if any, constraints. Full bit-rate (D5 format) or lossless compression (Digital BETACAM) VTRs are entrenched and provide high quality editing and transparent multigeneration recording. In this environment, compression is a choice not a necessity. There are, however, two cases where compression needs to be considered.
- Getting these signals out of the studio through distribution or transmission links: Here spectrum availability and cost impose constraints that can only be addressed by using compression.
- Portable newsgathering equipment: Portability and miniaturization requirements dictate the use of compression.
The same technology that made digital processing of full quality video possible also made compression techniques practical and affordable. The MPEG-2 compression system, with its toolkit approach, is an answer to these constraints. MPEG-2 uses interframe encoding for the high compression ratio required by reduced transmission bandwidths. Also available are low compression ratios with excellent signal quality for post-production. However, like its early predecessors the analog transmission standards, MPEG-2 is transmission oriented. It was designed with a single-pass process in mind and not for multigeneration processing and editing.
The nature of the problem MPEG data streams are characterized by three types of pictures:
- I (intraframe encoded): I frames are independent and need no information from other pictures. They contain all the information necessary to reconstruct the picture.
- P (predicted): P frames contain the difference between the current frame and a previous reference I or P frame. If the earlier reference frame is removed as a consequence of editing, the P frame cannot be decoded. P frames contain about half the information of an I frame.
- B (bidirectionally predicted): B frames use differences between the current frame and earlier and later I and/or P reference frames. B frames contain about 1/4 the information of an I frame. Bidirectional coding requires that frames be sent out of display sequence. This allows the decoder to reconstruct the later B frames. For display, the sequence has to be rearranged in the decoder. Figure 1 shows the relative timing of the I,P,B frames making up a group of pictures (GOP) at the input of the encoder, the output of the encoder and the output of the decoder. B frames need to be re-ordered so that future frames are available for prediction, causing a delay.
Different applications use different GOP structures to achieve the desired compression ratio. Typically, the longer the GOP, the higher the compression ratio. Long GOPs are found in MPEG-2 applications for transmission and distribution. The maximum permitted length of a GOP is 15 frames. I, P,B GOPs end with a B frame, which has two reference frames: a previous P frame and the future I frame. I, P, B GOPs are therefore referred-to as open GOPs.
Signal manipulations of MPEG transmission streams are generally limited to switching of signal sources in a master control room and are referred to as splicing. VTR or disk server handling of MPEG production streams, are referred-to as editing. Editing consists of replacing a recorded sequence on tape with another sequence (clip) coming from an alternate source. The new sequence is inserted starting with its own reference I frame substituting for the original I frame. This creates a problem with I,P,B open GOPs. Since the B frame is the result of a forward as well as a backward prediction, substituting the I frame with a new I frame that is unrelated to the previous B frame disrupts the sequence.
Two simple solutions Seamless frame-accurate editing of compressed video is most easily accomplished with the use of short and closed GOP structures. A closed GOP does not contain frames that make reference to frames in the preceding GOP. Longer GOP structures can be edited by decoding and re-encoding or by transcoding to shorter GOP structures. There are two relatively simple solutions to the MPEG editing problem:
- Naive cascading: Naive cascading consists of decoding the MPEG-2 compressed signal to the ITU.R 601 level, performing the required operation in the uncompressed domain and then re-encoding back to MPEG-2. The intermediate processing might be a switch or some other effect. The frame must be fully decoded to have access to the basic pixels. Figure 2 shows the conceptual block diagram of naive editing as used in some disk-based servers. Each output channel has two MPEG decoders, each with its own buffer. While Clip 1 is being played out of decoder A, Clip 2 is being decoded by decoder B and stored ready to be played on demand. Switching of clips is made in the digital video domain resulting in seamless cuts. While switching problems are eliminated, the decoding/encoding process introduces a certain amount of picture degradation. In a typical operational configuration there are likely to be several cascaded decoding/encoding processes resulting in a concatenation effect.
- Restricted MPEG-2: Here the concatenation problem is avoided by restricting the compression process to a limited subset of MPEG-2 so that frame-accurate editing can be performed. A typical case is the Sony SX system. SX records an IBIBIB GOP structure. Each B frame is the result of a forward prediction (from the previous I frame) and a backward prediction (from the next I frame). It is therefore dependent on both surrounding I frames. If one of the reference I frames is substituted by a new I frame, as a result of editing, the B frame cannot be completely reconstructed. To avoid this, the B frame immediately preceding the newly inserted I frame, at the edit point, is reconstructed using only the information from the previous I frame. It effectively becomes a P frame, which Sony calls a BU (unidirectional) frame. The original open GOP is effectively replaced by a closed GOP. This is achieved by using a pre-read playback head whose output is decoded and used to generate a BU frame which is switch-selected and recorded on tape, replacing the originally recorded B frame. After the BU frame is inserted, the switch returns to the input video source. The result is a seamless edit. Figure 3 shows the conceptual block diagram of the insert editing process. Figure 4a shows the original IBIBIB sequence recorded on tape. Figure 4b shows the edited sequence. The newly created frame is referred to as the B'U3 and all new frames are identified with a prime sign.