H.264/AVC I- and P-slice encoding

One of the many characteristics of H.264/AVC that makes it difficult to understand is its use of terms similar to those used when discussing MPEG-2 — for example, “I,” “P” and “B.” An H.264 I-slice is a portion of a picture composed of macroblocks, all of which are based upon macroblocks within the same picture. Thus, H.264 introduces a new concept called slices — segments of a picture bigger than macroblocks but smaller than a frame. Just as there are I-slices, there are P- and B-slices. P- and B-slices are portions of a picture composed of macroblocks that are not dependent on macroblocks in the same picture.

H.264 encoding begins by chroma downsampling to 4:2:0. Next, each incoming picture is divided into macroblocks. (When interlaced video is encoded, both fields are compressed together.) Many of the same techniques used to compress an MPEG-2 I-frame are used to compress macroblocks making up an I-slice. Each 16 × 16 pixel macroblock is further partitioned into four 8 × 8 submacroblocks. (See Figure 2.) The encoder can switch between working with 16 × 16 blocks and 8 × 8 blocks.

Blocks, of course, are located next to other blocks. For example, the Current Block (yellow) in the Figure 2 frame to be encoded has a block to the left (green) and a block above (blue). The latter two blocks are Previous Blocks. Reference Pixels are located at the left (dark green) and lower (dark blue) boundaries between Previous Blocks and the Current Block. Four different types of prediction methods (modes) are used with 16 × 16 macroblocks. (See Figure 3.)

When predictions are made for 8 × 8 submacroblocks, nine modes are used. (See Figure 4.)

In all cases, the mode that best predicts the content of the Current Block is selected as the Current Prediction Mode. The Current Prediction Mode is linked to the Current Block. Each Predicted Block (from the column and row of Reference Pixels) is “subtracted” from the Current block, thereby generating a Residual (difference) Block. Each Residual Block is compressed, linked to the Current Block, and during decoding used as a picture “correction” block.

Once an I-slice has been encoded, P-slices are encoded. Motion estimation is methodically performed, and macroblocks in other frames are searched for the contents of the Current Block. H.264 supports searching within up to five pictures before or after the current picture. (AVCHD supports searching within four pictures.) Obviously, the greater the number of reference pictures used, the greater the memory that must be in an encoder. For this reason, AVCHD cameras typically only support one or two reference frames.

The block with the best measured content match becomes a Reference Block. A P-reference is generated when only a single motion vector is defined by the displacement between Current and Reference Blocks. Each motion vector and each P-slice compressed Residual Block are linked to a P-slice.