Consider the progress we have made since the early origins of digital video when MPEG-1 was introduced. Video codecs for digital broadcast and distribution all descend from that common heritage. Technical decisions made 15 years ago were amazingly farseeing. The state of the art is represented by H.264 coding for HDTV services at bit rates that were inconceivable just a few years ago.
How it works
Video encoders output a serialized bit stream that modulates a carrier signal for broadcast or networking. Producing an economically coded bit stream with no unnecessary duplication is challenging. The receiver reconstructs a sequence of moving pictures from this stream.
Only the player is standardized. The encoders will use smarter techniques as they evolve over time. This is not a problem, provided they produce a compliant output. This demonstrates that bit rates will improve. Sound is processed independently and delivered with reference to the same timeline.
Video plays for hours at a time but is actually compressed in short sequences. The length depends on the video format and target platform. Fifteen frames for a group of pictures (GOP) is typical. There are three kinds of frame in a GOP:
- Intra-frames (I-frames) at the start;
- Predicted frames (P-frames) at the end; and
- Bidirectionally coded frames (B-frames) in-between.
The I-frame is coded first, just like a still photograph. The image is divided into 16 × 16 pixel macroblocks. Macroblocks are grouped into horizontal slices that help with dropout reconstruction. Some bit rate saving results immediately from culling similar macroblocks and only buffering unique blocks. (See Figure 1.)
Then P-frame content is analyzed. Only new blocks not present in the I-frame are retained. The collection of macroblocks describes the frames at each end of the GOP. Now, the intervening B-frames can be coded more efficiently.
B-frame macroblocks are discarded if they duplicate any I- and P-frame blocks already collected. The buffer maintains these unique macroblocks that are referred to by different frames.
The last frame of the GOP must be delivered earlier than it is presented for display so the B-frames can be reconstructed. Frame reordering immediately causes some coding latency because of the GOP length. If latency is a problem (perhaps for video conferencing), use shorter GOPs or omit P-frames altogether.
Motion JPEG encodes I-frames only. It won't achieve compression ratios as high as MPEG, but it does produce editable content.
I-frames could encode as small as 40KB. A GOP with just 15 I-frames would occupy 600KB. A single P-frame might save 35KB. These byte savings don't add up to much on their own, but the rest of the GOP then encodes as 1KB B-frames. The whole GOP might encode in less than 60K. So, P- and B-frames yield a useful 10:1 compression if we can tolerate latency. (See Figure 2.)
MPEG-1 provided simple match and discard techniques for macroblock reduction. Later codec designs find macroblocks that are similar but not identical and encode the residual differences. If the blocks are not identical, we could eliminate a few more at the expense of image reconstruction accuracy.
Some details in the macroblocks might have moved fractionally from one frame to another. MPEG-2 allows pixels to be shifted along a motion vector before working out the residuals. H.264 enhances this by allowing the distance to be less than a whole pixel. Motion vectors are computationally challenging but reduce the amount of data that needs to be encoded.
MPEG-2 and H.264 also widen the range of their search for duplication. MPEG-1 looks only within the same slice, MPEG-2 within the same GOP, and H.264 can look outside the GOP. Longer reach leads to better compression ratios.
Modern codecs provide many tools to eliminate macroblocks. H.264 implements a superset of all the tools supported by its competing codecs. Because it was worked on by a consortium of experts from several standards bodies and technically reviewed by hundreds of engineers, it should outperform the other codecs.
Encoding macroblocks directly into the output bit stream would not yield enough compression. We need a general-purpose reduction that is easy to apply and simple to reverse for the player.
A single macroblock is represented as luma at full 16 × 16 resolution and two chroma difference blocks at 8 × 8 resolution. The eye is less sensitive to color information. Compressing from 10-bit RGB to 8-bit Y'CbCr and reducing the chroma to 8 × 8 pixels gains a further 2:1 compression. This is still not enough.
Using frequencies rather than pixels is more efficient. Applying a fast Fourier transform delivers coefficients that describe how much each frequency contributes to the spatial image.
Discrete cosine transform computation is quite straightforward but compute intensive. The DCT formula is shown in Figure 3. The algorithm visits every pixel in the macroblock, accumulating the frequency coefficients and storing them in a grid. Luma is transformed as four 8 × 8 pixel blocks. (See Figure 4.)
The first value is a DC offset (average value). The rest are frequency perturbations that modify it. Frequency (and hence detail) increases to the right and towards the bottom. Fine detail is in the lower right of the grid with coefficients decreasing in magnitude for higher frequencies.
Starting at the top left, walk in a zigzag fashion down towards the lower right to order the coefficient values for transmission. (See Figure 5). The values decrease towards zero where the walk is truncated (entropy coding). Until this point, the encoder is lossless. Discarding earlier coefficients will cause visible artifacts. Even at high compression ratios, the picture can still look good.
Now, the original macroblock is represented by just a few coefficients ready to be coded into the bit stream. Buffer size feedback controls the entropy coding truncation to throttle the bit rate.
What compression ratios are possible? Say the video content can cull 25 percent of the macroblocks in an I-frame. Each subsequent stage contributes towards the result:
- I-frame culling — 75 percent
- B- and P-frames — 10 percent
- Sub-sampling — 50 percent
- DCT/Entropy — 50 percent
That is about a 50:1 compression factor and in the right ballpark for a well-tuned compression system. Reducing picture size and frame rate for Internet streaming will improve the performance.
Certainly, encoders will get better. H.264 will achieve amazing compression ratios, especially for HD. It is already very attractive for SD and even more so for building the emerging Interactive TV 2.0 concepts, which are on the horizon and being developed around MPEG-4 BIFS and LASeR standards. The journey isn't finished yet.
Cliff Wootton was the technical systems architect for BBC News Interactive TV and is now writing and developing advanced interactive TV content systems.
From MPEG-1 to H.264
The progression of the MPEG format:
- MPEG-1 delivers basic capabilities.
- MPEG-2 adds interlacing support for broadcast TV and DVD.
- MPEG-4 part 2 adds more sophisticated coding tools and alpha channels.
- MPEG-4 part 10, (aka AVC and H.264) adds more efficient DCT computation and better macroblock culling.
The important steps in video compression for each sequence are:
- Locate the I-frame.
- Define the slices.
- Store unique macroblocks.
- Locate and analyze the P-frame.
- Append its unique macroblocks.
- Analyze remaining B-frames saving new unique macroblocks.
- DCT the macroblocks into frequency plots.
- Entropy code to remove unnecessary fine detail.
- Assemble into a usable bit stream taking care of buffer overflows.