Coming so soon after MPEG-1 and MPEG-2, MPEG-4 raises a number of questions. Will it replace MPEG-2? Does it make existing equipment obsolete? How does it affect the broadcast industry? To answer those questions, this article explores MPEG-4's predecessors and puts MPEG-4 into perspective.
Envivio’s Video Lounge allows users to select and watch their favorite music videos and buy CDs and concert tickets. The Video Lounge screens consist of multiple MPEG-4 objects that are composed using the MPEG-4 binary information for scenes (BIFS) composition tools.
All compression-decompression schemes (codecs), including the MPEG variety, are concerned with using fewer data or less bandwidth to store or transmit pictures and sound. Compression certainly isn't new to broadcasting.
For example, interlacing scan lines is a way of halving the video bandwidth and is a crude compression technique. Systems like NTSC fit color into the same bandwidth as monochrome and are also classified as compression techniques.
Effective compression relies on two devices: an encoder at the signal source that packs the information more efficiently, and a corresponding decoder at the destination that unpacks it. Unless these two devices are compatible, the system won't work. Without standards to provide this compatibility, there would be chaos.
Flexible and upgradeable
The problems facing the MPEG designers were many, but the most important ones were how to make compression available to a wide range of applications and how to allow future enhancements to prevent obsolescence.
The designers wanted to make MPEG available to applications from large-screen, high-quality video systems to small, black-and-white security systems. Obviously, an electronic-cinema compression system designed to work on a giant screen must have more powerful hardware and more memory than a system designed for a security camera.
The designers addressed this problem by defining “Levels” and “Profiles” in the system. Levels set limits to the amounts of processing power and memory needed to decode the signal. Profiles set limits to the complexity of the encoding and decoding.
Figure 1. MPEG-2 did not obsolete MPEG-1, but rather augmented it. MPEG-2 can be thought of as a larger toolbox that includes all of the MPEG-1 tools. Likewise, the MPEG-4 toolbox is massive, but it still contains all of the MPEG-1 and MPEG-2 tools.
MPEG designers addressed the problem of avoiding obsolescence by adopting two steps. The first was to define the signal between the encoder and decoder devices, instead of defining the devices themselves. The second was to make improvements that were backward compatible.
MPEG defines the syntax and protocol of the signal between the encoder and the decoder. This is, effectively, a kind of language. A compliant encoder is one that can speak the language, even if it only has a limited vocabulary. A compliant decoder must be able to understand the whole vocabulary at a particular profile, just in case an encoder chooses to use some of the obscure terms.
MPEG works by making available a set of tools that may result in compression under different circumstances. One of these is the discrete cosine transform (DCT), which turns an eight-row-by-eight-column block of pixels into a set of coefficients. When typical images are subjected to DCT, many of the resulting coefficients are small, or zero, and thus can be eliminated from the data stream. Another MPEG tool is the ability to send picture differences that convert another picture into the present picture, and so on. An intelligent encoder will choose the most appropriate tools for the type of incoming material.
Figure 2. A comparison between MPEG-2 and MPEG-4. MPEG-4 brings extra tools to image coding.
Developing compression tools is rather like an arms race. Engineers build a simple encoder that works well, but occasionally it encounters a particular type of picture that it can't compress. So the engineers develop a new tool to fix that. The result is that the encoder encounters uncompressible pictures less often, but it still happens. So the engineers keep developing new tools to keep up with compression difficulties. They can't develop these tools overnight, and they have to start somewhere. MPEG-1 was the starting point for compressing moving pictures. It contained a useful set of compression tools (some of which were developed for JPEG), but it was by no means the definitive system.
Figure 1 shows that MPEG-2 did not obsolete MPEG-1, but rather augmented it. In a sense, MPEG-2 is a larger toolbox that includes all of the MPEG-1 tools. An MPEG-2 decoder, by definition, contains an MPEG-1 decoder and is compatible with MPEG-1 bit streams. A compliant MPEG-2 decoder includes all of the MPEG-1 tools, and would interpret an MPEG-1 bit stream as a valid MPEG-2 bit stream that did not explore all of the coding possibilities. However, an MPEG-1 decoder would not understand the coding tools introduced by MPEG-2 bit streams. This relationship of backward compatibility within the MPEG family holds for MPEG-4 as well. The MPEG-4 toolbox is massive, but still contains all of the MPEG-1 and MPEG-2 tools.
Let's consider the extra tools MPEG-4 brings to image coding. Figure 2 shows a comparison between MPEG-2 and MPEG-4 in this respect. In all of the MPEGs, the DCT creates coefficients. It does this on a block-by-block basis, which gives a certain degree of compression. However, in some picture areas, the coefficients in one block may be similar to those in the next. In these cases, the codec can obtain better performance by predicting coefficients from an earlier block rather than just sending new ones whole. MPEG-2 does this using the so-called DC coefficient. The author cannot see how an array of pixels can result in a direct current, and it might be more accurate to refer to this parameter as the zero-spatial-frequency coefficient, which effectively conveys the average brightness of the pixel block.
Clearly, in an image containing a large plain area, the average brightness of several blocks might be the same. MPEG-2 takes advantage of this by arranging blocks in a horizontal picture strip called a slice.
Within a slice, the first block will have an absolute value for the DC coefficient, whereas the subsequent blocks will have difference values that must be added to the previous block's value to create the value of the current block.
MPEG-4 goes further than this. Figure 3 shows that, in MPEG-4, one can predict the entire top row of coefficients, or the entire left column of coefficients, from an earlier block. Choosing between predicting the row coefficients or column coefficients would be based on the picture content. For example, consider an image containing a dominant vertical object such as a utility pole. Scanning horizontally across this image would result in large changes as the pole is encountered, whereas scanning vertically down the image would result in a column of blocks all containing the pole and all having similar coefficients whose similarity could be exploited to get better compression.
Figure 3. With MPEG-4, one can predict the entire top row of coefficients or the entire left column of coefficients from an earlier block.
To use MPEG-4 terminology, the utility pole would result in strong horizontal picture gradients that would indicate the use of vertical prediction. If the picture contained a dominant horizontal object such as a horizon, vertical prediction would create strong vertical picture gradients, indicating that horizontal prediction would be the better mode. Since the choice of which mode to use is based upon the picture gradients, the decoder can determine which mode the encoder must have used simply by looking at the picture, so no extra data needs to be transmitted to specify the mode.
MPEG-2 uses motion compensation, sending one vector per macroblock (a macroblock is a set of four blocks). MPEG-2 uses this vector to bring pixels from another picture to the location in the present picture giving the greatest similarity to the actual values. MPEG-2 predicts the vectors horizontally using slices (a slice is a series of macroblocks). MPEG-4 enhances this approach by allowing extended vector prediction. MPEG-4 can predict the vector for a given macroblock from those above or to the left, so it only sends the prediction error. As this reduces the amount of vector data, it then becomes possible to have one vector per DCT block. With four times as many vectors, the motion prediction will be better, resulting in smaller prediction errors.
Considering just picture compression, MPEG-4 does slightly better than MPEG-2, but not sufficiently better to warrant obsoleting the latter. MPEG-2 is well established in broadcast production and transmission as well as in DVDs, and is not under serious threat from MPEG-4.
However, whereas MPEG-1 and -2 work with pictures in their entirety, MPEG-4 goes far beyond. It can work with picture information generated, captured or manipulated by computers, and it is in these areas where the potential of MPEG-4 resides.
Figure 4. MPEG-4 approaches graphics better than MPEG-2. As Figure 4a shows, an MPEG-2 encoder expects as an input a complete picture repeating at the frame rate. However, an MPEG-4 encoder can handle the graphic instructions directly so that the rendering engine is actually in the MPEG-4 decoder, as Figure 4b shows.
Figure 4 shows a comparison between MPEG-2 and MPEG-4. In Figure 4a, an MPEG-2 encoder expects as an input a complete picture repeating at the frame rate. Imagine that such a picture was the output of a graphics engine that was rendering images in real time. The graphics engine would compute the appearance of any virtual objects, from the selected viewpoint, using ray tracing. If the viewpoint or one of the objects were to move, each video frame would be different and the MPEG-2 encoder would use its coding tools to encode the image differences. However, the motion of a virtual object could be described by one vector. Figure 4b shows that an MPEG-4 encoder can handle the graphic instructions directly so that the rendering engine is actually in the MPEG-4 decoder. Once the appearance of objects is established in the decoder, animating them requires little more than transmitting a few vectors.
Figure 5 shows that MPEG-4 works with four types of objects. Objects may be encoded as two- or three-dimensional data. Two-dimensional objects are divided into video and still. A video object is a textured area of arbitrary shape that changes with time, whereas a still texture object does not change with time. Typically, a still texture object may be a background. Although it does not change with time, it may give the illusion of doing so. For example, if the background pixel array is much larger than the display, the display can pan across the background to give the impression of motion.
Figure 5 also shows that MPEG-4 standardizes ways of transmitting the three-dimensional shape of a virtual object, known as a mesh object, along with the means to map its surface appearance, or texture, onto that object. Generally, it can handle any shape object. The decoder re-creates each object and renders it from the selected viewpoint. In parts of the picture where there is no object, the decoder keys in the background. It should be clear that if the decoder is aware of the shape and texture of all relevant objects, the encoder does not need to choose the viewpoint. In an interactive system, the viewer might choose the viewpoint.
For applications such as videophones and video conferencing, MPEG-4 supports a specific type of mesh object (described above), which may be a human face alone or a human face and body.
At this point, we have left television and video far behind. Forget rasters and pixel arrays. Instead, consider the appearance of an object changing with time. A flexible object may actually change shape, whereas a rigid object would appear to change shape as the result of a change of perspective. MPEG-2 can only handle perspective shape changes very crudely, using motion-compensation vectors on individual macroblocks. MPEG-4 handles such changes using a technique called warping. It samples the surface of the object by a series of points. As the object changes, the points may move relative to one another. If a flexible object shrinks, all of the points get closer together, whereas if it turns in perspective, some points may get closer together and others may get further apart. The points form a structure called a mesh. MPEG-4 can distort the mesh by sending vectors to move the points. As the points move, the texture is interpolated in the decoder so that it continues to fit the new shape. In this way, object movements that would result in considerable picture differences in the video domain can be coded by MPEG-4 with just a handful of vectors.
Figure 5. MPEG-4 works with four types of objects: video and still objects, which are two-dimensional, and mesh and face/body animation objects, which are three-dimensional.
In two-dimensional coding, the equivalent of a video frame is the video object plane (VOP). VOPs occur at the frame rate and intersect one or more video objects. MPEG-4 uses prediction between VOPs just as MPEG-2 uses prediction between pictures. There are groups of video object planes (GOVs) that contain I-VOPs, P-VOPs and B-VOPs. Figure 6 shows how a B-VOP can be bi-directionally coded from two other VOPs.
In three-dimensional coding, the meshes are not planar. Instead, each point or vertex has x, y and z coordinates. The encoder describes the mesh to the decoder using differential coding, where each vertex is coded as a spatial difference from the previous one. Each set of three points forms a triangle that can be filled with texture. By definition, a triangle has a flat surface. Based on this, it is possible to have a scalable three-dimensional mesh. The base-level mesh describes a body having reference planar triangles, but a subsequent layer could make the shape of the body more accurate by defining new points that are displaced with reference to the triangle surface. In this way, the decoder can produce an image that is the best possible for the allowable bit rate.
In face animation, MPEG-4 defines a neutral face. This is a mesh having a set of vertices corresponding to an average expressionless human face. All decoders know this neutral face, and so to create a real face it is only necessary to transmit the differences between the vertices of the neutral face and those of the real individual. Texture then covers the surface to obtain a realistic reproduction of the speaker in three dimensions. The decoder can then render the face for a certain viewpoint.
To animate the speaker, the encoder sends vectors that move the vertices of the facial mesh. However, these are not generic vectors, but are specifically designed vectors relating to the kind of expressions that humans use. As the decoder receives the vectors, it modifies the mesh to create the latest version of the shape of the face. It then maps existing texture onto the face. This is a very efficient process because facial texture only needs to be sent once. From then on, the facial expressions are obtained by warping the texture to fit the new mesh. In practice, the bit rate needed to send the mesh update is less than the bit rate needed for compressed speech.
Figure 6. VOPs in MPEG-4 work like pictures in MPEG-2 and can be intra-, forward- or bi-directionally coded.
Given the massive power of MPEG-4, there isn't much it can't compress. One of the fundamentals of coding theory is that the complexity of the encoder and decoder must rise with the compression factor. MPEG-4 probably represents the practical limit in coding complexity. Although it can reduce a moving image to a few vectors, it does require the decoder to be a powerful graphics-rendering engine. And although the object-based tools of MPEG-4 are very efficient, they are easily applicable only to computer-generated images. In principle, an encoder could be built that would take in real video from a natural image and dissect it into objects, but this would be a very complex process.
So there we have it. Is MPEG-4 clever? Yes. Does it make MPEG-2 obsolete? No. MPEG-4 will find applications in videophones, video conferencing, Internet image transfer, interactive video games and virtual reality, but it won't replace MPEG-2 in DVB, DVD or television production.
John Watkinson is a high technology consultant and author of The MPEG Handbook (Focal Press).