the new flavors of MPEGs

In one interesting demonstration, video coverage of a soccer game was processed to separate the ball from the rest of the scene.

The new MPEGs will not replace MPEG-2 as the definitive standard for delivery of video to consumers.

Stored information is useful only if its existence is known, and if it can be retrieved.

Compression standards have become a matter of great importance in the television broadcast community. Most television industry activity revolves around Motion JPEG, MPEG-2 and DV compression, with MPEG-2 being used almost exclusively for final delivery to the consumer.

We hear of MPEG-4 and MPEGs beyond that, and some engineers and managers wonder if the new work will supplant MPEG-2 and require even more capital investment. Do broadcasters need to think about MPEG-4, MPEG-7 and MPEG-21? Will these developments mean that a digital conversion today will become obsolete within a few years? More optimistically, will the new systems mean new revenue opportunities?

The evolution of MPEG compression

The impact of compression on broadcast television started with JPEG or, more accurately, with various proprietary implementations of Motion JPEG systems that enabled nonlinear editing, broadcast-quality disk-based servers and more. Unfortunately there was no standard for adjusting JPEG compression to fit an image sequence within bandwidth limits, so each manufacturer's solution was different, and there was no possibility of interchange of compressed signals.

The first international compression standard for motion imagery, MPEG-1, was developed between 1988 and 1992 and included motion compensation for temporal compression. MPEG-1 represented a remarkable technical achievement, but had little direct impact on the television broadcast industry. It was, by design, limited to CIF picture size (352×240 pixels), to approximately 1.5Mb/s compressed data rate and had no tools to handle interlaced images.

It should be noted that the tool set and syntax developed for MPEG-1 was vastly more powerful than was represented by the constraints of the standard. MPEG-1 syntax was used for direct-to-home satellite broadcasting, and to compress HDTV by one of the proponents during the competitive phase of the Advanced Television Service proceedings.

MPEG-1 was also noteworthy for its approach to interoperability. The standard defines a tool set, the syntax of the bit stream and the operation of the decoder. It does not define the operation of the encoder — any encoder that produces a syntactically valid bit stream that can be decoded by a compliant decoder is fair game. This allows for the evolution of encoding technology without change to the standard, and without rendering existing decoders obsolete.

MPEG-1 was frozen (i.e., subsequent changes were allowed to be editorial only) in 1991. In the same year the MPEG-2 process was started, and MPEG-2 became a standard in 1995. The initial goals were simple — there was a need for a standard that would accommodate broadcast-quality video, including interlace.

In many ways, MPEG-2 represents the “coming-of-age” of MPEG. The greater flexibility of MPEG-2, combined with the increased availability of large-scale integrated circuits, meant that MPEG-2 could be used in a vast number of applications.

The success of MPEG-2 is best highlighted by the demise of MPEG-3, intended for high-definition television. MPEG-3 was soon abandoned when it became apparent that MPEG-2 embraced this application with ease. MPEG-2 is the basis for both the ATSC and DVB broadcast standards and the compression system used by DVD.

Perhaps the most fundamental change brought by MPEG-2 is the number of compliance points. MPEG-1 defined a single compliance point. Every MPEG-1 compliant decoder had to decode any MPEG-1 compliant bit stream. MPEG-2 capabilities include a vast range of image sizes, encoding tools and data rates, suitable for different applications. A decoder capable of handling every possible MPEG-2 bit stream would be enormously complex and expensive, and enforcing this generality would preclude the use of the standard in most environments.

The MPEG committee decided on a structure of profiles and levels. Profiles define the tools and syntactical elements that may be used; levels define the permissible ranges of parameters. Various combinations of profile and level are provided to allow practical subsets to be implemented in a standard manner, as shown in Figure 1. For standard definition television we generally use Main Profile at Main Level (MP@ML) or the studio 4:2:2 Profile at Main Level (4:2:2@ML). For high-definition television we use the same profiles at High Level (MP@HL and 4:2:2@HL).

MPEG-4

The wheels of international standardization grind slowly, and to ensure a standard is eventually achieved there are strict rules that prohibit substantive change after a certain point in the process. By the time a standard is officially adopted, there is often a backlog of desired enhancements and extensions — as it was with MPEG-2. As discussed above, MPEG-3 had been started and abandoned, so the next project became MPEG-4. Two versions of MPEG-4 are already complete and work is continuing on further extensions.

At first, the main focus of MPEG-4 was the encoding of video and audio at very low rates. In fact, the standard was explicitly optimized for three bit rate ranges:

Below 64kb/s
64- to 384kb/s
384kb/s to 4Mb/s

Performance at low bit rates remained a major objective and some creative ideas contributed to this end. Great attention was also paid to error resilience, making MPEG-4 suitable for use in error-prone environments such as transmission to personal handheld devices. However, other profiles and levels use bit rates up to 38.4Mb/s, and work is still proceeding on studio-quality profiles and levels using data rates up to 1.2Gb/s.

More importantly, MPEG-4 evolved into a new concept of multimedia encoding with powerful tools for interactivity and a vast range of applications. This article can provide only the briefest of introductions to the system as the official “overview” of this standard spans 67 pages.

Object coding

The most significant departure from conventional transmission systems is the concept of objects. Different parts of the final scene can be coded and transmitted separately as video objects and audio objects to be brought together, or composited, by the decoder. Different object types may be coded with the tools most appropriate for the job. The objects may be generated independently, or a scene may be analyzed to separate, for example, foreground and background objects. In one interesting demonstration, video coverage of a soccer game was processed to separate the ball from the rest of the scene. The background (the scene without the ball) was transmitted as a “teaser” to attract a pay-per-view audience. Anyone could see the players and the field, but only those who paid could see the ball.

The object-oriented approach leads to three key characteristics of MPEG-4 streams:

Multiple objects may be encoded using different techniques and then composited at the decoder;
Objects may be of natural origin, such as scenes from a camera, or synthetic, such as text; and,
Instructions in the bit stream, and/or user choice, may enable several different presentations from the same bit stream.

The generalized system for object coding in MPEG-4 is shown in Figure 2. This diagram also emphasizes the opportunities for user interaction within MPEG-4 systems — a powerful feature, particularly for video game designers.

These capabilities do not have to be used. MPEG-4 provides traditional coding of video and audio and improves on MPEG-2 by offering enhanced efficiency and resilience to errors. However, the true power of MPEG-4 comes from the architecture described above. The coding of objects independently offers a number of advantages. Each object may be coded in the most efficient manner, and different spatial or temporal scaling (see below) may be used as appropriate.

Video coding

Many of the video coding tools in MPEG-4 are similar to those of MPEG-2, but are enhanced by better use of predictive coding and more efficient entropy coding. However, the application of the tools may differ significantly from earlier standards.

MPEG-4 codes video objects. In the simplest model a video is coded in much the same way as in MPEG-2, but it is described as a single video object with a rectangular shape. The representation of the image is known as texture coding. Where there is more than one video object, some may have irregular shapes, and generally all will be smaller than a full-screen background object. This means only the active area of the object needs be coded, but the shape and position must also be represented. The standard includes tools for shape coding of rectangular and irregular objects, in either binary or gray-scale representations (similar to an alpha channel).

Scalability

In the context of media compression, scalability means the ability to distribute content at more than one quality level within the same bit stream. MPEG-2 and MPEG-4 both provide scalable profiles using a conventional model; the encoder generates a base layer and one or more enhancement layers. The enhancement layer(s) may be discarded for transmission or decoding if insufficient resources are available. This approach works, but all decisions about quality levels have to be made at the time of encoding, and in practice the number of enhancement layers is severely limited (usually to one).

Later versions of MPEG-4 include the fine grain scalability (FGS) profile. This technique generates a single bit stream representing the highest quality level, but allows for lower-quality versions to be extracted downstream. FGS uses bit-plane encoding. The quantized coefficients are “sliced” one bit at a time, starting with the most significant bit. This provides a coarse representation of the largest (and most significant) coefficient(s). Subsequent slices provide more accurate representations of the most-significant coefficients, and coarse approximations of the next most significant and so on.

Spatial scaling, including FGS, may be combined with temporal scaling that permits the transmission and/or decoding of lower frame rates when resources are limited. As mentioned above, objects may be scaled differently. It may be appropriate to retain full temporal resolution for an important foreground object, but to update to the background as a lower rate.

Other aspects of MPEG-4

MPEG-4 is enormous, and the comments above touch on only a few of the many aspects of the standard. There are studio profiles for high-quality encoding that, in conjunction with object coding, will permit structured storage of all the separate elements of a video composite. Facial and body animation profiles will permit a stored face to “read” text in many languages. Further extensions of MPEG-4 may even provide solutions for digital cinema. Figure 3 shows the MPEG-4 profiles defined today.

Some describe MPEG-4 as the standard for video games, and certainly many of the constructs are ideally suited to that industry. However, even a cursory examination of the standard reveals such a wealth of capabilities, and such depth in every aspect, that the potential applications are endless.

MPEG-7

The first question about MPEG-7 is “why seven?” As mentioned above, the cancellation of MPEG-3 caused the sequence of real standards to be MPEG-1, MPEG-2, and MPEG-4. There were those on the committee who took the pragmatic approach and expected the next standard to be MPEG-5. The “binary-buffs” saw the historical 1-2-4 as the start of a pre-ordained binary sequence, and wanted the new work to be MPEG-8. Finally, it was concluded that any simple sequence would fail to signal the fundamental difference from the work of MPEG-1 through MPEG-4, and MPEG-7 was chosen.

MPEG-7 is not about compression; it is about metadata, also known as the “bits about the bits.” Metadata is digital information that describes the content of other digital data. In modern parlance, the program material or content, the actual image, video, audio or data objects that convey the information, are known as data essence. The metadata tells the world all it needs to know about what is in the essence.

Anyone who has been involved with the storage of information, be it videotapes, books, music, whatever, knows the importance and the difficulty of accurate cataloging and indexing. Stored information is useful only if its existence is known, and if it can be retrieved in a timely manner when needed.

This problem has always been with us and is addressed in the analog domain by a combination of labels, catalogs, card indexes, etc. More recently, the computer industry has given us efficient, cost-effective, relational databases that permit powerful search engines to access stored information in remarkable ways — provided the information is present in a form the search engine can use.

The real problem is that the world is generating new media content at an enormous and ever-increasing rate. With the increasing quantity and decreasing cost of digital storage media, more and more content can be stored. Local and wide-area networks can make the content accessible and deliverable if it can be found. The search engines can find what we want and the databases can be linked to the material itself, but we need to get the necessary indexing information into the database in a form suitable for the search engine.

We might guess from knowledge of earlier standards that the MPEG committee would not concern itself unduly with mechanisms for generating data. MPEG rightly takes the view that if it creates a standardized structure, and if there is a market need, the technological gaps will be filled. In previous MPEG standards, the syntax and the decoder were specified by the standard. In MPEG-7, only the syntax is standardized The generation of the metadata is unspecified, as are the applications that may use it. MPEG-7 specifies how metadata should be expressed. This means the fields that should go into a database are specified, and anyone designing a search engine knows what descriptive elements may be present and how they will be encoded.

MPEG-7 defines a structure of descriptors and description schemes that can characterize almost anything. In theory, primitive elements such as color histograms and shapes can be combined to represent complex entities such as individual faces. It may be possible to index material automatically such that the database can be searched for scenes that show, for example, George Burns and Ella Fitzgerald together.

The constructs are not confined to images. To paraphrase the official FAQ, it should be possible to use a voice sample to search for recordings by, or images of, Pavarotti, or to play a few notes on a keyboard to find matching or similar melodies.

The rapid advance of storage and networking systems will enable access to vast quantities of digital content. As technology advances to satisfy the needs of MPEG-7, we will be able to index and retrieve items in ways unimaginable a few years ago. We will then need a system to control access, privacy and commercial transactions associated with this content. This is where MPEG-21 is targeted.

MPEG-21

MPEG-21 again differs from the earlier work of the committee. The basic concept is fairly simple — though wide reaching. MPEG-21 seeks to create a complete structure for the management and use of digital assets, including all the infrastructure support for the commercial transactions and rights management that must accompany this structure. The vision statement is “to enable transparent and augmented use of multimedia resources across a wide range of networks and devices.”

The work is at an early stage, but a committee draft is planned by December 2001. As with other MPEG projects, we can expect an initial standard with increased sophistication and flexibility contained in later amendments.

Conclusions

MPEG-2 is firmly established in the consumer market. Millions of MPEG-2 decoders already exist in DVD players and digital cable, satellite and terrestrial television receivers. In a sense, this simple fact answers one of the questions posed: The new MPEGs will not replace MPEG-2 as the definitive standard for delivery of video to consumers by these mechanisms.

However, MPEG-4 may still be important to broadcasters for a number of reasons. Digital broadcasting will include program-related data, and may be targeted at devices other than conventional television receivers. MPEG-4 constructs may use MPEG-2 compression and may be carried on MPEG-2 transport streams so this could be a viable mechanism for data-enhanced programming.

Probably more important, however, is the use of MPEG-4 on the Internet. Most broadcasters now see streaming media as an essential part of their operations and are obliged to support popular streaming formats. The efficiency for low-bit rate applications, documentation as an international standard and the flexibility of the FGS option all suggest MPEG-4 will be a major factor in the future of streaming media.

MPEG-7 and MPEG-21 will likely be of great importance to broadcasters in the future, but neither is a compression standard. The advent of these standards will impact many aspects of facility design and operational models, but not in ways that will devalue investments made today. In contrast, if MPEG-7 and -21 are successful, they will together provide a tremendous boost to e-commerce, and greatly enhance the value of digital assets.

This article includes material extracted and adapted from Video Compression Demystified by Peter Symes (McGraw-Hill, 2001) by permission of the publisher. Details at www.symes.tv. Peter Symes is Manager, Advanced Technology, and a Fellow of Grass Valley Group.