Video encoding technology

Today's electronic imaging technology has come a long way from the first, pre-WWII, monochrome TV services that began the long-time competition with the cinema.

The invention of pulse-code modulation allowed analog signals to be expressed by binary number, which inevitably and irrevocably forged a link between computing and audiovisual information. Once audio and pixel information are expressed by binary numbers, the resulting data are distinguished from other types of data, such as text, only by the fact that they need to be reproduced with the original time base. Computing, which we now call information technology (IT), is adept at processing, storing and networking data. With the advent of error-correcting techniques like Reed-Solomon coding, such data could be preserved to arbitrary accuracy, although the result of the use of error correction in digital television broadcasts is that the compression artifacts are delivered accurately.

The spectacular growth of IT led to computers shrinking from the size of a house down to the size of a match head, along with a comparable reduction in price. As a result, computers today are essentially consumer products. One of the unfortunate consequences is the ubiquity of consumer-grade software that is quite unsuitable for anything important. Another consequence is that the television and cinema industries found the computer to be a double-edged sword because it helped them produce material more quickly and efficiently while at the same time presenting their audiences with an alternative medium in the shape of the Internet.

Compression techniques

Electronic images have always required a lot of bandwidth, and compression techniques have been used since the earliest days of television. The use of gamma allows the same perceived quality to be obtained at a lower signal-to-noise ratio. Color difference signals need less bandwidth than RGB. Interlace is a compression technique that results in well-known artifacts. Composite video, such as NTSC, allows color in the same bandwidth as monochrome.

Information theory tells us that the greater the compression factor, the more complex the processing. While composite video and interlace are easily performed in the analog domain, the adoption of digital techniques allows greater complexity at lower cost. While the IT industry has lossless codecs that deliver bit-accurate pixels, the possible compression factors are not considered high enough for television. As a result, TV codecs are lossy. The decoded signal is not as good as the original. Compression also increases the characteristic time span of the signal. The four-field sequence of NTSC and the group of pictures in MPEG are direct parallels.

Compression can take place within individual pictures by identifying plain areas like sky or repetitive patterning. This is called intracoding or spatial coding. Compression can also take place between successive pictures, and this is even more successful when combined with compensation for object motion. This is known as intercoding or temporal coding. Groups start with an anchor picture and alter it to move forward. Some of the pictures are recreated by taking parts of earlier or later pictures, moving them across the screen to compensate for motion and only using new information for filling in the gaps. It could be likened to making a meal out of kitchen scraps.

Temporal coding is the more powerful of the two techniques, which is why delivery codecs run with long picture groups. Of course, long group coding makes production difficult. In order to perform any production step, the material first has to be decoded and then re-encoded. The problem occurs when a former “kitchen scraps” picture is encoded as an anchor. The generation loss is breathtaking.

So while long group compression is ideal for final delivery of video to the consumer, the generation loss due to temporal coding means you would only suggest it for production purposes if you had a serious conflict of interest. For questionability, it's right up there with using interlace for HD.

Moving image compression

It is disappointing that HDTV appears to be the same juddery old thing but with more pixels. The greatest technical shortcoming in television has always been the inadequate frame rates and the poor motion portrayal that results. The most tangible improvement in television comes not from increasing the static resolution, but from improving the dynamic resolution by increasing the frame rate. In a compressed delivery environment, given that temporal coding is more efficient than spatial, increasing the frame rate doesn't increase the bit rate much, whereas increasing static resolution drives the bit rate up dramatically without a corresponding quality increase.

At the time of writing this article, moving image compression seems to have settled into a number of basic applications. Digital cinema requires high pixel counts, and the contrast ratio possible in the cinema demands a greater number of bits in the pixel. On the other hand, digital cinema does not have a bandwidth problem. Cinemas can use fiber-optics networks or download data in non-real time to local file servers. Digital cinema exploits that freedom to use relatively mild compression techniques that produce pictures that are substantially free from compression artifacts. For production purposes, digital cinema recorders may use lossless or mild spatial coding.

Most TV viewing takes place with some ambient lighting, and as a result, the contrast ratio of television is much less than can be obtained in the cinema. This makes 8-bit resolution perfectly adequate. Broadcast television faces two bandwidth restrictions — one external and one self-made. First, the electromagnetic spectrum is needed for other purposes, and the spectacular growth of cellular telephones has made spectrum more valuable. Second, television broadcasters have decided that viewers want more channels, even though the constant amount of talent is thereby diluted. As a result, the compression factors used in digital broadcasting are high, and the level of artifacts is nothing to be proud of. For TV production purposes, intracoding gives editing freedom. Most videotape formats use intracoding for that reason.

Moving pictures viewed over the Internet tend to be downsampled and heavily compressed. This is a consequence of immediate and free access to an extremely wide range of material. Nevertheless, as the bandwidth available to Internet subscribers increases, the quality will improve.

One of the requirements for Internet use is a codec that allows the same material to be available in a range of qualities dependent on the bit rate available to the individual subscriber. Wavelet-based compression is usually superior in this respect.

Moving pictures by educated guesswork

There is no one ideal compression codec. The difficulty is figuring out how to make compression available to a wide range of applications and how to allow future developments to enhance the system without causing obsolescence. At one extreme, an electronic cinema compression system designed to work on a giant screen will need more powerful hardware and more memory than a system designed for a security camera. The way around this is to define levels and profiles in the system. Levels set limits on the amounts of processing power and memory needed to decode the signal. Profiles set limits on the complexity of the encoding and decoding. Obsolescence is avoided by adopting two steps. The first is to define the signal between the encoder and the decoder and not the encoder itself. The second is to make improvements in a way that is backward compatible.

A good way of visualizing compression is to consider that the decoder is equipped with tools that allow it to make an educated guess about what is coming next based on what came before. If the encoder contains a decoder, it must know what the decoder can predict and then sends only what couldn't be predicted. MPEG is an acronym for Moving Pictures by Educated Guesswork.

Clearly, if the decoder is equipped with more tools or those tools are more highly refined, the guesswork will be better, and the amount of unpredictable content decreases. So the development path from MPEG-1, through MPEG-2 to MPEG-4 represents the process of increasing and refining the toolkit. As MPEG-4 contains additional tools and refinements of what went before, then an MPEG-4decoder automatically contains MPEG-2 and MPEG-1 decoders, and backward compatibility is achieved. If we compare like with like and look at the performance of MPEG-2 and MPEG-4 on conventional video inputs, we find that the extra predictive ability of MPEG-4 allows the same picture quality at a significantly reduced bit rate. H-264, also known as Advanced Video Coding (AVC), is the part of MPEG-4 that relates to conventional video inputs. This is likely to be a popular codec for delivery of HD.

Whereas MPEG-1 and MPEG-2 work with entire pictures, MPEG-4 goes far beyond that. (See Figure 1.) In Figure 1A, a video coder expects as an input a complete picture repeating at the frame rate. Imagine that such a picture was the output of a graphics engine that was rendering images in real time. The graphics engine would compute the appearance of any virtual objects from the selected viewpoint using ray tracing. If the viewpoint or one of the objects moves, each video frame will be different, and the MPEG-2 coder will use its coding tools to encode the image differences. However, the motion of a virtual object could be fully described by vectors. In Figure 1B, an MPEG-4 encoder can handle the graphic instructions directly so that the rendering engine is actually in the MPEG-4 decoder. Once the appearance of objects is established in the decoder, animating them requires little more than the transmission of a few vectors.

MPEG-4 works with four types of objects. (See Figure 2.) Objects may be encoded as 2-D or 3-D data. 2-D objects are divided into video and still. A video object is a textured area of arbitrary shape that changes with time, whereas a still texture object does not change with time. Typically, a still texture object may be a background. Although it does not change with time, it may give the illusion of doing so. For example, if the background pixel array is much larger than the display, the display can pan across the background to give the impression of motion.

Figure 2 further shows that MPEG-4 standardizes ways of transmitting the 3-D shape of a virtual object, known as a mesh object, along with the means to map its surface appearance, or texture, onto that object. Generally, any shape of object can be handled. The decoder will recreate each object and render each one from the selected viewpoint. In parts of the picture where there is no object, the background will be keyed in. It should be clear that if the decoder is aware of the shape and texture of all relevant objects, the viewpoint does not need to be chosen at the encoder. The viewpoint might be chosen by the viewer in an interactive system such as a video game or a simulator. For applications such as video phones and video conferencing, MPEG-4 supports a specific type of mesh object that may be a human face or a human face and body.

Unlike the DC-based transforms of MPEG-2, the Dirac codec uses wavelets and so inherently works well in multiresolution applications. Dirac is available in intracoding versions for production purposes, as well a temporally-coded version for delivery. Developed by the BBC, it has the advantage of being royalty-free.

John Watkinson is a consultant in advanced technology. His most recent books are “The Art of Digital Video,” “The Art of the Helicopter” and “The MPEG Handbook” available from Focal Press/Elsevier.