Aspects of format conversion have been known and appreciated for some time, due to the frequent need to convert between different video standards. Many of the same techniques that were developed for analog systems apply in the digital world. However, additional techniques are needed, some of which are not always appreciated. To understand the basis of digital conversion techniques, let's look at the analog situation first.
When converting between two analog television systems, several elements must be changed, including frame rate, line rate, scan method and signal encoding. In effect, the first three elements are changed by an interrelated form of sample rate conversion, which was the subject of last month's Transition to Digital column. This element trio essentially defines the pixel rate of the system. Therefore, in order to convert between two different standards, it's necessary to do the appropriate sample rate conversion both spatially and temporally, meaning within a field or frame and between the fields or frames.
Mathematically, the appropriate interpolations or decimations need to be done in both the spatial and temporal directions. In practice, the situation is more complex because of the use of interlaced scan.
For simplicity, let's first consider a fictional case, with two systems at 30Hz and 60Hz frame rates and both using progressive scanning. Let's also assume that both systems have the same vertical line rate. Converting from the 30Hz system to the 60Hz requires interpolation between the 30Hz frames and the creation of new frames. In principle, this would seem to be the same as a change in spatial resolution, which would involve spatial interpolation between adjacent spatial pixels, in order to create new pixels. However, the time dimension must account for objects in the picture that can move from one frame to the next. This creates a new requirement. Now, it's necessary to predict the motion of these objects to faithfully reproduce their motion in a new frame.
By illustration, let's assume an object is moving horizontally, such that it appears at the locations shown in Figure 1 at the times A and B. (The translation is exaggerated here for clarity.)
Logically, an interpolator should produce an image as shown in the intermediate frame X. However, a straight pixel-by-pixel temporal interpolation by averaging the successive frames would actually produce the result shown in Figure 2. Obviously, this is not the correct way to create the new frame.
In fact, using a more sophisticated filter to reconstruct the intermediate image, such as a (sin x)/x, creates an even worse situation. Such a filter requires many taps, or coefficients, in the frame-to-frame time direction. The result of such a filter is that the moving objects would become smeared over the same number of frames.
An intelligent motion estimator is needed to determine the motion of objects within the picture and to create new pixels based on this motion. Before the advent of high-speed digital processing, this was impossible to do. And format conversion inevitably resulted in serious conversion artifacts when rendering sequences of moving objects.
Extending the principle to conversions between field rates of 50Hz and 60Hz (or 59.94Hz), the difference is that it may be necessary to alter each frame to smoothly transition from groups of five frames to groups of six. Conversions in the reverse direction can also involve motion estimation, as the mere dropping of one out of every six frames would result in jerky video.
Adding interlaced scan to the situation further complicates matters. Interlace can be thought of as a vertical-temporal image sampling that alternates phase every field. (See Figure 3.) By further extension of the motion estimation technique, this sampling can be taken into account.
The difference is that objects having a vertical component to their translation should be processed using a different algorithm than objects with a purely horizontal motion. De-interlacing, that is conversion from interlace to progressive, often entails maintaining the same image resolution. Fully generalized conversion, on the other hand, adds the element of spatial resolution to the process. Finally, the different analog standards require a conversion of the signal encoding techniques, such as luminance levels, chrominance encoding and blanking signals.
In the digital world, the conversion between different scanning systems and resolutions is based on the same analog conversion techniques. However, the overall encoding is quite different when considering compression. A full digital standards converter thus adds the burden of conversion between different compression systems.
With MPEG-2 now a ubiquitous world standard, conversion between different compressed sources seemed to be straightforward, or even trivial. But the introduction of new standards, such as MPEG-4, keeps things interesting.
The astute reader may have deduced that image compression may offer a shortcut to motion-compensated scan conversion, as MPEG encoding already performs motion estimation. However, this process in an encoder is aimed at lowering the energy in the frame-to-frame difference of images, and this is done on a block-by-block basis, without regard to visual objects in the image. (While certain parts of MPEG-4 actually do code visual objects within pictures, the more frequently used MPEG-4 Part 10, also called AVC or H.264, does not.)
Therefore, a motion-compensated scan converter cannot base its conversion exclusively on the MPEG motion vectors within the stream. But it can use these as a starting point to arrive more efficiently at the needed information.
DCT and quantization
In order to transcode between compressed signals, such as between MPEG-2 and MPEG-4, the brute force method is to completely decode the source and then re-encode the signal. However, this can often result in a substantial degradation of video quality, especially if the compression ratio is high.
A better conversion technique is to partially decode the source, and then re-encode from this point, while paying special attention to certain coding elements, namely, Discrete Cosine Transform (DCT) and quantization. (DCT is essentially a way of converting the spatial information in a block of pixels to an array of frequency information.) After this process is performed, the resulting coefficients can be quantized, or lowered in amplitude resolution.
This step actually performs the signal compression by reducing the number of bits required to represent each block of pixels. It also creates visual artifacts and limits the number of successive encode/decode cycles that are tolerable.
For this reason, successive recoding of pictures will result in fewer artifacts if the previous information on quantization is preserved as much as possible. In fact, even if this information is not used — if all the pixel blocks line up exactly where they were in the previous encoding — a subsequent encoder will often process the images in the same manner as the previous. This will cause fewer artifacts than a completely independent recoding.
Combining these effects, a well-designed integrated format converter should always yield better resultant video than the brute force method. Some of these same considerations apply to audio. For example, when converting between different perceptual coding systems, such as Dolby and MPEG, a better result should occur when the hardware takes into account the previously applied encoding and only partially decodes the signal.
Always expect artifacts to become more apparent when multiple generations of encoding and decoding are applied. A recent widely broadcast sports even unfortunately demonstrated the results of a poor concatenation of standards converters. Choosing appropriate equipment, based on knowledge of how these conversions work, can go a long way to maintaining the highest quality video and audio.
Aldo Cugnini is a consultant in the digital television industry.
Send questions and comments to:email@example.com