Using motion estimation to convert HDTV frames rates

There's something in evolutionary theory that suggests while things may develop along apparently disordered and diverse paths, the overall tendency is towards an ordered world and, hence, progress. We can only assume that the increased number of television standards and formats created as a result of the move to HDTV is a journey down one or several such diverse paths. This isn't to suggest that such routes weren't necessary and not taken for good reasons. The original temporal field rates for television standards were established back in the 1930s, when the then “state-of-the-art” oscillator and power regulation circuits were based on the power frequency.

Further diversification in the number of scanning lines and color-coding schemes left us with the NTSC 59.94Hz, and PAL and SECAM 50Hz standards. The additional diversification introduced by HDTV has produced a total of nearly 30 different formats and standards using both progressive and interlaced scanning structures. The predominant HD formats for broadcast use are 720 lines for progressive and 1080 lines for interlaced pictures at either 59.94Hz or 50Hz temporal sampling rates. It is possible through all this, however, to see a path back to some semblance of order.

The move to digital has already allowed us to dispense with the NTSC, PAL and SECAM color-coding schemes and if processing speeds continue to improve, we may ultimately see the new 1080-line progressive format (sometimes described as the Holy Grail) emerge as the unifying format. What is unlikely to be forthcoming in the foreseeable future is the emergence of a single temporal or picture frame rate to accompany this format.

At this point, it is important to note that in the past, the term standards conversion was exclusively applied only when temporal sampling rate conversion was part of the process. Conversion between same temporal sampling systems was referred to as transcoding. This practice should be maintained when encompassing the new HD systems.

Unfortunately, it now appears commonplace to describe a device that can convert 576i SD to 1080i HD at the same field/frame rate as a standards converter. The correct term for this process is upconversion, and the reverse process is downconversion. However, even if we find the Holy Grail, we can be sure that the international nature of television will mean that standards converters (in the true sense) will be part of the furniture for a long while yet.

HDTV at different frame rates

Now that Europe appears to have fully embraced the move to HDTV, a common question is why the availability and choice of high-quality HD standards converters is currently so limited. Surely this step was inevitable, and manufacturers should have been poised ready to take advantage of the predictable initial high demand. There are two main explanations for this situation.

First, the significant cost in design and development of such devices necessitate a fairly immediate return on investment. Although this is true of most major project developments, initial sales in this case would be heavily dependant on adoption of HDTV by countries whose broadcast transmission standard(s) use 50Hz temporal sampling rates.

Without this step, countries such as the USA, with an established HD infrastructure operating at 59.94Hz temporal rate, could import material simply by using their existing SD standards converters to implement the frame rate conversion and then upconvert the output to HD. Despite, for example, Australia initiating HD transmissions in 2004, it is the European arena that provides the overwhelming commercial justifications for commencing product development. After seemingly lengthy deliberations, there is no doubt that the number and pace of European organizations planning and re-equipping for HD in Europe has taken many people by surprise and, hence, the restricted number of “high-end” products on offer ready to satisfy the demand.

Second, there has always been an underlying nervousness among the traditional standards converter manufacturers regarding emerging picture processing techniques for use in standard computer-based hardware platforms. A considerable slice of the standard's converter market has been in the post-production environment, where real-time picture processing is not necessarily a prerequisite.

Would a large slice of this traditional market disappear to software products for installation on powerful off-the-shelf PCs? The truth is that from both a quality and economic perspective, these concerns have proven to be premature and will most likely remain so for some years to come. The major bulk of theoretical work has been and continues to be focused on developing new and improved picture compression algorithms, which are significantly different to those required for full-resolution baseband standards conversion.

Standards conversion is a more demanding application than data compression. In addition, even if the quality of non-real-time conversion could produce satisfactory performance, there are still operating costs to be considered. Non-real-time might better be described as “increased-time,” which means increased cost. Hourly rates for high-performance frame rate standards conversion have never been cheap. The business case for the design and development of a real-time HD standards converter on a dedicated hardware platform is as strong, if not stronger, than ever before.

The primary function of a standards converter is to create a new stream of picture frames from an existing one but at a different rate defined by the output standard. Each new output frame is displaced by a varying offset time DT from the previous adjacent input frame, where ΔT is the frame repetition period in the original stream. (See Figure 1 on page 16.)

Interpolation and temporal aliasing

Picture elements from the same relative positions in adjacent input frames are, in fact, temporal samples and can be treated as such for use in standard sampling theory. It would seem sensible, therefore, to create each new output frame by simply interpolating these samples with the relative position of the new samples being defined by ΔT1, ΔT2, etc.

The problem is that the temporal sampling rate in all television standards is not high enough to accurately depict anything other than very slow moving objects. Moving objects will be in a different location on each successive input frame. Simple interpolation or averaging between four frames, for instance, will produce four images of the object in the output frame(s). Fast moving objects will appear to judder and blur. (See Figure 2.) This is otherwise referred to as temporal aliasing. It's not a problem when viewing the original native input because the eye is able to track moving objects, making them stationary relative to the retina and the temporal aliases are not seen. When the input signal passes through a simple linear standards converter, the temporal aliasing causes errors in the interpolation process.

Vector frames

What is required is a method of modifying the operation of the converter to track the course of a moving object in the same way the eye does. To do this, the direction and speed of each region of movement needs to be determined.

Each region or group of pixels with the same movement is then allocated a vector. The final result is a vector frame for each proposed new output picture frame. (See Figure 3.) The individual vectors in the frame are essentially pointers to tell the converter which part of its memory to address in order to retrieve the correct samples for creation of that part of the new output frame. A primary vector frame is created by the motion estimator part of the converter by comparison and analysis of adjacent input frames. This is then scaled by a factor proportional to ΔT to create the vector frame associated with the new output frame.

At face value, this all appears straightforward enough until the inquisitive mind takes over. How does the motion estimator ensure that other moving objects in the same scene do not confuse it? What happens with concealed and revealed parts of the picture when an object has moved? Techniques used in motion estimators come in various guises, including hierarchical spatial correlation (sometimes referred to as block matching) and phase correlation. Both methods have their pros and cons and deal with these issues in different ways.

The truth is that such technologies have become so refined over the last few years that you have to concentrate very hard to spot the motion artifacts. At the end of the day, it's not the technique employed that matters, but rather the quality of the pictures.

HD issues

In terms of extending these techniques to manipulate HD in addition to SD signals, the vastly increased speed in the latest data processing components permits additional computations (in a given time) beyond that to just cater for the inherent increased resolution. The result is even higher resolution and greater accuracy of the motion vectors.

Changes in the architecture of the converter itself provide other enhancements. In SD converters, both the input and output standards were interlaced scanning formats, and the filtering and estimation processes were applied across four fields of the input and output signals. The high performance of adaptive de-interlacing algorithms these days means that this process can occur prior to the motion estimation and temporal rate conversion processes. This means that a 1080i input signal is first upconverted to 1080p and subsequently downconverted if the output signal required is an interlaced format. All the temporal rate processing is implemented on progressive scanned picture frames.

Has this been planned or is it just fortuitous? Could migration to the Holy Grail 1080p format be simply a matter of removing a bit of de-interlacing software?

Kim Francis is a product specialist at Pro-Bel.