One way to limit the amount of compression needed is to sub-sample the chroma.
4:1:1 sampling produces the same bit rate as 4:2:0. Which is better?
Here we are going to look at some basic approaches to take when troubleshooting compression systems. How well your compression system works depends on the quality of the data fed into them. We'll start with the data (you know, those video and audio streams) we feed into a compression system. Compression systems are basically embedded computer systems. They perform extremely complex digital signal processing (DSP) driven by microprocessors, application-specific ICs (ASICs), and lots and lots of RAM. Orchestrating this hardware is software and firmware that carries out the algorithms that determine what data can be thrown away and not be missed (too much). One of today's myths is that digital devices all process digital data entrusted to them equally well and in the same manner. This is not true in the baseband digital arena, as we will see shortly, and it definitely isn't true in compression systems. The ISO MPEG standard defines the resulting PES and transport streams that emerge from compression systems so a user can recover the data. Each manufacturer of compression systems is left to develop their own methods to create those streams.
Most DTV stations are currently taking their NTSC video and analog audio outputs and converting them to digital streams immediately before compression. This process is known as decoding because NTSC color info is decoded into component color. The process of going from analog to digital is not the hard part; it is separating the chroma from the luminance. Another process that should be in the signal chain before compression is noise reduction in the signals because noise generates random data that stresses compression systems. These two processes actually use similar techniques since decoding and noise reduction both involve filtering. Most of us know that 3D filtering is better than 2D, which is better than 1D, which in turn is better than linear filtering.
If the video is in the digital domain we can use 1/2/3D-filtering techniques. This filtering technique operates in the time domain. The higher the number the more hardware that is required, which results in increased cost. 1D simply looks at data that comes before and after the data point of interest. The number of data points before and after the point of interest are called the aperture of the filter. The wider the aperture the more intelligent the decision that can be made as to how to act on the data point of interest. 1D filtering applied to television streams usually means that any filtering process was done solely along a single horizontal line.
2D filtering is done in an array of data. A raster of video (many horizontal lines — one after another) is an array. 2D filtering not only uses data points along a single horizontal line, but points on multiple lines as well. Thus the aperture not only extends horizontally but vertically. These are the two dimensions in 2D. Intuitively you can see that a more intelligent decision on a particular data point can be made via 2D than 1D. 2D filters are used as comb filters for separating chroma from luminance. 1D and 2D filters are known as spatial filters. A 3D filter combines 2D filtering across multiple fields (or frames) of video. This filter is known as a temporal filter since it acts over a wider time. Much more intelligent decisions can obviously be made over multiple video frames than within a single frame.
NTSC decoding and limiting of noise are important precursors to successful compression. JPEG was the original algorithm for compression. It is spatial compression as it only works on a single frame of video. Many digital VTRs still use JPEG, as do some video servers. When more severe compression was needed MPEG was adopted. MPEG is temporal compression as it works across multiple video frames. STL and DTV transmissions rely on MPEG. JPEG compression ratios of four or less are generally considered transparent. JPEG Betacam quality is generally obtained with compression ratios of 8:1. MPEG can usually increase the JPEG numbers by a factor of five. As compression ratios increase, some compression algorithms work better than others. SMPTE259 data streams require a 45:1 compression or data reduction ratio to produce a 6Mb/s PES stream.
One way to limit the amount of compression needed is to sub-sample the chroma. SMPTE259 is sub-sampled already. As most of us know the 4:2:2 ratio applied to digital component video means that the chroma information is sampled one-half as often as the luminance information. If this were not done, the bit rate of SMPTE259 would be 315Mb/s instead of 270Mb/s. But many compression systems take the chroma sub-sampling even farther. Instead of chroma sub-sampling only in the horizontal direction some compression systems also work in the vertical axis. This sub-sample scheme is known as a 4:2:0 ratio. This method reduces the bit rate fed to the compression engine to 126Mb/s. This reduces the compression ratio required to obtain a desired bit rate out of the compression system. Another sub-sampling scheme is to take the 4:2:2 horizontal sub-sampling approach and sub-sample the chroma only one-fourth as many times as the luminance. This sub-sampling scheme is known as 4:1:1. 4:1:1 produces the same bit rate as 4:2:0. Which is better?
4:2:0 produces more horizontal chroma information, plus the horizontal and vertical chroma resolution is equal, but 4:1:1 is easier to implement. So is it better to increase the chroma sub-sampling or increase the compression ratio required? A few years ago the EBU and the CBC conducted a test. They found that 4:2:2 sub-sampling produced marginally better video quality than 4:2:0 until the bit rates got extremely low. At that time 4:2:0 had the advantage. But another interesting finding of the study was that when multiple compression and decompression cycles were encountered 4:2:2 performed better. This illustrated that starting out with more was still better than starting out with less.
So now we can start to compress. There are three aspects when it comes to video that can be manipulated to reduce the video rate. The first is spatial information, which is the dependence or similarities between neighboring pixels. The second is temporal information, or the dependence between neighboring frames in a video sequence. The third is coding redundancy, which is the likelihood that one data byte will be similar to another. Audio compression uses both spectral (frequency) and temporal techniques to reduce bit rates. Spectral masking implements a threshold mask across the audio spectrum. This mask varies with frequency. If a sound is below the threshold it is not encoded. Temporal masking eliminates softer sounds that occur immediately before or after louder sounds. An interesting fact here is if you starve an AC-3 (Dolby compression standard used in ATSC DTV) bitstream (as an example, the output set for 64Kb/s — 384 Kb/s is normal) that has a sporting event with an announcer talking over crowd noise, the crowd noise disappears. The same effect happens with a musical piece. The accompanying background instruments disappear and you end up hearing only the loudest voices or instruments.
A compression system has a target bit rate to achieve that is based on setup input by the user. Most systems offer users some choice as to how to achieve that bit rate. Those choices are via the MPEG toolkit, which is comprised of levels and profiles. When we talk levels in MPEG we mean the (sub) sampling structure and bit rate. Sampling structure and bit rate have a great effect on how many times we can compress and decompress before the quality is unacceptable. MPEG profiles refer to the tools used for temporal compression. These tools are the types of frames used (I,B,P) and the ratio of each. The mix of B and P to I becomes important when multiple compression/decompression cycles are encountered. But compression systems do have some parameters that become set after the user has decided on bit rate out of the compression system and on the MPEG tools to use. If you decide to use only I frames (essentially a JPEG situation) the compression required to hit the target bit rate will be much higher than if P and B frames were also used. The I frame is a stand-alone, spatially compressed frame. Generating I frames only means more redundant information to throw away. The compression engine using the Discreet Cosine Transform (almost all do) to do this will ratchet up the quantization value used to throw away low-level values in the frequency coefficient block or array. This array is generated by transforming and mapping small blocks of spatial video information into small blocks that represent frequency values describing the original video information. The frequency block is arranged so the coefficients representing high frequencies are mapped together. These coefficients are usually low in value and are therefore likely to be scaled to zero by the quantization value. The higher this value the more coefficients become zero. Manufacturers of compression systems usually employ proprietary algorithms to determine what the quantization value will be under different circumstances.
After the DCT process, which is lossy, a lossless compression technique is used. This is the coding redundancy aspect of compression mentioned earlier. As just mentioned, the coefficients that become zero are located such that they usually follow each other when read out of the frequency coefficient block. A process called run length encoding will replace long runs of the same value with data that indicates the redundant situation and how long it lasts. After this step Hoffman coding is applied. This assigns likely-to-happen values to short codes, and unlikely values longer codes. The Morris code is an example of this. The code for likely letters (a, e, i) is short while less-likely letters have longer code. When this is done we have an I frame. P frames use the information from other preceding I and P frames. This greatly reduces the amount of data to build a P frame. Spatial differences between the preceding I or P frame and the newly generated P frame undergo I-frame-like spatial compression. Plus, areas of video information that don't change, but simply move, are encoded as vector change information. Thus in some ways a P frame merely describes the changes that have occurred. This greatly reduces the data needed over completely repainting a new frame each time. B frames can also be added. B frames use information from preceding and succeeding I and P frames. More P and B frames used between I frames means that I frame encoding is less severe. The tradeoff is perceived video quality will be slightly less with each additional P or B frame because video information is increasingly approximated in P and B frames until the next I anchor frame arrives. So another tradeoff is either I frames that are more harshly compressed and few approximation frames (P/B) or I frames that are more lightly compressed, with video quality that decreases in value with successive P or B frames until the next I frame.
Some encoders provide a little help as to how hard the I frame compression engine is working by displaying a bar graph or a number that indicates the value of the quantization being used. Some multiplexers, which take the various PES streams coming from multiple MPEG encoders and weave them into a single transport stream, use statistical multiplexing to control the quantization values of the encoders. Encoders processing video with lots of changes — say a sporting event — receive lower quantization values and thus produce higher bit rates than encoders handling video that is fairly static.
Setting up compression systems offers many tradeoffs. Most installations can only rely on their subjective opinion of what constitutes good compress video and audio. Bit rates, sampling and MPEG tools used can vary widely and produce the same overall subjective result. The best setting for one type of program might not be the best for the program that follows. Since most compression systems are not yet under automation control your settings must represent a compromise that achieves the look you want. Remember though that what you feed into the compression system must be as clean as possible. The bottom line here is that DTV's cause will not be helped if viewers merely end up trading a set of NTSC artifacts for a new set of ATSC artifacts.
Jim Boston is director of emerging technology for The Evers Group.