Video and audio sampling

One of the fundamental processes involving digital signals is sampling. When sampling video and audio, the question inevitably comes up as to what is optimum sampling. Just how much resolution do we need, anyway? The question is difficult to answer because there are many factors that contribute, including the subjective characteristics of human vision and hearing.

Years ago, there was a research project conducted at RCA Labs called “Light-to-light,” which set out to derive an overall transfer function describing a complete camera-encoding-transmission-reception-decoding-display video system. Hardware was built to simulate the effect of changes to parts of the system on the final images. Eventually, real-time digital signal processing and simulation systems like the Princeton Engine overtook the project. But all audio and video distribution systems can be looked at the same way — as a cascade of signal processing elements. First, let's go over the basics of signal sampling.

Sampling is fundamental to digital signals

The Nyquist Theorem states that information signals must be sampled at a rate at least twice as high as the highest frequency component; otherwise, aliasing will result, and no amount of processing will retrieve the complete original signal. In the frequency domain, the sampling process creates repeat spectra; the original signal spectrum will repeat, centered at multiples of the sampling rate. Thus, the input signal must be band-limited so that the repeat spectra do not overlap.

Most color video systems incorporate chrominance subsampling; the RGB pixels are converted into a YUV color space, and then the U and V chrominance components are spatially subsampled to save transmission bandwidth. Whereas U and V were historically used for analog systems, and the designations CB and CR (for “component-blue difference” and “component-red difference”) grew from the use of digital encoding, the latter two are also often used for analog signals derived from a digital decoder. This color encoding gives rise to the commonly used 4:2:2 and 4:2:0 subsampling grids, where 4:2:2 provides full vertical resolution and one-half horizontal resolution of the color components compared to the luminance signal, and 4:2:0 provides one-half vertical and one-half horizontal resolution of the color difference components. (Full horizontal and vertical color resolution is given by 4:4:4 encoding.)

This J:a:b notation stems from the notion of a reference sampling block with four pixels horizontally and two pixels vertically. J is the number of luminance samples in the top row, and a and b are the number of chrominance samples in the top row and bottom row, respectively. Figure 1 on page 18 illustrates this notation.

Because chrominance subsampling can generate aliasing of the chrominance components that is different from that of the luminance component (which is further complicated by interlace), contribution and distribution signals preferentially use 4:2:2 sampling, especially when processing or editing video, whereas transmission signals most often use 4:2:0 sampling to save bandwidth.

Audio sampling

We mentioned earlier that sampling must be performed at an appropriate rate, lest aliasing will occur. We can see this graphically in the spectrum plots of the undersampled signal shown in Figure 2, which include the repeat spectra inherently formed by the sampling process. With audio, this will result in an artifact that sounds like intermodulation distortion.

A critically sampled signal, where the sampling rate is exactly twice the highest frequency component, will avoid aliasing, but it will require an extremely steep analog low-pass filter when converting back to analog. For this reason, the signal is often upconverted to a higher sample rate, where a more economic digital filter can remove the repeat spectra, and a simpler analog filter — with a more gentle roll-off — can be used to reconstruct the analog signal.

Of course, the aliased signal in the example above could be post-filtered to remove the artifact, but this would result in a decrease in the high-frequency response of the system — perhaps an acceptable trade-off in certain (e.g., low-cost and low-complexity) situations.

How much is too much?

A key consideration in the design of any video system is the response of the final receptor — the human visual system. When trying to determine the resolution capability of the eye, we can start by measuring visual acuity, i.e., the measure of the spatial resolution performance of the human visual system. The term “20/20 vision” is defined by a Snellen chart as the ability to just distinguish features that subtend one-arc-minute of angle (one-sixtieth of a degree). The standard feature developed by the chart's eponymous inventor is the optotype, such as one of the well-known letters of the chart. Distinguishing optotype features on the 20/20 line — such as the cross-arm of the E — occupy a space of 60 features per degree or 30 cycles per degree.

Simple trigonometry produces the result shown in Figure 3, that the optimum distance from which to observe a 1080-line display is 3.16 times the picture height, where the vertical viewing angle is 18 degrees. Further than that, a person with 20/20 corrected vision can't resolve the smallest displayed details; closer than that, and you'll start to see individual pixels. Stated in screen diagonals, this works out to 1.55 times the diagonal measure of a 1920 × 1080 display.

Thus, if you've got a 1080-line monitor with a 15in diagonal, the optimum viewing distance is just under 2ft; with a 42in display, it works out to about 5.5ft. Because most people view their TV from a larger distance of about 9ft (the so-called Lechner distance, named after TV researcher Bernie Lechner), the required optimum screen size grows proportionally.

These calculations, however, assume that there are no other limiting conditions. In reality, factors based on Kell factor, interlace, the interpixel grid, contrast and the sharp edges of the optotypes must all be taken into account. And making the case for Ultra-HDTV, NHK researchers wrote in a 2008 paper that test subjects could distinguish between images with effective resolutions of 78 and 156 cycles per degree.

This suggests that some people can tell the difference between a 42in display with 1080 lines and one with 2160 lines, when viewed within the practical confines of a living room. Perhaps the era of a complete video wall in the home is not that far off!

Aldo Cugnini is a consultant in the digital television industry.

Send questions and comments to:aldo.cugnini@penton.com