Predicting subjective video quality

For a number of years, work has been performed to replace the manual process of human viewer trials with machines delivering results that highly correlate to the results obtained from ITU BT.500 trials. ITU J.144 attempted to save time and tedium by replacing humans with an algorithm that could take the place of the BT.500 human-based procedure. Although some success was achieved, more general applications remained unaddressed. The key challenges have been the complexity of the human vision system and its proper representation when feeding an electrical video data stream into the analysis engine. In recent years, a proliferation of encoding schemes, formats, resolutions and frame rates have increased the complexity of this task.

The human vision system

Objects in our surroundings usually produce reflections in the presence of light or directly produce light as in the case of most video displays. (See Figure 1.) This light or the reflections enters the human eye through the lens and pupil and triggers the receptors on the retina in the eye to produce a stimulus. The stimulus is passed through the optic nerve to the brain where the visual cortex turns the stimuli of the entire retina into a picture. The information channel to the brain (optic nerve) has a certain capacity further limiting our ability to resolve detail and process movement of the objects we are observing. Once the objects begin to move, acuity will degrade, and the human vision system is limited to processing about 30 pictures per second. Research indicates, however, that we can resolve light stimuli of a much higher frequency (temporal frequency resolution), given the proper settings other aspects of the light stimulus, such as sufficiently high average luminance level, etc.

The charging process of the receptors is dependent on the duration and intensity of the light stimulus. Charging and discharging cannot occur infinitely fast. Therefore, the human vision produces visual illusion such as the dark spots in the Hermann grid. (See Figure 2.) However, in an area on the retina where this dramatic difference in a charge profile exists, a drain from low charge areas occurs. This can be seen in gray scales in Figure 3, where the horizontal gray bar in the combined picture seems to have brighter and darker areas where the contrast is high. In reality, the gray horizontal bars are identical.

In an amazing experiment, George Malcolm Stratton (1865-1957), a psychologist at the University of Berkeley, proved that the visual cortex can even account for a reversed optic, making objects appear upside down. This illustrates the importance of adaptation and, as a consequence, the relativity of the human vision system. Recently, there have been breakthroughs in human vision models specifically because of the appreciation of the importance of adaptation. One example is the international standard for color appearance models. CIECAM02, for example, was ratified in 2004. Its primary purpose is to mimic adaptation effects for color in nonvideo (static) applications.

When modelling the human vision system mathematically, it is absolutely clear that a simple linear system will have unacceptable inaccuracies. Instead, emphasis on accurately depicting the inherently nonlinear behavior of adaptation is essential.

Modelling the human vision

One way to specify the desired model behavior is to build stimuli-response pairs taken from vision science experiments in human vision science literature. (See “References” on the next page.) These then can be modelled in filter functions and other processing to account for other adaptation, masking and combined perceptual and cognitive effects. We can consider four classes of stimuli responses: transparent, linear, fixed (or stationary) nonlinear and adaptive (or dynamically nonlinear).

Transparent response class

In a transparent response system, usually no preprocessing occurs, and the measurements include actions like pixel-by-pixel subtraction. This process does not apply any weighting or filtering emulating physical limitations of the human vision system. PSNR measurements are a typical representative of this class and usually have limited correlation with results obtained from testing with human viewers. Only if the video contains stimuli closely corresponding with the transparent class, the correlation will provide meaningful results.

Linear response class

A linear response corresponds to cases where fixed or stationary linear filters are applied to the stimulus. These filters can mimic to the degree required the human vision response. Again, stimuli that strongly correlate with linear behavior can be approximated well with this approach. However, linear filtering does not take into account the human vision system's capability to discern detail in light or dark patches when the ambient light is changing. Linear filtering usually does not take Weber's law into account either. Weber's law states that a noticeable visual difference is dependent on the magnitude of the original stimulus and that the noticeable difference divided by the magnitude of the original stimulus is a constant.

Fixed or stationary nonlinear response class

The fixed or stationary nonlinear response class corresponds to cases where two images overlayed, or more precisely superimposed, are not equal to the sum of the individual stimuli response. This means that the human vision system cannot superimpose properly certain stimuli in its response to the optic nerve and the cortex. When two sources of light are added, the human vision response for each point (or pixel) is not equal to the sum of the responses of each image alone.

Most advanced measurement techniques for picture quality combine linear and stationary nonlinear filtering to predict the responses of the human vision system. Yet even this combination of filters does not account for phenomena like flicker versus brightness, where light of a given intensity appears brighter when it is turned off and on rapidly. This class is also unable to detect the effect of perceived brightness against the adaptation to varied luminance levels nor other temporal aspects of visual illusions such as a phantom third pulse seen when two pulses are used as a stimulus.

Adaptive (dynamically nonlinear) response class

One additional element when dealing with perception of the human being, regardless of whether the vision system is observed (or another sense), is the perceptual contrast that humans can identify. In a comparative situation, humans have to make decisions and this is done by contrasting one incident from another.

A good example of this behavior was identified by Sherif, Taub and Hovland in 1958. They asked people to lift different weights. When the test subjects initially had to lift a heavy weight they subsequently underestimated the weight of lighter weights lifted afterwards. Similar effects happen in the vision system as well. Some ITU standards refer to this effect when introducing training sequences before the actual testing to set a common baseline for the participants of human viewer trials.

Predicting subjective quality ratings must take into account that perceptual contrast will happen in the process of detecting the threshold of noticeable differences. It is essential to calibrate the system with input of human vision science data to support an adaptive filtering system.

For the general case, human vision response to video stimuli is adaptive (dynamically nonlinear). Response sensitivities can change by more than an extra order of magnitude beyond the spatiotemporal dynamic range. Spatiotemporal dynamic range represents the human vision system's capability to identify differences in light stimuli over the area (resolution) and time (how many different light stimuli occur in a given area). In a very simplified way, we can say that the adaptive nature of the human vision acts like a magnifying glass for spatial as well as temporal aspects. However, the adaptation itself has time constants for it to work properly. Any measurement system for predicting subjective picture quality must take these effects into account.

The approach of the vision science community is to measure stimulus response pairs in a controlled environment. To determine the contrast sensitivity of the human vision system, for example, a whole series of tests need to be conducted to take into account the adaptive capabilities of the human vision.

First, the spatial content is varied (detail increased for example) for a given ambient luminance (lowlight for example to emulate cinema conditions) and multiple curves are traced for different levels of temporal changes in the video. The human vision will then adapt to the lowlight levels and for each frequency of temporal changes. Then spatial content is varied for a fixed combination of the other two parameters, and results are recorded. Only one parameter is changed and the effects on spatial contrast sensitivity are recorded again for this different set of parameters.

This data then can be used to model the human vision system and serve as a parameter set to account for adaptation parameters dynamically determined in the measurement process. Why is this so important? Contrast sensitivities can change almost up to 100 times depending on the values of the other parameters due to the adaptive nature of the vision system. As a consequence, meaningful results can only be obtained from a system that dynamically adapts its filter settings according to the surrounding conditions and the video stimuli observed.

The change in sensitivity with average luminance, such as light and dark adaptations, involves the nonlinearity that is consistent with many visual perception phenomena. One very obvious adaptation is the ability of the human vision system to adapt to ambient luminance. As a consequence, a movie watched in a cinema compared with watching it at home or on a mobile device in bright sunlight is not only a matter of screen size but also of perceived quality due to the adaptive nature of the human vision system. Brightness with flicker, changes in dynamic responses to step increases, after images, visual illusions and extreme sensitivity (i.e. photosensitive epilepsy) are consistent with the types of nonlinearity that account for most of the adaptation.

Key human vision stimulus-response data sets

Any analysis of perceived video quality has to identify the threshold at which differences compared to a reference will become noticeable to the viewer. This is comparable to differentiating a trend from noise. The underlying measurements for predicting subjective quality ratings have to be calibrated with human vision science data supporting the ability to detect the differences at the smallest increment. This is called detecting supra-threshold responses of the human vision system to stimuli applied. One effective way to do this is to calibrate the measurement system with stimulus-response data sets based on findings of the human vision science community.

Predicting subjective video quality ratings

The human vision system only responds to light stimuli. Any measurement system must establish a transfer from (electrical) video data to light stimuli emitted from a display. Figure 4 shows the signal flow through the analysis engine. In the display model, a conversion from electrical signals to light stimuli is performed. The viewing model takes into account the viewing distance, ambient light and so forth.

The vision model provides adaptive filtering to effectively simulate the human vision system as described. The difference is obtained from the results of each predicted human vision response. Within the objective maps node, visible impairments are classified and measured objectively, with the ability to then sum each impairment with corresponding relative annoyance or relative preference. The summary node extracts single summary measures per frame and/or video sequence. Also, the ITU BT.500 training equivalent, which maps the response summary measures to difference mean opinion score (DMOS) is included in the summary node.

An adaptive integrator (see Figure 5) is used to filter in four spatial directions (right, left, up, down) and temporally. The result is a spatiotemporal filter that is tunable in each dimension. Consistent with previous models taking into account center and surround interaction, two spatiotemporal filters are used: one for the center and one for the surround. The surround spatiotemporal response is used to both subtract from the center and tune the center spatiotemporal response. In addition, the surround spatiotemporal response also alters its own response via feedback to the frequency controls, but much more slowly than for the center, consistent with longer term adaptation such as long term light and dark adaptations, after-images, and other long-term effects.

Calibration

Extensive controls allow for calibration for direct threshold spatiotemporal response, horizontal and vertical dimensions, and center and surround. A control for baseline frequency cut-off (corresponding to integration time or area) is required. Other items requiring calibration are frequency response adaptation sensitivity for control of the transition between threshold and supra-threshold response (one for spatial and one temporal). In addition to the adaptive spatiotemporal filter, other model components are used to take into account Weber's law, perceptual differences between correlated versus uncorrelated images and other behavior including types of masking.

Conclusion

Vision science has developed comprehensive sets of stimuli-response pairs to predict human vision system response and perception. Functional blocks can be implemented in a dynamically nonlinear adaptive system that successfully models the human vision system far beyond currently existing technical implementations according to ITU J.144. This implementation is widely agnostic to video systems in terms of resolution, frame rate or compression algorithms and is capable of simulating viewing conditions, display types and viewer skills. It is eventually calculating DMOS scores based on an adaptive filtering system emulating the human vision system. The objective is to help accelerate the optimization of encoding algorithms, improve bandwidth utilization in distribution systems and to foster a better viewing experience for the TV consumers by measuring and tracking optimum perceived video quality.

Kevin Ferguson is a principal engineer at Tektronix, responsible for mathematical modelling and algorithm development for automated video measurement and picture quality analysis. Winfried Schultz is marketing manager video EMEA of Tektronix' video product line.

References

Ferguson, Kevin, “An Adaptable Human Vision Model for Subjective Video Quality Rating Prediction Among CIF, SD, HD and E-Cinema,” Tektronix Inc., Whitepaper, 1. June 2007, Lit.no. 25W-21014-0.
Ferguson, Kevin, “Predicting Subjective Quality Ratings of Video,” US Patent No. 6829005, Issued Dec. 7, 2004.
W. H. Swanson, T. Ueno, V. C. Smith, J. Pokorny, “Temporal modulation sensitivity and pulse-detection thresholds for chromatic and luminance perturbations,” J. Opt. Soc. Am., Oct. 1987, Vol. 4, No. 10, pp. 1992-2005.
D. Hubel, “Eye, Brian, and Vision,” Scientific American Library, NY, NY, 1995, pp. 33-136.
Enroth-Cugell, “The World of Retinal Ganglion Cells,” from Shapley, R., Man-Kit Lam, D., ed., Contrast Sensitivity, MIT Press, 1993, pp. 155,159.
B. Levitan and G. Buchsbaum, “Signal sampling and propagation through multiple cell layers in the retina: modeling and analysis with multirate filtering,” J. Opt. Soc. Am., July 1993, Vol. 10, No. 7, pp. 1463-1480.