Quality control

Consumer picture quality expectations are higher than ever, creating intense pressure on video equipment manufacturers, broadcasters, network operators and content providers to verify that their devices, systems or processes have not introduced impairments in video content that will affect perceived picture quality. This has led to the development of new automated picture quality instruments capable of evaluating picture quality across the video supply chain without the need for slow and expensive human evaluators, while also improving repeatability.

Historically, organizations have used an informal method of subjective picture quality assessment that relies on one person or a small group of people who demonstrate an ability to detect video quality impairments. These are the organization's “golden eyes.” Subjective picture quality evaluations are fraught with error and expense, and often end up only approximating viewer opinion.

These factors have led organizations to alternatives for subjective evaluation, such as the ITU-R BT.500 recommendation that describes several methods, along with requirements for selecting and configuring displays, determining reference and test video sequences, and selecting subjects for viewing audiences. Such subjective picture-quality assessments are expensive and time consuming.

Instead, engineering, maintenance and quality assurance teams are starting to turn to a new class of instruments that use full-reference objective picture quality measurements. Using these instruments, teams can make accurate, reliable and repeatable picture quality measurements more rapidly and cost effectively than testing with actual viewers.

The tests used by instruments include Difference Mean Opinion Score (DMOS), Picture Quality Rating (PQR) and traditional Peak Signal-to-Noise Ratio (PSNR) measurements as a quick check for picture quality problems. This article explores key concepts associated with these measurements. It also provides tips and guidance for how teams can use these techniques in a variety of settings for their greatest advantage.

Subjective assessment and objective picture quality measurement

If people perceived all changes in video content equally, assessing picture quality would be much easier. A measurement instrument could simply compute the pixel-by-pixel differences between the original video content (the reference video) and the content derived from this reference video (the test video). It could then compute the Mean Squared Error (MSE) of these differences over each video frame and the entire video sequence.

However, people are not mechanical measuring devices. Many factors affect viewers' ability to perceive differences between the reference and test video. Figure 1 illustrates this situation. The video frame shown in Figure 1.1 has greater MSE with respect to the original reference video than the video frame in Figure 1.2.

However, the error in Figure 1.1 has high spatial frequency, while the error in Figure 1.2 consists of blocks containing much lower spatial frequencies. The human vision system has a stronger response to the lower spatial frequencies in Figure 1.2 and less response at the higher spatial frequencies in Figure 1.1. Subjectively, Figure 1.2 is worse than Figure 1.1, even though the MSE measurement would assess Figure 1.1 as the poorer image.

Objective picture quality measurements that only measure the noise difference between the reference and test video sequences, e.g. PSNR, will not accurately and consistently match viewers' subjective ratings. To match subjective assessments, objective picture quality measurements need to account for human visual perception.

One of the two categories of full-reference objective picture quality measurements is shown in Figure 2. Noise-based measurements compute the noise, or error, in the test video compared to a reference video. This form of PSNR measurement is helpful in diagnosing defects in video processing hardware and software. Changes in PSNR values also give a general indication of changes in picture quality.

Alternative versions of the PSNR measurements adjust the base measurement result to account for perceptual factors and improve the match between the measurement results and subjective evaluations. Other noised-based picture quality measurements use different methods to determine noise and make perceptual adjustments.

A second category of full-reference objective picture quality measurements is illustrated in Figure 3. Perceptual-based measurements use human vision system models to determine the perceptual contrast of reference and test videos. Further processing accounts for several other perceptual characteristics. These include relationships between perceptual contrast and luminance and various masking behaviors in human vision. The measurement then computes the perceptual contrast difference between the reference and test videos rather than the noise difference. The perceptual contrast difference is used directly in making perceptual-based picture quality measurements. With an accurate human vision model, picture quality measurements based on perceptual contrast differences match viewers' subjective evaluations.

Picture quality rating measurements

Picture quality rating measurements convert the perceptual contrast difference between the reference and test videos to a value representing viewers' ability to notice these differences between the videos. Perceptual sensitivity experiments measure the viewer's ability to notice differences in terms of Just Noticeable Differences (JNDs).

The concept of JND dates to the early 19th century and the work of E.H. Weber and Gustav Theodor Fechner on perceptual sensitivity. Measurements of perceptual sensitivity involve repeated measurements with a single test subject. A 1 JND difference corresponds to approximately 0.1 percent perceptual contrast difference between the reference and test videos. With this perceptual contrast difference, most viewers can barely distinguish the test video from the reference video in the forced-choice pairwise comparison. At this, and at lower levels of perceptual contrast difference, viewers will perceive the test video as having essentially equal quality to the reference video.

Configuring PQR measurements

Like the actual human vision system, electronic human vision system models operate on light and must convert the data in the reference and test video files into light values. This conversion process introduces several factors that influence PQR and DMOS measurements.

In a subjective picture quality evaluation, the light reaching a viewer comes from a particular type of display. The display's properties affect the spatial, temporal and luminance characteristics of the video the viewer perceives. Viewing conditions also affect differences viewers perceive in a subjective evaluation. In particular, changes in the distance between the viewer and the display screen and changes in the ambient lighting conditions can affect test results. Taking this into account, instruments offer built-in models for a range of CRT, LCD and DLP technologies and viewing conditions, or can be custom-configured.

Interpreting PQR measurements

The PQR scale uses data from perceptual sensitivity experiments to ensure that 1 PQR corresponded to one JND and that measurements around this visibility threshold match the perceptual sensitivity data. The following scale offers guidance in interpreting PQR measurement results:

0: The reference and test image are identical. The perceptual contrast difference map is completely black.
<1: The perceptual contrast difference between the reference and test videos is less than 0.1 percent or less than one JND. Viewers cannot distinguish differences between videos. Video products or systems have some amount of video quality “headroom.” Viewers cannot distinguish subtle differences introduced by additional video processing, or by changes in display technology or viewing conditions. The amount of headroom decreases as the PQR value approaches 1.
1: The perceptual contrast difference between the reference and test videos equals approximately 0.1 percent or 1 JND. Viewers can barely distinguish differences between the videos. Video products or systems have no amount of video quality headroom. Viewers are likely to notice even slight differences introduced by additional video processing, or by changes in display technology or viewing conditions.
2-4: Viewers can distinguish differences between the reference and test videos. These are typical PQR values for high-bandwidth, high-quality MPEG encoders used in broadcast applications. This is generally recognized as excellent to good quality video.
5-9: Viewers can easily distinguish differences between the reference and test videos. These are typical PQR values for lower bandwidth MPEG encoders used in consumer-grade video devices. This is generally recognized as good to fair quality video.
>10: Obvious differences between reference and test videos. This is generally recognized as poor to bad quality video.

Difference Mean Opinion Score measurements

The perceptual contrast difference map produced by an instrument's human vision system model contains information on differences viewers will perceive between reference and test videos. As a result, devices can predict how viewers would score the test videos if they evaluated the video content using methods described in ITU-R BT.500. In particular, predicted Difference Mean Opinion Score (DMOS) values for test videos can be generated. And, unlike testing with people, results can be produced for each frame in the test video sequence as well as the overall sequence.

ITU-R BT.500-11 describes several methods for the subjective assessment of television picture quality. While they differ in the manner and order of presenting reference and test videos, they share characteristics for scoring video and analyzing results.

In methods that compare both reference and test videos, viewers grade the videos separately. They use a grading scale shown to collect opinion scores from each viewer participating in a test. Subjective evaluations typically involve groups of around two dozen viewers. These scores are averaged to create the Mean Opinion Score (MOS) for the evaluated videos. The MOS for the reference video sequences is then subtracted from the MOS for the test video sequences. This generates a DMOS for each test sequence. The DMOS value for a particular test video sequence represents the subjective picture quality of the test video relative to the reference video used in the evaluation.

Before viewers evaluate any video, they are shown training video sequences that demonstrate the range and types of impairments they will assess in the test. ITU-R BT.500 recommends that these video sequences should be different than the video sequences used in the test, but of comparable sensitivity. Without the training session, viewers' assessments would vary widely and change during the test as they saw different quality videos.

Configuring predicted DMOS measurements

The considerations about display technologies also apply to configuring DMOS measurements. In addition, DMOS measurements also have a configuration parameter related to training sessions. The training session held before the actual subjective evaluations ensures consistent scoring by aligning viewers on the “best case” and “worst case” video quality, establishing the range of perceptual contrast differences viewers will see in the evaluation. The worst case training sequence response configuration parameter performs the same function in a DMOS measurement.

This parameter is a generalized mean of the perceptual contrast differences between the best case and worst case training video sequences associated with the DMOS measurement. This generalized mean, called the Minkowski metric or k-Minkowksi metric, was calculated by performing a perceptual-based picture quality measurement, either PQR or DMOS, using the best case video sequence as the reference video and the worst case video sequence as the test video in the measurement.

Instruments offer preconfigured DMOS measurements that contain different values for the worst case training sequence response parameter, determined by using video sequences appropriate for the measurement. These serve as templates for creating custom measurements that more precisely address a specific application's characteristics and requirements for picture quality evaluation.

Interpreting DMOS measurements

Figure 4 shows a typical DMOS measurement. In the preconfigured DMOS measurements, values in the 0-20 range indicate test video that viewers would rate as excellent to good relative to the reference video. Results in the 21-40 range correspond to viewers' subjective ratings of fair to poor quality video. DMOS values above 40 indicate the test video has poor to bad quality relative to the reference video.

DMOS measurements predict the DMOS values viewers would give the reference and test videos used in the measurement if they evaluated these videos in a subjective evaluation conducted according to procedures defined in ITU-R BT.500.

The same test videos can receive different DMOS values from different viewer audiences. It depends on the video sequences used to train the viewers. Similarly, DMOS measurements configured with the same display technology and viewing conditions can produce different results if they are also configured with different worst case training sequence responses.

In this sense, the DMOS measurement is a relative scale. The DMOS value depends on the worst case training sequence response used to configure the measurement, just as the results of the associated ITU-R BT.500 subjective evaluation depend on the video sequences used to train the viewing audience. When comparing DMOS measurement results, evaluators need to verify that the measurements use the same display technologies, viewing conditions and worst case training sequence response parameters.

The DMOS measurement is an excellent choice for picture quality evaluation teams needing to understand and quantify how differences between a reference and test video degrade subjective video quality. The PQR measurement complements the DMOS measurement by helping these teams determine if viewers can notice this difference, especially near the visibility threshold.

PSNR measurements

To calculate a PSNR value, an instrument computes the root mean squared (RMS) difference between the reference and test video and divides this into the peak value. It computes the PSNR value for every frame in the test video and for the entire video sequence. In PSNR measurements, as the difference between the reference and test video increases, the PSNR measurement result decreases.

Combining PSNR measurements with the perceptual-based measurements offers unique insight into the impact of differences between the reference and test videos. Figure 5 shows a comparison of a PSNR measurement in Mean Absolute LSBs units (solid blue line) and a DMOS measurement (dotted magenta line). The PSNR measurement shows when differences occur between the two video sequences. The DMOS measurement shows the perceptual impact of these differences.

In these comparison graphs, evaluation teams can see how differences do, or do not, impact perceived quality. They can see how adaptation in the visual system affects viewers' perception. For example, a large transition in average luminance during a scene change can mask differences. Comparing the difference map created in the PSNR measurement and the perceptual contrast difference map created in a PQR or DMOS measurement can reveal problem regions within the video field or frame. These comparisons can help engineers more easily map visual problems to hardware or software faults.

Conclusion

Engineering and quality assurance teams need to perform frequent, repeated and accurate picture quality assessments to diagnose picture quality problems; optimize product designs; qualify video equipment; optimize video system performance; and produce, distribute and repurpose high-quality video content. They cannot afford the time and expense associated with recruiting viewers, configuring tests and conducting subjective viewer assessments. They need objective picture quality measurements that can make these assessments more quickly than subjective evaluation and at a lower cost. However, these objective measurements should match subjective evaluations as closely as possible.

Full-reference objective picture quality measurements address these requirements. The perceptual-based DMOS and PQR measurements offer results well matched to subjective evaluations. Over a wide range of impairments and conditions, DMOS measurements can help evaluation teams determine how differences between reference and test videos can affect subjective quality ratings. PQR measurements can help these teams determine to what extent viewers will notice these differences, especially for applications that place a premium on high-quality video.

Richard Duvall is the Americas video marketing manager for Tektronix.