Skip to main content

FACE time

In video compression algorithms, the quantization parameter is typically adapted based on overall bit usage and relative complexity of the region in the picture. However, such complexity-based rate control algorithms do not emphasize the fact that certain objects, such as human faces, are more sensitive to the overall perceptual video quality. To improve overall perceptual quality, it is important to classify human faces as regions of interest (ROI) and preserve as much detail as possible. The challenge is to develop a reliable algorithm that works in real-time implementations. This article will explore a low-complexity system that can run on a single-core DSP as part of an encoder implementation.

The proposed system is a low-complexity, color-based skin tone detection, which classifies skin tone macroblocks (MB) as ROI MBs and non-skin tone macroblocks as non-ROI MBs (MB; 16 × 16 block of pixels). The classification is based on some empirical thresholds applied to the mean of the color components. The empirical threshold values were defined after an extensive training using material that covers all kinds of races. According to this classification and a modified rate control (RC) that permits smoothly assigning different levels of quality, we can increase visual quality in human faces. The new RC assigns a lower quantization parameter (QP) to ROI areas compared to non-ROI areas, while maintaining the overall bits budget per frame.

To reduce false positives and missed MBs, a detection refinement is applied using erosion and dilation algorithms. These morphology algorithms use classified neighbors' information to fill holes (missed MBs) and find isolated blocks (false positives). False positive ROI MBs lead to an erroneous allocation of valuable bits, while missed ROI MBs create a non-smooth region perception.

Erosion helps to find false positives and mark them as non-ROI. Dilation does the opposite; it finds holes in skin regions (like eyes or mouth in faces regions) and marks them as ROI. (See Figure 1.)

Two versions of erosion and dilation were implemented — one for preprocessing, in which all the MBs of a frame were previously classified as an ROI or non-ROI before being encoded, and one that was merged inside the encoder. In preprocessing, we use all neighboring MBs' skin information to decide if an MB skin classification is a false positive, a hole or if it is correct, and accordingly make the correct ROI/non-ROI classification. When the erosion and dilatation algorithms are immersed in the encoder, only top, left, top-left and top-right skin MB neighbors' information are available for making refinement decisions, but this version is suitable for low-latency applications. (See Figure 2.)

In addition to the erosion/dilation algorithms, an MB activity gradient threshold is used for reducing the number of false positives, especially when videos have many small faces (e.g. faces in a crowd). Background faces in a crowd are not subjects for ROI treatment.

An implementation based on 8 × 8 pixel blocks detection for luma and 4 × 4 pixel blocks for chroma components (in case of 4:2:0 video format) helps to improve the algorithm precision comparing with 16 × 16 pixel block detection. The logic used is if two or more of the four blocks from an MB are classified as skin blocks, then the complete MB is marked as skin MB.

Also, we implemented another preprocessing decision to eliminate frames that have too many MBs marked as ROI, in which case it is pointless to do frame bit redistribution. If more than 30 percent of a frame is detected as skin, then all the MBs are remarked as non-ROI.

Finally, to reduce processing cycles and increase channel density per core, a decimation process was implemented. In this process, we skip some pixel values to get the mean of the color component blocks. For luma components, instead of getting the mean value of 64 8-bit pixels, we decimate in steps of four and get the mean of only four 8-bit pixels per 8 × 8 block. In the case of chroma 4 × 4 component blocks, the decimation step used is four. Decimation decreases ROI classification precision. However, it results in a fair tradeoff in case of preprocessing multiple HD channels in a single core.

Figure 3 and Figure 4 show a visual quality (VQ) comparison of using/non-using ROI detection as part of H.264 encoder. Figure 3 was encoded at a low bit rate in order to stress the VQ difference. This sequence has many small objects in movement (i.e. water drops), which drain rate control's bit budget. This results in a notorious visual degradation of the face in front when ROI RC modification is not applied. Figure 4 shows content similar to that found in video conferencing, in which the background is usually static and faces are the critical information to transmit.

For better user perception, a low-complexity implementation of skin tone detection algorithms with a highly accurate classification is currently desired in the video market. ROI detection could be implemented in a video frame preprocessing stage, or with less accuracy, it could be merged inside a standard video codec. A low-complexity implementation gives the advantage of fast decisions (fewer cycles) when deciding if an MB is part of an ROI area. Fast decisions with a low-complexity classification algorithm permit a real-time ROI detection implementation on low-power processors for high channel density scenarios. Additionally, a low-complexity ROI implementation helps encoders to improve overall video quality in numerous applications, such as video broadcast, video conference, video security and smart cameras, among others.

Paula Carillo is a video applications engineer for multicore and media infrastructure group, and Akira Osamoto is a software engineer specializing in video compression algorithms and video processing at Texas Instruments.