Latest from Tv Technology in Fingerprint

A/V Fingerprinting—Transporting, Binding and Applications

Mary C. Gruszka — Sat, 25 Oct 2014 14:00:00 +0000

Audio and video fingerprints are bits of data derived from certain characteristics of each type of content. The algorithms for generating fingerprints, in Part 1 (See “Book ‘Em Danno: A/V Fingerprinting”), have been developed as part of a proposed SMPTE standard from the SMPTE Drafting Group (24TB-01 DG Lip-sync). As of this writing, the standard is going through the approval cycle.

Other parts of the proposed standard deal with the container for the A/V fingerprint data (which packetizes the data), as well as how to transport and bind it to various digital formats.

For each video frame (or field in progressive), there is one byte for a video fingerprint and several bytes for the audio fingerprint, the number depending upon the video frame rate. For each frame, the video and audio fingerprints associated with that frame are bundled together into a defined container, with one container per frame.

The container is additional data that wraps around the fingerprint data and contains information as to protocol version, sequence count, status bits, ID descriptor and checksum.

Once the fingerprint data is containerized, it can be sent along with and bound to the audio and video to wherever it’s going. The fingerprint by design is not part of the audio or video content itself. Rather it’s a separate piece of data with the audio and video content remaining unchanged.

GENERIC MECHANISM
Currently the proposed SMPTE standard covers fingerprint binding in SDI, MPEG-2 transport stream and Internet UDP/IP. Work is continuing on binding in file-based formats.

For SDI transport, the fingerprint container would be carried in a standard ST291 vertical ancillary (VANC) packet.

A functional view of how A/V fingerprints can be used to determine lip sync errors. Courtesy of SMPTE. “This is a generic mechanism to take a chunk of data and attach it to the stream,” said Paul Briscoe, chair of the SMPTE Drafting Group (24TB-01 DG Lip-sync). A common use for VANC is for embedded digital audio; the fingerprint data would be another packet in this space. The fingerprint ancillary stream will have its own unique registered data identifier (DID) and secondary data identifier (SDID).

For MPEG-2 transport streams, the fingerprint container would be inserted as private user data, as defined by the MPEG-2 TS standard. It would have its own unique packet identifier (PID), an MPEG-2 registration address.

With both SDI and MPEG-2, the fingerprint data is inherently bound to the video and audio essence, with MPEG-2 via maps. However, with IP transport, fingerprint data is indirectly bound to the essence.

For IP transport, the fingerprint data is put into a raw user datagram protocol (UDP) packet with an IP address and an ID descriptor. These relate to the audio and video streams so that a receiver can pull them out and re-associate them with their corresponding audio and video.

UDP is the simplest type of Internet packet, according to Briscoe. It’s generic, one-way, transmit-only, and is not acknowledged by a receiver. It’s very possible that fingerprint packets could get lost or be received out of order, but as we’ll see in a minute, that’s generally not a problem.

FILE BINDING
Binding fingerprint data to files is more involved. “The file binding aspect is about to become the primary focus of the work,” Briscoe said. “We have a proposal that everyone feels is viable and is based on the fingerprinting standard we’re completing now. It’s basically a definition of how to bind those fingerprints to files.”

It’s important to note that the same fingerprint container is used for all binding applications. “The fingerprint bucket [the container] looks the same whether in SDI or in a file,” Briscoe said. “The payload is the same, so you can go between media very easily. You don’t have to take it apart and decode it [when going between different media].”

Briscoe compared transporting fingerprint data to a letter (the fingerprint data) inside an envelope (the container). That letter can be transported by FedEx, UPS or USPS Express Mail, for example, each with its own outside envelope, unique to each carrier. That outside envelope is analogous to the specific packet wrapper for SDI VANC, MPEG-2 TS or UDP/IP.

FREEDOM OF CHOICE
How the audio and video fingerprints get used in practice is not part of the proposed SMPTE standard. Nor are any consumer standards to bring applications for fingerprints to the home end-user, although liaisons have been established with CEA, DVB other organizations, Briscoe said.

“We are building a toolset that people can use to build applications and leaving the feature-space stuff to innovative designers,” Briscoe said. “Accordingly, we are standardizing the bits that need to be nailed down. There may ultimately be utility in creating an Engineering Guideline or Recommended Practice, which informs people further. Should it become evident that we need to standardize other aspects, we would do so.”

Applications will be left to developers and manufacturers, but here’s a typical scenario:

Content that’s known to be in A/V sync is fingerprinted. It’s important to remember that the audio and video must be in sync in the first place when fingerprinted as this technology will not fix sync problems if they weren’t right in the first place.

Once the fingerprints are derived they are bound to the source. At the content’s destination, fingerprints are derived from the locally received content, a set for video and another set for audio. Video and audio fingerprints are correlated to the source fingerprints. This establishes delay times for each type of signal. The difference in delay between audio and video is the lip sync error.

What happens at this point would be up to the application. It can automatically make corrections, a useful feature in a home viewing system, or it can warn an operator that there is an error; indicate what it is, but make no automatic corrections. Or it can correct sync and refingerprint before the signal is sent down the line.

Fingerprinting technology is pretty robust, as stress tests during the working group’s activities showed. “The fingerprint algorithm that we’ve chosen uses specific measures of video and audio that are intended to be minimally impacted by various processing, such as geometrical scaling, loss of detail and so on,” Briscoe said.

And if some fingerprints get lost along the way? Lip sync doesn’t generally change over the course of a piece of program, Briscoe said. As long as some are received, the correlation can still happen, and adjustments made if necessary.

“You don’t need a fingerprint for every frame,” Briscoe said. “You can lose fingerprints, [and when they return] you can start measuring again to see if there’s a change.”

Once the SMPTE standards are approved as expected, it will be interesting to see what applications result. It sounds like we may finally have a handle on how to detect and correct lip sync errors.

Mary C. Gruszka is a systems design engineer, project manager, consultant and writer based in the New York metro area. She can be reached via TV Technology.

‘Book ’Em, Danno’: A/V Fingerprinting

Mary C. Gruszka — Mon, 22 Sep 2014 04:35:00 +0000

Audio and video are about to get fingerprinted. Despite much success and many valiant efforts to keep them aligned, often enough these miscreants head off at their own speed, drift apart and cause lip-sync problems.

Now it’s time to “book ’em, Danno” as a character on a classic TV cop show once said. It’s time for audio and video to get fingerprinted. So that no matter where they roam or what path they take, there could always be “someone” watching and, if needed, rein them in to alignment.

Over the years a number of different solutions have arisen to deal with A/V sync issues. Some are out-of-service processes that check signal paths for A/V sync before sending any content down the line. As the name implies, these can’t be used with live content.

Then there are in-service processes where lip sync could be measured with actual content, but there has been no compatibility among various techniques or manufacturers.

That’s why SMPTE has stepped in. The SMPTE Drafting Group (24TB-01 AHG Lip-sync) is close to completing its work on some of the standards needed for A/V fingerprinting, and is continuing its work on others.

Fig. 1: Simplified block diagram for audio and video fingerprinting and lip sync correction, showing the extent of the SMPTE standardization work. courtesy of SMPTE. MEASUREMENT OBJECTIVES
According to Paul Briscoe, chair of this SMPTE Drafting Group, and who has also presented a webinar on the subject, the goal of the committee was to establish a standardized in-service measurement (without precluding out-of-service use) that would be: interoperable among different manufacturers; not modify any content; be able to be used on-air within live content at anytime; be medium- agnostic and not care how the media is moved around, in plant or out, through just about all kinds of processing. It needed to have low data rates to avoid using DSPs and not be too complex to keep costs reasonable to implement.

The SMPTE group chose fingerprinting technology to accomplish these goals.

“The fingerprint is a summary description of a frame of video and the audio associated therewith,” Briscoe said. “It is a very compact bit of metadata that is calculated directly from the audio and video.”

To derive the fingerprint, certain simple characteristics are measured as they change over time from frame to frame or field to field of video, and sample to sample of digital audio.

“For each frame, we generate one set of fingerprints, one each from audio and video,” Briscoe said. “These are bundled together for each frame and transmitted along with, but not within, the video and audio. In actual fact, the fingerprints will come after the frame for which they have been calculated, but this isn’t a problem as we are not interested in the relationship between a given frame and its fingerprint, but the relationship between video and audio that it represents. The correlation function at the receiving end will figure it out.”

The correlation function that Briscoe referred to is part of an application (not part of the SMPTE standard) that, at any point along a signal path, can take a look at the incoming audio and video, create local fingerprints and correlate them with the original reference fingerprints to determine if there is any lip sync error. We’ll go into this more later, but for now, let’s return to the metadata.

“For every video frame, we produce one video fingerprint byte, and a variable number (format dependent) of audio fingerprint bytes,” Briscoe said. “These are bundled together into a fingerprint container, and that’s the metadata that is sent along with the audio and video over SDI, MPEG or IP. The same containers will be used in the file-based system.”

DERIVING FINGERPRINTS
Let’s see how each of the fingerprints is derived.

Multichannel audio is first downmixed to mono, although individual channels, including multiple languages, could be fingerprinted as well. Up to 32 audio fingerprints could be associated with corresponding video. According to Briscoe, operating practice will determine which audio will be fingerprinted.

The audio fingerprinting process works on digital audio samples associated with a frame of video or field depending on format.

“We use 16-bit samples at 48 kHz sample rate,” Briscoe said. Any other sample rate is converted to 48 kHz, and any digital audio word greater than 16 bits gets truncated to 16.

The samples are fed into two processing blocks—a mean detector and an envelope detector. “The mean detector is a long time-constant process, which outputs a value corresponding to the long-term average of the audio level,” Briscoe said. “The envelope detector is a short time-constant process, which outputs a value corresponding to the [near] instantaneous value of the audio level.”

For each sample of audio, the mean and the envelope value is compared and a bit is produced, a “1” if the mean is greater than the envelope, and a zero otherwise. All the ones and zeroes are accumulated for the entire duration of the video frame.

“We tally up all of the ones, and then decimate them to a smaller number using an algorithm that considers the video frame rate, with the goal being to establish approximately 1 millisecond of accuracy,” Briscoe said.

The decimator reduces the data rate from 48 kHz to around 900 bits per second. This is the audio fingerprint. The reduced data rate allows it to be more easily transported, and get correlated downstream more quickly.

On the video side, “we look at gross changes in the picture from frame to frame [for interlace, or field to field, if progressive].” Briscoe said. “The fingerprint is generated by [simple horizontal] downsampling the video to a common low-resolution format, which is used for all formats of picture.”

The picture is scaled to a common SD resolution; then 960 specific points in a central window of the image are sampled for the luminance value. These 960 points are compared to 960 points from the prior frame to identify any that have changed by a value greater than 32.

“A count of those is kept, and the result [zero to 960], is divided by four to render it to a single byte for transport,” Briscoe said. “This is the video fingerprint.”

Once we have the fingerprint metadata, how does it travel with the audio and video and not get lost along the way? And how is it used to correct lip sync errors? Tune in next time.

Mary C. Gruszka is a systems design engineer, project manager, consultant and writer based in the New York metro area. She can be reached viaTV Technology.