Skip to main content

The Zen Behind Audio Levels, Lip-Sync

It’s time to revisit the state of how things are in broadcast audio. In the Jan. 9, 2008, issue of TV Technology, I wrote about things that were getting better, which include the number of channels broadcasting in HD and surround sound, and the significant reduction in the number of channels with serious audio problems. Some big improvements!

In the same issue, I also noted that audio levels were still not under control. And today I need to report, sadly, that lip-sync problems appear to be getting worse, which is to say I’m seeing more errors and they seem to be getting a little, er, longer.

Over the next several months, I‘m going to dig more deeply into these problems, and I’ll take a look at some possible solutions. But before I do that, I want to discuss, briefly, why these things matter. To get things started, I’m going to let my evil twin, Rave Moulton, the studly student of audio have the floor for a second.


Rave says, “So you like to whine about how the level changes when you switch channels? You find it a little annoying? So what? Change the volume on your remote and move on. It’s no big deal. Like Supreme Court Justice Antonin Scalia said about Bush in Florida, ‘Get over it.’ And most people can barely discern lip-sync error. We’re talking a frame here, or maybe two, max. We’re gonna quibble about time drift of maybe one-twentieth of a second? Get real! That’s AV for complainers, for Nader-Wimps!”

So Rave, in his crass way, thinks that none of this is a big deal. And, believe it or not, he’s actually got a point.

I was talking to Dave Gardiner about this stuff. Dave, for those of you with a short memory, is audio engineering crew chief at WCVB in Boston (the ABC affiliate), and we were busy doing the usual whine-session (which is what passes for journalistic research in my neighborhood) about how screwed up everything is all the time. Dave made a very good point worth considering here, which is that content trumps quality. While obvious, this point is also profound. The content, the meaning, the quality of the actual story, the game, the drama or whatever, eclipses the surface values of production quality, definition, resolution and sonic and visual accuracy. As audio guru Neil Muncy used to say, “Nobody ever bought a record for the signal-to-noise ratio.”

Given the axiomatic truth of that idea, Rave’s point of view seems pretty reasonable. The emphasis has to be given to content, and in the greater scheme of things, our complaints seem to be pretty trivial.

But that in turn raises a really good question—what is the value of really good surface production quality? Does HD with 5.1 surround sound make a difference, relative to content? How much does improved surface quality bring to the table of the sensory experience and impact of good television?


I’ve noted before that low-resolution AV is often used for the dramatic effect of enhanced authenticity. In ENG, low-res AV is often all we’ve got, and we’ve learned to accept that many of the most important events being reported are captured only in fleeting, fragmentary low-res glimpses. At the same time, I think we’ve even learned to distrust high-resolution representations of such news events as obviously staged, and therefore probably false.

The place where this gap between low-res authenticity and high-res sensory engagement is being bridged is professional sports. Here, all the high-res big-budget virtues of staging are shown to their best effect, with the added virtues of authenticity via handheld cameras roving the sidelines and locker rooms, with coaches and players wired up for authentic commentaries.

So, maybe such sports are the best place to study the relationship between the values of high surface quality and powerful or at least authentic content.

The issue of channel-to-channel levels doesn’t play a big role here, but the change in levels between program and commercials certainly does. Further, in the various auto-racing sports (I’m thinking particularly of NASCAR and Formula One here), levels and authenticity present a big problem. For instance, it is probably not wise, healthy or even legal to play back “what it really sounds like” over a television set (or even in a home theatre). Unfortunately, this has led to really wimpy audio for such sports, to a point where a great deal of the authenticity is lost (for me, at least!).

But for other sports, the enveloping wash of crowd noise, well-balanced with announcer voice levels and in-action sounds plus sideline chatter, can be compelling and engaging. And, in a high-res context, when that balance isn’t achieved, the result is a flat, enervating and disengaged surface quality that really cuts into our ability to “get into” the event.


Lip-sync is another can of worms altogether. We’ll discuss the causes more fully in a future article, but suffice it to say that musicians begin to grumble about timing errors of about 10 ms. A frame of error is about 33 ms, and it is pretty easily noticed by lay viewers, particularly when the audio leads the video (which never happens in nature). Informally, I personally perceive lip-sync error magnitudes in terms of (a) “something’s not right”—10 to 50 ms error, (b) half-word error—50 to 150 ms, and (c) whole word error—more than 150 ms error.

Now, these errors really destroy the willing suspension of disbelief. In the “something’s not right” range, we are distracted from the message by a cognitive anomaly that we can’t quite put our finger on. In the half- and whole-word error ranges, it is clear that what we’re hearing is not being spoken by the talking head on the screen. Nasty stuff.

How nasty? Imagine a presidential debate where a 150 ms delay was put on the camera covering just one of the candidates. I suspect the resulting cognitive disruption could actually tip the vote in a close election.

Back to sports for a second. As I write this, my beloved Boston Celtics are playing the Los Angeles Lakers in the NBA finals, and I am happily surrendering myself to the joys of couch potato land. Everything is fine while they are playing, but it all falls apart as soon as we cut to the announcers on-camera. During Game 1 they had at least a whole-word error. In Game 2, it looked like a little more than a half-word error. Who are these people? What are they talking about? Why are they on-camera? All you have to do is close your eyes, and it’s all fine.

Enough said!

Thanks for listening!

Dave Moulton is hard at work on loudspeakers and composing. You can complain to him about anything at his Web site,