The Mystery of the Set-top Box

Alert readers may recall that in my last column, I described the production requirements for using/setting dialnorm metadata values. The basic point I made was that dialnorm is not so much a specification as an active level calibration protocol that establishes a reference decode level of -31 dBFS LeqA (Line Mode operation) for dialogue.

So, instead of calibrating using a traditional "nominal" (or alignment) level such as +4 dBu or -20 dBFS with a sine wave at 1 kHz, we need to use the longterm Leq (A-weighted) of dialogue as our reference "nominal" signal level, and we tie it to -31 dBFS LeqA. This is all fine, and it makes excellent sense, better in many respects than the time-honored sine-wave-based "nominal" level. This is because most people adjust their television volume controls in an effort to normalize the level of speech between programs and channels. And some studies have shown that most programs tend to be played back at a "nominal" sound pressure level of 65 dBA SPL, which is the level of normal acoustic speech. It is an acoustic level that we all instinctively "know," amateur and professional alike. It works. Good idea.

In our present distribution system, approximately 70 percent of all television viewing households receive cable (i.e., around 70 million), and approximately 20 million subscribe to the digital tier of service and use some sort of set-top box to do the processing needed to enable a television to play back the range of programs the cable provider has made available. This set-top box takes in a range of analog and digital channels, processes them, and sends them on to the television, a VCR and/or a home theater receiver, as called for by the consumer.

The typical digital set-top box has multiple A/V outputs: an RF cable output (75? coaxial) that carries the remodulated TV signal including audio, a so-called "line output" that includes a stereo pair of audio signals plus a composite video output, and a digital audio output. My particular box (a Scientific Atlanta Explorer 3100) also provides for S-video output.

The various audio outputs have a variety of applications and each of them have different capabilities as well. These behaviors are based on differing assumptions about the intended performance and capability of the system, based on which output is selected. This is a serious potential booby trap.

THE RF OUTPUT

The most important output to keep in mind is the 75? coaxial remodulated RF output signal. The RF output and its capabilities are significantly different from the line-level output, in that it wasn't designed to carry wide dynamic range signals.

In order to feed an RF modulator, the Dolby Digital decoder should operate in RF mode. RF mode introduces 11 dB gain within the decoder itself and then employs compression and limiting (controlled by the metadata word compr-which is calculated in the encoder) to ensure that the output peaks do not exceed full-scale digital when being decoded. With the 11 dB gain, dialogue normalized to

-34 dBFS (in each channel for line mode operation) is boosted to -23 dBFS (in each channel), yielding a signal where peaks can only go 6 dB or so above normalized speech peaks.

The RF modulator should be fed from the digital-to-analog converters (DACs) with a gain such that a full-scale signal corresponds to 6 dB above (i.e., 50 kHz peak deviation) the normal maximum analog transmitter peak deviation of 25 kHz (most if not all digital set-tops only support monophonic sources). Only when these conditions are met will speech be at a similar level to analog broadcasts (NTSC) on the remodulated RF output of the set-top box. And most importantly it assumes that the incoming dialnorm value for the program is correct as well. This has some important implications, particularly for the setting of dialnorm.

THE GOOD OLD DAYS

In the good old analog days (including today as well), measurements established that speech, as typically transmitted (whether being controlled by a human being or by a broadcast limiter), has peaks that are in the range 1 to 3 dB below 100 percent modulation and has an Leq(A) of 17 dB below 100 percent modulation. The peak-to-Leq(A) ratio of typical dialogue tends to be around 15 dB, and 1-3 dB of headroom seemed like a good idea at the time. Hence the -17 dB Leq(A) value for NTSC analog dialogue level has coalesced into a working reference level.

Such a signal is suitable for mono low-power TVs. Assuming that a signal level of 17 dB below 100 percent modulation yields a viewer listening level of 65 dBA SPL, we can expect maximum levels of approximately 82 dBA SPL in such a situation. For a one-watt, four-inch mono speaker in a cheap TV, this is about as much as we can reasonably expect.

Meanwhile, stereo TV and home theaters have arrived on the scene. With such systems, the peak levels available to the consumer climb from the low 80 dB range to somewhere around 115 dB SPL in a full-bore surround system.

It doesn't do to just turn up the volume-if we took advantage of the peak capability of such a home theater system and set 100 percent modulation to occur at 100 dBA SPL, it would put dialogue at 83 dBA SPL, which would probably be intolerable for more than a few minutes (as in: "Mel! MEL!!! Turn that #$%*@ thing down! I can't stand it! I don't care about realistic explosions!").

Hence the line-level output on the set-top box: When the listener is using baseband line outputs, the expectation is that the Dolby Digital decoder will operate in line mode and offer the viewer a choice on the amount of dynamic range (depending of course on the features provided by the manufacturer of the box) he or she desires. In this mode the normalized dialogue level is -31 dBFS Leq(A) (-34 dB in each channel for a two-channel decoder) as opposed to the

-20 dBFS (-23 in each channel) for RF mode operation.

However there may be circumstances in which the listener may prefer to use RF mode (at the line-level outputs); for instance for late-night viewing or to avoid disturbing the neighbors. Since in RF mode, the Dolby Digital decoder has an 11 dB boost applied (required for the RF modulator), normalized dialogue emerges at

-23 dBFS as discussed earlier.

Clearly it is very undesirable that the speech level should change by 11 dB when the operating mode is changed, so in RF mode, 11 dB of attenuation should be inserted in the feed to the line outputs only. The two modes would then deliver the same speech levels, but different amounts of dynamic range. In RF mode, the loudest and the quietest sounds are compressed toward the dialogue level. In our extreme situation described above, we have lowered our dialogue level by 11 dB to 72 dBFS-it's still loud, but no longer intolerable or over the "Spousal Acceptance Threshold."

That's why we have the two different decoder operating modes and output interfaces: to accommodate different kinds of systems, each with its own capabilities: RF for the cheap low-level legacy mono TVs and line level (also digital multichannel) for comparatively high-quality stereo and multichannel systems.

When you think about it, it's sweetly reasonable, if complicated. The important thing to keep in mind is that if the baseband line-level outputs are set to reproduce dialogue at approximately 65 dBA SPL, then the RF output has a maximum level of approximately 82 dBA SPL and the line-level output has a maximum level of approximately 96 dBA SPL.

Naturally, life ain't all that easy! As I mentioned above, when the decoder is operating in RF mode Dynamic Range Control information (carried via metadata) is applied to the decoded audio on the RF output, to save the listener from really crass overloads whenever he or she decides to play back yet another Star Wars epic or Guvinator XXX.

And here's where the problem comes in big time. What happens if the dialnorm isn't set correctly?

Just imagine for a sec... what happens if dialnorm is set to -31 dBFS and dialogue is actually at -20 Leq(A), and your beloved viewers (a lot of them, anyway) are hooked up to the RF outputs, and their decoder is defaulted to RF mode?

I can speak to this, being a couch potato; it ain't a pretty sound. Remember, the dynamic range control gain words calculated in the encoder consider the current dialnorm setting, the current compression profile selected and the target decode level for both line and RF mode operation; hence, the reason metadata carries two dynamic range control words in the bitstream, Dynrng (for Line) and Compr (for RF). With the dialogue level being encoded 11 dB above the dialnorm value the encoder predicted (when calculating compr), when this program is being decoded in RF mode the peaks would quite often exceed clipping, causing it to apply overload protection limiting to protect the digital-to-analog converters from clipping. As a result, dialogue is being turned up 11 dB and then brutally squashed.

One local station does a lot of sports shows this way. I feel like I'm being caned! The slow attack time constant lets all the announcer's consonants pop out front like little "thwacks" around the head and then jams down the level 10 dB for the vowels, popping back up as the release time constant kicks in, sucking up the crowd noise in anticipation of the next thwack! Ugh-leeee!!

WHAT IT ALL MEANS

Our distribution system is a multifaceted and complex one. One of the key problem points is the set-top box, which appears to be fairly innocuous but is actually quite complex, with diverse behaviors running essentially on autopilot. Dialnorm is a sophisticated and thoughtful protocol to help with the levels problem in a meaningful way. But, naturally, the devil is in the details, and we are getting tripped up a fair amount by them right now.

Next month, I'll take a look at the cable providers, set-top box manufacturers and TV manufacturers for an idiosyncratic case study of one poor benighted soul (me) trying to figure out how to make his system work.

Thanks for listening.

Dave Moulton would once again like to thank Jeffrey Riedmiller of Dolby Laboratories for his assistance with this article.