Producing Intelligible Speech for Mobile DTV

Mary C. Gruszka

It’s still the early days for mobile DTV, so it shouldn’t be a surprise that some audio issues are making themselves heard.

A good example was relayed to me by Tim Carroll, president of Linear Acoustics in a recent interview. Carroll likes to check out mobile DTV broadcasts wherever he travels, and did just that in an undisclosed location in Washington, D.C. The place had a lot of background noise and Carroll was hard pressed to discern what was being spoken as he attempted to listen and watch at the same time.

“I had to put the mobile device up to my ear [for the sound to be intelligible], but then I couldn’t see the picture. Or I could watch the video and not understand what was being said,” Carroll said.

When I heard that, I thought this is yet another example of poor speech intelligibility of audio for video. This time the cause is not necessarily poor production practices at the front end— examples of which were discussed in a previous column—but rather problems at the transmission and reception end. Although if there were problems with production, that could only compound the speech intelligibility problem for the listener. These issues aren’t limited to mobile DTV, but can affect internet and over-the-top content distribution.

TRANSMISSION AND RECEPTION

While home listening/viewing generally occurs in a fairly well-controlled environment, the same could not be said for mobile devices. Crowd noise, traffic noise, subway or train noise, any kind of noise that surrounds the mobile listener can impair speech intelligibility.

Fig. 1: Comparison of loudness levels of TV/movie content (purple) and music files (blue) on mobile devices. The average loudness of the television content was significantly lower than that of music. Courtesy of Tim Carroll and Jeff Riedmiller. (From Wolters, Riedmiller, et al.; AES 128th Convention, May 2010) First of all, if the background noise level exceeds that of the mobile device, the noise will dominate and mask the lower-level sound. Also if it’s too noisy, the normal response for the listener is to turn up the volume. But there’s a limit.

“Frequency response and distortion vary wildly and they impact speech intelligibility greatly,” Carroll said. “For example, if the background noise is loud, the tendency is to crank the gain of the portable device, often leading to distortion, which can interfere with the speech, and it spirals from there.”

Spiraling often to unsafe listening levels in headphones or ear buds. And the listening level, no matter how high it may go, may not even be able to top the noise.

Even if the noise level is below listening level, it can still mask speech, especially if the spectral content of the noise falls within the speech-critical 500 Hz and 2000 Hz octave bands, or even higher for sibilant sounds (i.e. “hissing”).

Then there’s the audio quality and fidelity of the device itself, as well as any headphones or plug-in loudspeakers. With uneven or poor frequency response in those critical bands, intelligibility could suffer. The quality and operation of the audio decoder can also be a contributing factor in poor intelligibility.

“In mobile devices you have no idea about how the decoder is implemented,” Carroll said. “In many cases the people who are doing the implementation are more concerned about how the device looks as compared with putting in quality components and acoustics. The audio portion is not taken into account.”

But wait, there’s more. Unlike AC-3 audio encoding used in ATSC broadcasting, mobile DTV doesn’t employ metadata to control such listening parameters as loudness or dynamic range. Carroll found there is often too much dynamic range for mobile devices.

And yet broadcasters hope to reach this growing mobile audience and perhaps grab their attention away from the music player or games.

WHAT TO DO NOW

While various ATSC committees are addressing these issues, there are some practices that broadcasters can employ now, at little or no cost, as outlined in Carroll’s 2013 NAB Show presentation, “Practical Audio Issues for Mobile Digital Television.” In addition, there are some things that broadcasters should avoid.

Fig. 2: An example of an audio signal chain for mobile DTV showing the +10 dB level shifter followed by a peak limiter.as external processor(s). ATSC mobile DTV audio levels that are boosted by 10 dB will more closely match the levels of audio that already exist on portable devices. Courtesy of Tim Carroll

(Click to Enlarge)

First on the to-do list: Increase the audio loudness level for mobile DTV by 10 dB, from –24 LKFS (the target loudness level for regular TV programming) to –14 LKFS.

Right away, this increases the mobile device signal to background noise ratio and gives the listener a better chance of hearing what’s being broadcast. Not only that, a 10 dB increase in audio level is generally perceived as being twice as loud.

Increasing the loudness has another benefit, not directly related to speech intelligibility, but rather to listener enjoyment.

In research conducted by Dolby, it was discovered that the loudness of television content was, on average, 11 dB lower than music accessed or stored on a mobile device (Fig. 1). So raising the transmitted level of TV content by 10 dB comes closer to achieving parity with music levels.

Increasing the level, however, results in 10 dB less headroom, so a peak limiter following the level shifter in the signal chain is also required, Carroll said. This helps to avoid overloading the subsequent high efficiency (HE) AAC audio encoder.

According to Carroll, “the limiter should prevent peaks from exceeding –3 dBFS to allow for possible overshoots caused by the bit-rate reduction process.” (See Fig. 2.)

When creating a downmix from a 5.1 source for the mobile DTV audio feed, Carroll suggests boosting the center (dialog) channel to help increase intelligibility.

“The downmix is not intended to sound like the original as no one listens on mobile like they do at home,” Carroll said. “So give the dialog a little extra help. It’s easy to do and pretty much free.”

If the source material is already stereo or mono, try experimenting with frequency contouring, the careful equalization followed by peak limiting, Carroll said. He mentioned new tools are coming down the pike that will address this further.

Notice that on Carroll’s to-do items, compression isn’t on the list. This may sound counterintuitive, but compression can actually decrease speech intelligibility. Here we’re talking about controlling and reducing the peak to low ratio in an audio signal, not bit-rate reduction.

Since speech intelligibility—for most Western languages among others—relies on consonant sounds, compression reduces the “formant to trough” ratio, or in other words, the level of fricatives (consonants) compared with vowels.

“Processing for radio is not appropriate for DTV, and processing for DTV is not appropriate for mobile TV,” Carroll said.

Wideband AGC systems “can create unintended audible spectral shifts, especially impacting high-frequency content,” he added. “Multiband techniques avoid the psychoacoustic spectral shift, but can damage peaks and valleys. Controls to prevent this are not found on typical audio processors, so stations cannot just drop an off-the-shelf unit in line and get desired results. It will be a fight between control and intelligibility.”

Listen to your station’s broadcast on one or more mobile devices. If speech isn’t as intelligible as it should be, give these ideas a try.

Some current mobile DTV audio processors such as the Linear Acoustic AERO. mobile already support these new developments, according to Carroll. But stay tuned. There could be some future processing technology coming our way.

Mary C. Gruszka is a systems design engineer, project manager, consultant and writer based in the New York metro area. She can be reached viaTV Technology.