HRTFs and Binaural Reproduction

It may be easy to see how a listening room with 22 loudspeakers can provide an immersive audio experience, but to recreate realistic 3D audio with a pair of headphones requires head-related transfer functions.
Author:
Publish date:
Updated on
Image placeholder title

In this picture showing a pivoting robotic arm equipped with loudspeakers, the mics are placed at the entrance of the blocked ear canals. The subject is seated on a chair with neck rest, on a remotely controlled turntable. The reflective markers on the subject’s head are used for motion capture (infrared cameras), so that the exact head position can be monitored during measurements. Credit: Olivier Jouinot (France Télévisions, Paris), Hélène Bahu (IRCAM, Paris).

NEW YORK—It may be easy to see how a listening room with 22 loudspeakers can provide an immersive audio experience, but to recreate realistic 3D audio with a pair of headphones requires head-related transfer functions. This involves large amounts of data, but data exchange should now be made easier with the recent ratification of the AES standard for file exchange—Spatial acoustic data file format (AES69-2015).

Before we delve into the AES standard and why it’s important, some background on binaural audio and HRTFs would be in order.

Binaural recordings can provide a more realistic and immersive sonic experience of being present at a performance or event, as if you were hearing it through your own ears. Binaural playback is most effective through headphones.

Binaural recordings aren’t new, but until recently they have been mostly experimental. Years ago, I played around with analog recordings using in-the-ear microphones made by the late Hellmuth Kolbe, a recording engineer and acoustical designer from Switzerland. Played back on headphones, I had the impression I was in the room or the environment where the recordings were made.

A colleague of mine on the west coast frequently brought in a “dummy head,” a specially designed mannequin of a torso and head with molded ears in which microphones were inserted, to make recordings of the various audio seminars we attended. Although these recordings were made with generic “ears,” they still had a more realistic ambient sound than with either a stereo mic or two mics placed for stereo pickup, but not as realistic as the recordings I made with my own in-the-ear mics.

Today there are multi-element microphone arrays, which can be used for recording and with subsequent processing played back binaurally. For example, mh acoustics, a Summit, N.J.-based provider of microphone and acoustics technology, has the 32-element em32 Eigenmike microphone array.

“Some research labs developed array prototypes that use way more capsules,” said Markus Noisternig from Ircam (Institute for Research and Coordination in Acoustics/Music, Paris). “At Ircam we are using 64 high-quality back-electret capsules for musical recordings, and are working towards a new prototype using 256 MEMS microphones. The more capsules, the higher the spatial resolution, the more precise the binaural transcoding.”

This has great potential for content providers to add an immersive sonic experience to their programming, especially for users of smartphones and other portable devices where listening on headphones is the norm. These devices have enough processing power to implement decoders for binaural listening.

For example, the BiLi (binaural listening) Project, a collaborative research project funded by nine partners including France Télévisions, Radio France, Orange and Ircam, involves quality assessment of binaural listening, content production, HRTF acquisition and generation, multiplatform customizable binaural listening, and a common HRTF file format.

“[The BiLi Project] aims at improving the entire production chain—recording, streaming, decoding—for providing binaural 3D audio content to broadcasters,” Noisternig said.

LOCALIZING SOUND
Binaural processing takes advantage of the way we perceive and localize sound. And that’s where head-related transfer functions come in. The size and shape of our head, ears and upper body act like prefilters to the sound impinging on our eardrums, entering the inner ear and being processed by the brain. Like anything in acoustics, we’re dealing with absorption, reflection and diffusion or diffraction.

We use a variety of auditory cues to localize sound. One of the key cues is sound level. For example, if a sound source is to the left of us, it will generally appear louder in the left ear than the right. There is also the difference in sound arrival times between the two ears. Again in our example, the sound source coming from our left will arrive at our left ear before it gets to the right. These two localization cues work very effectively with sound sources in the same plane as our ears.

Localizing sounds above or below that plane is more dependent on the effects that the pinna (outer ear) has on the sound entering the ear canal. Because of the dimensions of the pinna, this effect is more pronounced at higher frequencies. Think of the pinna as the “sound reflector” it is. Sound bouncing off of it mixes with the more direct sound entering the ear canal with the result that the frequency response will have peaks and dips as the two signals add in-phase and out-of-phase (commonly called comb filtering). Studies have found that a deep notch is formed at a certain frequency, and that this notch frequency changes as the sound elevation changes.

A seminal paper, “Pinna Transformations and Sound Reproduction,” by Carolyn A. Puddie Rodgers and published in the Journal of the Audio Engineering Society, described the effects of the pinna on sound localization and pointed out the problems of misaligned loudspeakers or early reflections in control rooms, which produced their own comb filtering. This tended to cause “localization confusion,” in that the pinna cues told the ear/brain system one thing, while loudspeaker misalignment or early reflection comb filtering provided conflicting and incorrect localization information.

Readers who’ve been following studio design columns in TV Technology will recognize that this is one of the reasons why we create reflection-free zones around the mixing position in control rooms, to preserve pinna localization cues.

The head and torso also come into play into localization with how they diffract and reflect sound. Then there are dynamic auditory cues, such as unconscious head movements that help to resolve the front-to-back ambiguity in binaural hearing.

“It has been shown that head-tracking improves the localization accuracy, out-of-the-head localization and immersion when playing back binaural audio over headphones,” Noisternig said. “Thus virtual audio in general allows for integrating motion capture devices.”

There is, of course, processing in the inner ear and brain, which we won’t get into here.

A way to describe all the effects that the head, ear and torso have on sound before it hits each ear drum is a head-related transfer function, one for each ear. As Noisternig said, “without HRTFs, there is no 3D audio over headphones.”

HRTFs can be directly measured or calculated, and this data is then used in binaural processing of incoming sound to produce a 3D listening experience. As mentioned, this is best achieved when using one’s own HRTFs.

Next time we will look at ways that HRTFs can be determined and then delve into AES69.

Mary C. Gruszka is a systems design engineer, project manager, consultant and writer based in the New York metro area. She can be reached via TV Technology.