Blind Testing in the Audio Realm

Alert readers will recall that last month I began a discussion of how we try to determine what it is that we humans can really hear. I noted that people often report hearing "awesome" differences between small changes (like 16 to 20 bits, for instance), but that when they are tested in a "blind" way, they can't seem to make out the difference at all. How can this be? This month, we'll take a quick look at blind testing.

To begin with, we humans are quirky creatures. Our conscious and subconscious minds behave in unruly and pesky ways. Our perceptions of the world around us are multifaceted, complex, cross-integrated, extraordinarily rich in information, and also heavily "edited." In fact, our raw perceptual information is overwhelmingly complex and noisy, and some sort of perceptual "editing" is necessary to keep us from drowning in sensory noise.

With the editing that occurs at various neurological and precognitive stages of our perception, however, we also experience a loss of á well, objectivity. Our edited perceptions tell us quickly what we think we need to know. Such perceptions take into account all of what we think we need to know about the stimulus in question. Sometimes these edited perceptions get us into trouble.

THIS IS A TEST

To give you an example, let's go through the following exercise: Say, out loud, the following sentence: "Compared to Microphone B, Microphone A sounds really dull and lifeless."

Not very hard to say, is it? Anybody could say that ÷ and mean it, too!

Now, lets's try another sentence. [Disclaimer: Any mention of particular brand names is strictly for the purpose of hypothetical comparision.] Say this one loud and clear (I dare you!):

"Compared to a Shure microphone, a Neumann microphone sounds really dull and lifeless."

A little harder to say, isn't it? Especially if you have a knowledgeable colleague listening.

Why is this?

It is because we "know," as a matter of professional competence, that Neumann microphones are better-sounding than Shure microphones. (Ah, the power of brand identities ... .)

How do we know? From our own experience with microphones (whether or not we've ever directly compared a Shure to a Neumann), gossip, group-think, perceived value, relative cost, mythology, etc. This sum total of "knowledge" helps us quickly make the acceptable professional decisions that allow us to survive.

This sum total of "knowledge" is also the basis for bias and prejudice. For better or worse, it is lacking in what we like to think of as scientific rigor and objectivity.

Welcome to blind testing.

ALL BEING EQUAL

When we wish to determine the answer to the question: "All other things being equal, which microphone sounds better to human listeners, a Shure BX58 or a Neumann Y49?" the "all other things being equal" part of the question (which is seldom stated, but always implied) requires that we take steps to make sure all other things are equal when we perform an experiment that compares the two microphones.

Because we already "know" that Neumanns are better than Shures, that knowledge colors our perceptions.

How strong is this effect? Overwhelming, apparently. In1994, Floyd Toole published a report on this effect, which found, in a study of loudspeakers, that brand identity was the strongest force affecting perceived audio quality ÷ far stronger than audio performance.

So, we conceal the identity of the devices under test, and rename them with neutral names such as Microphone A and Microphone B. This is what we mean by "blind testing."

The test listener should not know the identity of the items under test. Then, when he or she reports that, "compared to Microphone A, Microphone B sounds really dull and lifeless," we can reasonably assume that he or she is not drawing on other, prejudicial "experience" in making that judgment.

NOT ENOUGH

Sadly, this isn't enough. It turns out that we humans are pernicious and sneaky enough to screw things up anyway. If the person administering the listening test knows the identity of the two microphones, he or she may, by either inadvertent or advertent conduct, cause the blind listener to prefer one microphone over the other. So, if we are going to be thorough about this, we also need to conceal the identity of the microphones from the person administering the test as well. This is what we mean by "double-blind testing."

So, if we are going to rule out prejudices and biases, however well-founded they might be, we need to use double-blind testing, or so it seems. This requires a lot of complexity and expense, but at least we've found out a truth that stands up "when all other things are equal," which is a cornerstone of scientific method.

Unfortunately, all of this gets a little tougher when we start trying to measure very small differences, or trying to find out if there is an audible difference at all. When we compare two microphones, the differences are generally pretty large and at least reasonably obvious ÷ both in terms of objective measurement and perceived sound quality.

REAL OR IMAGINED

When we go digging for the real or imagined differences between, for instance, 20 and 24 bits, or two audio cables, the acoustical or electronic measured objective differences are really very small. And with such small differences, the complexity of the blind test begins to actually affect the results.

It boils down to this obvious but inescapable fact: It is harder to correctly answer questions whose answers we don't know than questions whose answers we do know. Setting aside the obvious issues of prejudice, bias and cheating for a moment, we will get "correct" answers more often when we "know" the answers than when we don't.

I've seen this effect a lot when doing my Golden Ears seminars (I publish a set of audio ear training CDs called "Golden Ears," and often present ear-training seminars using them). Listeners asked to identify the difference between two versions of the same recorded excerpt will have real trouble, at first, hearing that one version is 3 dB louder than the other. Once they are told and shown that such a difference exists, they find it "obvious."

So, when we try to measure really small differences, we can reasonably expect to find (and do in fact find) that blind (and double-blind) tests yield more negative results than nonblind tests, due both to bias effects and also to the confidence effect of "knowing' the answer. The insight is that blind tests are "harder" than "sighted" tests.

Interestingly, this doesn't mean that blind tests are necessarily more rigorous than sighted tests. Because they represent a testing context that is comparatively more difficult, and because they represent a listening situation that is different from the end-use situation where we wish to apply our findings, the results from blind tests may not prove to be perfectly reproducible or relevant.

BIAS EFFECTS

What we can be sure of is that blind tests are not contaminated by bias effects. Usually, that benefit more than outweighs the small error caused by the "lack of confidence" effect.

So, for these very small differences, I personally take the pragmatic (and lazy) approach. I use blind testing, and figure any errors due to "loss of confidence" are so small that they aren't worth worrying about. It generally works pretty well and relieves me from the problem of having to account for the small problem it causes.

Such pragmatism is widely accepted in the testing community, because the alternative ÷ sorting out the bias effect present in sighted tests ÷ is prohibitively expensive and time-consuming, if it is even possible.

So, I recommend that you depend on blind (or better, double-blind) testing to find out answers to questions about the audibility of effects like 96 kHz sampling rates or 24-bit words.

In the next issue, we'll look at our curious choice of words like "amazing" to describe small differences that we can barely hear. Thanks for listening.

Dave Moulton