Fixed-point vs. floating-point numbers in audio processing

The traditional view is that the floating-point number format is superior to the fixed-point number format when it comes to representing sound digitally. In fact, while it may be counter-intuitive, there is a case to be made that the use of floating-point numbers yields lesser resolution than the use of fixed-point notation.

Floating-point numbers defined

Floating-point numbers are like scientific notation on calculators: They have a mantissa, the number part, and an exponent, a multiplier used to scale the number part. For example, 1.414 × 103 is a floating-point number with a mantissa of 1.414 and an exponent of 3. The attraction of this form of notation is that it can be used to express numbers over a much larger range than would be possible if the same number of digits were used in a fixed-point (integer) number.

To understand the impact number format has on digital audio systems, we must consider two properties: resolution and dynamic range. Resolution or numerical precision is determined by word length. As it increases, resolution improves. Dynamic range is also determined by word length, but in the floating-point format, it can be dramatically extended by the choice of exponent. Compare, for example, a 24-bit fixed-point number that has a dynamic range of about 144dB, to a 24-bit floating-point number where eight bits are designated as an exponent. The latter has a dynamic range of more than 1500dB.

Choosing the right number format

To choose the right number format for a digital audio system, the dynamic range and resolution must be large enough to afford faithful representation of all audio signals that may be encountered. Sound pressure level (SPL) is a logarithmic measurement of sound levels where 0dB represents the threshold of human hearing. (See Table 1 for real-world examples.)

Real-world sounds Sound levels Calm breathing, or gently rustling leaves 10dB Normal conversation 40dB to 60dB Passenger car at 10 meters 60dB to 80dB Hearing damage (long-term exposure) 85dB Vuvuzela 120dB Level of sound that can cause physical pain 130dB Jet engine at 30m 50dB M1 rifle at 1m 168dB Stun grenades 170dB to 180dB

Table 1. Here are some examples of the dynamic range of sounds in the real world.

In general, music, sports and drama do not demand the full range of sounds — from jet engine to gently rustling leaves. Nonetheless, the 24-bit fixed-point format remains unsuitable for digital audio systems because audio processing can introduce errors that are manifested as audible noise unless additional resolution is provided. Note that it is resolution, not dynamic range, that is needed. To illustrate this, let's take a look at the most important audio processes: gain, mixing and equalization.

Gain, mixing and equalization

Applying gain to a digital audio signal means multiplying by a big number for louder and a small number for quieter. The problem comes when the product is a number that doesn't fit neatly into the number of digits you have to represent it. There is usually an extra bit that you have to get rid of, a process called truncation. How you truncate affects sound quality. Simply rounding up or down introduces an unpleasant quantization noise, so a better idea is to add a random number to the leftover bits and then round up or down. This idea is known as dithering, and it makes low-amplitude signals sound much better. The downside is that the number format must carry additional resolution in the form of footroom bits to dither against. Here, floating point does not help because there is no need for extended dynamic range. In fact, any bits given over to carrying an exponent are extraneous and would be better deployed extending notation in the mantissa.

If multiple signals are to be mixed (added), then it is a good idea to provide some additional headroom bits to extend the dynamic range during the calculation, but this does not have to be a large number. If a floating-point number were used to generate intermediate headroom, as the mantissa is scaled up, footroom bits would be lost, affecting any subsequent dither calculation and introducing noise. Hence, in mixing calculations, the floating-point format is in fact a liability.

Equalization is a more complex case. To make it more convenient for processing, the ubiquitous biquad calculation used in digital filters may be arranged in different forms. For example, the popular direct form II is used to reduce the number of computations in systems where DSP cycles are at a premium. The tradeoff is that the resulting intermediate calculations have such a large dynamic range that floating-point format must be used, at least in a system constrained to a fixed word length such as a DSP chip. In other words, the need for using floating point is a consequence of cost-cutting rather than a consequence of the pursuit of high quality.

Limitations of a rigid floating-point format

The problem that arises from the use of a rigid floating-point format (for example, that found in ADSP SHARC chips) is that the resolution is fixed by DSP architecture, not by the requirements of the calculation. Most of the time, this is adequate, but there are certain filter configurations where it is not. The elevation of the noise floor that results from the resolution limit of the floating-point format is significant. (See Figure 1.)

If high-quality sound is the goal, the best approach is to first decide on the desirable level of performance and then select the number format to achieve it. In the case of EQ, a high level of resolution is needed in parts of the calculation to avoid generating the kind of noise evident in Figure 1. A flexible architecture allows word length to be increased so that it most precisely matches desired performance.

Using the same filter in an audio console that relies on fixed-point-based digital signal processing shows how the high word length, fixed-point approach has reduced the noise floor of the filter to almost exactly that of the test set. (See Figure 2)

Lack of resolution produces a secondary effect that also impairs filter performance. This is because you cannot add very big numbers to very small ones. Let me demonstrate this with an extreme case. Imagine that you have adopted a seven-digit floating-point format with a four-digit mantissa and a three-digit exponent. You can represent the number 1 million by writing it as 1000 × 103. Now, add the number 999 to this. You should get 1,000,999, but since you have only four digits available, the result you end up with is 1000 × 103 — the same number you started with. In other words, adding 999 has no effect on the result, an illustration of the limitations of the floating-point system's lack of resolution.

In conclusion, it is inevitable that the use of floating-point numbers will deposit arithmetic errors when they are subject to the mathematics of audio processing. Candidly, these errors are small and often irrelevant, especially when compared to the many other threats encountered by audio quality on the path between microphone and living room. Still, if numerical precision is our goal, as it should be, then we should strive for objective analysis of the virtues of competing notational systems — floating point vs. fixed point — rather than falling prey either to intuitive explanations or the common wisdom.

Patrick Warrington is technical director at Calrec.

Recommended reading