3D Art
 2D Art
 Games
 Programming

Fixed-Point Arithmetic

The Playstation and other old computers did not have built-in support for floating-point math. Technically, floating-point calculations can be done in software, but that is often too slow and impractical for the purpose of games where speed is crucial.
  As you may have figured out, I'm very inspired by the PS1 and I always wanted to understand how those games from my childhood were made. I decided to make an entire game engine using nothing but fixed-point arithmetic in order to get comfortable with it. Ironically, doing so has also thought me to understand and appreciate the floating-point standard (IEEE-754) a lot more as well.

In this post I'll try to describe the fixed-point format and how to perform fixed-point arithmetic. A part 2 with more specific applications of it may appear later.

Basics

Fixed-point is a method of storing whole and fractional parts of a real number in an integer, and fixed-point Arithmetic is a way to perform arithmetic operations on those.

With fixed-point, N number of bits of an integer are used to represent the fractional part, where the rest is used to represent whole integer value.

For example, if 8 bits are used for the integer as well as the fractional part:

IIIIIIII.FFFFFFFF

That's 8 I's and 8 F's. The higher significance bits (I) contain the integer part, and the lower significance bits (F) contain the fractional part.

You could note this format as '8.8'. When commenting code, I often just type 'X.8', because the upper bits may not be of importance for the purpose of the comment. (It's not my invention by the way, just a thing I've picked up reading code.)

Here's the decimal number 1.25 as fixed-point in binary:
  00000001.01000000

The integer-part as a standalone integer would have the value of 1. Since it's left-shifted 8 bits, it has the value of 256. If converted properly to a float, it would go back to having the value '1.0'.

The fractional part is a '1' left-shifted 6 bits, or 64 which is the product of 256 * 0.25.

If you were to add up 256 and 64, you'd get 320. That looks like this in binary:
  00000001 01000000
And with the fixed-point dot at the 8th bit:
  00000001.01000000

So there's nothing strange or mysterious going on here. It's like if you were to express 0.9 meters by writing 90 centimeters. Or instead of 0.75 feet write 9 inches. Except here, your meter/feet unit would be be some power of two of the cm/inch unit, not 100 or 12. A useful quirk.

Decimal  Integer (Raw)   Binary 
1.0 256 0000 0001 . 0000 0000
2.0 512 0000 0010 . 0000 0000
0.5 128 0000 0000 . 1000 0000
2.25 576 0000 0010 . 0100 0000

For easy of reading, I'll stick to X.8 throughout this post, but you can use any number of fractional bits. There are no rules. Use what's practical. I use a wide range of formats in my games.

Addition and Subtraction

Plus and minus work the same as with regular decimals, nothing strange needs to happen. 2.0 + 0.5 would translate to 512 + 128, and the result would be 640. 2.5 = 640.0 / 256.0

Multiplication

This is where it gets tricky, and even more so considering the risk of overflow (more on that later).

Because the integers are left-shifted, multiplying produces a much larger number than we want. 1.0 * 1.0 should produce 1.0 again, but 256*256 produces 65356, which is wrong.
  In fact, it's too large by a factor of 256, or 8 bits. We must divide or right-shift.

  256 = 256 * 256 / 256
  256 = 256 * 256 >> 8
  product = factor * factor / 256
  product = factor * factor >> 8

There are differences between right-shifting and dividing if the input is negative. For example (-1)/256 will be 0 whereas (-1)>>8 will remain -1.

Safest bet is to always use division. When dividing by a literal like '256', the compiler will most likely figure out that the (expensive) division can be replaced by a shift, and some additional (cheap) instructions if the input is signed.

I personally like to use >> when I know the input is going to be positive, or where small errors are negligible.

Division

Performing division is similar to multiplication, just the other way around. 1.0/1.0 should be 1.0, but 256/256 is just 1, which if converted back to decimal would be 1.0/256.0 or 0.00390625. So before performing the division, we either multiply the input by 256 again, or left-shift it by 8.

  ratio = dividend * 256 / divisor
  ratio = ( dividend << 8 ) / divisor

(There is again a risk of overflow if the dividend is so large that less than 8 leading bits remain.)

Overflow

A challenge when using fixed-point is to avoid integer overflow. It happens easily with because of the potentially large left-shift especially when performing multiplication.

For example, on a 16-bit CPU using 8.8 fixed-point format; even multiplying 1.0 with 1.0 may cause trouble.

65536 = 256 * 256
65536 is '1 0000 0000 0000 0000' in binary.

That is 17 bits. Even though you should right-shift by 8 after the operation, the overflow has already happened on a 16-bit CPU. It's something you must be aware of. (On some CPU's there are flags that will be set when overflow happens (CF, OF on x86), the result may still be useful for continuing to work towards a desired outcome. )

On a 32 or 64-bit CPU, this case (256*256) wouldn't cause overflow as we have many bits to spare, but you need to watch out when using larger numbers, or perhaps more fractional bits.

One way to avoid overflow is to separate the integer and fractional parts and perform the operation in multiple steps:

aI = a >> 8
aF = a & 255
bI = b >> 8
bF = b & 255

r = (ai * bi) << 8
r += ai * bf
r += bi * af
r += af * bf >> 8

There are also checks that can be done ahead of an operation. One favorite of mine is to get the most significance set bit of a variable, which can be done cheaply on most CPU's with a single instruction.
 If there are 16 leading zeroes, you know how much you can left-shift something (in order to get the maximum precision when dividing, ie. 1000/3 gives you more information than 10/3) or how large numbers you can multiply it with safely.

Typically though, I will prefer to just know the possible range of values that will enter a function. With a well designed program you rarely need to perform these checks. But use asserts in debug builds!

Conclusion

In contrast to floating-point, where numbers are distributed with greater separation the further away from 1.0 you get, with integers they are uniformly spaced out, and thus you will reach INT_MAX sooner than you''d reach FLT_MAX.

You can abstract away a lot of the misery, but in the end, fixed-point simply demands more awareness of the bits and the range of variables you use in expressions.

But there are some benefits too. There's no drop in precision as you go further from 1.0, which might make it preferable when you knowing the confines of a space or simulation.
  It's also nice to be able use a lot of power-of-two bit-twiddling hacks directly on the format, as with all integers. Avoiding int-to-float conversions (and then back again) is quite nice too, when you're doing things like iterating colors, sound samples, or other situations where a low-bit integer format is expected.


Toodeloo!




End of page.  TopHome