[Math] Calculate the largest possible floating-point value: formula

binaryfloating pointnumber-systems

If I have a 32-bit representation of a floating-point where 1 bit is the sign, 8 bits are the exponent, and 23 bits make up the mantissa. The exponent notation is Excess-127.

32=1+8+23

And, where the "1." of the mantissa is implied and not stored in the binary sequence.

It makes sense to me to calculate the largest, possible, positive floating-point using this formula:

(implicit "1." value) + (largest possible bit sequence) × (largest possible exponent)

… which in this case is:

(1) + ( { 2^(23) - 1 } x {2^128} )

The textbook I'm studying from uses this formula:

+ { 1 - 2^(-24) } x 2^(128)

The only part; in this formula; that makes sense to me, is 2^(128). I don't understand why they are using 2^(-24). I don't understand why they are subtracting 2^(-24) from 1 either.

Is the formula correct? And if so, how does it work?
If anyone could point me to resources that explain this I would really be grateful.

Best Answer

Your formula is slightly off. Your expression (2^23)-1 represents the 'all 1s' of the maximum mantissa - but you have forgotten that this is a fraction. To make it a fraction you must scale it by a factor of 2^-23 before adding 1. If you work out the math, you will see that then it will be equivalent to the textbook formula.

I also believe you and the textbook are both wrong about the exponent - it should be at most 2^127 (assuming the 8-bit value is signed twos-complement, which has the value range [-128..127]).