This isn't for any class or homework. As part of my personal study, I'm trying to better understand the IEEE754 representation of decimal floating-point numbers in binary. I'd like to add two numbers: $1.111$ and $2.222$, then compare the result by converting the IEEE754 representation of the sum back to decimal.
Per this online tool:
- $1.111 = 00111111100011100011010100111111$
- $2.222 = 01000000000011100011010100111111$
Summing these two together using signed binary addition, I get:
$0111 1111 1001 1100 0110 1010 0111 1110$
In hexadecimal, this is:
$7F9C6A7E$
And according to this other version of the tool, that corresponds to $NaN$.
What's going on here?
Best Answer
You cannot expect to use integer binary addition on two floating-point representations and get a meaningful result.
First, $1.111$ cannot be represented exactly in binary floating point. Your
00111111100011100011010100111111
is actually the IEEE-754 single precision representation of the number $$ 1.11099994182586669921875 $$ which is the closest representable number to $1.111$. This breaks up asand stands for the number $$ 1.00011100011010100111111_2 \times 2^{127-127} $$
The representation of $2.222$ is twice that, with the same mantissa but the exponent one higher. When we add them we must position the mantissas correctly with respect to each other:
And the representation
01000000010101010100111111011110
corresponds to the number $$ 3.332999706268310546875 $$ Note that this is not the closest representable number to $3.333$, which would be the next one, $$ 3.33329999446868896484375 $$ but the round-to-even rule led to rounding down the full result of the addition, which compounded the error inherent in the two inputs each being slightly smaller than $1.111$ and $2.222$.