I want to convert the decimal number 0.1 to binary 64 bit double. So I do it like that:
$$ 0.1_{10} = 0.00011001100110011001100110011001100110011001100110011001100110… \times 2^0 $$
Represent it in the scientific form:
$$ 1.1001100110011001100110011001100110011001100110011001100110… \times 2^{-4} $$
Now 64 bit IEEE754 float allows 52 bits for mantissa, so I need to round the number to 52 bits.
$$ 1.\underbrace{1001100110011001100110011001100110011001100110011001}_{52 bits}100110… \times 2^{-4} $$
So I have to round to either:
smaller number (truncated)
$$ 1.1001100110011001100110011001100110011001100110011001 $$
larger number (original number plus 1)
$$ 1.1001100110011001100110011001100110011001100110011010 $$
Since the 53
bit is 1
, I'm rounding up to the larger number. So I have mantissa part ready. Then I'm calculating biased exponent (11 bits for the exponent):
$$ 2^{11-1} -1 = 1023\\
1023-4=1019\\
1019_{10} = 1111111011_2 $$
So the final representation should be:
$$ \underbrace{0}_{sign}\underbrace{01111111011}_{exponent}\underbrace{1001100110011001100110011001100110011001100110011010}_{mantissa} $$
Is this correct?
Best Answer
Short C code:
Gives
3FB999999999999A
, which is equivalent to:For the record, due to the strict aliasing rule, I cannot recommend this programming method.