[Math] Converting 0.1 to binary 64 bit double

binaryfloating point

I want to convert the decimal number 0.1 to binary 64 bit double. So I do it like that:

$$ 0.1_{10} = 0.00011001100110011001100110011001100110011001100110011001100110… \times 2^0 $$

Represent it in the scientific form:

$$ 1.1001100110011001100110011001100110011001100110011001100110… \times 2^{-4} $$

Now 64 bit IEEE754 float allows 52 bits for mantissa, so I need to round the number to 52 bits.

$$ 1.\underbrace{1001100110011001100110011001100110011001100110011001}_{52 bits}100110… \times 2^{-4} $$

So I have to round to either:

smaller number (truncated)

$$ 1.1001100110011001100110011001100110011001100110011001 $$

larger number (original number plus 1)

$$ 1.1001100110011001100110011001100110011001100110011010 $$

Since the 53 bit is 1, I'm rounding up to the larger number. So I have mantissa part ready. Then I'm calculating biased exponent (11 bits for the exponent):

$$ 2^{11-1} -1 = 1023\\
1023-4=1019\\
1019_{10} = 1111111011_2 $$

So the final representation should be:
$$ \underbrace{0}_{sign}\underbrace{01111111011}_{exponent}\underbrace{1001100110011001100110011001100110011001100110011010}_{mantissa} $$

Is this correct?

Best Answer

Short C code:

double x = 0.1;
long long n = *(long long*)&x;
printf("%llX",n);

Gives 3FB999999999999A, which is equivalent to:

0011 1111 1011 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010

For the record, due to the strict aliasing rule, I cannot recommend this programming method.