[Math] How to add IEE754 half precision numbers

floating point

I'm getting stuck with an exercise on adding two IEE754 half precision numbers, the numbers are:

$1110001000000001$
$0000001100001111$

I have tried to solve it using this procedure:

Half precision is:
$1$ sign bit, $5$ bits exponent and $10$ bits mantissa

So I rewrote the numbers in this way:

$1\hspace{1em}11000\hspace{0.5em}1000000001$
$0\hspace{1em}00000\hspace{0.5em}1100001111$

So the value of the first exponent is $24$ , i subract 15 from 24 and the result is $9$.

But I don't know how to continue , and I haven't found a good reference or tutorial on this , can someone explain me the procedure to solve this?

Best Answer

The information you need is all in the Wikipedia pages on floating point arithmetic and IEEE 754. I agree with the comment suggesting that you may do better on a computer stackexchange site, but let me sketch a few hints about the mathematics that may help you understand the Wikipedia pages.

The basic mathematical idea is that a number $x$ with sign bit $s \in \{0, 1\}$, mantissa $c \in \Bbb{N}$ and exponent $q \in \Bbb{Z}$ in base $2$ represents the number

$$x = (-1)^s \times c \times 2^q$$

(The Wikipedia pages also refer to the mantissa $c$ as the "significand" or "coefficient".) For each defined precision, there are rules for the allowable ranges of $c$ and $q$.

If you want to add to $x$ another IEEE 754 number $x'$ say represented by $s'$, $c'$ and $q'$, you first adjust the representations so the exponent is the smaller of $q$ and $q'$. E.g., if $q \le q'$, you represent $x'$ with exponent $q$ using the identity:

$$x' = (-1)^{s'} \times c' \times 2^{q'} = (-1)^{s'} \times (2^{q'-q}c') \times 2^q$$

Once you have made the exponents the same, you can meaningfully add or subtract the adjusted mantissas and calculate the sign of the result according to the sign bits $s$ and $s'$, giving say:

$$x + x' = (-1)^{s''} \times c'' \times 2^q$$

E.g., if $s = 0$, $s' = 1$ and $c < 2^{q'-q}c$, this will be $-1^1 \times (2^{q'-q}c' - c) \times 2^q$.

You then "normalise" the result, i.e., you round and scale to make it conform to the rules for the required precision (if possible). With care, you can arrange for the intermediate results in all these calculations to involve just one or two additional bits.

As you will find from the Wikipedia pages cited above, there are various special cases and optimisations in the representation. E.g., as you seem to know, the exponent is represented by adding a bias value dependent on the precision ($15$ in the case of half precision.) If a number can be normalised, you can infer that the top bit of the mantissa is $1$ and the representation omits it. Denormalised numbers are represented with the exponent bits all $0$ and do include all bits of the mantissa (this applies to one of the numbers in your example).

Best Answer

Related Solutions

[Math] Show that floating point $\sqrt{x \cdot x} \geq x$ for all long $x$.

[Math] the maximum difference between two successive real numbers in the given floating point representation

Related Question