[Math] Encoding a floating point value

binarybinary operationsfloating point

Sincere salutations everyone. I would like to to encode 7/8 into Binary floating point. I know that 7/8 is .111 in Binary. However, how would I go about finding the exponent and the Mantissa for the value? I know the sign bit would be 0 (as it is positive). How many spaces would I move the decimal point? Thanks!

Best Answer

Assuming a standard IEEE 32 bits we have:

8 bits of exponents
23 bits of mantissa
1 bit of sign

Your expansion $0.111$ is correct however we must normalize it such that there is a $1$ in front of the decimal point. In this case we will multiply it by $2$ (which as in base $10$ shifts the number toward the left).

Therefore $0.111 = 0.111 \cdot 2^1 \cdot 2^{-1} = (0.111 \cdot 2^1) \cdot 2^{-1} = 1.11 \cdot 2^{-1}$.

Now you can extract the sign-bit which is $0$, the exponent which is $-1$ (remember to apply the bias of $+127$) and the mantissa which is obtained by discarding the $1$ in front of the decimal point and keeping only the decimal. In this case the mantissa is $11000000000000000000000$.

Related Solutions

[Math] the maximum difference between two successive real numbers in the given floating point representation

Denote by $[p,q]$ the set of integers $k$ with $p\leq k\leq q$. Then $e\in[0,62]$, $\>m\in[0,511]$. The set $R$ of representable real numbers is therefore given by $$R=\left\{\pm\left(1+[0,511]\cdot 2^{-9}\right)2^{[0,62]-31}\right\}\cup\{0\}\ .$$ The smallest positive number in $R$ is $2^{-31}$, then we have $512$ jumps of size $2^{-40}$, bringing us to $2^{-30}$, and so on in jumps of ever doubling size, until we reach $2^{31}$. Then come $511$ jumps of size $2^{-9}\cdot 2^{31}=2^{22}$, bringing us to $(2-2^{-9})2^{31}$. The latter is the largest representable number in this system.

It follows that the largest occurring difference between numbers in $R$ is $2^{22}$.

[Math] How to convert from floating point binary to decimal in half precision(16 bits)

You are right.

You can do that automatically with python and numpy :

import numpy as np
import struct
a=struct.pack("H",int("0101011101010000",2))
np.frombuffer(a, dtype =np.float16)[0]

and you get : 117.0

Best Answer

Related Solutions

[Math] the maximum difference between two successive real numbers in the given floating point representation

[Math] How to convert from floating point binary to decimal in half precision(16 bits)

Related Question