[Math] Floating point representation in 8 bit

floating pointnumerical methods

A computer has 8 bits of memory for floating point representation.

The first is assigned for the sign, the next four bits for the exponent and the last three for the mantissa.

The computer has no representation for $\infty$ and 0 is represented like in IEEE754. Assume that the mantissa starts with $\text{base}^{-1}$ and that to the left of the mantissa there is an implied 1 that does not consume a place value.

  1. What is the smallest positive number that can be represented?

  2. What is the machine epsilon

  3. How many numbers in base 10 can be represented?

  1. In general the exponent is $2^{(\text{bits})}-2$ so in this case we have $2^4-2=14$ so the exponent range from -6 to 7 so the smallest positive number is $1.001*2^{-6}=2^{-6}+2^{-9}=0.017578125$

  2. To find machine epsilon we take $\text{base}^{-(p-1)}$ where $p$ is the number of significant bits in the mantissa which is $2^{-(3-1)}=2^{-2}=0.25$

How should I approach 3, and are my solutions to 1 and 2 correct?

Best Answer

Due to the finite precision of the computer, numbers used in calculations must conform to the format imposed by the machine. So only real numbers with a finite number of digits can be represented. A normalized floating point system $\mathbb{F}=F(\beta,p,e_{\text{min}},e_{\text{max}})$ consists of a set of real numbers written in normalized floating point form $x=\pm m \times \beta^{e}$, where $m$ is the mantissa of $x$ and $e$ is the exponent.

If $x \neq 0$ then the mantissa $m$ can be written as: \begin{equation} m = a_N +a_{N-1} \beta^{-1}+...+a_{-p} \beta^{-p-N} \end{equation} with $a_N \neq 0$ and $e_{\text{min}} \leq e \leq e_{\text{max}}$. If $x=0$ then the mantissa $m=0$ while the exponent $e$ can take any value.

In the above expressions, $p$ is the precision of the system, $\beta$ the base, and $[e_{\text{min}},e_{\text{max}}]$ the exponent range, with $e_{\text{min}}<0$, and $e_{\text{max}}=|e_{\text{min}}|+1$.

According to the definition the mantissa $m$ belongs to the range $[1,\beta)$. The machine epsilon is $\beta^{1-p}$ and represents the difference between the mantissae of two successive positive numbers. Now a number $x$ belong to the range $[x_{\text{min}}, x_{\text{max}}]$ where: \begin{equation} x_{\text{min}} = \beta^{e_{\text{min}}} \end{equation} and \begin{equation} x_{\text{max}} = (\beta-1)(1+\beta^{-1}+\beta^{-2}+... + \beta^{-(p-1)}) \beta^{e_{\text{max}}}< \beta^{e_{\text{max}}+1} \end{equation} We now prove the statement above. The general representation of $x \in \mathbb{R}$ in base $\beta$ is: \begin{equation} x=\pm (a_N \beta^N+a_{N-1} \beta^{N-1}+...+a_1 \beta+a_0+a_{-1} \beta^{-1}+...+a_{-p} \beta^{-p})= \pm m \times \beta^{e} \end{equation} When we collect the terms $\beta^N$ we have: \begin{equation} x=\pm (a_N +a_{N-1} \beta^{-1}+...+a_1 \beta^{-N+1}+a_0 \beta^{-N}+a_{-1} \beta^{-1-N}+...+a_{-p} \beta^{-p-N}) \times \beta^N= \pm m \times \beta^{e} \end{equation} We can identify $N$ with $e$ ($N=e$). Then: \begin{equation} m=\sum_{i=-p}^N a_i \beta^{i-N} \end{equation} The minimum value of $m$ is reached when $a_0=1$ and $a_i=0$ with $1 \leq i \leq p-1$. In this case $m=1$ and $x_{\text{min}} = \beta^{e_{\text{min}}}$. The maximum value of $m$ is obtained when $a_i=\beta-1$ for all $0 \leq i \leq p-1$.

The machine epsilon is defined as $\epsilon_M=\beta^{1-p}$. It is a measure of the precision of the system, since it is a maximum bound on the relative distance between two consecutive numbers. It also represents the difference between the mantissae of two successive positive numbers. In normalized floating point systems, no number that does not fit the finite format imposed by the computer can be represented.

The total number of elements in $\mathbb{F}$ is given by the following expression: \begin{equation} 2 (\beta-1) \beta^{p-1} (e_{\text{max}}-e_{\text{min}}+1)+2 \end{equation} Computers can work with single- or double-precision. IEEE standard single-precision floating point numbers belong to the normalized floating point system $F(2, 24, −126, +127)$, while IEEE standard double-precision floating point numbers belong to the normalized floating point system $F(2, 53, −1022, +1023)$.