Can I simulate 32 bits floating point operations with smaller 8 or 16-bit floating point

arithmeticmodular arithmeticnumber theory

Suppose I want to simulate addition, multiplication, subtraction and most importantly, division, for a 32-bit number, but I only have 8-bit registers. Can I use a bunch of them to represent my number?

I can of course do carry addition/multiplication, but I think the carry operation will accumulate some errors, and also it's O(n^2). Is there a more clever way?

For the integer case, I could use a https://en.wikipedia.org/wiki/Residue_number_system, but this does not work for decimal values (and also on RNS it's hard to do division).

If not possible for floating point, is it possible for fixed point?

Best Answer

Yes you can. You can use these rules of thumb :

  1. Addition of two numbers in floating format you will need to bit shift the mantissas to align the exponents. Then afterwards adjust the exponent depending if you get carry or how many leading bits are zero or ones in your mantissa.

  2. Subtraction the same as addition but change sign on one number.

  3. Multiplication will add the exponents and multiply the mantissas. Then you may need to adjust the exponent slightly afterwards.

  4. Division is trickier. You can use same idea as above with splitting the computation in two parts : subtracting exponents and dividing mantissas. The division of mantissas you can for example use Euclidean algorithm. But it is slow and tedious. Even if you have hardware support for float arithmetics you want to avoid floating point divisions because they are slow and mess up the pipelines.