Okay, so yesterday, I was thinking about the algorithms I use for computing logs and exponentials with my floating point numbers and I realized that I had the core of a fast division algorithm right there. Then I thought, "oh wait this is just

**Goldschmidt Division**." I was wrong though, now that I am more awake and looking at the algorithm. The biggest difference is that mine works on the same time complexity as long-division so I need 64 iterations for my 64 bits of precision versus 5.

But there is a

**but**. Those 5 iterations would require 2 full precision 64-bit multiplications, which I currently have down to about 9000 t-states each. My method would require 3*32 shifts+subtracts. On the surface, it looks like it would require 32 more, but I have tricks and magic on my side.

So how does it work? I decided to look at x/y as a vector {x,y} where I try to drive y towards 1 (I have been doing a lot of CORDIC stuff lately). The speed comes from how I drive y towards 1. In it's normal state, I have mantissas as 1.xxxxx, so x and y can be treated as numbers on [1,2). Now as a quick observation, if y>=4/3, then y*3/4 is less than y, but greater than or equal to 1. However, y*3/4=y-y/4 = y-(y>>2) where ">>2" refers to a right shift of 2. Similarly, I compare to 8/7, 16/15, et cetera.

That comparison step can get annoying, especially if you are using a lookup table of 64-bit numbers. It can also get slow. What I did was look at the binary expansions of these numbers:

4/3 = 1.01010101...

8/7 = 1.001001001...

16/15 = 1.000100010001...

32/31 = 1.000010000100001...

See the pattern? Basically, I can count the leading zeroes after the decimal point and if the result is 63, I can stop (denominator is 1) else if it is n, the denominator is guaranteed to be on

##[\frac{2^{n+2}}{2^{n+2}-1},\frac{2^{n+1}}{2^{n+1}-1}]##. From this, a quick pseudo-code example would be:

`Label:`

2+clz(y)→a

if a>m : Return

x-x>>a→x

y-y>>a→y

The code assumes clz will count the leading 0s after the first digit (which is always 1) and m is the desired accuracy. A speed up can occur by noticing that if y=1.000...001xxxxx in binary with n 0s, then y-y>>(n+1) isn't going to change the next sequence of n bits after the 1 (because you are subtracting 0s from xs). So if you have gone up to m bits, the next m bits are basically computed for free in y. This leads to a 2-part algorithm:

`Label1:`

2+clz(y)→a

if a>m : Goto Label2

x-x>>a→x

y-y>>a→y

Label2:

For n from m+1 to 2m-1

if bit n of y is 1: x-x>>n→x

Adding that extra bit of code gets 2M bits of precision.

I just thought I would share

This sucks on the Z80 because there isn't a CLZ instruction (but this can be simulated at decent speed), but those arbitrary shifts are what can't kills it.