 Author Topic: My Newest Trick For Division  (Read 2281 times)

0 Members and 1 Guest are viewing this topic. Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
•            • • Posts: 4687
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do. « on: March 25, 2014, 02:03:43 pm »
Okay, so yesterday, I was thinking about the algorithms I use for computing logs and exponentials with my floating point numbers and I realized that I had the core of a fast division algorithm right there. Then I thought, "oh wait this is just Goldschmidt Division." I was wrong though, now that I am more awake and looking at the algorithm. The biggest difference is that mine works on the same time complexity as long-division so I need 64 iterations for my 64 bits of precision versus 5.

But there is a but. Those 5 iterations would require 2 full precision 64-bit multiplications, which I currently have down to about 9000 t-states each. My method would require 3*32 shifts+subtracts. On the surface, it looks like it would require 32 more, but I have tricks and magic on my side.

So how does it work? I decided to look at x/y as a vector {x,y} where I try to drive y towards 1 (I have been doing a lot of CORDIC stuff lately). The speed comes from how I drive y towards 1. In it's normal state, I have mantissas as 1.xxxxx, so x and y can be treated as numbers on [1,2). Now as a quick observation, if y>=4/3, then y*3/4 is less than y, but greater than or equal to 1. However, y*3/4=y-y/4 = y-(y>>2) where ">>2" refers to a right shift of 2. Similarly, I compare to 8/7, 16/15, et cetera.

That comparison step can get annoying, especially if you are using a lookup table of 64-bit numbers. It can also get slow. What I did was look at the binary expansions of these numbers:
4/3 = 1.01010101...
8/7 = 1.001001001...
16/15 = 1.000100010001...
32/31 = 1.000010000100001...

See the pattern? Basically, I can count the leading zeroes after the decimal point and if the result is 63, I can stop (denominator is 1) else if it is n, the denominator is guaranteed to be on $[\frac{2^{n+2}}{2^{n+2}-1},\frac{2^{n+1}}{2^{n+1}-1}]$. From this, a quick pseudo-code example would be:
Code: [Select]
Label:
2+clz(y)→a
if a>m : Return
x-x>>a→x
y-y>>a→y
The code assumes clz will count the leading 0s after the first digit (which is always 1) and m is the desired accuracy. A speed up can occur by noticing that if y=1.000...001xxxxx in binary with n 0s, then y-y>>(n+1) isn't going to change the next sequence of n bits after the 1 (because you are subtracting 0s from xs). So if you have gone up to m bits, the next m bits are basically computed for free in y. This leads to a 2-part algorithm:
Code: [Select]
Label1:
2+clz(y)→a
if a>m : Goto Label2
x-x>>a→x
y-y>>a→y
Label2:
For n from m+1 to 2m-1
if bit n of y is 1: x-x>>n→x
Adding that extra bit of code gets 2M bits of precision.

I just thought I would share This sucks on the Z80 because there isn't a CLZ instruction (but this can be simulated at decent speed), but those arbitrary shifts are what can't kills it.
« Last Edit: March 25, 2014, 02:05:37 pm by Xeda112358 »