I decided to compute some square roots by hand to keep my tired self awake and alert at work. I opted to use Newton's Method this time and I noticed an optimization!

Instead of attaining 2N precision with a 2N-by-N->2N division, I split it up into an N-by-N multiply and a N-by-N->N division, which on the Z80 or similar processor can be nearly twice as fast.

The standard algorithm works as follows:

To find the square root of **c**, start with a guess called **x**.

Iterate the following: **(x+c/x)/2**

Each iteration doubles the digits of precision; try it on your calculator if you aren't familiar with it! To fin the square root of 21, guess '5'. Type 5, hit [Enter]. Then do `.5(Ans+21/Ans` and hit [Enter] a few times.

The way it works is, if the current **x** is bigger than the square root of **c**, then `c/x` is less, and vice-versa. So averaging the two gets a closer estimate to the actual square root. My observation was that if I had, say, 3 digits precision, then those digits stay the same in the next iteration. Through some observations, this would mean the first 3 digits of `c/x` will match that of x. In fact, for `c/x`, I'll only need to compute it to 6 digits, but if I can skip the first 3, then the division is easier and faster! So, lets optimize:

=>.5(x+c/x)

=>.5(x+x+(c/x-x))

=>x+.5(c-x*x)/x

So it's probably not so clear, but essentially for 2n-digits precision, I need to square an n-digit number and divide an n-digit number by n digits, to n digits precision. This replaces a 2n/2n division.

The former is faster, so let's try an example of finding the square root of 21:

c=21

Guess: x=5

(c-x*x) = -4

-4/5/2 = -.4

x-.4 = 4.6

(c-x*x) = 21-4.6*4.6 = -.16

-.16/4.6/2 = -.0173913... ~.02

x-.02 = 4.58

(c-x*x) = 21-4.58*4.58 = .0236

.0236/4.58/2 = .0025764... ~.0026

x+.0026 = 4.5826

(c-x*x) = 21-4.5826*4.5826 = -.00022276

-.00022276/4.5826/2 = -.000024304979... ~-.000024305

x-.000024305 = 4.582575695

In practice by hand, it actually doesn't save much time, but that's because by hand I usually can perform a 2n-by-2n division a little slower than half that of a 2n-by-2n and I can do multiplication and division in roughly the same speed. So overall This gets me about 20% faster.

On the Z80, a 2N-by-N->2N division takes roughly the time of 1.6 2N-by-2N->4N multiplies. As well, a 2N-by-2N->4N multiply takes roughly 3.5 times the time of an N-by-N->2N multiply using recursive Karatsuba multiplication.

So the standard algorithm takes roughly 5.6 N-by-N->2N multiplies worth of time.

The modified algorithm takes roughly 2.9 N-by-N->2N multiplies worth of time.