Author Topic: A faster Newton's Method Square Root (Read 7770 times)

Xeda112358 · « **on:** October 10, 2017, 02:04:50 pm »

EDIT 2022-06-29: The post below has a much needed improvement

I decided to compute some square roots by hand to keep my tired self awake and alert at work. I opted to use Newton's Method this time and I noticed an optimization!
Instead of attaining 2N precision with a 2N-by-N->2N division, I split it up into an N-by-N multiply and a N-by-N->N division, which on the Z80 or similar processor can be nearly twice as fast.

The standard algorithm works as follows:
To find the square root of c, start with a guess called x. Iterate the following: (x+c/x)/2

Each iteration doubles the digits of precision; try it on your calculator if you aren't familiar with it! To fin the square root of 21, guess '5'. Type 5, hit [Enter]. Then do .5(Ans+21/Ans and hit [Enter] a few times.

The way it works is, if the current x is bigger than the square root of c, then c/x is less, and vice-versa. So averaging the two gets a closer estimate to the actual square root. My observation was that if I had, say, 3 digits precision, then those digits stay the same in the next iteration. Through some observations, this would mean the first 3 digits of c/x will match that of x. In fact, for c/x, I'll only need to compute it to 6 digits, but if I can skip the first 3, then the division is easier and faster! So, lets optimize:

=>.5(x+c/x) =>.5(x+x+(c/x-x)) =>x+.5(c-x*x)/x
So it's probably not so clear, but essentially for 2n-digits precision, I need to square an n-digit number and divide an n-digit number by n digits, to n digits precision. This replaces a 2n/2n division.

The former is faster, so let's try an example of finding the square root of 21:
c=21 Guess: x=5 (c-x*x) = -4 -4/5/2 = -.4 x-.4 = 4.6 (c-x*x) = 21-4.6*4.6 = -.16 -.16/4.6/2 = -.0173913... ~.02 x-.02 = 4.58 (c-x*x) = 21-4.58*4.58 = .0236 .0236/4.58/2 = .0025764... ~.0026 x+.0026 = 4.5826 (c-x*x) = 21-4.5826*4.5826 = -.00022276 -.00022276/4.5826/2 = -.000024304979... ~-.000024305 x-.000024305 = 4.582575695

In practice by hand, it actually doesn't save much time, but that's because by hand I usually can perform a 2n-by-2n division a little slower than half that of a 2n-by-2n and I can do multiplication and division in roughly the same speed. So overall This gets me about 20% faster.

On the Z80, a 2N-by-N->2N division takes roughly the time of 1.6 2N-by-2N->4N multiplies. As well, a 2N-by-2N->4N multiply takes roughly 3.5 times the time of an N-by-N->2N multiply using recursive Karatsuba multiplication.

So the standard algorithm takes roughly 5.6 N-by-N->2N multiplies worth of time.
The modified algorithm takes roughly 2.9 N-by-N->2N multiplies worth of time.

Xeda112358 · « **Reply #1 on:** June 29, 2022, 07:30:26 pm »

I never followed up with this! A little while after, I refined this significantly and then a year or two ago, I implemented it in my z80float library (link is directly to sqrt32) and that's how I made square roots faster than multiplication.

Today I was excited to learn that this is how GMP does it, it just looks scarier there because they are optimizing the notation to make it easy to implement in code without necessarily understanding how it works (which is cool, they present the pseudocode in a very easy-to-digest manner). But that's not what I'm here for, I'm here to show how it works!

Here is the TL;DR if you don't want to go through the derivation:
##
\begin{array}{rcl}
x_{0} & \approx & \sqrt{c} \\
a_{0} &=& c - x_{0}^{2} \\
q_{n} + \frac{r_{n}}{2x_{n}} &=& \frac{a_{n}}{2x_{n}} \\
x_{n+1} &=& x_{n} + q_{n} \\
a_{n+1} &=& r_{n} - q_{n}^{2} \\
\end{array}
##
(The third line means "divide ##\frac{a_{n}}{2x_{n}}##, but truncate at some precision to get the quotient and the remainder")

Let's start. I like the "Babylonian Method" where you start with a guess ##x_{0}## as an approximation to the square root of ##c##, then you run this iteration a few times until you have the precision you want:
##
\begin{array}{l}
x_{n+1} &=& \frac{\left(x_{n}+\frac{c}{x_{n}}\right)}{2}
\end{array}
##
All you are doing is averaging the guess, with ##c## divided by that guess. If your guess was an overestimate, ##\frac{c}{x_{n}}## is going to be an underestimate, so their average will be closer to the actual answer.

And that's just a simplification of the Newton Iteration:
##
\begin{array}{l}
x_{n+1} &=& x_{n} + \frac{c-x_{n}^{2}}{2x_{n}}
\end{array}
##

We are going to build off of the Newton Iteration, but make it a little more complicated by turning ##c-x_{n}^{2}## into it's own recurrence:
##
\begin{array}{l}
a_{n} &=& c-x_{n}^{2} \\
x_{n+1} &=& x_{n} + \frac{a_{n}}{2x_{n}}
\end{array}
##

And actually, in practice we are going to truncate that division so that ##q_{n} + \frac{r_{n}}{2x_{n}} = \frac{a_{n}}{2x_{n}}## where ##q_{n}## is the (truncated) quotient, and ##r_{n}## is whatever the remainder was after dividing by ##2x_{n}##. So what we'd really be doing is:
##
\begin{array}{rcl}
a_{n} &=& c-x_{n}^{2} \\
q_{n} + \frac{r_{n}}{2x_{n}} &=& \frac{a_{n}}{2x_{n}} \\
x_{n+1} &=& x_{n} + q_{n}
\end{array}
##

Now that everything is more complicated, let's take a look at what happens to ##a_{n+1}##:
##
\begin{array}{l}
a_{n+1} &=& c-x_{n+1}^{2} \\
&=& c-(x_{n} + q_{n})^{2} \\
&=& c-(x_{n}^{2} + 2x_{n}q_{n} + q_{n}^{2}) \\
&=& c - x_{n}^{2} - 2x_{n}q_{n} - q_{n}^{2} \\
&=& a_{n} - 2x_{n}q_{n} - q_{n}^{2} \\
\end{array}
##

Already, we have an optimization. Those two multiplications are half the precision of a full ##x_{n+1}^{2}## if you do it right, and that is strictly faster. But we're not done! Remember that division? Let's re-write it:
##
\begin{array}{rcl}
q_{n} + \frac{r_{n}}{2x_{n}} &=& \frac{a_{n}}{2x_{n}} \\
2x_{n}q_{n} + r_{n} &=& a_{n} \\
r_{n} &=& a_{n} - 2x_{n}q_{n} \\
\end{array}
##

Well ain't that awfully convenient? Let's rewrite our ##a_{n+1}## recurrence:
##
\begin{array}{l}
a_{n+1} &=& a_{n} - 2x_{n}q_{n} - q_{n}^{2} \\
&=& r_{n} - q_{n}^{2} \\
\end{array}
##

So now we've significantly improved that step by replacing the squaring operation with one at half the precision! Let's look at the recurrence relations now:
##
\begin{array}{rcl}
q_{n} + \frac{r_{n}}{2x_{n}} &=& \frac{a_{n}}{2x_{n}} \\
x_{n+1} &=& x_{n} + q_{n} \\
a_{n+1} &=& r_{n} - q_{n}^{2} \\
\end{array}
##

Neat!

Now let's retry our example of finding the square-root of 21:
##
\begin{array}{rcl}
x_{0} & \approx & \sqrt{c} \\
&=& 4\\
a_{0} &=& c - x_{0}^{2} \\
&=& 21-16 \\
&=& 5 \\
q_{0} + \frac{r_{0}}{2x_{0}} &=& \frac{a_{0}}{2x_{0}} \\
&=&\frac{5}{8} \\
&=&.5 + \frac{1}{8} \\
x_{1} &=& x_{0} + q_{0} \\
&=& 4.5 \\
a_{1} &=& r_{0} - q_{0}^{2} \\
&=& 1 - .5^{2} \\
&=& .75 \\

q_{1} + \frac{r_{1}}{2x_{1}} &=& \frac{a_{1}}{2x_{1}} \\
&=&\frac{.75}{9} \\
&=&.083 + \frac{.003}{9} \\
x_{2} &=& x_{1} + q_{1} \\
&=& 4.583 \\
a_{2} &=& r_{1} - q_{1}^{2} \\
&=& .003 - .083^{2} \\
&=& -.003889 \\

q_{2} + \frac{r_{2}}{2x_{2}} &=& \frac{a_{2}}{2x_{2}} \\
&=&\frac{-.003889}{9.166} \\
&=&-.0004243 + \frac{.0000001338}{9.166} \\
x_{3} &=& x_{2} + q_{2} \\
&=& 4.5825757 \\

\end{array}
##

EDIT: updates z80float link to point to the routine that actually implements this

E37 · « **Reply #2 on:** June 29, 2022, 09:17:22 pm »

That is pretty neat! In all your dealing with floats did you come across a reason why ti-os uses 11 byte BCD floats? (with one whole byte for sign IIRC)
Is it just because BCD helps prevent any incremental errors?

Its a pity that I have (pretty much) finished my big z80 projects because there is several places I could have really used some floating points. I'm terrified by how close it's average of 1165 cycles is to a workaround I made a while ago using abs, min and max to calculate distance. I really need to set aside a weekend and add in floating points.

Testing gave me an average of 800 cycles for Axe's 16 bit integer square root. That's not much slower than your 32 bit floating point square root. Do floats just lend themselves to that kind of math?

_{_{Your link to GMP is broken...}}

Xeda112358 · « **Reply #3 on:** June 29, 2022, 10:11:37 pm »

I fixed the link, thanks

For TI Floats, the first byte is actually used to indicate variable type (lower 5 bits), and for floats, TI uses the upper bit to indicate sign. The next byte is used for the exponent, and then 7 bytes for the 14 digits, for a total of 9 bytes.

TI also uses extra precision for some intermediate operations, so sometimes the floats are actually larger (which is why the OPx spots in RAM are 11 bytes instead of 9).TI uses BCD because numbers are entered in base 10, and they want that to be handled "exactly," especially for financial usage. (BCD has the same rounding issues as binary floats, it just happens at different places).

I'm personally not a fan of BCD, as it is wasteful and slow, and binary floats can do the job. The only advantage with BCD floats is that with financial math, you don't have to be careful with comparisons.

As for the speed of that 32-bit square root routine, remember that it is fully unrolled, and relies on a fully unrolled 16-bit square root for a grand total of about 270 bytes. Also note: the routine I linked to is an integer square root routine. The floating-point routine is probably closer to 2000cc as it needs an extra 8 bits of precision in the result and some special handling for the exponent and "special values." That said, yes, there are some advantages to the floating-point format, especially with division and square roots.

EDIT: Actually, the original link to sqrtHLIX doesn't even use this technique, so I updated it to point to sqrt32 which does use this technique, and expects the upper two bits of the input to be non-zero (which is guaranteed with these binary floats)

E37 · « **Reply #4 on:** June 30, 2022, 08:54:36 pm »

Unless I am reading this wrong, float64's (or doubles as I will call them) get between 15 and 17 digits of precision compared to the 14 of a 9 byte BCD float. Wouldn't it be easier to use doubles then? Any 14 digit number the user enters can be represented perfectly. Sure, there will be some precision the user doesn't see but isn't that already the case since the OS only prints out 9 (10?) digits? I can't think of case where a double would cause a rounding error where a BCD wouldn't also. And that isn't taking into account that the OS uses 9 bytes and not 8.

Off topic note because I can't be bothered to find my Discord login: what emulator do you use? I have always used Wabbitemu but it tends to crash when I mess with flash.

Xeda112358 · « **Reply #5 on:** June 30, 2022, 11:29:30 pm »

Yup, I see no good reason for BCD floats. There is some usefulness to decimal floats, though-- they have the one advantage of BCD (don't have to be careful where base-10 representation is a must), but are nearly as efficient as binary floats in terms of storage. A 64-bit decimal64 has 16 digits of precision.

As for emulators, I use TIlem, but it does have glitches of it's own. One I found was it sometimes doesn't emulate the carry flag properly with add/adc/sbc on IX/IY, which was hell when testing some math routines

I believe the problem is the emulator stores them internally as int32 or int64 and doesn't handle 16-bit overflow.

Author Topic: A faster Newton's Method Square Root (Read 7770 times)

Xeda112358

A faster Newton's Method Square Root

Xeda112358

Re: A faster Newton's Method Square Root

E37

Re: A faster Newton's Method Square Root

Xeda112358

Re: A faster Newton's Method Square Root

E37

Re: A faster Newton's Method Square Root

Xeda112358

Re: A faster Newton's Method Square Root