Omnimaga

Calculator Community => TI Calculators => ASM => Topic started by: Streetwalrus on April 10, 2013, 03:50:37 pm

Title: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Streetwalrus on April 10, 2013, 03:50:37 pm
I need a 32 bit by 16 bit division routine but I have absolutely no idea how to do this. I specifically need to check if the modulus is non zero. Can you give me some tips ? Asm in 28 days has 32/8 which is not enough for me.

Same for a 32 bit square root routine. I need the result as a 16 bit integer.

Note that I'm not asking for code but some help to create them myself and then optimization because I'm not Runer/ThePenguin/Calc84maniac and I want to learn. :P
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Streetwalrus on April 11, 2013, 07:21:25 am
Is anyone willing to help ?
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Matrefeytontias on April 11, 2013, 10:52:16 am
I don't really know how to do that, but I think it's the same way as a 16 by 16 division. I know that in the process you loop as much times as bits that are in the first operand (idk if that's English :P), so you would do 32 loops I think. But I can't really help.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Hayleia on April 11, 2013, 11:38:56 am
I am not really sure that people trying to win the TI Planet contest will be willing to help you before the deadline :P
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Runer112 on April 11, 2013, 11:47:07 am
I think Z80 Bits (http://baze.au.com/misc/z80bits.html) is where everybody learned how bitwise math routines worked. It has code for multiplication, division, and square root routines of varying bit-width inputs, and pretty good explanations of the algorithms behind them. Unfortunately none of the routines go up to bit widths quite as high as you're looking for, probably because at that point you start running out of registers. I figure you can get 32-bit/16-bit by just using one of the existing routines and throwing in the use of ix to fit all the necessary bits. As for the restoring square root algorithm... that algorithm is a little mind-boggling. But I figure you'd need 16 bits to store the result instead of 8, the comparison step would have to compare 24 bits instead of 16, and you'd need to find a way to feed in 24 bits of the input (I think 8 bits always start in the "accumulator"). By the looks of it, you might be short one register, even if you use ix. Perhaps make sure interrupts are turned off and use shadow registers?

But as Hayleia said, since I'm pretty sure this is for the TI-Planet contest, I don't think anybody should really write your routines for you. But hopefully I've given you some good guidance to being able to take a stab at it yourself. :)
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Streetwalrus on April 11, 2013, 11:48:14 am
Xeda helped me optimize my primality checker a bit. I think I have an idea for arbitrary precision division though and jacobly linked me an arbitrary precision sqrt routine. I will see what I can do with that. And everyone is not in the contest. :P Also :
Note that I'm not asking for code but some help to create them myself and then optimization because I'm not Runer/ThePenguin/Calc84maniac and I want to learn. :P

Thanks for the link Runer.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on April 11, 2013, 05:27:21 pm
I did? I don't remember that D: Or were you talking about the BASIC code/trial division stuff?
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Streetwalrus on April 12, 2013, 02:01:54 am
Yeah that's what I was talking about but since I use the exact same thing in Axe... :P
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Streetwalrus on April 12, 2013, 07:36:40 am
I can't get the 24/16 routine from z80 bits to work. :s
Here's what I'm running :
Code: [Select]
    di
    ld a, 244      ;dividend high byte
    ld bc, 9216   ;dividend low word (244*2^16+9216=16000000)
    ld de, 47288 ;divisor
    ld hl, 0
    exx
    ld b, 24
LOOP:
    exx
    slia c ; unroll 24 times
    rl b ; ...
    rla ; ...
    adc hl,hl ; ...
    sbc hl,de ; ...
    jr nc,$+4 ; ...
    add hl,de ; ...
    dec c ; ...
    exx
    djnz LOOP
    exx
    ;load the values of a, bc and hl to the first five bytes of appBackupScreen
    ei
What am I doing wrong ? ??? I tried it several times and I always get wrong results.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: jacobly on April 12, 2013, 07:52:33 am
You only have one exx in the loop, so every time through the loop uses different registers.  It should work if you add an exx at the beginning of the loop.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Streetwalrus on April 12, 2013, 07:58:03 am
You only have one exx in the loop, so every time through the loop uses different registers.  It should work if you add an exx at the beginning of the loop.
I forgot to copy it but I actually have it.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: jacobly on April 12, 2013, 08:07:44 am
In that case you probably have an overflow problem.  Try either using a 48-bit temp value or putting
Code: [Select]
jr nc,$+7
or a
sbc hl,de
jr $+7
after adc hl,hl.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Streetwalrus on April 12, 2013, 08:18:20 am
Should I leave the extra sbc hl,de ?

Edit : It doesn't fix anything...
Edit 2 : My code actually worked I was just displaying without the small r in axe. :P
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 17, 2019, 09:53:03 am
I know this is old (and I'm sorry for sending you all a notification), but I happened to need a 32 div 16 routine today, so I made one:
This divides HLIX by BC
The result is stored in HLIX, the remainder in DE
BC is unmodified
A is 0
it doesn't use any other registers or RAM
Code: [Select]
div32_16:
 ld de,0 ; 10
 ld a,32 ; 7
div32_16loop:
 add ix,ix ; 15
 adc hl,hl ; 15
 ex de,hl ; 4
 adc hl,hl ; 15
 or a ; 4
 sbc hl,bc ; 15
 inc ix ; 10
 jr nc,cansub ; 12/7
  add hl,bc ; 11
  dec ix ; 10
cansub:
 ex de,hl ; 4
 dec a ; 4
 jr nz,div32_16loop ; 12/7
 ret ; 10
I'm not saying this is the fastest possible, but at least it does the job. This takes 4094 T-States in the worst case, 3582 in the best case, which is between .6 and .7 ms at 6 MHz.
I could add some comments to this, but it's more difficult to explain what is going on than to just program it.
Also, note that this doesn't check if BC is 0 so you will get junk as an answer in that case.
I might also write a 32-bit square root routine if it is still needed.

EDIT: an optimized version can be found further in this topic.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 17, 2019, 10:40:19 am
Thanks! I am on mobile, so not much help, but a few quick optimizations I see are changing those Inc/dec ix to use ixl instead (the bottom bit is 0 and 1 at those points, so no worry of overflow) and you don't need the `or a` before the sbc since the add before it is always pushing out a 0 bit (so leaving carry reset).

Do you prefer speed optimizations or size optimizations, btw?

Thanks for the work, I actually visit this thread semi-frequently!
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 17, 2019, 12:35:00 pm
Quote
a few quick optimizations I see are changing those Inc/dec ix to use ixl instead
Sure! Didn't see that one. The problem there is also that those are undocumented and don't run on some emulators.
Quote
you don't need the `or a` before the sbc
Yes, that's also right. Wasn't too smart of me. But hey, at least I wrote a 32 div 16 routine, okay?
Quote
Do you prefer speed optimizations or size optimizations, btw?
Usually, I prefer speed, but it shouldn't be to big either. When the program is done and it runs too slowly or is too big, I can optimize as I need.

The resulting routine would look like this:
Code: [Select]
div32_16:
 ld de,0 ; 10
 ld a,32 ; 7
div32_16loop:
 add ix,ix ; 15
 adc hl,hl ; 15
 ex de,hl ; 4
 adc hl,hl ; 15
 sbc hl,bc ; 15
 inc ixl ; 8
 jr nc,cansub ; 12/7
  add hl,bc ; 11
  dec ixl ; 8
cansub:
 ex de,hl ; 4
 dec a ; 4
 jr nz,div32_16loop ; 12/7
 ret ; 10
This takes between 3838 and 3390 T-States, which is .64 to .57 ms.

I also though that maybe we could invert BC so then we can use ADD HL,BC instead of SBC, which is 4 T-States faster, but this has the problem that we would need SBC further on. This would bring down the best case scenario time, but up the worst case.

About the square root routine: is it still needed? I already know how the sqrt algorithm works. I just have to implement it, but I don't have the time now.

EDIT: maybe we could put this on http://z80-heaven.wikidot.com/math, where many other math routines already reside?
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 17, 2019, 04:22:38 pm
Okay, when I get home I might have a chance to optimize this more. And yes, that's a good place for it. I have been working on-and-off on that page for a few months to revamp it. If it is locked due to a pending draft, let me know and I'll move the draft to another page altogether.

Also, this might be useful: https://github.com/Zeda/z80float/tree/master/extended/sqrt
The sqrt32 and sqrt64 are special-purpose routines (the top two bits shouldn't both be 0), but it can give you an idea. Also the division folder might inspire optimizations in this routine. It has a 32/16 routine but it is also special-purpose (top bit of denominator is always set).
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 17, 2019, 06:58:37 pm
I am making a double-post because this post is very distinct from the other.
Before I proceed, the routine doesn't take into account that the `adc hl,hl` could overflow on the 17th iteration and beyond if BC>32768. For example, 0x80000000/0x8001 will leave HL as 8000h on the 16th iteration, and then 0x0000 on the 17th.

Before we fix that, notice that you are using two `ex de,hl` each iteration, but here is a more efficient way:
Code: [Select]
div32_16:
 ex de,hl   ; 4
 ld hl,0    ; 10
 ld a,32    ; 7
div32_16loop:
 add ix,ix  ; 15
 rl e       ; 8
 rl d       ; 8
 adc hl,hl  ; 15
 sbc hl,bc  ; 15
 inc ixl    ; 8
 jr nc,cansub  ; 12/7
  add hl,bc ; 11
  dec ixl   ; 8
cansub:
 dec a      ; 4
 jr nz,div32_16loop ; 12/7
 ex de,hl   ; 4
 ret        ; 10
This ends up adding 2 bytes and 8cc to the overhead code, but saves 7cc per iteration, for a net savings of 216cc.

But we can replace those `rl e` with `rla`, we'll save 4cc per iteration.
Code: [Select]
div32_16:
 ex de,hl   ; 4
 ld hl,0    ; 10
 ld a,e     ; 4
 ld e,32    ; 7
div32_16loop:
 add ix,ix  ; 15
 rla        ; 4
 rl d       ; 8
 adc hl,hl  ; 15
 sbc hl,bc  ; 15
 inc ixl    ; 8
 jr nc,cansub  ; 12/7
  add hl,bc ; 11
  dec ixl   ; 8
cansub:
 dec e      ; 4
 jr nz,div32_16loop ; 12/7
 ex de,hl   ; 4
 ret        ; 10
This is actually a cheap optimization. It adds no extra bytes, but adds 4cc to overhead and saves 4cc per iteration, for a net savings of 124cc!

But in a similar vein, we can just split up the code into 4 parts with a common subroutine, reducing the shifting work:
Code: [Select]
div32_16:
;136+4*div32_16_sub8
;min: 2180cc
;max: 2788cc
;avg: 2484cc
;49 bytes
  ex de,hl   ; 4
  ld hl,0    ; 10

  ld a,d              ; 4
  call div32_16_sub8  ; 17
  ld d,a              ; 4

  ld a,e              ; 4
  call div32_16_sub8  ; 17
  ld e,a              ; 4

  ld a,ixh            ; 8
  call div32_16_sub8  ; 17
  ld ixh,a            ; 8

  ld a,ixl            ; 8
  call div32_16_sub8  ; 17
  ld ixl,a            ; 8

  ex de,hl   ; 4
  ret        ; 10

div32_16_sub8:
;119+8*div32_16_sub
;min: 511cc
;max: 663cc
;avg: 587cc
  call +_
_:
;17+2(17+2(div32_16_sub)))
  call +_
_:
;17+2(div32_16_sub)
  call div32_16_sub
div32_16_sub:
;min: 49cc
;max: 68cc
;avg: 58.5cc
 add a,a    ; 4
 adc hl,hl  ; 15
 sbc hl,bc  ; 15
 inc a      ; 4
 ret nc     ;11/5
 add hl,bc  ; 11
 dec a      ; 4
 ret        ; 10

The above routine works when BC<=0x8000, but to extend it, we'll need to take care of the overflow and that will slow it down:
Code: [Select]
div32_16:
;136+4*div32_16_sub8
;min: 2340cc
;max: 3012cc
;avg: 2676cc
;55 bytes
  ex de,hl   ; 4
  ld hl,0    ; 10

  ld a,d              ; 4
  call div32_16_sub8  ; 17
  ld d,a              ; 4

  ld a,e              ; 4
  call div32_16_sub8  ; 17
  ld e,a              ; 4

  ld a,ixh            ; 8
  call div32_16_sub8  ; 17
  ld ixh,a            ; 8

  ld a,ixl            ; 8
  call div32_16_sub8  ; 17
  ld ixl,a            ; 8

  ex de,hl   ; 4
  ret        ; 10

div32_16_sub8:
;119+8*div32_16_sub
;min: 551cc
;max: 719cc
;avg: 635cc
  call +_
_:
;17+2(17+2(div32_16_sub)))
  call +_
_:
;17+2(div32_16_sub)
  call div32_16_sub
div32_16_sub:
;54+{0,2+{0,19}}
;min: 54cc
;max: 75cc
;avg: 64.5cc
  add a,a    ; 4
  inc a      ; 4
  adc hl,hl  ; 15
  jr c,+_    ;12/7
  sbc hl,bc  ; 15
  ret nc     ;11/5
  add hl,bc  ; 11
  dec a      ; 4
  ret        ; 10
_:
  or a       ; 4
  sbc hl,bc  ; 15
  ret        ; 10
This adds 6 bytes, and 192cc on average, so it's not too bad.


Now we move in with your idea to negate BC upon entry to improve some cases. That allows some other optimizations too, like not needing to reset the carry flag in the last case of div32_16_sub:
Code: [Select]
div32_16:
;HLIX/BC -> HLIX remainder DE
;158+4*div32_16_sub8
;min: 2298cc
;max: 3034cc
;avg: 2546cc
;59 bytes
  ex de,hl   ; 4

; Negate BC to allow add instead of sbc
  xor a      ; 4
; Need to set HL to 0 anyways, so save 2cc and a byte
  ld h,a     ; 4
  ld l,a     ; 4
  sub c      ; 4
  ld c,a     ; 4
  sbc a,a    ; 4
  sub b      ; 4
  ld b,a     ; 4


  ld a,d              ; 4
  call div32_16_sub8  ; 17
  ld d,a              ; 4

  ld a,e              ; 4
  call div32_16_sub8  ; 17
  ld e,a              ; 4

  ld a,ixh            ; 8
  call div32_16_sub8  ; 17
  ld ixh,a            ; 8

  ld a,ixl            ; 8
  call div32_16_sub8  ; 17
  ld ixl,a            ; 8

  ex de,hl   ; 4
  ret        ; 10

div32_16_sub8:
;119+8*div32_16_sub
;min: 535cc
;max: 719cc
;avg: 597cc
  call +_
_:
;17+2(17+2(div32_16_sub)))
  call +_
_:
;17+2(div32_16_sub)
  call div32_16_sub
div32_16_sub:
52+{4,0+{0,23}}
;min: 52cc
;max: 75cc
;avg: 59.75cc
  add a,a    ; 4
  inc a      ; 4
  adc hl,hl  ; 15
  jr c,+_    ;12/7
  add hl,bc  ; 11
  ret c      ;11/5
  sbc hl,bc  ; 15
  dec a      ; 4
  ret        ; 10
_:
  add hl,bc  ; 11
  ret        ; 10
This adds a net of 4 bytes. It adds 22cc to the worst case, saves 42cc in the best case, but manages to bring the average case down by 130cc!

But now let's take advantage of the output flags in the sub routine. It turns out that except for that last case, the output is carry flag reset if the bit shifted in should be 0, and carry set if the bit should be 1.
Code: [Select]
div32_16:
;HLIX/BC -> HLIX remainder DE
;174+4*div32_16_sub8
;min: 2186cc
;max: 2794cc
;avg: 2466cc
;61 bytes
  ex de,hl   ; 4

; Negate BC to allow add instead of sbc
  xor a      ; 4
; Need to set HL to 0 anyways, so save 2cc and a byte
  ld h,a     ; 4
  ld l,a     ; 4
  sub c      ; 4
  ld c,a     ; 4
  sbc a,a    ; 4
  sub b      ; 4
  ld b,a     ; 4


  ld a,d              ; 4
  call div32_16_sub8  ; 17
  rla                 ; 4
  ld d,a              ; 4

  ld a,e              ; 4
  call div32_16_sub8  ; 17
  rla                 ; 4
  ld e,a              ; 4

  ld a,ixh            ; 8
  call div32_16_sub8  ; 17
  rla                 ; 4
  ld ixh,a            ; 8

  ld a,ixl            ; 8
  call div32_16_sub8  ; 17
  rla                 ; 4
  ld ixl,a            ; 8

  ex de,hl   ; 4
  ret        ; 10

div32_16_sub8:
;119+8*div32_16_sub
;min: 503cc
;max: 655cc
;avg: 573cc
  call +_
_:
;17+2(17+2(div32_16_sub)))
  call +_
_:
;17+2(div32_16_sub)
  call div32_16_sub
div32_16_sub:
;48+{8,0+{0,19}}
;min: 48cc
;max: 67cc
;avg: 56.75cc
  rla        ; 4
  adc hl,hl  ; 15
  jr c,+_    ;12/7
  add hl,bc  ; 11
  ret c      ;11/5
  sbc hl,bc  ; 15
  ret        ; 10
_:
  add hl,bc  ; 11
  scf        ; 4
  ret        ; 10
This costs two extra bytes, but saves 112cc in the best case, 240cc in the worst case, and 80cc on average !
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 17, 2019, 08:02:08 pm
I wonder, if you can make my routine twice as fast and if you've written complete support for floats, why didn't you write this routine in the first place? Good job finding that BC > 8000 bug btw.

Also, I was having trouble with getting INC/DEC IXL to work for some reason. Haven't had the time to investigate that though. It could be that I just typed in the wrong opcode because I have a bad assembler which doesn't support it apparently and it's too much of a hassle to get a new one on Linux.

As for your optimizations, I could follow along until you got those subroutines in. (EDIT: I think I got it now) And that might also be a problem if you have limited stack space or if you can't use the stack at all (in which case the routine would be inline). I tend to put RAM page 02 into bank C and completely fill it up with data, not leaving any place for a stack.

Another, easy way of doing this would be by thinking of the 32 and 16 bit numbers as 4&2-digit base-256 numbers and using a routine for smaller numbers to do the actual division.

Quote
And yes, that's a good place for it. I have been working on-and-off on that page for a few months to revamp it. If it is locked due to a pending draft, let me know and I'll move the draft to another page altogether.
I think it would be better if you put it there.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 17, 2019, 08:47:07 pm
I wonder, if you can make my routine twice as fast and if you've written complete support for floats, why didn't you write this routine in the first place?
It was just because my float routines were special-purpose and writing a general-purpose routine was a daunting idea (which is why I'm grateful for your work). For the float routines, I can always guarantee the top 16 bits are smaller than BC, resulting in a routine that is under 900cc. That said, I should update my above routine with a trick I came up with for those (it reduces the problem so that BC is always less than 32768, )(correctly) correcting later so that there is no overflow to check for).

Also, I was having trouble with getting INC/DEC IXL to work for some reason. Haven't had the time to investigate that though. It could be that I just typed in the wrong opcode because I have a bad assembler which doesn't support it apparently and it's too much of a hassle to get a new one on Linux.
I use SPASM-ng (https://github.com/alberthdev/spasm-ng) and that has been great for me.

As for your optimizations, I could follow along until you got those subroutines in. (EDIT: I think I got it now) And that might also be a problem if you have limited stack space or if you can't use the stack at all (in which case the routine would be inline). I tend to put RAM page 02 into bank C and completely fill it up with data, not leaving any place for a stack.
Hmm, I'll keep this in mind if I work more on these.

Another, easy way of doing this would be by thinking of the 32 and 16 bit numbers as 4&2-digit base-256 numbers and using a routine for smaller numbers to do the actual division.
I do this in the extended-precision float routines, and it requires two multiplications (one multiply if you don't need the remainder), but could come out faster. I might pursue this, too. In my float routines, because they were special-purpose, this method didn't pay off until the 32/32 divisions and up. However, general-purpose 32/16 might benefit.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 17, 2019, 10:06:07 pm
Quote
writing a general-purpose routine was a daunting idea
My trick: pen & paper & sleep

Quote
I use SPASM-ng and that has been great for me.
This is not the first time someone has recommended SPASM to me. It is, however, the first time someone provided me with a Github link and I thought it was only available for Windows until now. It would also mean that I have to transform all my previous programs (or at least include files) to this syntax.

Quote
Quote
As for your optimizations, I could follow along until you got those subroutines in. (EDIT: I think I got it now) And that might also be a problem if you have limited stack space or if you can't use the stack at all (in which case the routine would be inline). I tend to put RAM page 02 into bank C and completely fill it up with data, not leaving any place for a stack.
Hmm, I'll keep this in mind if I work more on these.
I don't think that's too hard in this case: just djnz 8 times

I will try to make the sqrt routine asap, if I don't forget. It'll probably be for next weekend.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 17, 2019, 10:30:45 pm
Code: [Select]
[quote author=fghsgh link=topic=18691.msg406921#msg406921 date=1552874767]
My trick: pen & paper & sleep
[/quote]
Hey, that's the trick I use! I should clarify: without motivation, I am lazy.

[quote author=fghsgh link=topic=18691.msg406921#msg406921 date=1552874767]
This is not the first time someone has recommended SPASM to me. It is, however, the first time someone provided me with a Github link and I thought it was only available for Windows until now. It would also mean that I have to transform all my previous programs (or at least include files) to this syntax.
Glad I could help! I think with spasm you can do such things as #define TASM or something like that and it'll allow `equ` instead of `=` and whatnot.

I don't think that's too hard in this case: just djnz 8 times
True (except not using djnz since that would use B !), but you would have to call that routine which uses a stack anyways :P I imagine you mean a very limited stack as opposed to no stack whatsoever?

Quote
I will try to make the sqrt routine asap, if I don't forget. It'll probably be for next weekend.
Good luck! The pseudocode outlined here (http://www.retroprogramming.com/2017/07/a-fast-z80-integer-square-root.html?m=1) has yielded me better results than algorithm I learned as a kid. It's the same algorithm, just structured better for programming instead of computing by hand.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 18, 2019, 01:50:58 pm
Good luck! The pseudocode outlined here (http://www.retroprogramming.com/2017/07/a-fast-z80-integer-square-root.html?m=1) has yielded me better results than algorithm I learned as a kid. It's the same algorithm, just structured better for programming instead of computing by hand.
Well, I hadn't seen this, so here is a routine using the normal algorithm. I wasn't able to fit everything into the registers so it also uses one stack frame. It even uses IY, so don't run this with IM 1 enabled and restore it afterwards. It doesn't use shadow registers though, which means you can still use IM 2. This algorithm could also be better because it only uses left shifts, which are quicker for 16-bit numbers.
Code: [Select]
; calculate sqrt of 32-bit number
; input: IXIY = the number to calculate sqrt of
; output: DE = square root of IXIY, rounded down
; B = 0
; if CHL = 0, IXIY was a perfect square
; IXIY = 0

sqrt32:
 ld bc,$1000
 ld d,c
 ld e,c
 ld h,c
 ld l,c
sqrt32loop:
; IXIY << 2; carry into CHL
 add iy,iy
; adc ix,ix doesn't exist
 ld a,ixl
 rla
 ld ixl,a
 ld a,ixh
 rla
 ld ixh,a
 adc hl,hl
 rl c
; second time now
 add iy,iy
 ld a,ixl
 rla
 ld ixl,a
 ld a,ixh
 rla
 ld ixh,a
 adc hl,hl
 rl c

; get DE * 4 + 1 and store into AHL
 push hl
 xor a
 ld h,d
 ld l,e
 add hl,hl
 rla
 add hl,hl
 rla
 inc l

; if DE*4+1<=remainder -> add 1 to answer
 sub c
 jr c,nottoobigA ; A < C
 jr nz,toobigA ; A > C
 ex de,hl ; A = C
 ex (sp),hl
 sbc hl,de
 jr c,toobigDE ; HL < DE
 add a,c ; HL >= DE
 ex de,hl
 pop hl
 add hl,hl
 inc l
 ex de,hl
 djnz sqrt32loop
 ret

nottoobigA:
 pop hl
 ex de,hl
 add hl,hl
 inc l
 ex de,hl
 djnz sqrt32loop
 ret

toobigA:
 add a,c
 pop hl
 ex de,hl
 add hl,hl
 ex de,hl
 djnz sqrt32loop
 ret

toobigDE:
 add a,c
 add hl,de
 ex de,hl
 pop hl
 add hl,hl
 ex de,hl
 djnz sqrt32loop
 ret
A little explanation of what the registers mean:
B is a DJNZ counter for 32 / 2 = 16 (2 bits in IXIY are only one bit in DE)
CHL is the remainder
A(SP) is DE * 4 + 1
In between, the registers are swapped around a bit to be able to perform arithmetic on them.

Time: 4213 to 5247 cycles (.70 to .87 ms at 6 MHz), which is actually not that bad (although it seems that a 16-bit sqrt is faster than a 16-bit division, so maybe this should also be possible)
Size: 78 bytes

I'm curious how you'll optimize this one.

As for the other algorithm you sent, I will try that one later.

EDIT: I'm currently remaking this one but with shadow registers instead of IXIY and it seems promising so far.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 18, 2019, 02:56:10 pm
True (except not using djnz since that would use B !), but you would have to call that routine which uses a stack anyways :P I imagine you mean a very limited stack as opposed to no stack whatsoever?
About not using B: I thought that was evident so I left it out.
I meant copy & pasting the routine into the program, where it is needed:
Code: [Select]
; code code code
; set arguments for divisionn
; insert division routine here
; code code code
... or I would have to put page 0 back in before calling.

It's not that I often need division, especially not in a stackless environment, but it's handy to have either way.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 19, 2019, 04:56:25 pm
The routine using shadow registers is done. The previous one probably wasn't working correctly anyway. I've tested this one with 1000 different inputs, so I'm quite confident this time.
This uses an odd number of EXXs, so the BC input ends up in BC'.
Code: [Select]
; input: DEHL = number to calculate sqrt of
; output: DE = sqrt
; destroys:
;  BC, DE', HL' = 0
;  AHL = remainder (how much lower than DEHL the closest perfect square was)
;  BC' = previous BC
; enables interrupts
; time: 3087 to 3375 (.51 ms to .56 ms at 6 MHz)
; size: 56 bytes

sqrt32:
 di             ; 4
 exx            ; 4
 xor a          ; 4
 ld bc,$1000    ; 10
 ld d,a         ; 4
 ld e,a         ; 4
 ld h,a         ; 4
 ld l,a         ; 4
sqrt32loop:
 exx            ; 4
 add hl,hl      ; 11
 rl e           ; 8
 rl d           ; 8
 exx            ; 4
 adc hl,hl      ; 15
 rla            ; 4
 exx            ; 4
 add hl,hl      ; 11
 rl e           ; 8
 rl d           ; 8
 exx            ; 4
 adc hl,hl      ; 15
 rla            ; 4

; registers here:
;  AHL: remainder
;  DEHL': input
;  CDE: output * 2
;  B: djnz

 ex de,hl       ; 4
 add hl,hl      ; 11
 rl c           ; 8
 inc l          ; 4
 ex de,hl       ; 4

 sbc hl,de      ; 15
 sbc a,c        ; 4 
 jr nc,nottoobig; 12/7 
 add hl,de      ; 11
 adc a,c        ; 4
 dec e          ; 4
 dec e          ; 4 
nottoobig:
 inc e          ; 4 
 djnz sqrt32loop; 13/8 
 rr c           ; 8
 rr d           ; 8
 rr e           ; 8
 ei             ; 4
 ret            ; 10   
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 19, 2019, 04:58:24 pm
I haven't had much luck with this one, unfortunately. If you are open to using more stack space or external memory, you can take advantage of first working with 8-bit values, then 16-bit, then 32-bit.

Oh! I was just about to post and you posted. That looks like a much cleaner routine!

EDIT: One thing though: In you routine, A is 0 until the last iteration, so you can make the last iteration a special case and speed up the inner loop.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 19, 2019, 05:07:08 pm
One thing though: In you routine, A is 0 until the last iteration, so you can make the last iteration a special case and speed up the inner loop.
You mean C? A can get quite big, actually. While debugging, I've seen it go up to 7. That might have been because of a bug though (which is now fixed).

I'd rather go to sleep now, so it'll be for tomorrow.

EDIT: I tried, and it didn't pass the first of my 1000 tests.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 19, 2019, 05:56:10 pm
The remainder is no more than 2sqrt(x), so in this case at most 0x1FFFE, but on the 15th iteration it is at most 0xFFFE
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 20, 2019, 07:46:36 am
Actually, HL' is 0 from the 8th iteration onward, so no need to ADD HL,HL.
So I'd say:
first 8 iterations ADD HL,HL \ RL E \ RL D
then one EX DE,HL
then 4 iterations ADD HL,HL
then 4 iterations SLA H

or maybe even split up the first 8 iterations into
ADD HL,HL \ RL E \ RL D
one EX DE,HL
and SLA D \ ADC HL,HL

The remainder is no more than 2sqrt(x), so in this case at most 0x1FFFE, but on the 15th iteration it is at most 0xFFFE
It must've been a bug then.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 20, 2019, 08:58:40 am
Well,why not just do ex de,hl first and make the first 8 iterations add hl,hl and then another ex de,hl and the final iterations add hl,de.

In the version I'm working on, I have 3 groups of 4 iterations (shifting 1 byte) and then three iterations and then 1 special case. Something is wrong with it though, so I have to work more on it.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 23, 2019, 10:18:40 am
Here is my version. It's about 50 bytes larger, but averages 1889.75cc. It does use stack and shadow registers, but it could be made faster by totally unrolling (which would be about 300 bytes of code).
Code: [Select]
sqrt32:
;Input: HLDE
;Output: DE is the square root, AHL is the remainder
;Destroys: D'E', H'L'
;Speed: 248+{0,44}+3*sqrt32sub+sqrt32sub_2+sqrt32_iter15
;min: 1697cc
;max: 2086cc
;avg: 1889.75cc
;
;Python implementation:
;  remainder = 0
;  acc = 0
;  for k in range(16):
;    acc<<=1
;    x&=0xFFFFFFFF
;    x<<=2
;    y=x>>32
;    remainder<<=2
;    remainder+=y
;    if remainder>=acc*2+1:
;      remainder-=(acc*2+1)
;      acc+=1
;  return [acc,remainder]
;
  di
  exx 
  ld hl,0         ;remainder
  ld d,h \ ld e,h ;acc
  exx

  ld a,h \ call sqrt32sub \ exx
  ld a,l \ call sqrt32sub \ exx
  ld a,d \ call sqrt32sub \ exx
;Now we have four more iterations
;The first two are no problem
  ld a,e
  exx
  call sqrt32sub_2

;On the next iteration, HL might temporarily overflow by 1 bit
  call sqrt32_iter15

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrt32_iter16:
  add a,a
  adc hl,hl
  rla
  adc hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,0
  inc e
  sbc hl,de \ sbc a,0
  ret p
  add hl,de
  adc a,0
  dec e
  add hl,de
  adc a,0
  ret


sqrt32sub:
;min: 391cc
;max: 483cc
;avg: 437cc
  exx
  call sqrt32sub_2

sqrt32sub_2:
;min: 185cc
;max: 231cc
;avg: 208cc
  call +_

_:
;min: 84cc
;max: 107cc
;avg: 95.5cc

  sll e \ rl d      ;sla e \ rl d \ inc e

  add a,a
  adc hl,hl
  add a,a
  adc hl,hl

  sbc hl,de
  inc e
  ret nc
  dec e
  add hl,de
  dec e
  ret

sqrt32_iter15:
;91+{8,0+{0,23}}
;min: 91cc
;max: 114cc
;avg: 100.75cc

  sll e \ rl d      ;sla e \ rl d \ inc e
  add a,a
  adc hl,hl
  add a,a
  adc hl,hl       ;This might overflow!
  jr c,sqrt32_iter15_br0
;
  sbc hl,de
  inc e
  ret nc
  dec e
  add hl,de
  dec e
  ret
sqrt32_iter15_br0:
  or a
  sbc hl,de
  inc e
  ret

EDIT:
Oh jeez, here is an even bigger version that uses less stack space and doesn't use shadow registers or index registers:
Code: [Select]
sqrt32:
;Input: HLDE
;speed: 238+{0,1}+{0,44}+sqrtHL+3*sqrt32sub_2+sqrt32_iter15
;min: 1260
;max: 1506
;avg: 1377.75

  push de
  call sqrtHL
  pop bc
  add a,a
  ld e,a
  jr nc,+_
  inc d
_:

  ld a,b
  call sqrt32sub_2
  call sqrt32sub_2
;Now we have four more iterations
;The first two are no problem
  ld a,c
  call sqrt32sub_2

;On the next iteration, HL might temporarily overflow by 1 bit
  call sqrt32_iter15

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrt32_iter16:
  add a,a
  adc hl,hl
  rla
  adc hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,0
  inc e
  or a
  sbc hl,de \ sbc a,0
  ret p
  add hl,de
  adc a,0
  dec e
  add hl,de
  adc a,0
  ret

sqrt32sub_2:
;min: 185cc
;max: 231cc
;avg: 208cc
  call +_

_:
;min: 84cc
;max: 107cc
;avg: 95.5cc

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl

  sbc hl,de
  inc e
  ret nc
  dec e
  add hl,de
  dec e
  ret

sqrt32_iter15:
;91+{8,0+{0,23}}
;min: 91cc
;max: 114cc
;avg: 100.75cc

  sll e \ rl d      ;sla e \ rl d \ inc e
  add a,a
  adc hl,hl
  add a,a
  adc hl,hl       ;This might overflow!
  jr c,sqrt32_iter15_br0
;
  sbc hl,de
  inc e
  ret nc
  dec e
  add hl,de
  dec e
  ret
sqrt32_iter15_br0:
  or a
  sbc hl,de
  inc e
  ret
.echo $-sqrt32

sqrtHL:
;returns A as the sqrt, HL as the remainder, D = 0
;min: 376cc
;max: 416cc
;avg: 393cc
  ld de,$5040
  ld a,h
  sub e
  jr nc,+_
  add a,e
  ld d,$10
_:
  sub d
  jr nc,+_
  add a,d
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  set 5,d
  res 4,d
  srl d

  set 2,d
  sub d
  jr nc,+_
  add a,d
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  set 3,d
  res 2,d
  srl d

  inc d
  sub d
  jr nc,+_
  add a,d
  dec d   ;this resets the low bit of D, so `srl d` resets carry.
  .db $06   ;start of ld b,* which is 7cc to skip the next byte.
_:
  inc d
  srl d
  ld h,a


  sbc hl,de
  ld a,e
  jr nc,+_
  add hl,de
_:
  ccf
  rra
  srl d
  rra
  ld e,a

  sbc hl,de
  jr nc,+_
  add hl,de
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  or %00100000
  xor %00011000
  srl d
  rra
  ld e,a


  sbc hl,de
  jr nc,+_
  add hl,de
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  or %00001000
  xor %00000110
  srl d
  rra
  ld e,a
  sbc hl,de
  jr nc,+_
  add hl,de
  srl d
  rra
  ret
_:
  inc a
  srl d
  rra
  ret
.echo $-sqrtHL
It does use the 16-bit square root routine here (https://github.com/Zeda/z80float/blob/master/extended/sqrt/sqrt16.z80) to take care of the first 16 bits :P Combined, it is 194 bytes.
EDIT2: Forgot that sqrtHL didn't preserve BC, fixed that. Now it seems the last bit of the remainder might be broken, so I have to fix that :(
EDIT3: Fixed the bug in the bottom bit :) I just needed to reset the carry flag before the second subtraction in the final iteration.

EDIT4: In a scenario where you don't have RAM for a stack, we can hardcode it! It even saves 54cc (but adds 20 bytes). 10cc of that 54cc is just due to not having an ending RET. I also switched input to HLIX instead of HLDE.
I reorganized the code so that it would be "obvious" that sqrt32 is an in-line routine, so I put it at the end (in practice, the subroutines would probably be toward the end of mem). The precomputed stack is inserted just before sqrt32.
Code: [Select]
sqrt32sub_2:
;min: 178cc
;max: 224cc
;avg: 201cc
  jp return4
return4:
;min: 84cc
;max: 107cc
;avg: 95.5cc

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl

  sbc hl,de
  inc e
  ret nc
  dec e
  add hl,de
  dec e
  ret

sqrt32_iter15:
;91+{8,0+{0,23}}
;min: 91cc
;max: 114cc
;avg: 100.75cc

  sll e \ rl d      ;sla e \ rl d \ inc e
  add a,a
  adc hl,hl
  add a,a
  adc hl,hl       ;This might overflow!
  jr c,sqrt32_iter15_br0
;
  sbc hl,de
  inc e
  ret nc
  dec e
  add hl,de
  dec e
  ret
sqrt32_iter15_br0:
  or a
  sbc hl,de
  inc e
  ret

sqrtHL:
;returns A as the sqrt, HL as the remainder, D = 0
;min: 376cc
;max: 416cc
;avg: 393cc
  ld de,$5040
  ld a,h
  sub e
  jr nc,+_
  add a,e
  ld d,$10
_:
  sub d
  jr nc,+_
  add a,d
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  set 5,d
  res 4,d
  srl d

  set 2,d
  sub d
  jr nc,+_
  add a,d
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  set 3,d
  res 2,d
  srl d

  inc d
  sub d
  jr nc,+_
  add a,d
  dec d   ;this resets the low bit of D, so `srl d` resets carry.
  .db $06   ;start of ld b,* which is 7cc to skip the next byte.
_:
  inc d
  srl d
  ld h,a


  sbc hl,de
  ld a,e
  jr nc,+_
  add hl,de
_:
  ccf
  rra
  srl d
  rra
  ld e,a

  sbc hl,de
  jr nc,+_
  add hl,de
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  or %00100000
  xor %00011000
  srl d
  rra
  ld e,a


  sbc hl,de
  jr nc,+_
  add hl,de
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  or %00001000
  xor %00000110
  srl d
  rra
  ld e,a
  sbc hl,de
  jr nc,+_
  add hl,de
  srl d
  rra
  ret
_:
  inc a
  srl d
  rra
  ret
sqrt32_stack:
.dw return0
.dw return4   ;subroutine
.dw return1
.dw return4   ;subroutine
.dw return2
.dw return4   ;subroutine
.dw return3
.dw return5
sqrt32_stack_end:




sqrt32:
;Input: HLIX
;Output: DE is the sqrt, AHL is the remainder
;min: 1203
;max: 1455
;avg: 1323.75
  ld sp,sqrt32_stack
  jp sqrtHL
return0:
  add a,a
  ld e,a
  jr nc,+_
  inc d
_:

  ld a,ixh
  jp sqrt32sub_2
return1:
  jp sqrt32sub_2
return2:
;Now we have four more iterations
;The first two are no problem
  ld a,ixl
  jp sqrt32sub_2
return3:
;On the next iteration, HL might temporarily overflow by 1 bit
  jp sqrt32_iter15
return5:

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrt32_iter16:
  add a,a
  adc hl,hl
  rla
  adc hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,0
  inc e
  or a
  sbc hl,de \ sbc a,0
  jp p,+_
  add hl,de
  adc a,0
  dec e
  add hl,de
  adc a,0
_:
  ;...
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: fghsgh on March 23, 2019, 12:51:01 pm
Okay, you win.
Title: Re: [z80] 32 bit by 16 bits division and 32 bit square root
Post by: Xeda112358 on March 23, 2019, 12:58:23 pm
I bet if the Z80 gods got in here they could do even better! As it is, I see a neat optimization saving 3 bytes and on average 5cc, but I think I'll wait to make more edits. Basically, in the last iteration instead of:
Code: [Select]
sqrt32_iter16:
  add a,a
  adc hl,hl
  rla
  adc hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,0
  inc e
  or a
  sbc hl,de \ sbc a,0
  ret p
  add hl,de
  adc a,0
  dec e
  add hl,de
  adc a,0
  ret
It could be:
Code: [Select]
sqrt32_iter16:
  add a,a
  ld b,a  ;either 0x00 or 0x128
  adc hl,hl
  rla
  adc hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,b
  inc e
  or a
  sbc hl,de \ sbc a,b
  ret p
  add hl,de
  adc a,b
  dec e
  add hl,de
  adc a,b
  ret

I'll have to test that and other optimizations after work (if I have time tonight).

EDIT:
Here is an unrolled version that includes the above optimization and works with input=HLIX. This code is 167 bytes, but including the sqrtHL routine posted earlier, the total size is 266 bytes.
Code: [Select]
sqrtHLIX:
;Input: HLIX
;Output: DE is the sqrt, AHL is the remainder
;speed: 754+{0,1}+6{0,6}+{0,3+{0,18}}+{0,38}+sqrtHL
;min: 1130
;max: 1266
;avg: 1190.5
;167 bytes

  call sqrtHL
  add a,a
  ld e,a
  jr nc,+_
  inc d
_:

  ld a,ixh
  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

;Now we have four more iterations
;The first two are no problem
  ld a,ixl
  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

sqrt32_iter15:
;On the next iteration, HL might temporarily overflow by 1 bit
  sll e \ rl d      ;sla e \ rl d \ inc e
  add a,a
  adc hl,hl
  add a,a
  adc hl,hl       ;This might overflow!
  jr c,sqrt32_iter15_br0
;
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  jr sqrt32_iter16
sqrt32_iter15_br0:
  or a
  sbc hl,de
_:
  inc e

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrt32_iter16:
  add a,a
  ld b,a        ;either 0x00 or 0x80
  adc hl,hl
  rla
  adc hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,b
  inc e
  or a
  sbc hl,de \ sbc a,b
  ret p
  add hl,de
  adc a,b
  dec e
  add hl,de
  adc a,b
  ret

Thank you so much for the inspiration to work on this routine! I was able to take the results here and make my float routines a little bit faster and I saved 222 bytes in all!