0 Members and 1 Guest are viewing this topic.

a few quick optimizations I see are changing those Inc/dec ix to use ixl instead

you don't need the `or a` before the sbc

Do you prefer speed optimizations or size optimizations, btw?

div32_16: ld de,0 ; 10 ld a,32 ; 7div32_16loop: add ix,ix ; 15 adc hl,hl ; 15 ex de,hl ; 4 adc hl,hl ; 15 sbc hl,bc ; 15 inc ixl ; 8 jr nc,cansub ; 12/7 add hl,bc ; 11 dec ixl ; 8cansub: ex de,hl ; 4 dec a ; 4 jr nz,div32_16loop ; 12/7 ret ; 10

div32_16: ex de,hl ; 4 ld hl,0 ; 10 ld a,32 ; 7div32_16loop: add ix,ix ; 15 rl e ; 8 rl d ; 8 adc hl,hl ; 15 sbc hl,bc ; 15 inc ixl ; 8 jr nc,cansub ; 12/7 add hl,bc ; 11 dec ixl ; 8cansub: dec a ; 4 jr nz,div32_16loop ; 12/7 ex de,hl ; 4 ret ; 10

div32_16: ex de,hl ; 4 ld hl,0 ; 10 ld a,e ; 4 ld e,32 ; 7div32_16loop: add ix,ix ; 15 rla ; 4 rl d ; 8 adc hl,hl ; 15 sbc hl,bc ; 15 inc ixl ; 8 jr nc,cansub ; 12/7 add hl,bc ; 11 dec ixl ; 8cansub: dec e ; 4 jr nz,div32_16loop ; 12/7 ex de,hl ; 4 ret ; 10

div32_16:;136+4*div32_16_sub8;min: 2180cc;max: 2788cc;avg: 2484cc;49 bytes ex de,hl ; 4 ld hl,0 ; 10 ld a,d ; 4 call div32_16_sub8 ; 17 ld d,a ; 4 ld a,e ; 4 call div32_16_sub8 ; 17 ld e,a ; 4 ld a,ixh ; 8 call div32_16_sub8 ; 17 ld ixh,a ; 8 ld a,ixl ; 8 call div32_16_sub8 ; 17 ld ixl,a ; 8 ex de,hl ; 4 ret ; 10div32_16_sub8:;119+8*div32_16_sub;min: 511cc;max: 663cc;avg: 587cc call +__:;17+2(17+2(div32_16_sub))) call +__:;17+2(div32_16_sub) call div32_16_subdiv32_16_sub:;min: 49cc;max: 68cc;avg: 58.5cc add a,a ; 4 adc hl,hl ; 15 sbc hl,bc ; 15 inc a ; 4 ret nc ;11/5 add hl,bc ; 11 dec a ; 4 ret ; 10

div32_16:;136+4*div32_16_sub8;min: 2340cc;max: 3012cc;avg: 2676cc;55 bytes ex de,hl ; 4 ld hl,0 ; 10 ld a,d ; 4 call div32_16_sub8 ; 17 ld d,a ; 4 ld a,e ; 4 call div32_16_sub8 ; 17 ld e,a ; 4 ld a,ixh ; 8 call div32_16_sub8 ; 17 ld ixh,a ; 8 ld a,ixl ; 8 call div32_16_sub8 ; 17 ld ixl,a ; 8 ex de,hl ; 4 ret ; 10div32_16_sub8:;119+8*div32_16_sub;min: 551cc;max: 719cc;avg: 635cc call +__:;17+2(17+2(div32_16_sub))) call +__:;17+2(div32_16_sub) call div32_16_subdiv32_16_sub:;54+{0,2+{0,19}};min: 54cc;max: 75cc;avg: 64.5cc add a,a ; 4 inc a ; 4 adc hl,hl ; 15 jr c,+_ ;12/7 sbc hl,bc ; 15 ret nc ;11/5 add hl,bc ; 11 dec a ; 4 ret ; 10_: or a ; 4 sbc hl,bc ; 15 ret ; 10

div32_16:;HLIX/BC -> HLIX remainder DE;158+4*div32_16_sub8;min: 2298cc;max: 3034cc;avg: 2546cc;59 bytes ex de,hl ; 4; Negate BC to allow add instead of sbc xor a ; 4; Need to set HL to 0 anyways, so save 2cc and a byte ld h,a ; 4 ld l,a ; 4 sub c ; 4 ld c,a ; 4 sbc a,a ; 4 sub b ; 4 ld b,a ; 4 ld a,d ; 4 call div32_16_sub8 ; 17 ld d,a ; 4 ld a,e ; 4 call div32_16_sub8 ; 17 ld e,a ; 4 ld a,ixh ; 8 call div32_16_sub8 ; 17 ld ixh,a ; 8 ld a,ixl ; 8 call div32_16_sub8 ; 17 ld ixl,a ; 8 ex de,hl ; 4 ret ; 10div32_16_sub8:;119+8*div32_16_sub;min: 535cc;max: 719cc;avg: 597cc call +__:;17+2(17+2(div32_16_sub))) call +__:;17+2(div32_16_sub) call div32_16_subdiv32_16_sub:52+{4,0+{0,23}};min: 52cc;max: 75cc;avg: 59.75cc add a,a ; 4 inc a ; 4 adc hl,hl ; 15 jr c,+_ ;12/7 add hl,bc ; 11 ret c ;11/5 sbc hl,bc ; 15 dec a ; 4 ret ; 10_: add hl,bc ; 11 ret ; 10

div32_16:;HLIX/BC -> HLIX remainder DE;174+4*div32_16_sub8;min: 2186cc;max: 2794cc;avg: 2466cc;61 bytes ex de,hl ; 4; Negate BC to allow add instead of sbc xor a ; 4; Need to set HL to 0 anyways, so save 2cc and a byte ld h,a ; 4 ld l,a ; 4 sub c ; 4 ld c,a ; 4 sbc a,a ; 4 sub b ; 4 ld b,a ; 4 ld a,d ; 4 call div32_16_sub8 ; 17 rla ; 4 ld d,a ; 4 ld a,e ; 4 call div32_16_sub8 ; 17 rla ; 4 ld e,a ; 4 ld a,ixh ; 8 call div32_16_sub8 ; 17 rla ; 4 ld ixh,a ; 8 ld a,ixl ; 8 call div32_16_sub8 ; 17 rla ; 4 ld ixl,a ; 8 ex de,hl ; 4 ret ; 10div32_16_sub8:;119+8*div32_16_sub;min: 503cc;max: 655cc;avg: 573cc call +__:;17+2(17+2(div32_16_sub))) call +__:;17+2(div32_16_sub) call div32_16_subdiv32_16_sub:;48+{8,0+{0,19}};min: 48cc;max: 67cc;avg: 56.75cc rla ; 4 adc hl,hl ; 15 jr c,+_ ;12/7 add hl,bc ; 11 ret c ;11/5 sbc hl,bc ; 15 ret ; 10_: add hl,bc ; 11 scf ; 4 ret ; 10

And yes, that's a good place for it. I have been working on-and-off on that page for a few months to revamp it. If it is locked due to a pending draft, let me know and I'll move the draft to another page altogether.

I wonder, if you can make my routine twice as fast and if you've written complete support for floats, why didn't you write this routine in the first place?

Also, I was having trouble with getting INC/DEC IXL to work for some reason. Haven't had the time to investigate that though. It could be that I just typed in the wrong opcode because I have a bad assembler which doesn't support it apparently and it's too much of a hassle to get a new one on Linux.

As for your optimizations, I could follow along until you got those subroutines in. (EDIT: I think I got it now) And that might also be a problem if you have limited stack space or if you can't use the stack at all (in which case the routine would be inline). I tend to put RAM page 02 into bank C and completely fill it up with data, not leaving any place for a stack.

Another, easy way of doing this would be by thinking of the 32 and 16 bit numbers as 4&2-digit base-256 numbers and using a routine for smaller numbers to do the actual division.

writing a general-purpose routine was a daunting idea

I use SPASM-ng and that has been great for me.

QuoteAs for your optimizations, I could follow along until you got those subroutines in. (EDIT: I think I got it now) And that might also be a problem if you have limited stack space or if you can't use the stack at all (in which case the routine would be inline). I tend to put RAM page 02 into bank C and completely fill it up with data, not leaving any place for a stack.Hmm, I'll keep this in mind if I work more on these.

[quote author=fghsgh link=topic=18691.msg406921#msg406921 date=1552874767]My trick: pen & paper & sleep[/quote]Hey, that's the trick I use! I should clarify: without motivation, I am lazy.[quote author=fghsgh link=topic=18691.msg406921#msg406921 date=1552874767]This is not the first time someone has recommended SPASM to me. It is, however, the first time someone provided me with a Github link and I thought it was only available for Windows until now. It would also mean that I have to transform all my previous programs (or at least include files) to this syntax.

I don't think that's too hard in this case: just djnz 8 times

I will try to make the sqrt routine asap, if I don't forget. It'll probably be for next weekend.

Good luck! The pseudocode outlined here has yielded me better results than algorithm I learned as a kid. It's the same algorithm, just structured better for programming instead of computing by hand.

; calculate sqrt of 32-bit number; input: IXIY = the number to calculate sqrt of; output: DE = square root of IXIY, rounded down; B = 0; if CHL = 0, IXIY was a perfect square; IXIY = 0sqrt32: ld bc,$1000 ld d,c ld e,c ld h,c ld l,csqrt32loop:; IXIY << 2; carry into CHL add iy,iy; adc ix,ix doesn't exist ld a,ixl rla ld ixl,a ld a,ixh rla ld ixh,a adc hl,hl rl c; second time now add iy,iy ld a,ixl rla ld ixl,a ld a,ixh rla ld ixh,a adc hl,hl rl c; get DE * 4 + 1 and store into AHL push hl xor a ld h,d ld l,e add hl,hl rla add hl,hl rla inc l; if DE*4+1<=remainder -> add 1 to answer sub c jr c,nottoobigA ; A < C jr nz,toobigA ; A > C ex de,hl ; A = C ex (sp),hl sbc hl,de jr c,toobigDE ; HL < DE add a,c ; HL >= DE ex de,hl pop hl add hl,hl inc l ex de,hl djnz sqrt32loop retnottoobigA: pop hl ex de,hl add hl,hl inc l ex de,hl djnz sqrt32loop rettoobigA: add a,c pop hl ex de,hl add hl,hl ex de,hl djnz sqrt32loop rettoobigDE: add a,c add hl,de ex de,hl pop hl add hl,hl ex de,hl djnz sqrt32loop ret

True (except not using djnz since that would use B !), but you would have to call that routine which uses a stack anyways I imagine you mean a very limited stack as opposed to no stack whatsoever?

; code code code; set arguments for divisionn; insert division routine here; code code code

; input: DEHL = number to calculate sqrt of; output: DE = sqrt; destroys:; BC, DE', HL' = 0; AHL = remainder (how much lower than DEHL the closest perfect square was); BC' = previous BC; enables interrupts; time: 3087 to 3375 (.51 ms to .56 ms at 6 MHz); size: 56 bytessqrt32: di ; 4 exx ; 4 xor a ; 4 ld bc,$1000 ; 10 ld d,a ; 4 ld e,a ; 4 ld h,a ; 4 ld l,a ; 4sqrt32loop: exx ; 4 add hl,hl ; 11 rl e ; 8 rl d ; 8 exx ; 4 adc hl,hl ; 15 rla ; 4 exx ; 4 add hl,hl ; 11 rl e ; 8 rl d ; 8 exx ; 4 adc hl,hl ; 15 rla ; 4; registers here:; AHL: remainder; DEHL': input; CDE: output * 2; B: djnz ex de,hl ; 4 add hl,hl ; 11 rl c ; 8 inc l ; 4 ex de,hl ; 4 sbc hl,de ; 15 sbc a,c ; 4 jr nc,nottoobig; 12/7 add hl,de ; 11 adc a,c ; 4 dec e ; 4 dec e ; 4 nottoobig: inc e ; 4 djnz sqrt32loop; 13/8 rr c ; 8 rr d ; 8 rr e ; 8 ei ; 4 ret ; 10

One thing though: In you routine, A is 0 until the last iteration, so you can make the last iteration a special case and speed up the inner loop.

The remainder is no more than 2sqrt(x), so in this case at most 0x1FFFE, but on the 15th iteration it is at most 0xFFFE