Author Topic: Streamlined Asm routines  (Read 7876 times)

0 Members and 1 Guest are viewing this topic.

Offline ACagliano

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 919
  • Rating: +32/-2
    • View Profile
    • ClrHome Productions
Re: Streamlined Asm routines
« Reply #15 on: October 11, 2011, 09:33:29 pm »
So if you want to delay for approximately 'a' seconds you can try this:
Code: [Select]
     ld b,107
     halt
     djnz $-1
     dec a
     jr nz,$-6
     ret
Does this help?
Remember to have interrupts enabled.  ;)

Thank you mucho. I don't actually disable them to begin with...lol.

Offline NanoWar

  • LV4 Regular (Next: 200)
  • ****
  • Posts: 140
  • Rating: +18/-6
    • View Profile
Re: Streamlined Asm routines
« Reply #16 on: December 08, 2011, 07:45:22 am »
If you don't want to look up the instruction bytes and are too lazy to write out labels, you could also use relative labels (with Spasm):

_ ld b,107
_ halt
  djnz -_
  dec a
  jr nz, --_
  ret

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Streamlined Asm routines
« Reply #17 on: January 04, 2012, 02:21:31 pm »
Okay, so Runer was speculating about how to get the best speed out of a math routine, so the first challenge he gave was for 8x8 multiplication with a 16-bit output. I am not doing all that well with the challenge, but here is a variation of what I came up with that actually is a 8x16 multiplication (it requires 4 more cycles to make it 8x8).
Code: [Select]
A_Times_DE:
;Input:
;     A,DE
;Outputs:
;     A is 0
;     BC is not changed
;     DE is not changed
;     HL is the result
;     z flag is set
;     c flag is set if the input A is not 0
;Notes:
;           If A is 0, 29 cycles
;Speed: 145+6n+21b cycles
;           n=floor(log(a)/log(2))
;           b is the number of bits in the number
;           Testing over all values of A from 1 to 255:
;           313.7058824 average cycles
;           Worst: 355
;           Best  : 166 (non trivial)
;Size: 25 bytes
     ld hl,0           ;10
     or a \ ret z     ;9
     cpl \ scf        ;8
     adc a,a         ;4
     jp nc,$+7       ;10         ;45
Loop:
     add a,a          ;4
     jp c,$-1         ;10         ;14(7-n)

     add hl,de        ;11         ;11         (the rest are counted below)
     add a,a          ;4           ;4b
     ret z              ;5|11      ;5b+6
     add hl,hl         ;11         ;11b-11
     jp p,$-4         ;21|20     ;20n+b
     jp $-7
So that code is about twice as large as the following more standard routine, but it is also on average about 52 cycles faster and the worst case scenario is 35 cycles better than the worst case for the next routine. The advantage with the following code is that the speed is not all over the place:
Code: [Select]
DE_Times_A:
;Inputs:
;     DE and A are factors
;Outputs:
;     A is not changed
;     B is 0
;     C is not changed
;     DE is not changed
;     HL is the product
;Speed:
;     342+6x, x is the number of bits in A
;     Average: 366 cycles
;Size:
;     13 bytes
     ld b,8            ;7              7
     ld hl,0           ;10             10
       add hl,hl      ;11*8         88
       rlca            ;4*8           32
       jr nc,$+3     ;(12|18)*8  96+6x
         add hl,de   ;--             --
       djnz $-5      ;13*7+8      99
     ret               ;10             10

Another interesting note is that the first routine does not use a counter and so it preserves BC. For the best speed without an LUT, I still think unrolled routines are the best :)

Actually, another note-- the first routine doesn't really get a speed boost from unrolling :/

You can make a hybrid of the two codes that will cost the b register, but it will be 21 bytes and average a smidge over 320 cycles, too.

EDIT: Also, I posted here because the topic title had a nice name and I am not ready to submit it elsewhere without the scrutiny of the better asm coders, first :P

Offline FloppusMaximus

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 290
  • Rating: +57/-5
    • View Profile
Re: Streamlined Asm routines
« Reply #18 on: January 07, 2012, 03:53:28 pm »
Okay, so Runer was speculating about how to get the best speed out of a math routine, so the first challenge he gave was for 8x8 multiplication with a 16-bit output. I am not doing all that well with the challenge, but here is a variation of what I came up with that actually is a 8x16 multiplication (it requires 4 more cycles to make it 8x8).

Unless I've made a mistake somewhere, this routine doesn't work as written, because at the time you're testing the sign flag, the bit you're interested in has already been shifted out.  But it's an interesting idea, so here's a version that works (but could probably be optimized more):
Code: [Select]
ld hl,0 ; 10
or a ; 4
ret z ; 5

scf ; 4
skip_zeroes:
adc a,a ; 4(9-n)
jr nc,skip_zeroes ; 12(9-n) - 5

jp loop_add0 ; 10

loop_add:
ret z ; 5k + 6
add hl,hl ; 11k
loop_add0:
add hl,de ; 11(k+1)
loop_noadd:
add a,a ; 4n
jr c, loop_add ; 7n + 5k
add hl,hl ; 11(n-k-1)
jp loop_noadd ; 10(n-k-1)
If I've worked it out correctly, this has a minimum (non-trivial) running time of 192, a maximum of 437, and average of ~368.57.

Offline DrDnar

  • LV7 Elite (Next: 700)
  • *******
  • Posts: 546
  • Rating: +97/-1
    • View Profile
Re: Streamlined Asm routines
« Reply #19 on: January 07, 2012, 05:15:59 pm »
For reference, here are the timings for the routines from Z80 Bits.

Rolled:
Code: [Select]
; Ref Worst Best
MultHbyE:
ld l, 0 ; 7 7 7
ld d, l ; 4 4 4
sla h ; 8 8 8
jr nc,$+3 ; 12/7 12 7
ld l,e ; 4 4
ld b, 7 ; 7 7 7
_: add hl,hl ; 11 11 11
jr nc,$+3 ; 12/7 7 12
add hl,de ; 11 11
djnz -_ ; 13/8 13 13
; -5 -5
; 327 284

MultAbyDE:
ld hl, 0 ; 10 10 10
ld c, l ; 4 4 4
add a,a ; 4 4 4
jr nc,$+4 ; 12/7 7 12
ld h,d ; 4 4
ld l,e ; 4 4
ld b, 7 ; 7 7 7
_: add hl,hl ; 11 11 11
rla ; 4 4 4
jr nc,$+4 ; 12/7 7 12
add hl,de ; 11 11
adc a,c ; 4 4
djnz -_ ; 13/8 13 13
; -5 -5
; 385 312

Unrolled:
Code: [Select]
MultHbyE:
; L and D must already be 0
sla h ; 8 8 8
jr nc,$+3 ; 12/7 12 7
ld l,e ; 4 4

add hl,hl ; 11 11 11
jr nc,$+3 ; 12/7 7 12
add hl,de ; 11 77
; *7 *7 *7
; 223 180

MultAbyDE:
; HL and C must already be 0
add a,a ; 4 4 4
jr nc,$+4 ; 12/7 7 12
ld h,d ; 4 4
ld l,e ; 4 4

add hl,hl ; 11 11 11
rla ; 4 4 4
jr nc,$+4 ; 12/7 7 12
add hl,de ; 11 11
adc a,c ; 4 4
; *7 *7 *7
; 278 205
« Last Edit: January 07, 2012, 05:22:58 pm by DrDnar »
"No tools will make a man a skilled workman, or master of defense, nor be of any use to him who has not learned how to handle them, and has never bestowed any attention upon them. . . . Yes, [] the tools which would teach men their own use would be beyond price."—Plato's The Republic, circa 380 BC

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Streamlined Asm routines
« Reply #20 on: January 07, 2012, 07:18:36 pm »
This was to Floppus, but I lost internet connection for a while:
Dang, I must have messed up big time somewhere. You are right, that routine really does not work at all. Now I need to figure out where I went wrong because I was sure I had a working version :(

I think you over counted your cycles, too o.o What I did to make things easier was to remove the code that stripped the leading zeroes and that alone has a worst case of 404 cycles and best case of 327 cycles. pretty much, it is 316+11b where b is the number of bits in A (b=0 is the trivial case and I am not counting that). Your routine is still eluding me how to best analyse it, though, but I think it is a lot faster than you thought and better than the normal routine.