Omnimaga

Calculator Community => TI Calculators => ASM => Topic started by: ACagliano on October 10, 2011, 10:14:06 am

Title: Streamlined Asm routines
Post by: ACagliano on October 10, 2011, 10:14:06 am
I need a couple asm routines. They are for a program I am working on. They are as follows. The data the routine reads will be pushed onto the stack. Assuming you pop into hl:

1. Render a circle onto the screen
hl = time to wait, after rendering element, before moving on
hl + 1 = circle center x coord
hl + 2 = circle center y coord
hl + 3 = radius

2. Render a white rectangle with black border onto screen
hl = time to wait, after rendering element, before moving on
hl + 1 = x of upper left corner
hl + 2 = y of upper left corner
hl + 3 = width
hl + 4 = height

3. Render text onto the screen
hl = time to wait...
hl + 1 = x to start display
hl + 2 = y to start display
hl + 3 = width of text display (in chars)
hl + 4 = zero t'ded string

4. Render sprite onto screen
hl = time to wait...
hl + 1 = x to start
hl + 2 = y to start
hl + 3 = width
hl + 4 = height
hl + 5 = sprite data

Any help would be great.
Title: Re: Streamlined Asm routines
Post by: thepenguin77 on October 10, 2011, 10:53:14 am
While I'm not necessarily going to write the routines for you (I've already written quite a few) here's how I would go about writing them:

1. I know my house might get bombed for this, but honestly, the only way I can see to do this would be to check out how Axe does it. Axe draws perfect circles very quickly and if I needed a circle routine, I would just copy axe's.

2. This one has actually been done for you, bcall(_DrawRectBorderClear) (http://education.ti.com/calculators/downloads/US/Software/Download/en/177/6585/83psysroutines.pdf#page=142) draws a rectangle with a white inner section. (Page 142 if your browser doesn't redirect you)

3. If you mean just draw text, then bcall(_vPutS) is your routine. But, if you mean draw text within a certain boundary area, bcall(_SFont_Len) will tell you how long an individual letter is. From there, you can draw the letters 1 by 1 with bcall(_vPutMap) and start a new line whenever you run out of space. (bcall(_SFont_Len) is on page 47 of that pdf I linked above)

4. There are so many sprite rendering routines, there's no need to make a new one. Here (http://wikiti.brandonw.net/index.php?title=Z80_Routines:Graphic:put8x8sprite) is a page for 8-bit wide sprites and here (http://wikiti.brandonw.net/index.php?title=Z80_Routines:Graphic:put16xBsprite) is a page for 16-bit wide sprites. However, if you want just a single routine to do all of your sprites no matter how big they are, this (http://wikiti.brandonw.net/index.php?title=Z80_Routines:Graphic:putLargeSprite) is your routine. Of course, picking one of the smaller ones would be faster than this though.


As for the delays, your best bet is to make a single delay routine and call it.  In all honesty this routine right here is probably all you need:
Code: [Select]
Delay:
dec hl
ld a, h
or l
jr nz, delay
ret
Title: Re: Streamlined Asm routines
Post by: ACagliano on October 10, 2011, 01:30:43 pm
Quote
1. I know my house might get bombed for this, but honestly, the only way I can see to do this would be to check out how Axe does it. Axe draws perfect circles very quickly and if I needed a circle routine, I would just copy axe's.

Ok. Will do.

Quote
2. This one has actually been done for you, bcall(_DrawRectBorderClear) (http://education.ti.com/calculators/downloads/US/Software/Download/en/177/6585/83psysroutines.pdf#page=142) draws a rectangle with a white inner section. (Page 142 if your browser doesn't redirect you)

Awesome. Thanks.

Quote
3. If you mean just draw text, then bcall(_vPutS) is your routine. But, if you mean draw text within a certain boundary area, bcall(_SFont_Len) will tell you how long an individual letter is. From there, you can draw the letters 1 by 1 with bcall(_vPutMap) and start a new line whenever you run out of space. (bcall(_SFont_Len) is on page 47 of that pdf I linked above)

Yeah, I'll be drawing text of the small font within a boundary area.

Quote
4. There are so many sprite rendering routines, there's no need to make a new one. Here (http://wikiti.brandonw.net/index.php?title=Z80_Routines:Graphic:put8x8sprite) is a page for 8-bit wide sprites and here (http://wikiti.brandonw.net/index.php?title=Z80_Routines:Graphic:put16xBsprite) is a page for 16-bit wide sprites. However, if you want just a single routine to do all of your sprites no matter how big they are, this (http://wikiti.brandonw.net/index.php?title=Z80_Routines:Graphic:putLargeSprite) is your routine. Of course, picking one of the smaller ones would be faster than this though.

I need one routine to draw sprites of different sizes. Thanks.

Quote
As for the delays, your best bet is to make a single delay routine and call it.  In all honesty this routine right here is probably all you need:
Code: [Select]
Delay:
dec hl
ld a, h
or l
jr nz, delay
ret

Yeah. I think I can handle that. How many 'nop' or 'halt' cycles would equal 1 second, in fast mode?
Title: Re: Streamlined Asm routines
Post by: thepenguin77 on October 10, 2011, 02:44:39 pm
Yeah. I think I can handle that. How many 'nop' or 'halt' cycles would equal 1 second, in fast mode?

For nops, you need whole heck of a lot. Each nop takes 4 t-states. And you figure the median calculator is running at 15,500,000 t-states per second so... 3.7 million nops.

If you want to use instructions to slow down your program, here's a routine you can use:
Code: [Select]

;hl = milliseconds of delay

milliDelay:
ld b, 174 ;7
innerLoop:
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
djnz innerLoop   ;13*173+8 ;2257

dec hl ;6
ld a, h ;4
or l ;4
jr nz, milliDelay ;13
ret ;15,515

Halts work entirely different though. Halts wait for an interrupt and the interrupts run at a constant speed. That speed is 118 Hz on an 83+ and 107.79 Hz on everything else. But since you said you're running in fast mode, clearly you are not on an 83+ so you would need 108 halts to wait for one second.


(I said "median" calculator up above because calculators run anywhere from 14.5 MHz to 17.0 MHz in fast mode)
Title: Re: Streamlined Asm routines
Post by: ACagliano on October 10, 2011, 10:03:21 pm
Yeah. I think I can handle that. How many 'nop' or 'halt' cycles would equal 1 second, in fast mode?

For nops, you need whole heck of a lot. Each nop takes 4 t-states. And you figure the median calculator is running at 15,500,000 t-states per second so... 3.7 million nops.

If you want to use instructions to slow down your program, here's a routine you can use:
Code: [Select]

;hl = milliseconds of delay

milliDelay:
ld b, 174 ;7
innerLoop:
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
djnz innerLoop   ;13*173+8 ;2257

dec hl ;6
ld a, h ;4
or l ;4
jr nz, milliDelay ;13
ret ;15,515

Halts work entirely different though. Halts wait for an interrupt and the interrupts run at a constant speed. That speed is 118 Hz on an 83+ and 107.79 Hz on everything else. But since you said you're running in fast mode, clearly you are not on an 83+ so you would need 108 halts to wait for one second.


(I said "median" calculator up above because calculators run anywhere from 14.5 MHz to 17.0 MHz in fast mode)


Well, this may be the delay I utilize. Let me clarify...by "fast mode" I mean that I have opened the program with this...

Code: [Select]

in a,($2E)              ;initialize faster processing
push af
ld a,0
out ($2E),a
call Start              ;jump to main program. the main program will return here when 'ret' is called
pop af
out ($2E),a
ret
Title: Re: Streamlined Asm routines
Post by: thepenguin77 on October 10, 2011, 10:22:20 pm
Oh, well, that's not really fast mode. Doing that will get you a 15% speed increase in the best case scenario, and that scenario is that you are running from flash in 15Mhz mode.

Port ($2E) is actually just a delay port that TI added. What it does is it takes away the 1 t-states per read from flash delay that TI added. Furthermore, it's effects are only seen if you are running in 15 Mhz mode, it has no effect in 6Mhz mode. (Unless you do some other stuff. Check WikiTI for the full interaction between ports 29, 2A, 2B, and 2E.)

This code however:
Code: [Select]
ld a, 3
out ($20), a

Will make the calculator run 250% faster. This is all it takes to put the calculator into 15Mhz mode. In all honesty, you really don't even need your code. The only reason you might need it is if you are running a very time-sensitive app. But, since you know about it, you might as well zero port ($2E) anyways. Also, you are 100% allowed to leave port ($20) at 03 (default 00) because the OS is going to throw it in 03 when you return anyways. Leaving port ($2E) at 00 should be fine as well ;)
Title: Re: Streamlined Asm routines
Post by: ACagliano on October 11, 2011, 07:14:50 pm
Code: [Select]
;hl = milliseconds of delay

milliDelay:
ld b, 174 ;7
innerLoop:
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
djnz innerLoop   ;13*173+8 ;2257

dec hl ;6
ld a, h ;4
or l ;4
jr nz, milliDelay ;13
ret ;15,515

What would I do if hl is in seconds, rather than milliseconds? And, actually, my code holds the number of seconds to delay in 'a'.
Title: Re: Streamlined Asm routines
Post by: AngelFish on October 11, 2011, 07:18:16 pm
Multiply hl by 1000 to turn it into milliseconds. If you don't need all of the precision, leftshifting hl by 10 is approximately the same.
Title: Re: Streamlined Asm routines
Post by: ACagliano on October 11, 2011, 07:20:50 pm
Multiply hl by 1000 to turn it into milliseconds. If you don't need all of the precision, leftshifting hl by 10 is approximately the same.

Can I use a for this, rather than hl? And what routines can I use to draw a line between two pixel coords?
Title: Re: Streamlined Asm routines
Post by: ralphdspam on October 11, 2011, 08:15:28 pm
Can I use a for this, rather than hl?
Code: [Select]
;a = milliseconds of delay
milliDelay:
ld b, 174 ;7 <-Increase this for larger delay increments.
innerLoop:
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
djnz innerLoop   ;13*173+8 ;2257

dec a ;4
nop
nop ;Left here to keep similar timing.
jr nz, milliDelay ;13
ret ;15,515

Or you can use A as the high byte of HL
Code: [Select]
;a = high byte of milliseconds of delay
ld l, 0 ;7
ld h, a ;4
milliDelay:
ld b, 174 ;7
innerLoop:
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
ex (sp), hl    ;19*174 ;3306
djnz innerLoop   ;13*173+8 ;2257

dec hl ;6
ld a, h ;4
or l ;4
jr nz, milliDelay ;13
ret ;15,515
Title: Re: Streamlined Asm routines
Post by: ACagliano on October 11, 2011, 08:21:42 pm
Ok, now, all I have to do is figure out...right now 'a' is in milliseconds. My 'a' is in seconds.
Title: Re: Streamlined Asm routines
Post by: Xeda112358 on October 11, 2011, 08:24:12 pm
So if you want to delay for approximately 'a' seconds you can try this:
Code: [Select]
     ld b,107
     halt
     djnz $-1
     dec a
     jr nz,$-6
     ret
Does this help?
Title: Re: Streamlined Asm routines
Post by: ACagliano on October 11, 2011, 08:25:38 pm
So if you want to delay for approximately 'a' seconds you can try this:
Code: [Select]
     ld b,107
     halt
     djnz $-1
     dec a
     jr nz,$-6
     ret
Does this help?

What does $-6 mean? $-1?
Title: Re: Streamlined Asm routines
Post by: calc84maniac on October 11, 2011, 08:41:45 pm
So if you want to delay for approximately 'a' seconds you can try this:
Code: [Select]
     ld b,107
     halt
     djnz $-1
     dec a
     jr nz,$-6
     ret
Does this help?

What does $-6 mean? $-1?

$ refers to the address of the current instruction. We usually use that if we're too lazy to use label names. This code is equivalent:
Code: [Select]
delayLoop:
     ld b,107
haltLoop:
     halt
     djnz haltLoop
     dec a
     jr nz,delayLoop
     ret
Title: Re: Streamlined Asm routines
Post by: ralphdspam on October 11, 2011, 08:59:02 pm
So if you want to delay for approximately 'a' seconds you can try this:
Code: [Select]
     ld b,107
     halt
     djnz $-1
     dec a
     jr nz,$-6
     ret
Does this help?
Remember to have interrupts enabled.  ;)
Title: Re: Streamlined Asm routines
Post by: ACagliano on October 11, 2011, 09:33:29 pm
So if you want to delay for approximately 'a' seconds you can try this:
Code: [Select]
     ld b,107
     halt
     djnz $-1
     dec a
     jr nz,$-6
     ret
Does this help?
Remember to have interrupts enabled.  ;)

Thank you mucho. I don't actually disable them to begin with...lol.
Title: Re: Streamlined Asm routines
Post by: NanoWar on December 08, 2011, 07:45:22 am
If you don't want to look up the instruction bytes and are too lazy to write out labels, you could also use relative labels (with Spasm):

_ ld b,107
_ halt
  djnz -_
  dec a
  jr nz, --_
  ret
Title: Re: Streamlined Asm routines
Post by: Xeda112358 on January 04, 2012, 02:21:31 pm
Okay, so Runer was speculating about how to get the best speed out of a math routine, so the first challenge he gave was for 8x8 multiplication with a 16-bit output. I am not doing all that well with the challenge, but here is a variation of what I came up with that actually is a 8x16 multiplication (it requires 4 more cycles to make it 8x8).
Code: [Select]
A_Times_DE:
;Input:
;     A,DE
;Outputs:
;     A is 0
;     BC is not changed
;     DE is not changed
;     HL is the result
;     z flag is set
;     c flag is set if the input A is not 0
;Notes:
;           If A is 0, 29 cycles
;Speed: 145+6n+21b cycles
;           n=floor(log(a)/log(2))
;           b is the number of bits in the number
;           Testing over all values of A from 1 to 255:
;           313.7058824 average cycles
;           Worst: 355
;           Best  : 166 (non trivial)
;Size: 25 bytes
     ld hl,0           ;10
     or a \ ret z     ;9
     cpl \ scf        ;8
     adc a,a         ;4
     jp nc,$+7       ;10         ;45
Loop:
     add a,a          ;4
     jp c,$-1         ;10         ;14(7-n)

     add hl,de        ;11         ;11         (the rest are counted below)
     add a,a          ;4           ;4b
     ret z              ;5|11      ;5b+6
     add hl,hl         ;11         ;11b-11
     jp p,$-4         ;21|20     ;20n+b
     jp $-7
So that code is about twice as large as the following more standard routine, but it is also on average about 52 cycles faster and the worst case scenario is 35 cycles better than the worst case for the next routine. The advantage with the following code is that the speed is not all over the place:
Code: [Select]
DE_Times_A:
;Inputs:
;     DE and A are factors
;Outputs:
;     A is not changed
;     B is 0
;     C is not changed
;     DE is not changed
;     HL is the product
;Speed:
;     342+6x, x is the number of bits in A
;     Average: 366 cycles
;Size:
;     13 bytes
     ld b,8            ;7              7
     ld hl,0           ;10             10
       add hl,hl      ;11*8         88
       rlca            ;4*8           32
       jr nc,$+3     ;(12|18)*8  96+6x
         add hl,de   ;--             --
       djnz $-5      ;13*7+8      99
     ret               ;10             10

Another interesting note is that the first routine does not use a counter and so it preserves BC. For the best speed without an LUT, I still think unrolled routines are the best :)

Actually, another note-- the first routine doesn't really get a speed boost from unrolling :/

You can make a hybrid of the two codes that will cost the b register, but it will be 21 bytes and average a smidge over 320 cycles, too.

EDIT: Also, I posted here because the topic title had a nice name and I am not ready to submit it elsewhere without the scrutiny of the better asm coders, first :P
Title: Re: Streamlined Asm routines
Post by: FloppusMaximus on January 07, 2012, 03:53:28 pm
Okay, so Runer was speculating about how to get the best speed out of a math routine, so the first challenge he gave was for 8x8 multiplication with a 16-bit output. I am not doing all that well with the challenge, but here is a variation of what I came up with that actually is a 8x16 multiplication (it requires 4 more cycles to make it 8x8).

Unless I've made a mistake somewhere, this routine doesn't work as written, because at the time you're testing the sign flag, the bit you're interested in has already been shifted out.  But it's an interesting idea, so here's a version that works (but could probably be optimized more):
Code: [Select]
ld hl,0 ; 10
or a ; 4
ret z ; 5

scf ; 4
skip_zeroes:
adc a,a ; 4(9-n)
jr nc,skip_zeroes ; 12(9-n) - 5

jp loop_add0 ; 10

loop_add:
ret z ; 5k + 6
add hl,hl ; 11k
loop_add0:
add hl,de ; 11(k+1)
loop_noadd:
add a,a ; 4n
jr c, loop_add ; 7n + 5k
add hl,hl ; 11(n-k-1)
jp loop_noadd ; 10(n-k-1)
If I've worked it out correctly, this has a minimum (non-trivial) running time of 192, a maximum of 437, and average of ~368.57.
Title: Re: Streamlined Asm routines
Post by: DrDnar on January 07, 2012, 05:15:59 pm
For reference, here are the timings for the routines from Z80 Bits (http://baze.au.com/misc/z80bits.html#1.1).

Rolled:
Code: [Select]
; Ref Worst Best
MultHbyE:
ld l, 0 ; 7 7 7
ld d, l ; 4 4 4
sla h ; 8 8 8
jr nc,$+3 ; 12/7 12 7
ld l,e ; 4 4
ld b, 7 ; 7 7 7
_: add hl,hl ; 11 11 11
jr nc,$+3 ; 12/7 7 12
add hl,de ; 11 11
djnz -_ ; 13/8 13 13
; -5 -5
; 327 284

MultAbyDE:
ld hl, 0 ; 10 10 10
ld c, l ; 4 4 4
add a,a ; 4 4 4
jr nc,$+4 ; 12/7 7 12
ld h,d ; 4 4
ld l,e ; 4 4
ld b, 7 ; 7 7 7
_: add hl,hl ; 11 11 11
rla ; 4 4 4
jr nc,$+4 ; 12/7 7 12
add hl,de ; 11 11
adc a,c ; 4 4
djnz -_ ; 13/8 13 13
; -5 -5
; 385 312

Unrolled:
Code: [Select]
MultHbyE:
; L and D must already be 0
sla h ; 8 8 8
jr nc,$+3 ; 12/7 12 7
ld l,e ; 4 4

add hl,hl ; 11 11 11
jr nc,$+3 ; 12/7 7 12
add hl,de ; 11 77
; *7 *7 *7
; 223 180

MultAbyDE:
; HL and C must already be 0
add a,a ; 4 4 4
jr nc,$+4 ; 12/7 7 12
ld h,d ; 4 4
ld l,e ; 4 4

add hl,hl ; 11 11 11
rla ; 4 4 4
jr nc,$+4 ; 12/7 7 12
add hl,de ; 11 11
adc a,c ; 4 4
; *7 *7 *7
; 278 205
Title: Re: Streamlined Asm routines
Post by: Xeda112358 on January 07, 2012, 07:18:36 pm
This was to Floppus, but I lost internet connection for a while:
Dang, I must have messed up big time somewhere. You are right, that routine really does not work at all. Now I need to figure out where I went wrong because I was sure I had a working version :(

I think you over counted your cycles, too o.o What I did to make things easier was to remove the code that stripped the leading zeroes and that alone has a worst case of 404 cycles and best case of 327 cycles. pretty much, it is 316+11b where b is the number of bits in A (b=0 is the trivial case and I am not counting that). Your routine is still eluding me how to best analyse it, though, but I think it is a lot faster than you thought and better than the normal routine.