# Omnimaga

## Calculator Community => TI Calculators => ASM => Topic started by: Galandros on February 28, 2010, 07:27:53 am

Title: ASM Optimized routines
Post by: Galandros on February 28, 2010, 07:27:53 am
There are some cools optimized routines around. Calcmaniac is the recordist in z80, probably. At least in calculators z80 forums is.

On to the code:
Code: [Select]
`;calcmaniac84cpHLDE: or a sbc hl,de add hl,de ret;Important note: because the code is 3 bytes and a call is 3 bytes, just macro in:;SPASM, TASM and BRASS compatible, I guess#define cp_HLDE  or a \ sbc hl,de \ add hl,de;- Reverse a;input: Byte in A;output: Reversed byte in A;destroys B;Clock cycles: 66;Bytes: 18;author: calcmaniac84reversea: ld b,a rrca rrca xor b and %10101010 xor b ld b,a rrca rrca rrca rrca xor b and %01100110 xor b rrca ret;reverse hl;curiosity: a easy port of a common reverse A register is more efficient than tricky stuff;calcmaniac84;28 bytes and 104 cyclesld a,lrlarr hrlarr hrlarr hrlarr hrlarr hrlarr hrlarr hrlarr hrlarrcald l,aret;calc84maniac;in: a = ABCDEFGH;out: hl= AABBCCDDEEFFGGHHrrcarrarrald l,arrasra lrlarr lsra lrrarr lsra lrrcarrarrald h,arrasra hrlarr hsra hrrarr hsra hret`
Code: [Select]
`;Galandros optimized routines;try to beat me... maybe is possible...;Displays A register content on screen in decimal ASCII number, using no addition memoryDispA: ld c,-100 call Na1 ld c,-10 call Na1 ld c,-1Na1: ld b,'0'-1Na2: inc b add a,c jr c,Na2 sub c ;works as add 100/10/1 push af ;safer than ld c,a ld a,b ;char is in b CALL PUTCHAR ;plot a char. Replace with bcall(_PutC) or similar. pop af ;safer than ld a,c ret;Note the following one is optimized for RPGs menus and the such, it is quite flexible. I am going to use in Lost Legends I ^^;I started with one which used addition RAM for temporary storage (made by me, too), and optimized for size, speed and no extra memory use! ^.^;the inc's and dec's were trick to debug -.-", the registers b and c are like counters and flags;DispHL for games;input: hl=num, d=row,e=col, c=number of algarisms to skip;number of numbers' characters to display: 5 ; example: 65000;output: hl displayed, with algarisms skiped and spaces for initial zerosDispHL_games: inc c ld b,1 ;skip 0 flag ld (CurRow),de;Number in hl to decimal ASCII;Thanks to z80 Bits;inputs: hl = number to ASCII;example: hl=300 outputs '  300';destroys: af, hl, de used ld de,-10000 call Num1 ld de,-1000 call Num1 ld de,-100 call Num1 ld e,-10 call Num1 ld e,-1Num1: ld a,'0'-1Num2: inc a add hl,de jr c,Num2 sbc hl,de dec c ;c is skipping jr nz,skipnum inc c djnz notcharnumzero cp '0' jr nz,notcharnumzeroleadingzero: inc bskipnum: ld a,' 'notcharnumzero: push bc call PUTCHAR  ;bcall(_PutC) works, not sure if it preserves bc pop bc retPUTCHAR: bcall(_PutC) ret;Example usage of DispHL_games to understand what I meanTest2: ld hl,60003 ld de,\$0101 ld c,0 call DispHL_games ld hl,60003 ld de,\$0102 ld c,1 call DispHL_games ret`
Well, don't try to understand or optimize calcmaniac84 ones. j/k, trying to understand can be harsh (tip: have a good instruction set summary) but teaches some inner details of the z80 asm.
Title: Re: ASM Optimized routines
Post by: Quigibo on February 28, 2010, 05:21:57 pm
Here is a little optimization I use but haven't really seen around.  When you need a direct key press, you have to wait about 7 clock cycles between setting the port and reading it.  Most people just fill in the extra space with a waste instruction like this:

Code: [Select]
`ld a,xxout (1),ald a,(de)in a,(1)and yy`9 Bytes, 43 T-States.

You can actually use the waste instruction to do something useful.  It gives a slight speed increase.

Code: [Select]
`ld a,xxout (1),ald b,yyin a,(1)and b`9 Bytes, 40 T-States.
Title: Re: ASM Optimized routines
Post by: calc84maniac on February 28, 2010, 08:12:27 pm
Small and quick setup for IM 2 (this example sets up vector table at \$9900 and interrupt jump at \$9a9a, but values can be changed)
Code: [Select]
`dild a,\$99ld bc,\$0100ld h,ald d,ald l,cld e,bld i,ainc ald (hl),aldirld l,ald (hl),\$c3inc lld (hl),intvec & \$ffinc lld (hl),intvec >> 8im 2ei`
Title: Re: ASM Optimized routines
Post by: Galandros on April 24, 2010, 12:12:44 pm
I found this optimized routine around. It is as far optimized as z80 string copy can get.
Code: [Select]
`;author: calcmaniac84, I think;Copy zero terminated string at HL to DE.StrCopy: xor adocopystr: cp (hl) ldi jr nz,docopystr ret`
These are quite optimized. But may be is possible to optimize further. (speed and size) But it is not needed...
They shift a graphics buffer (optimized to 96x64) up or down by pixels passed in A register.
Code: [Select]
`scroll_up:#ifdef DEBUG cp 64+1 call c,ErrorOverFlow#endif add a,a add a,a ld l,a ld e,a ld h,0 ld d,h add hl,hl add hl,de ; hl=a*12 push hl ld de,768 ex de,hl; carry is never set here if input is correct; or a sbc hl,de ld b,h ld c,l ; bc=768-12*a ex de,hl ld de,plotsscreen add hl,de ldir;blank remaining area ld h,d ld l,e inc de ld (hl),\$00 pop bc dec bc ; bc=12*a-1 ldir ret;PSEUDO CODE; ld hl,plotsscreen+12*a; ld de,plotsscreen; ld bc,768-12*a; ldir; ld h,d; ld l,e; ld (hl),\$00; inc de; ld bc,12*a; dec bc; ldir; retscroll_down:#ifdef DEBUG cp 64+1 call c,ErrorOverFlow#endif; a can be from 1 to 63; a can be multiplied by 4 add a,a add a,a ; a*4 ld l,a ; hl = a*4 ld e,a xor a ld h,a ld d,a add hl,hl ; hl = a*8 add hl,de ; hl = a*12 ld e,a ; de = 0 push hl ; a*12 will needed later push hl ; 2 times ex de,hl;carry is never set here; or a sbc hl,de ; hl= -a*12, de=a*12 ld de,plotsscreen+767 add hl,de ; hl=plotsscreen+767-12*a pop bc push hl ld hl,768+1;carry always set; or a sbc hl,bc ld b,h ld c,l pop hl lddr;blank remaining area ld h,d ld l,e ld (hl),\$00 dec de pop bc dec bc lddr ret; ld hl,plotsscreen+767-12*a; ld de,plotsscreen+767; ld bc,768-12*a; lddr; or; ld (hl),\$00 ;; ld hl,plotsscreen; ld h,d ;; ld (hl),\$00; ld l,e ;; ld de,hl+1; dec de ;; ld bc,12*a-1; ld bc,12*a-1 ;; ldir; lddr ;; ret; ret`
Title: Re: ASM Optimized routines
Post by: mapar007 on April 25, 2010, 03:58:56 am
Very nice! I'll add these to my utils.z80 file that is included in all my app builds.

Anyone wanting to compile a stdlib.c and revive the tisdcc project? j/k
Title: Re: ASM Optimized routines
Post by: Galandros on April 25, 2010, 05:04:47 am
Very nice! I'll add these to my utils.z80 file that is included in all my app builds.

Anyone wanting to compile a stdlib.c and revive the tisdcc project? j/k
Actually I am working on something like that. I am hand writing C functions in z80 assembly just for fun. :P I will share them when I finish.
After seeing Axe Parser, it seems that is possible doing a good C compiler for z80. And we have documentation on how to optimize z80 assembly to do a optimizer, check the WikiTI topic: http://wikiti.brandonw.net/index.php?title=Z80_Optimization.
Title: Re: ASM Optimized routines
Post by: DJ Omnimaga on April 25, 2010, 12:19:53 pm
Very nice! I'll add these to my utils.z80 file that is included in all my app builds.

Anyone wanting to compile a stdlib.c and revive the tisdcc project? j/k
I think I remember this, it was Halifax from the old Omnimaga forums who worked on it, right? There was a thread about it somewhere
Title: Re: ASM Optimized routines
Post by: Quigibo on April 29, 2010, 05:59:58 pm
Quigibo's Challenge!

Can any of the following be done in 6 or fewer bytes?  The input and output must be HL.

• Multiply by 128?
• Signed division by any nontrivial constant, other than 2, including negative numbers?
• Modulus with any constant that is not a power of 2?

I'm rewriting my math engine almost from scratch so I decided I would just optimize everything I could possibly conceive of at the same time.  These are the ones I'm having trouble finding.
Title: Re: ASM Optimized routines
Post by: calc84maniac on April 29, 2010, 06:31:16 pm
Seems pretty impossible to me.
Title: Re: ASM Optimized routines
Post by: Quigibo on April 29, 2010, 06:58:39 pm
Okay, that's good.  I spent hours trying to optimize some of these using all the tricks I know.  That reassures me it was a wild goose chase.
Title: Re: ASM Optimized routines
Post by: DJ Omnimaga on April 29, 2010, 07:01:08 pm
Seems pretty impossible to me.
O.O

No way!

You're calc84god, you can do everything, even the impossible! (see TI-Boy SE/Project M/F-Zero)

j/k I can't wait to see what kind of optimizations there will be in the next versions of Axe
Title: Re: ASM Optimized routines
Post by: Quigibo on April 29, 2010, 07:34:45 pm
It's nothing big.  Mostly it just extend multiplication, modulus, and addition to higher powers of 2.  The big optimizations won't come for a long time unfortunately.  Functionality is more important right now.

By the way, is there a better way to display hl at the coordinates (xx,yy) than this?
Code: [Select]
`B_CALL(_SetXXXXOP2)B_CALL(_Op2ToOP1)ld hl,\$yyxxld (PenCol),hlld a,5B_CALL(_DispOP1A)`
Its seems really roundabout to me.  Is there a bcall I don't know about that does this automatically?
Title: Re: ASM Optimized routines
Post by: calcdude84se on April 29, 2010, 07:57:10 pm
yeah, there's _DispHL
so you're code would be:
Code: [Select]
`push hlld hl,\$yyxxld (PenCol),hlpop hlB_CALL(_DispHL)`Just be aware it's right-justified in 5 spaces. (Since \$ffff is 5 decimal digits, 65535)
EDIT: oh, wait, that's pencol? so this code doesn't work then. Oops... :-[
Title: Re: ASM Optimized routines
Post by: calc84maniac on April 29, 2010, 10:27:56 pm
He's talking about graph screen display.
Title: Re: ASM Optimized routines
Post by: Galandros on April 30, 2010, 09:21:30 am
Quigibo's Challenge!

Can any of the following be done in 6 or fewer bytes?  The input and output must be HL.

• Multiply by 128?
• Signed division by any nontrivial constant, other than 2, including negative numbers?
• Modulus with any constant that is not a power of 2?
Challenge accepted.

Answer to the multiplication by 128 in 6 bytes:

I started coding a routine that multiply A by 128:
Spoiler For Spoiler:
; The old trick to multiply by 256, by moving the low byte to high byte
ld h,a
xor a   ; resets carry
rr h     ; divide h by 2
rra      ; and pass bit 0 to a
ld l,a   ; store to l
; hl is a*128

After that, I very easily modified to (hl*128)%((2^16)-1). Unsigned version:
Spoiler For Spoiler:
ld h,l
xor a
rr h
rra
ld l,a
; 6 bytes and 24 clocks to multiply hl by 128, not bad O_o

I am very sure this routines works but I have not tested.
EDIT4: tested with a few values, it works.

EDIT3:
Multiply hl by 128, now signed. If I am right, to do signed, you only need to preserve the bit 7? If that's so:
Spoiler For Spoiler:
ld h,l
xor a
sra h
rra
ld l,a
; 6 bytes, 24 clocks, too

Now I will think about the others when I have more free time. Fun, fun, fun.
Give me some time, please. :)
EDIT: I am thinking in putting some of this challenges in WikiTI when we end the challenge. And maybe Axe's routines. If you have other routines/challenges of optimization share to see what I can do.
EDIT2: fixed a bug/typo and commented even more the code
Title: Re: ASM Optimized routines
Post by: DJ Omnimaga on April 30, 2010, 09:25:11 am
how much space would it take compared to the old routine?
Title: Re: ASM Optimized routines
Post by: Galandros on April 30, 2010, 01:15:29 pm

• Signed division by any nontrivial constant, other than 2, including negative numbers?
• Modulus with any constant that is not a power of 2?
Constants need to be 16-bit?
Title: Re: ASM Optimized routines
Post by: calc84maniac on April 30, 2010, 01:54:53 pm
Those multiplications by 128 won't work. Multiplying 256*128 would give 0.
Title: Re: ASM Optimized routines
Post by: Galandros on April 30, 2010, 02:23:29 pm
Those multiplications by 128 won't work. Multiplying 256*128 would give 0.
True. I tested small numbers. Works up to 255.
At least gives a good a*128 routine.

EDIT:
I got one solution that works on all numbers that fits in hl, 7 bytes but faster than conventional way:
Code: [Select]
` xor a ; resets carry and register rr h rr l   ; divide hl by 2 rra ld h,l ld l,a ; multiply hl by 256 (moving low byte to high byte trick);8 bytes, 32 clocks;conventional way is 7 bytes and 77 clocks;it does in less than half the time with only 1 byte cost, substantial speed increase with only 1 byte cost`
Axe Parser could improve with a option to make it optimize for speed instead of the size default.
Title: Re: ASM Optimized routines
Post by: Quigibo on April 30, 2010, 03:47:57 pm
Axe Parser could improve with a option to make it optimize for speed instead of the size default.
Yeah, I'm planning to do that eventually.

Actually, I found out my optimized signed division by 2 doesn't work, so that's up in the air too now.  Also, would multiplying by negative 2 be do-able in 6 bytes?  It can certainly be done in 7 so it seems possible maybe using some trick?

EDIT: also, is the daa instruction ever useful for optimizations? I can't think of how I would use it other than floating point math.
Title: Re: ASM Optimized routines
Post by: calc84maniac on April 30, 2010, 03:51:56 pm
Axe Parser could improve with a option to make it optimize for speed instead of the size default.
Yeah, I'm planning to do that eventually.

Actually, I found out my optimized signed division by 2 doesn't work, so that's up in the air too now.  Also, would multiplying by negative 2 be do-able in 6 bytes?  It can certainly be done in 7 so it seems possible maybe using some trick?

EDIT: also, is the daa instruction ever useful for optimizations? I can't think of how I would use it other than floating point math.
Signed division by 2 is just sra h \ rr l. And I have only found daa useful (other than its intended use) when converting between hex digits and ASCII, I think.
Title: Re: ASM Optimized routines
Post by: Galandros on April 30, 2010, 04:36:12 pm
And I have only found daa useful (other than its intended use) when converting between hex digits and ASCII, I think.
Yes, it is true. Strange how it works...

; this code is not documented...
cp 10
ccf
daa

;but this I know that converts the low nibble to ASCII char
and  \$0F
daa
daa
Title: Re: ASM Optimized routines
Post by: Quigibo on April 30, 2010, 08:31:55 pm
Signed division by 2 is just sra h \ rr l.

But what about -1/2?  Shouldn't that give 0?  Because if you do that operation to %11111111 11111111 it remains -1.  Or does this routine always round up in magnitude instead of down?  It might throw some people off since it would be inconsistent with my regular unoptimized signed division routine.
Title: Re: ASM Optimized routines
Post by: DJ Omnimaga on April 30, 2010, 08:56:24 pm
True, most people in BASIC are used to rounding stuff down with Int(), like for example 1.1 or 1.9999 will be rounded down to 1. The decimal part is completly truncated. It makes things much easier IMHO when programming.  (assuming this is what you mean?)
Title: Re: ASM Optimized routines
Post by: calc84maniac on April 30, 2010, 09:12:31 pm
Signed division by 2 is just sra h \ rr l.

But what about -1/2?  Shouldn't that give 0?  Because if you do that operation to %11111111 11111111 it remains -1.  Or does this routine always round up in magnitude instead of down?  It might throw some people off since it would be inconsistent with my regular unoptimized signed division routine.
It rounds down, just like the int() command that everyone is used to. int(-1/2)=-1. That's how it works.
Title: Re: ASM Optimized routines
Post by: Quigibo on April 30, 2010, 09:32:43 pm
Ooooh, okay, then that means my regular signed division routine is wrong then x.x  I'll have to fix it somehow.

I'm just going to put a copy of it here in its current state if anyone sees any optimizations I can make.  I felt pretty ashamed when I made it because it feels like such a dirty work around.  There has to be some real algorithm that exists to perform signed division non-naively right?

Code: [Select]
`;Divides HL by DE and stores to HLSDiv: ld a,h add a,a jr nc,\$+9 xor a sub l ld l,a sbc a,a sub h ld h,a scf rra xor d push af bit 7,d jr z,\$+8 xor a sub e ld e,a sbc a,a sub d ld d,a call Divide pop af add a,a ret nc xor a sub l ld l,a sbc a,a sub h ld h,a ret`
Title: Re: ASM Optimized routines
Post by: calc84maniac on April 30, 2010, 09:55:24 pm
I was just searching around and it looks like your method is the generally-used one.
Title: Re: ASM Optimized routines
Post by: Quigibo on April 30, 2010, 10:09:03 pm
Yeah, that's what the first 5 pages of Google said when I was looking, but I still have a feeling that there might be one that is rare or undiscovered that can do it faster.  I'm going to try experimenting.  Maybe I can use some SMC to sometimes add the quotient to the accumulator and sometimes subtract it depending on the sign of the numbers.

EDIT: Oooo! I just tried messing around with this!  It works!  I just need to work on a few details first.
Title: Re: ASM Optimized routines
Post by: calc84maniac on April 30, 2010, 10:37:57 pm
I thought you were trying to avoid SMC, though.
Title: Re: ASM Optimized routines
Post by: Quigibo on April 30, 2010, 10:41:52 pm
I am.  But I found I can actually use a bit from one of the registers instead to hold the state of the operation instead of using SMC or even an app flag if I run out of registers.
Title: Re: ASM Optimized routines
Post by: DJ Omnimaga on April 30, 2010, 11:01:50 pm
I don't understand the code above but I wish you good luck in this ^^
Title: Re: ASM Optimized routines
Post by: Quigibo on May 01, 2010, 03:19:23 am
Okay, turns out the condition testing was actually taking up more space then it was saving.  My brain hurts anyway, so I give up.

I did manage to improve my existing code by a couple bytes and it makes more sense now, but rounding is apparently still backwards.  Feel free to use it if anyone needs it.
Code: [Select]
`SignedDivision: ld a,h xor d push af bit 7,h jr z,\$+8 xor a sub l ld l,a sbc a,a sub h ld h,a bit 7,d jr z,\$+8 xor a sub e ld e,a sbc a,a sub d ld d,a call RegularDivision pop af add a,a ret nc xor a sub l ld l,a sbc a,a sub h ld h,a ret`
Title: Re: ASM Optimized routines
Post by: Galandros on June 25, 2010, 07:56:39 am
I still need to code the optimized arbitrary pixels to the left or right.
I am on one of coding graphics effects and display routines.

Until that here a cool one that does reverse video, reverse colour a "filled box" in the screen buffer.
Code: [Select]
`;input: a=x, e=y, b=height, c=width in bytes;complement box area at x,ycplbox: ld h,0 ld d,h; you can optimize here for speed with sla l and avoid some adds ld l,e add hl,de add hl,de add hl,hl add hl,hl ld e,a srl e srl e srl e add hl,de ld de,plotsscreen add hl,de ;get first position in buffer and 7 ld d,\$FF ld e,0 ;de is a mask jr z,cplalignedmaskrotateloop: srl d rr e dec a jr nz,maskrotateloopcplaligned: cplboxheightloop: push bc push hl ld b,ccplboxwifthtloop: ld a,(hl) ;\ xor d ;| ld (hl),a ;| inc hl ;| complement buffer ld a,(hl) ;| xor e ;| ld (hl),a ;/ djnz cplboxwifthtloop ; loop for width pop hl ld c,12 add hl,bc ; advance for next line pop bc djnz cplboxheightloop ; loop height ret`
I will convert width to pixels instead of bytes. Seems like I will need to use a temporary byte in ram.
Title: Re: ASM Optimized routines
Post by: calc84maniac on December 17, 2010, 10:29:55 pm
Optimized routine for HL=A-HL (the negate HL optimization can be derived from this by setting A=0 first):
Code: [Select]
`  sub l  ld l,a  sbc a,a  sub h  ld h,a`
Also, topic stickied.
Title: Re: ASM Optimized routines
Post by: Xeda112358 on August 11, 2011, 02:49:33 pm
Can these be optimised:
Code: [Select]
`C_Div_D:;Inputs:;     C is the numerator;     D is the denominator;Outputs:;     A is the remainder;     B is 0;     C is the result of C/D;     D,E,H,L are not changed;     ld b,8     xor a       sla c       rla       cp d       jr c,\$+4         inc c         sub d       djnz \$-8     ret`
Code: [Select]
`DE_Times_A:;Inputs:;     DE and A are factors;Outputs:;     A is not changed;     B is 0;     C is not changed;     DE is not changed;     HL is the product;     ld b,8     ld hl,0       add hl,hl       rlca       jr nc,\$+3         add hl,de       djnz \$-5     ret`
Code: [Select]
Code: [Select]
`DE_Times_BC:;Inputs:;     DE and BC are factors;Outputs:;     A is 0;     BC is not changed;     DE is 0;     HL is the product;       ld hl,0       ld a,16Mul_Loop_1:         add hl,hl         ex de,hl         add hl,hl         ex de,hl         jr nc,\$+3           add hl,bc         dec a         jr nz,Mul_Loop_1       ret`
Code: [Select]
`DEHL_Div_C:;Inputs:;     DEHL is a 32 bit value where DE is the upper 16 bits;     C is the value to divide DEHL by;Outputs:;    A is the remainder;    B is 0;    C is not changed;    DEHL is the result of the division;     ld b,32     xor a       add hl,hl       ex de,hl       adc hl,hl       ex de,hl       rla       cp c       jr c,\$+4         inc l         sub c       djnz \$-10     ret`
Code: [Select]
`;===============================================================DEHL_Times_A:;===============================================================;Inputs:;     DEHL is a 32 bit factor;     A is an 8 bit factor;Outputs:;     interrupts disabled;     BC is not changed;     AHLDE is the 40-bit result;     D'E' is the lower 16 bits of the input;     H'L' is the lower 16 bits of the output;     B' is 0;     C' is not changed;     A' is not changed;===============================================================     di     push hl     or a     sbc hl,hl     exx     pop de     sbc hl,hl     ld b,8mul32Loop:       add hl,hl       exx       adc hl,hl       exx       add a,a       jr nc,\$+8         add hl,de         exx         adc hl,de         inc a         exx       djnz mul32Loop       push hl       exx       pop de       ret`
Code: [Select]
`GCDHL_BC:;Inputs:;     HL is a number;     BC is a number;Outputs:;     A is 0;     BC is the GCD;     DE is 0;Destroys:;     HL;Size:  25 bytes;Speed: 30 to 49708 cycles;       -As slow as about 126 times per second at 6MHz;       -As fast as about 209715 times per second at 6MHz;Speed break down:;     If HL=BC, 30 cycles;     24+1552x;     If BC>HL, add 20 cycles;     *x is from 1 to at most 32 (because we use 2 16-bit numbers);     or a \ sbc hl,bc     ;B7ED42    19     ret z                ;C8        5|11     add hl,bc            ;09        11     jr nc,\$+8            ;3006      11|31       ld a,h             ;7C        --       ld h,b             ;60        --       ld b,a             ;47        --       ld a,l             ;7D        --       ld l,c             ;69        --       ld c,a             ;4F        --Loop:     call HL_Div_BC       ;CD****    1511     ld a,d \ or e        ;7AB2      8     ret z                ;C8        5|11     ld h,b \ ld l,c      ;6069      8     ld b,d \ ld c,e      ;424B      8     jr \$-10              ;18F8      12`
EDIT: 25-March-2015 This has been really in need of updating and optimizing. This version is 226cc to 322cc faster than the original for 2 bytes more.
Code: [Select]
`;===============================================================DE_Div_BC_round:;===============================================================;Performs DE/BC, rounded;Speed:   1172+6b cycles, 1268cc worst case;Size:    25 bytes;Inputs:;     DE is the numerator;     BC is the denominator;Outputs:;     DE is the quotient;     BC is divided by 2 (truncated);     A reflects the low bits of the quotient;Destroys: HL;===============================================================    ld a,d    ld hl,0    ld d,16     rl e    rla    adc hl,hl    sbc hl,bc    jr c,\$+3    add hl,bc    dec d    jr nz,\$-11    cpl    ld d,a    ld a,e    cpl    ld e,a    ret`
Code: [Select]
`HL_Div_C:;Inputs:;     HL is the numerator;     C is the denominator;Outputs:;     A is the remainder;     B is 0;     C is not changed;     DE is not changed;     HL is the quotient;       ld b,16       xor a         add hl,hl         rla         cp c         jr c,\$+4           inc l           sub c         djnz \$-7       ret`
Code: [Select]
`HLDE_Div_C:;Inputs:;     HLDE is a 32 bit value where HL is the upper 16 bits;     C is the value to divide HLDE by;Outputs:;    A is the remainder;    B is 0;    C is not changed;    HLDE is the result of the division;     ld b,32     xor a       ex de,hl       add hl,hl       ex de,hl       adc hl,hl       rla       cp c       jr c,\$+4         inc e         sub c       djnz \$-10     ret`
EDIT 16 Aug 2019: A less destructive nCr routine that isn't prone to overflow in intermediate calculations can be found here (https://www.omnimaga.org/asm-language/asm-optimized-routines/msg407160/#msg407160).
Code: [Select]
`;===============================================================nCrHL_DE:;===============================================================;Inputs:;     hl is "n";     de is "r";Outputs:;     interrupts off;     a is 0;     bc is an intermediate result;     de is "n";     hl is the result;     a' is not changed;     bc' is "r"+1;     de' is the same as bc;     hl' is "r" or the compliment, whichever is smaller;===============================================================     or a                     ;reset carry flag     sbc hl,de     ret c                    ;r should not be bigger than n     sbc hl,de \ add hl,de     jr nc,\$+3       ex de,hl                             ;hl is R     push de     ld bc,1                 ;A     exx     pop de                  ;N     ld bc,1                 ;C     ld h,b \ ld l,c         ;DnCrLoop:     push de     push hl     call DE_Times_BC     push hl \ exx \ pop de     push hl     call DE_Div_BC     pop de     push hl \ ex de,hl \ exx \ pop hl     ld b,h \ ld c,l     pop de \ add hl,de     pop de \ inc de     exx     inc bc     or a \ sbc hl,bc \ add hl,bc     exx     jr nc,nCrLoop     ret`
Code: [Select]
`RoundHL_Div_C:;Inputs:;     HL is the numerator;     C is the denominator;Outputs:;     A is twice the remainder of the unrounded value ;     B is 0;     C is not changed;     DE is not changed;     HL is the rounded quotient;     c flag set means no rounding was performed;            reset means the value was rounded;       ld b,16       xor a         add hl,hl         rla         cp c         jr c,\$+4           inc l           sub c         djnz \$-7       add a,a       cp c       jr c,\$+3         inc hl       ret`
Code: [Select]
`;===============================================================sqrtE:;===============================================================;Input:;     E is the value to find the square root of;Outputs:;     A is E-D^2;     B is 0;     D is the rounded result;     E is not changed;     HL is not changed;Destroys:;     C;        xor a               ;1      4         4        ld d,a              ;1      4         4        ld c,a              ;1      4         4        ld b,4              ;2      7         7sqrtELoop:        rlc d               ;2      8        32        ld c,d              ;1      4        16        scf                 ;1      4        16        rl c                ;2      8        32        rlc e               ;2      8        32        rla                 ;1      4        16        rlc e               ;2      8        32        rla                 ;1      4        16        cp c                ;1      4        16        jr c,\$+4            ;4    12|15      48+3x          inc d             ;--    --        --          sub c             ;--    --        --        djnz sqrtELoop      ;2    13|8       47        cp d                ;1      4         4        jr c,\$+3            ;3    12|11     12|11          inc d             ;--    --        --        ret                 ;1     10        10;===============================================================;Size  : 29 bytes;Speed : 347+3x cycles plus 1 if rounded down;   x is the number of set bits in the result.;===============================================================`
Code: [Select]
`;===============================================================sqrtE:;===============================================================;Input:;     E is the value to find the square root of;Outputs:;     A is E-D^2;     B is 0;     D is the result;     E is not changed;     HL is not changed;Destroys:;     C=2D+1 if D is even, 2D-1 if D is odd        xor a               ;1      4         4        ld d,a              ;1      4         4        ld c,a              ;1      4         4        ld b,4              ;2      7         7sqrtELoop:        rlc d               ;2      8        32        ld c,d              ;1      4        16        scf                 ;1      4        16        rl c                ;2      8        32        rlc e               ;2      8        32        rla                 ;1      4        16        rlc e               ;2      8        32        rla                 ;1      4        16        cp c                ;1      4        16        jr c,\$+4            ;4    12|15      48+3x          inc d             ;--    --        --          sub c             ;--    --        --        djnz sqrtELoop      ;2    13|8       47        ret                 ;1     10        10;===============================================================;Size  : 25 bytes;Speed : 332+3x cycles;   x is the number of set bits in the result. This will not;   exceed 4, so the range for cycles is 332 to 344. To put this;   into perspective, under the slowest conditions (4 set bits;   in the result at 6MHz), this can execute over 18000 times;   in a second.;===============================================================`
It doesn't matter if they are optimised for speed or size, I just want to know what optimisation tricks I still need to establish. I just copied these out of my math routines folder, so some of them have random scratch work with them...
Title: Re: ASM Optimized routines
Post by: Quigibo on August 11, 2011, 04:02:27 pm
For your DE_Times_BC, this is one byte more in overhead, but much faster:

Code: [Select]
`       ld a,c       ld c,b       ld hl,0       ld b,16Mul_Loop_1:         add hl,hl         add a,a         rl c         jr nc,\$+3           add hl,de         djnz Mul_Loop_1       ret`
You could also call it the more unconventional way with CA_TIMES_DE which saves a byte and is still faster.

Code: [Select]
`       ld hl,0       ld b,16Mul_Loop_1:         add hl,hl         add a,a         rl c         jr nc,\$+3           add hl,de         djnz Mul_Loop_1       ret`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on August 11, 2011, 04:20:03 pm
For your DE_Times_BC, this is one byte more in overhead, but much faster:
Code: [Select]
`       ld a,c       ld c,b       ld hl,0       ld b,16Mul_Loop_1:         add hl,hl         add a,a         rl c         jr nc,\$+3           add hl,de         djnz Mul_Loop_1       ret`
I like this method for speeding things up! This is exactly the kind of thing I was hoping for. I want to understand how to program better, so the more ideas I can learn, the better off I should be on my quest :D
Title: Re: ASM Optimized routines
Post by: Xeda112358 on December 01, 2011, 11:07:15 am
Hmm, here is a signed division routine I wrote... I compared it to the HL_Div_BC routine I wrote.
If both inputs are positive, this is exactly the same speed, but if both are negative, it takes 50 cycles more and if only one is negative, it takes 71 cycles more.
Code: [Select]
`;===============================================================HL_SDiv_BC:;===============================================================;Performs HL/BC;Speed:   1494 cycles;         **same cycles as the regular HL_Div_BC;         add 25 if HL is negative;         add 25 if BC is negative;         add another 46 if only one is negative ;Size:    54 bytes;         **31 bytes larger than the regular HL_Div_BC;Inputs:;     HL is the numerator;     BC is the denominator;Outputs:;     HL is the quotient;     DE is the remainder;     BC is not changed;     A is 0;     z flag is set;     c flag is reset;===============================================================     ld a,h     xor b     and 128     push afabsHL:     bit 7,h     jr z,absBC     ld a,l \ cpl \ ld l,a     ld a,h \ cpl \ ld h,a     inc hlabsBC:     bit 7,b     jr z,\$+9     ld a,c \ cpl \ ld c,a     ld a,b \ cpl \ ld b,a     inc bc     add hl,hl       ld a,15       ld de,0Div_Loop_1:         add hl,hl         ex de,hl         adc hl,hl         or a         sbc hl,bc         jr c,\$+5           inc e           jr \$+3         add hl,bc         ex de,hl         dec a         jr nz,Div_Loop_1       pop af \ ret z     ld a,l \ cpl \ ld l,a     ld a,h \ cpl \ ld h,a     inc hl       ret`
Title: Re: ASM Optimized routines
Post by: calc84maniac on December 01, 2011, 11:39:56 am
I took a shot at optimizing it some more, I'm not sure if it works, but I think it should.

Code: [Select]
`;===============================================================HL_SDiv_BC:;===============================================================;Performs HL/BC;Speed:   1168 to 1318 cycles depending on how many set bits in the result;         add 19 if HL is negative;         add 19 if BC is positive;         add another 28 if only one is negative ;Size:    54 bytes;         **31 bytes larger than the regular HL_Div_BC;Inputs:;     HL is the numerator;     BC is the denominator;Outputs:;     HL is the quotient;     DE is the remainder;     BC = -abs(BC);===============================================================     ld a,h     xor b     push afabsHL:     add hl,hl     jr nc,negabsBC     xor a \ sub l \ ld l,a     sbc a,a \ sub h \ ld h,anegabsBC:     bit 7,b     jr nz,\$+8     xor a \ sub c \ ld c,a     sbc a,a \ sub b \ ld b,a       ex de,hl       xor a       ld h,a       ld l,a       ld a,15Div_Loop_1:         rl e \ rl d         adc hl,hl         add hl,bc         jr c,\$+4          sbc hl,bc         dec a         jr nz,Div_Loop_1       ex de,hl       adc hl,hl       pop af \ ret p     xor a \ sub l \ ld l,a     sbc a,a \ sub h \ ld h,a     ret`
Edit: Just realized I needed to clear the carry before the loop. My fix renders Xeda's ld hl,0 comment moot though, sorry :P
Title: Re: ASM Optimized routines
Post by: Xeda112358 on December 01, 2011, 11:42:58 am
Wow, awesome! But can the ld hl,0 be changed to sbc hl,hl? That would save a byte (but I think it is 5 cycles slower... :/ )
Title: Re: ASM Optimized routines
Post by: Runer112 on December 12, 2011, 03:46:00 pm
Here's a very optimized way to convert a 16-bit signed number into an 8-bit signed number in a with overflow handling (if hl<-128, a=-128; if hl>127, a=127). Two added bonus to being super small and super fast are that it destroys nothing and that you could easily modify it to make the input a 16-bit register other than hl.

Code: [Select]
`Signed16To8: ld a,l add a,a sbc a,a sub h ld a,l ret z ld a,h add a,a sbc a,a xor %01111111 ret`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on December 12, 2011, 03:50:27 pm
Yes, this was truly awesome getting to witness live amazingness while I also tried to create the same routine O.O My attempt was extremely ugly compared to this :) Beautiful code, Runer112
Title: Re: ASM Optimized routines
Post by: Xeda112358 on March 12, 2012, 02:50:35 pm
Wow, I am surprised I haven't posted these in this topic as I have been very proud of them for a long while now. They make a legitimate use of RRD and RLD, so for those questioning the use, check it out:
Code: [Select]
`ShiftScreenRight4:;Shifts the graph screen right 4 pixels     ld hl,plotSScreen     ld c,64       xor a       ld b,12         rrd         inc hl         djnz \$-3       dec c       jr nz,\$-9     retShiftScreenLeft4:;Shifts the graph screen left 4 pixels     ld hl,plotSScreen+767     ld c,64       xor a       ld b,12         rld         dec hl         djnz \$-3       dec c       jr nz,\$-9     ret`It is the same size as shifting 1 pixel, though 7680 cycles slower. That is still about 1 bazillion times faster than shifting left or right 1 pixel, 4 times. I've been using these for years in my graphics related programs x.x I hope they prove useful!

EDIT: In this case, 1 bazillion == 88664, apparently. To run the shifting right code once, it is 22166 cycles. The above codes use 29846 cycles.
Title: Re: ASM Optimized routines
Post by: Xeda112358 on May 02, 2012, 08:28:56 pm
Hmm, not sure why I haven't posted this here, yet, either. This is pretty useful, especially for parsing a list of numbers from some form of user input. Feel free to optimise and report back :D

Code: [Select]
`;=============================================================== ConvRStr: ;=============================================================== ;Input: ;     DE points to the base 10 number string in RAM. ;Outputs: ;     HL is the 16-bit value of the number ;     DE points to the byte after the number ;     BC is HL/10;     c flag reset (nc);     z flag reset (nz);Destroys: ;     A (actually, add 30h and you get the ending token) ;Size:  23 bytes ;Speed: 104n+42+11c;       n is the number of digits ;       c is at most n-2 ;       at most 595 cycles for any 16-bit decimal value ;===============================================================      ld hl,0          ;  10 : 210000 ConvLoop:             ;      ld a,(de)        ;   7 : 1A      sub 30h          ;   7 : D630      cp 10            ;   7 : FE0A      ret nc           ;5|11 : D0      inc de           ;   6 : 13                       ;      ld b,h           ;   4 : 44      ld c,l           ;   4 : 4D      add hl,hl        ;  11 : 29      add hl,hl        ;  11 : 29      add hl,bc        ;  11 : 09      add hl,hl        ;  11 : 29                       ;      add a,l          ;   4 : 85      ld l,a           ;   4 : 6F      jr nc,ConvLoop   ;12|23: 30EE      inc h            ; --- : 24      jr ConvLoop      ; --- : 18EB `The ones with t-states as '---' are computed along with the previous instruction to make calculations easier. Anyways, to give an idea, at the slowest, this can execute 9803 times per second (assuming you are using call which takes another 17 t-states). This stops reading when a character that is not a decimal number is run into (for example, a comma or newline).

EDIT: By removing that one byte, timing is much more easily computed and slowest time drops from 625 to 595 t-states :) This means it can execute an extra 459 times per second. It also makes the c flag have a definite output and as well the z flag
Title: Re: ASM Optimized routines
Post by: calc84maniac on May 02, 2012, 08:52:30 pm
The ret c on line 23 is redundant, but otherwise a great routine :)
Title: Re: ASM Optimized routines
Post by: Xeda112358 on May 02, 2012, 09:07:45 pm
Oh, wow, awesome! I cannot believe I didn't see that, that is a source of some of my other optimisations from the original code o.O
Title: Re: ASM Optimized routines
Post by: calc84maniac on May 02, 2012, 10:25:44 pm
Here's a very optimized way to convert a 16-bit signed number into an 8-bit signed number in a with overflow handling (if hl<-128, a=-128; if hl>127, a=127). Two added bonus to being super small and super fast are that it destroys nothing and that you could easily modify it to make the input a 16-bit register other than hl.

Code: [Select]
`Signed16To8: ld a,l add a,a sbc a,a sub h ld a,l ret z ld a,h add a,a sbc a,a xor %01111111 ret`

Implied challenge accepted!
Code: [Select]
`Signed16To8: ld a,l add a,a ld a,h adc a,l cp l ret z ld a,\$7F ret p inc a ret`
Edit: Whoops, misteak

Edit2: This routine is a failure, disregard its failtasticness
Title: Re: ASM Optimized routines
Post by: Xeda112358 on July 03, 2012, 12:13:46 pm
Necroedit: For a much better routine, please try the routines at the end of this (https://www.omnimaga.org/asm-language/asm-optimized-routines/msg367877/#msg367877) post!

I created this last night for my next project:
Code: [Select]
`PseudoRandWord:;Outputs:;     BC was the previous pseudorandom number;     HL is the pseudorandom number;f(n+1)=(241f(n)+257) mod 65536   ;65536;181 cycles, add 17 if called     ld hl,(randSeed)     ld c,l     ld b,h     add hl,hl     add hl,bc     add hl,hl     add hl,bc     add hl,hl     add hl,bc     add hl,hl     add hl,hl     add hl,hl     add hl,hl     add hl,bc     inc h     inc hl     ld (randSeed),hl     ret`There are a few other nice features, too. For example, every 16-bit value is hit if you run this 65536 times. Or, if you only read 1 byte (for example, H from the output), it will hit every 8-bit number once if you run this 256 times. Plus, it can be seeded, which has its own uses. This can be modified to be smaller, too, if you know what you are doing, but I just like the numbers 241 and 257. Anyways, it produces some nice results :)

P.S.-I used this in a routine called "ShuffleDeck" and it works very well.
Title: Re: ASM Optimized routines
Post by: chickendude on July 04, 2012, 11:19:47 am
I don't understand the theory behind that algorithm, but you could save a couple clocks with SMC ;)

And is ShuffleDeck a hint at what your next project might be?
Title: Re: ASM Optimized routines
Post by: Xeda112358 on July 09, 2012, 08:08:46 am
Yes, you can use SMC to save at least 6 cycles for RAM programs :) My next mini project is an app with some small games (including card games). I don't have my computer with me, but I will post a working sound routine next time I get a chance.

The way it works is that we are using mod 216, so I selected two numbers relatively prime to 65536 (so any odd number, in this case). There are a few other conditions dealing with the Euler phi function, I believe, but I got lucky with the numbers I chose, so I didn't need to look it up. If you check, I chose prime numbers, specifically, because I figured those would give me the best shot.

If you choose the wrong values, you will get cycles of 2n. I am not sure how familiar you are with group theory, but essentially, you will be creating sub groups and the order (size) of a subgroup will always divide the order of the main group. So some values will make cycles of 32768, 16384, and other smaller powers of 2. (gah, there is so much cool theory behind this, but I don't have much time).

EDIT: ooh, here is a useful routine :)
Code: [Select]
`FindNumPages:;Inputs:;     The app base page is loaded in MemBank1;Outputs:;     c flag set if the field was found;     nc means the app header subfield was not found;     A is the number of app pages;     B is 0;     (HL) is the number of app pages     ld hl,4000h     ld bc,128     ld a,c     or aFNPLoop:     cpir     ret po     ret nz     inc a     cp (hl)     jr z,\$+5     dec a     jr FNPLoop     inc l     ld a,(hl)     scf     ret`I made that to be a faster alternative to using a bcall :)
Title: Re: ASM Optimized routines
Post by: Runer112 on July 09, 2012, 02:45:26 pm
Optimized a bit. :) The largest optimization was removing end checking, because it's impossible for an application not to have a number of pages field. I also optimized the search loop by rearranging it a bit to remove the unconditional jump.

Code: [Select]
`FindNumPages:;Inputs:;     The app base page is loaded in MemBank1;Outputs:;     A, (HL) is the number of app pages     ld hl,4000h     ld a,81h     ld c,aFNPLoop:     dec a     cpir     inc a     cp (hl)     jr nz,FNPLoop     inc l     ld a,(hl)     ret`
Title: Re: ASM Optimized routines
Post by: thepenguin77 on July 09, 2012, 02:46:04 pm
There are actually rare cases where that routine could fail. Of course I would assume it will work 99.9% of apps, if someone changed the order of the header and put the time stamp key in front of the number of pages, it could theoretically contain \$80, \$81.

But, now that I think about it, this is so rare that it will never happen.
Title: Re: ASM Optimized routines
Post by: calc84maniac on July 09, 2012, 04:54:29 pm
Isn't that routine searching for \$80, \$81 anyway?

Edit: Oh, I see what you're saying. The time stamp data could contain \$80, \$81.
Title: Re: ASM Optimized routines
Post by: NanoWar on July 12, 2012, 06:02:41 pm
Has anybody got a good rectangle function? It should use variable width by pixel, not byte... Here's my ugly code:
Code: [Select]
`rectangle ;inputs: l=Y, a=X, b=height, c=length ;save coords & stuff h, a push hl push bc call rectangle.calc ;below pop bc pop hlrectangle.display; inputs: h=X, l=Y, b=height ld a, h ld e, l ld h, \$00 ld d, h add hl, de add hl, de add hl, hl add hl, hl ;l*12 ld e, a srl e srl e srl e ;x/8 add hl, de ld de, gbuf add hl, derectangle.display.loop: push bc push hl ld a, (hl) ld c, a ld a, (rectangle.scanline1) ; somewhere in ram... xor c ld (hl), a inc hl ld a, (rectangle.scanline2) or a jr z, rectangle.display.noloop2 ld b, arectangle.display.loop2: ld a, (hl) xor \$FF ld (hl), a inc hl djnz rectangle.display.loop2rectangle.display.noloop2: ld a, (hl) ld c, a ld a, (rectangle.scanline3) xor c ld (hl), a pop hl pop bc ld de, 12 add hl, de djnz rectangle.display.loop retrectangle.calc;inputs:; a = x; b = height; c = length ld d, a ld a, \$FF ld (rectangle.scanline1), a xor a ld (rectangle.scanline2), a ld (rectangle.scanline3), a ld a, d and 7 ld d, a or a jr z, rectangle.skipShift1 ld e, \$FFrectangle.shift1 srl e dec a or a jr nz, rectangle.shift1 ld a, e ld (rectangle.scanline1), arectangle.skipShift1 ld a, d ; a = shift right ld h, a ; save add a, c ; a + c ld b, a ; save b = a + c and 7 ; /8 Rest? ld d, a ; Rest ld a, 8 sub d ; 8 - Rest ld d, a ; = d ld e, \$FFrectangle.shift2 sla e dec a or a jr nz, rectangle.shift2 ld a, e ld (rectangle.scanline3), a ld a, 16 ld e, h ; a sub e ; 16 - a sub d ; -d srl a srl a srl a ld d, a ; ld a, c srl a srl a srl a ; /8 sub d ld d, a ld a, (rectangle.scanline2) add a, d ld (rectangle.scanline2), a ; ld a, b and %11111000 or a ; if (shift_right + length)<8, do (rectangle.scanline1 & rectangle.scanline3) ret nz ld a, (rectangle.scanline1) ld d, a ld a, (rectangle.scanline3) and d ld (rectangle.scanline1), a xor a ld (rectangle.scanline2), a ld (rectangle.scanline3), a ret`How bad is it? :D
Title: Re: ASM Optimized routines
Post by: chickendude on July 15, 2012, 03:38:27 am
You can check out the MirageOS source (http://brandonw.net/calcstuff/MirageOS.txt), but i think it just draws four lines using their line routine.
Title: Re: ASM Optimized routines
Post by: NanoWar on July 18, 2012, 03:59:28 am
Oh it should be a filled rect routine :) .
Title: Re: ASM Optimized routines
Post by: Hayleia on July 18, 2012, 05:00:08 am
There is a filled rectangle routine in Axe. I don't know where you can see its source code though.
Title: Re: ASM Optimized routines
Post by: deeph on July 18, 2012, 05:49:18 am
Here's one I use for one of my project :

Code: [Select]
`;=======================================;; Rectangle Filling Routine Version 1.0 ;; By Jason Kovacs & The TCPA - 10/11/99 ;;=======================================;; Input:  D = Top Left X Coordinate, E = Top Left Y Coordinate; H = Bottom Right X Coord,  L = Bottom Right Y Coord; C = Color of Lines (0-White, 1-Black, 2-XORed);; Output: A Rectangle is drawn to the Graph Buffer with its border;    and everything within it Filled in according to the value in;    reg C which specifies the Color.;; Registers Affected: AF Destroyed; B=0 ; C, DE, HL Preserved.;    The Index Registers and the Shadow Registers Aren't Used.Rectangle_Filled: ld a,l sub e inc a ld b,a ld a,h sub d inc a    push deRect_Fill_Loop: push af call V_Line pop af inc d dec a jr nz, Rect_Fill_Loop pop de ret;=======================================;; Horizontal and Verticle Line Routines ;; By Jason Kovacs & The TCPA - 10/11/99 ;;=======================================;; For H_Line and V_Line:;; Input:  B = Length of Line (Number of Pixels); C = Color of Line (0-White, 1-Black, 2-XORed); D = X Coordinate Start of the Line; E = Y Coordinate Start of the Line;; Output: Lines are Drawn to the Graph Buffer, and the Starting;    Byte and Pixel Mask are Automatically determined according;    to the Input of the Coordinates in DE.;; Registers Affected:  All Registers are Preserved Except AF.V_Line: push de push hl push bc ld a,d call Getpix pop bc push bc ld d,c ld c,a ld a,d ld de,12 or a call z,V_White_Line dec a call z,V_Black_Line dec a call z,V_XORed_Line pop bc pop hl pop de retV_White_Line: ld a,c cpl ld c,a V_White_Line_2: ld a,(hl) and c ld (hl),a add hl,de djnz V_White_Line_2 xor a retV_Black_Line: ld a,(hl) or c ld (hl),a add hl,de djnz V_Black_Line xor a retV_XORed_Line: ld a,(hl) xor c ld (hl),a add hl,de djnz V_XORed_Line retGetpix: ld d,0 ld h,d ld l,e add hl,de add hl,de add hl,hl add hl,hl ld de,plotsscreen add hl,deGetbit: ld b,0 ld c,a and %00000111 srl c srl c srl c add hl,bc ld b,a inc b ld a,%00000001GBLoop: rrca djnz GBLoop ret`
Title: Re: ASM Optimized routines
Post by: NanoWar on July 19, 2012, 07:42:36 pm
Ah cool thanks. This was a bit off topic I guess :)
Title: Re: ASM Optimized routines
Post by: Xeda112358 on July 20, 2012, 10:13:55 am
Here is a routine that I started a long time ago (back when I coded only in hex). I have been too lazy to replace the hex code with mnemonics, but it can draw all sorts of rectangle types :) I am sure it can be optimised, but it works:

(note, RectData needs to be 24 bytes of free ram. I typically use data in the op registers or cmdshadow, or something like that.)
Code: [Select]
`DrawRectToGraph:     ld hl,9340h;===============================================================DrawRectToBuffer:DrawRect:;===============================================================;Inputs:;     A is the type of rectangle to draw;        0 =White;        1 =Black;        2 =XOR;        3 =Black border;        4 =White border;        5 =XOR border;        6 =Black border, white inside;        7 =Black border, XOR inside;        8 =White border, black inside;        9 =White border, XOR inside;        10=Shift Up;        11=Shift Down;;;        14=pxlTestRect  (returns the number of on pixels in the rectangular region);        15=pxlTestBorder (returns the number of on pixels on the border, good for collision detection);     B is the height;     C is the Y pixel coordinate;     D is the width in pixels;     E is is the X pixel coordinate;===============================================================     di     push hl     pop ix     ex af,af';Check if coords are negative     ld a,c     or a     jp p,\$+9       add a,b       ret nc       ret z       ld b,a       ld c,0     ld a,e     or a     jp p,\$+9       add a,d       ret nc       ret z       ld d,a       ld e,0;Check dimensions     ld a,b     or a     ret z     jp p,\$+6       neg       ld b,a     add a,c     sub 64     jr c,\$+6       neg       add a,b       ld b,a     ld a,d     or a     ret z     jp p,\$+6       neg       ld d,a     add a,e     sub 96     jr c,\$+6       neg       add a,d       ld d,a     ld a,c     cp 64     ret nc     ld a,e     cp 96     ret ncMakePattern:     push bc     ld hl,RectData     ld b,24     xor a     ld (hl),a     inc l     djnz \$-2     ld hl,RectData     ld c,RectData+12     ld a,e     sub 8     jr c,\$+6       inc l       inc c       jr \$-6     add a,8     ld e,a     ld b,a     inc b     ld a,d     add a,e     ld e,a     ld a,1     rrca     djnz \$-1     ld b,l     push af     ld l,c     or (hl)     ld (hl),a     ld l,b     pop af     dec a     scf     adc a,a     ld (hl),a     ld a,e     sub 8     jr c,\$+10     jr z,\$+10       inc l       ld (hl),-1       inc c       jr \$-10     add a,8     ld b,a     or a     ld a,1     jr z,\$+5     rrca     djnz \$-1     ld b,l     push af     ld l,c     or (hl)     ld (hl),a     ld l,b     pop af     dec a     cpl     and (hl)     ld (hl),a     pop bc     ld a,b     ld b,0     ld h,b     ld l,c     add hl,hl     add hl,bc     add hl,hl     add hl,hl     push ix     pop bc     add hl,bc     ld b,a     ex af,af'     .db \$CB,\$67,\$28,\$10,\$D6,\$10,\$F5,\$0E,\$18,\$11     .dw RectData     .db \$1A,\$2F,\$12,\$13,\$0D,\$20,\$F9,\$F1     or a     jr nz,\$+13h     .db \$0E,\$0C,\$11     .dw RectData     .db \$1A,\$2F,\$A6,\$77,\$13,\$23,\$0D,\$20,\$F7,\$10,\$F0,\$C9     dec a     jr nz,\$+12h     .db \$0E,\$0C,\$11     .dw RectData     .db \$1A,\$B6,\$77,\$13,\$23,\$0D,\$20,\$F8,\$10,\$F1,\$C9     dec a     jr nz,\$+12h     .db \$0E,\$0C,\$11     .dw RectData     .db \$1A,\$AE,\$77,\$13,\$23     .db \$0D,\$20,\$F8,\$10,\$F1,\$C9     dec a     jr nz,\$+26h     .db \$0E,\$0C,\$11     .dw RectData     .db \$1A,\$B6,\$77,\$13,\$23,\$0D,\$20,\$F8,\$05,\$C8     .db \$05,\$28,\$0F,\$0E,\$0C,\$11     .dw RectData+12     .db \$1A,\$B6,\$77,\$13,\$23,\$0D,\$20,\$F8,\$10,\$F1,\$04,\$18,\$DC     dec a     jr nz,\$+28h     .db \$0E,\$0C,\$11     .dw RectData     .db \$1A,\$A6,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F7,\$05,\$C8,\$05,\$28,\$10,\$0E,\$0C,\$11     .dw RectData+12     .db \$1A,\$A6,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F7,\$10,\$F0,\$04,\$18,\$DA     dec a     jr nz,\$+26h     .db \$0E,\$0C,\$11     .dw RectData     .db \$1A,\$AE     .db \$77,\$13,\$23,\$0D,\$20,\$F8,\$05,\$C8,\$05,\$28,\$0F,\$0E,\$0C,\$11     .dw RectData+12     .db \$1A,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F8     .db \$10,\$F1,\$04,\$18,\$DC     dec a     jr nz,\$+36h     .db \$0E,\$0C,\$11     .dw RectData     .db \$1A,\$B6,\$77,\$13,\$23,\$0D,\$20,\$F8,\$05,\$C8,\$05     .db \$28,\$1F,\$E5,\$0E,\$0C,\$11     .dw RectData     .db \$1A,\$A6,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F7,\$E1,\$0E,\$0C,\$11     .dw RectData+12     .db \$1A     .db \$B6,\$77,\$13,\$23,\$0D,\$20,\$F8,\$10,\$E1,\$04,\$18,\$CC     .db \$3D,\$20,\$33,\$0E,\$0C,\$11     .dw RectData     .db \$1A,\$B6,\$77,\$13     .db \$23,\$0D,\$20,\$F8,\$05,\$C8,\$05,\$28,\$1E,\$E5,\$0E,\$0C,\$11     .dw RectData     .db \$1A,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F8,\$E1     .db \$0E,\$0C,\$11     .dw RectData+12     .db \$1A,\$B6,\$77,\$13,\$23,\$0D,\$20,\$F8,\$10,\$E2,\$04,\$18,\$CD,\$3D,\$20,\$34,\$0E,\$0C,\$11     .dw RectData     .db \$1A,\$A6,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F7,\$05,\$C8,\$05,\$28,\$1E,\$E5,\$0E,\$0C,\$11     .dw RectData     .db \$1A,\$B6     .db \$77,\$13,\$23,\$0D,\$20,\$F8,\$E1,\$0E,\$0C,\$11     .dw RectData+12     .db \$1A,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F8,\$10,\$E2,\$04,\$18     .db \$CC,\$3D,\$20,\$35,\$0E,\$0C,\$11     .dw RectData     .db \$1A,\$A6,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F7,\$05,\$C8,\$05,\$28,\$1F,\$E5     .db \$0E,\$0C,\$11     .dw RectData     .db \$1A,\$AE,\$77,\$13,\$23,\$0D,\$20,\$F8,\$E1,\$0E,\$0C,\$11     .dw RectData+12     .db \$1A,\$A6,\$AE,\$77,\$13     .db \$23,\$0D,\$20,\$F7,\$10,\$E1,\$04,\$18,\$CB,\$3D,\$20,\$37,\$05,\$C8,\$F3,\$E5,\$D9,\$01,\$0C,\$00,\$E1,\$09,\$D9,\$0E     .db \$0C,\$11     .dw RectData     .db \$D5,\$D9,\$D1,\$D9,\$1A,\$2F,\$A6,\$D9,\$47,\$1A,\$A6,\$B0,\$13,\$23,\$D9,\$77,\$13,\$23,\$0D,\$20     .db \$EF,\$10,\$E4,\$0E,\$0C,\$11     .dw RectData     .db \$1A,\$2F,\$A6,\$77,\$13,\$23,\$0D,\$20,\$F7,\$FB,\$C9     .db \$3D,\$20,\$40,\$F3,\$C5     .db \$11,\$0C,\$00,\$19,\$10,\$FD,\$2B,\$E5,\$D9,\$11,\$F4,\$FF,\$E1,\$19,\$D9,\$C1,\$05,\$C8,\$0E,\$0C,\$11     .dw RectData+11     .db \$D5     .db \$D9,\$D1,\$D9,\$1A,\$2F,\$A6,\$D9,\$47,\$1A,\$A6,\$B0,\$1B,\$2B,\$D9,\$77,\$1B,\$2B,\$0D,\$20,\$EF,\$10,\$E4,\$0E,\$0C     .db \$11     .dw RectData+11     .db \$1A,\$2F,\$A6,\$77,\$1B,\$2B,\$0D,\$20,\$F7,\$FB,\$C9     dec a     ret z     dec a     ret z     exx     ld de,0     ld c,8     exxPxlTestRect:     dec a     jr nz,PxlTestBorder     ld c,12     ld de,RectDataPxlTstRectLoop:       call PxlTestWithMask       djnz PxlTstRectLoop-5       exx;DE contains the number of pixels       retPxlTestBorder:     dec a     ret nz     ld c,12     ld de,RectData     call PxlTestWithMask     dec b     jr z,PxlTestBorder-4     dec b     jr z,PxlTestBrdrEnd     ld c,12     ld de,RectData+12PxlTstBrdrLoop:       call PxlTestWithMask       djnz PxlTstBrdrLoop-5PxlTestBrdrEnd:     ld de,RectData     ld c,12     call PxlTestWithMask     exx;DE contains the number of on pixels       retPxlTestWithMask:     ld a,(de)     and (hl)     exx     ld b,c     add a,a     jr nc,\$+3     inc de     jr z,\$+4     djnz \$-6     exx     inc de     inc hl     dec c     jr nz,PxlTestWithMask     ret`
Title: Re: ASM Optimized routines
Post by: chickendude on October 01, 2012, 09:21:55 am
Not an optimized "routine", or really even useful, but i was trying to figure out how to add new instructions (and not macros) to spasm, and finally got this:
Code: [Select]
`.addinstr add ahl,de 00CE19 3 NOP 1 ;add hl,de \ adc a,0.addinstr add ahl,cde 8919 2 NOP 1 ;add hl,de \ adc a,c`Now whenever i want to do 24-bit addition (well, kinda) i can just use the new add ahl,de and add ahl,cde instructions ;)

EDIT: oops, my bbcode leaves a lot to be desired...
Title: Re: ASM Optimized routines
Post by: ben_g on October 01, 2012, 03:02:45 pm
so does it work like this?
Code: [Select]
`.addinstr <instruction> <asm code for the instruction, written in hex> <amounth of bytes of asm code> NOP 1`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on October 01, 2012, 06:35:31 pm
In most cases, yes, that is what you will want to do. If you open up the file TASM80, you will see the format of other such instructions.
Title: Re: ASM Optimized routines
Post by: chickendude on October 01, 2012, 06:37:09 pm
More or less, though you have to watch out for the ordering of the hex (it's backwards):
00CE19

And maybe there's a limit of four bytes/command, 'cuz nothing over four bytes seems to work. The command will take up that many bytes, but everything past four bytes seems to turn into random bytes.

You can use other values, too, like an asterisk can be used to pull in data, for example from tasmtabs.htm (NOTOUCH=NOP):
Code: [Select]
`                                          EXAMPLE         EXAMPLEINSTRUCTION DEFINITION                    SOURCE          OBJECT-------------------------------------------------------------------XYZ  *      FF   3  NOTOUCH 1             xyz 1234h       FF 34 12XYZ  *      FF   2  NOTOUCH 1             xyz 1234h       FF 34ZYX  *      FE   3  SWAP    1             zyx 1234h       FE 12 34ZYX  *      FE   3  R2      1             zyx \$+4         FE 01 00ABC  *,*    FD   3  COMBINE 1             abc 45h,67h     FD 45 67ABC  *,*    FD   3  CSWAP   1             abc 45h,67h     FD 67 45ADD  A,#*   FC   2  NOTOUCH 1             add A,#'B'      FC 42RET  ""     FB   1  NOTOUCH 1             ret             FBLD   IX,*   21DD 4  NOTOUCH 1             ld  IX,1234h    DD 21 34 12LD   IX,*   21DD 4  NOTOUCH 1 1 0         ld  IX,1234h    DD 21 68 24LD   IX,*   21DD 4  NOTOUCH 1 0 1         ld  IX,1234h    DD 21 35 12LD   IX,*   21DD 4  NOTOUCH 1 1 1         ld  IX,1234h    DD 21 69 24LD   IX,*   21DD 4  NOTOUCH 1 8 12        ld  IX,34h      DD 21 12 34`
Title: Re: ASM Optimized routines
Post by: chickendude on February 01, 2013, 05:08:48 am
Here's a quick filled rectangle routine i wrote the other night. It probably doesn't belong here since i don't think it's that optimized, but it should be smaller and faster than Jason Kovacs' routine posted on the previous page. As always, do whatever you want with the code and please don't give me credit! It takes the starting x/y coordinates in de, and the width and height in pixels in bc. rectangle_filled_solid will make a solid black rectangle, rectangle_filled_xor will xor the pixels within the rectangle's area.
EDIT: Added Xeda's optimisations. (see below for a slightly faster version)
Code: [Select]
`#ifdef TI83PGBUF_LSB = \$40GBUF_MSB = \$93#elseGBUF_LSB = \$29GBUF_MSB = \$8E#endif;b = # columns;c = # rows;d = starting x;e = starting yrectangle_filled_xor: ld a,\$AE ;xor (hl) jr rectangle_filled2rectangle_filled_solid: ld a,\$B6 ;or (hl)rectangle_filled2: push de push bc ld (or_xor),a ;use smc for xor/solid fill ld a,d ;starting x and \$7 ;what bit do we start on? ex af,af' ld a,d ;starting x ld l,e ;ld hl,e ld h,0 ; .. ld d,h ;set d = 0 add hl,de ;starting y * 12 add hl,de ;x3 add hl,hl ;x6 add hl,hl ;x12 rra ;a = x coord / 8 rra ; rra ; and %00011111 ;starting x/8 (starting byte in gbuf) add a,GBUF_LSB ld e,a ; ld d,GBUF_MSB ; add hl,de ;hl = offset in gbuf ex af,af' ;carry should be reset and z affected from and \$7 ld e,a ld a,%10000000 jr z,\$+6 rra dec e jr nz,\$-2 ld d,a ;starting bit to drawrectangle_loop_y: push bc push hlrectangle_loop_x: ld e,a ;save a (overwritten with or (hl))or_xor = \$ or (hl) ;smc will modify this to or/xor ld (hl),a ld a,e ;recall a rrca ;rotate a to draw the next bit jr nc,\$+3 inc hl djnz rectangle_loop_x pop hl ;hl = first column in gbuf row ld c,12 ;b = 0, bc = 12 add hl,bc ;move down to next row pop bc ;restore b (# columns) ld a,d ;restore a (starting bit to draw) dec c jr nz,rectangle_loop_yrectangle_end: pop bc pop de ret`
Here's a little example of how you could use the routine to make a text box:
Code: [Select]
`start: ld de,\$041B ld bc,\$1103 call draw_box2 call ionFastCopy bcall _getkey ret  draw_box2: call rectangle_filled_solid inc d inc e dec b dec b dec c dec c call rectangle_filled_xor ret`
EDIT: Small optimization
Title: Re: ASM Optimized routines
Post by: Xeda112358 on February 01, 2013, 01:39:49 pm
Awesome, I definitely like rectangel codes! I am glad you preserve the coordinates, too. Here are a few optimisations that I see without examining the code too much.
Code: [Select]
`  ld hl,or_xor   ld (hl),a   ;use smc for xor/solid fill `To save a byte and a clock cycle, this can simply be:
Code: [Select]
`  ld (or_xor),a`
Code: [Select]
`   rra    ;a = x coord / 8    rra    ;    rra    ;    and %00011111 ;starting x/8 (starting byte in gbuf)    ld e,a   ;    add hl,de  ;add x    ld de,gbuf  ;    add hl,de  ;hl = offset in gbuf `To save 7 clock cycles and 0 bytes:
Code: [Select]
`   rra    ;a = x coord / 8    rra    ;    rra    ;    and %00011111 ;starting x/8 (starting byte in gbuf)    or 40h   ;gbuf=9340h, 40h = %01000000, so this won't cause problems   ld e,a   ;   ld d,93h   add hl,de  ;add x `

Code: [Select]
`  pop hl   pop bc    ;restore b (# columns)   pop af   ld de,12   add hl,de   ;move down to next row `Since B is already 0 at the start of this code, save a byte and 3 t-states:
Code: [Select]
`  pop hl   ld c,12  add hl,bc  pop bc    ;restore b (# columns)   pop af `And now you can actually modify that whole routine to be a little faster:
Code: [Select]
`  ld d,arectangle_loop_y:   push bc  push hl rectangle_loop_x:    ld e,a or_xor = \$    or (hl)   ;smc will modify this to or/xor    ld (hl),a    ld a,e    rrca     jr nc,\$+3     inc hl    djnz rectangle_loop_x   pop hl   ld c,12  add hl,bc  pop bc  ld a,d  dec c    jr nz,rectangle_loop_y `You get rid of a push af / pop af in the loop which takes 21 t-states and replace it with a ld a,d which is 4 t-states.

This is definitely smaller than my routine. In mine, I make a 12-byte pattern to OR or XOR onto the screen o_o
One of the tricks I use is to find the first byte non-zero byte of the pattern:
Code: [Select]
`;D is the x coordinate, here;HL points to the pattern buffer (12 bytes)     ld a,d     and 7     ld a,80h     ld b,a     jr z,\$+5       rrca       djnz \$-1;A is the mask if it were for a pixel;B is 0     add a,a     dec a     ld (hl),a`So for example, if D and 7 = 3, you would get %00010000. 'add a,a' turns that to %00100000 and then dec a â†’ %00011111.
And if you are worried about ' D and 7 ' = 0, you get %10000000â†’%00000000â†’%11111111 which is correct, or if 'D and 7' returns 7, %00000001â†’%00000010â†’%00000001.
Title: Re: ASM Optimized routines
Post by: chickendude on February 03, 2013, 04:11:16 am
That's a cool trick with the add a,a \ dec a. My first go at it only handled multiples of 8 and looked like this:
Code: [Select]
`;b = # columns ;c = # rows ;d = starting x ;e = starting y rectangle_filled2: ld a,d ;a = starting x coord ld l,e ;ld hl,e ld h,0 ; .. ld d,h ;set d = 0 add hl,de ;starting y * 12 add hl,de ;x3 add hl,hl ;x6 add hl,hl ;x12 rra ;a = x coord / 8 rra ; rra ; and %00011111 ;starting x/8 ld e,a ; add hl,de ;add x ld de,gbuf add hl,de ;offset in gbuf ld a,b ;b = no columns rra rra rra and %00011111 ;no. columns / 8 ld b,a ld a,12 sub b ld e,a ld d,0 rectangle_loop_y: push bc rectangle_loop_x: ld (hl),\$FF inc hl djnz rectangle_loop_x pop bc ;restore c (# columns) add hl,de ;move down to next row dec c jr nz,rectangle_loop_y rectangle_end: ret`Much smaller and faster, but certain cases (for example, a rectangle less than 8 pixels wide, 2 byte rectangles, etc.) seemed like they were going to bump up the size quite a bit and what I wanted was to write it more for size than speed since it's not being used in any time-critical areas of the game (mostly in menus and the battle engine). How do you handle cases where, say, a rectangle doesn't fill an entire byte?

The gbuf bit is also pretty creative (i guess you could also just add a,\$40) but unfortunately wouldn't work out since it's being written for the 83 as well, though i guess a simple define would do. Yep:
Code: [Select]
`#ifdef TI83PGBUF_LSB = \$40GBUF_MSB = \$93 .org progstart-2 .db \$bb,\$6d#else GBUF_LSB = \$29GBUF_MSB = \$8E .org progstart#endif;... rra ;a = x coord / 8 rra ; rra ; and %00011111 ;starting x/8 (starting byte in gbuf) add a,GBUF_LSB ld e,a ; ld d,GBUF_MSB ; add hl,de ;hl = offset in gbuf`...works just fine for the 83/+ :)

I hope deeph doesn't mind, if you feel like checking their project out it's over at yAronet: http://www.yaronet.com/posts.php?s=153983

EDIT: Here's a version that draws vertically (here b = height and c = width) which seems to be slightly faster (and's the same size):
Code: [Select]
`#ifdef TI83PGBUF_LSB = \$40GBUF_MSB = \$93#elseGBUF_LSB = \$29GBUF_MSB = \$8E#endif;b = # rows;c = # columns;d = starting x;e = starting yrectangle_filled_xor: ld a,\$AE ;xor (hl) jr rectangle_filled2rectangle_filled_solid: ld a,\$B6 ;or (hl)rectangle_filled2: push de push bc ld (or_xor),a ;use smc for xor/solid fill ld a,d ;starting x and \$7 ;what bit do we start on? ex af,af' ld a,d ;starting x ld l,e ;ld hl,e ld h,0 ; .. ld d,h ;set d = 0 add hl,de ;starting y * 12 add hl,de ;x3 add hl,hl ;x6 add hl,hl ;x12 rra ;a = x coord / 8 rra ; rra ; and %00011111 ;starting x/8 (starting byte in gbuf) add a,GBUF_LSB ld e,a ; ld d,GBUF_MSB ; add hl,de ;hl = offset in gbuf ex af,af' ;carry should be reset and z affected from and \$7 ld d,a ld a,%10000000 jr z,\$+6 rra dec d jr nz,\$-2 ld e,12rectangle_loop_x: push af push bc push hl ld c,arectangle_loop_y:or_xor = \$ or (hl) ;smc will modify this to or/xor ld (hl),a ld a,c add hl,de djnz rectangle_loop_y pop hl pop bc pop af rrca jr nc,\$+3 inc hl dec c jr nz,rectangle_loop_xrectangle_end: pop bc pop de ret`
EDIT: Small optimization :)
Title: Re: ASM Optimized routines
Post by: NanoWar on February 03, 2013, 12:39:46 pm
So these work only with rectangle width W where W modulo 8 = 0 and W >= 8 ?

Any chance of a generic rectangle routine?
Title: Re: ASM Optimized routines
Post by: chickendude on February 03, 2013, 11:51:44 pm
They should work with any rectangle with a width/height greater than 0 (there's no error checking for invalid dimensions and no clipping). I can see what i can do for a generic rectangle routine, what i've been doing is just calling the routine twice:
Code: [Select]
`start: ld de,\$0204 ld bc,\$2121 call draw_box2 call ionFastCopy bcall _getkey retdraw_box2: call rectangle_filled_solid inc d inc e dec b dec b dec c dec c call rectangle_filled_xor ret`
EDIT: What i meant was that the first routine i wrote only handled multiples of 8, the other routines (the one in post #64 and at the bottom of #66) can handle any rectangle with valid (non-zero) dimensions. Well, as long as they don't go off-screen!

EDIT2: Here's a simple normal rectangle routine. I just converted the other routine to not draw the inside of the rectangle, so it won't erase what's inside the rectangle. It's currently at 90 bytes and not terribly fast, slightly slower than drawing a filled rectangle. By default a rectangle needs to be at least 3x2 (that is X*Y), but if you don't mind adding 8 bytes (altogether 99 bytes) you can use it to draw horizontal and vertical lines and even plot pixels ;) You can just uncomment those lines.
Code: [Select]
`;b = # rows;c = # columns;d = starting x;e = starting yrectangle: push de push bc ld a,d ;starting x and \$7 ;what bit do we start on? ex af,af' ld a,d ;starting x ld l,e ;ld hl,e ld h,0 ; .. ld d,h ;set d = 0 add hl,de ;starting y * 12 add hl,de ;x3 add hl,hl ;x6 add hl,hl ;x12 rra ;a = x coord / 8 rra ; rra ; and %00011111 ;starting x/8 (starting byte in gbuf) add a,GBUF_LSB ld e,a ; ld d,GBUF_MSB ; add hl,de ;hl = offset in gbuf ex af,af' ;carry should be reset and z affected from and \$7 ld e,a ld a,%10000000 jr z,\$+6 rra dec e jr nz,\$-2 dec b ;you could adjust your input to take care of this, ie b = width-2, c = height-1 and save 3 bytes here dec b ;we draw the ends separately dec c ;we'll draw the last line at the end ld d,a ;starting bit to draw;d = starting bitrectangle_loop_y: push bc push hl call rectangle_loop_x pop hl ;hl = first column in gbuf row ld c,12 ;b = 0, bc = 12 add hl,b ;move down to next row pop bc ;restore b (# columns) xor a; cp c ; # UNCOMMENT TO ALLOW LINES WITH A HEIGHT OF 1 PIXEL; jr z,rectangle_end ; # ld (ld_hl),a ;change ld (hl),a to nop ld a,d ;restore a (starting bit to draw) dec c jr nz,rectangle_loop_y ld a,\$77 ;ld (hl),a ld (ld_hl),a ;return nop to ld (hl),a ld a,d call rectangle_loop_xrectangle_end: pop bc pop de retrectangle_loop_x: or (hl) ;first bit ld (hl),a; inc b ; # UNCOMMENT TO ALLOW LINES WITH A WIDTH OF 1 PIXEL; ret z ; # ; dec b ; # ; jr z,rectangle_loop_x_end ; # ld a,drectangle_loop_x_inner: rrca ;rotate a to draw the next bit jr nc,\$+3 inc hl ld e,a ;save a (overwritten with or (hl)) or (hl) ;smc will modify this to or/xorld_hl = \$ ld (hl),a ld a,e ;recall a djnz rectangle_loop_x_innerrectangle_loop_x_end: rrca ;rotate a to draw the next bit jr nc,\$+3 inc hl or (hl) ;last bit ld (hl),a ret`If you don't care about the coordinates, you can just remove the push/pops and get another 4 bytes. And i've got another idea which should be faster (essentially working with bytes instead of pixels), though it might take up a bit more space. If you're interested let me know and i'll try to work on it some.

As always, use and abuse without restrictions :)

EDIT3: Small optimization
Title: Re: ASM Optimized routines
Post by: Xeda112358 on February 04, 2013, 07:23:40 am
EDIT Five years later, I found my code didn't work that well :( At the bottom of this post is a working routine, but not ideal.

Hmm, for 'rectangle_loop_x' and 'rectangle_loop_x_end' you have 'or (hl)' which I think needs to be SMC'd.

I decided to try and optimise my code a bit and I managed to optimise it for speed in most cases and size. Unfortunately, the code is still much bigger than chickendude's at 133 bytes x.x To give an indicator of speed (6MHz):

my old routine, 9x18 rectangle : 1926 times in two seconds
my new routine, 9x18 : 5668 times in two seconds
chickendude's old routine : 1287 times in two seconds
chickendude's new routine : untested .__.

So for cases where you don't need crazy speed, chickendude's is still very fast and much smaller (73 bytes versus 133).
Code: [Select]
`Rectangle_or: ld a,\$B6 jr RectangleRectangle_xor: ld a,\$AERectangle:;    DE = (x,y);    BC = (height,width) ld (smc_logic1),a ld (smc_logic0),a push de push bc push bc ld a,d call ComputeByte ld (smc_FirstByte),a ex (sp),hl ld a,d neg and 7 ld b,a ld a,l sub b ex (sp),hl ld c,a call ComputeByte cpl ld (smc_LastByte),a ld b,a    sra c \ sra c \ sra c    inc c; ld a,c; and %11111000; rra \ rra \ rra;   inc a; ld c,a ld a,d ld d,0 ld h,d ld l,e add hl,hl add hl,de add hl,hl add hl,hl and %11111000 rra \ rra \ rra add a,GBUF_LSB ld e,a ld d,GBUF_MSB add hl,de;HL points to the first byte pop de;D is the height;E is the number of bytes wide inc c dec c jr nz,RectOverLoop-1 ld a,(smc_FirstByte) and b ld c,a ;value ld b,d ;height ld de,12 ld a,csmc_logic1: or (hl) ld (hl),a add hl,de djnz \$-4 pop bc pop de ret ld e,cRectOverLoop: ld b,e ld c,12 .db 3Eh       ;start of ld a,*smc_FirstByte: .db 0RectLoop:smc_logic0: or (hl) ld (hl),a inc hl dec c    jr z,ExitLoop ld a,-1 djnz RectLoop;    jp p,\$+4;    dec b .db 3Eh       ;start of ld a,*smc_LastByte: .db 0 or (hl) ld (hl),a add hl,bcExitLoop:    dec d jr nz,RectOverLoop pop bc pop de retComputeByte: and 7 ld b,a ld a,80h jr z,\$+5   rrca   djnz \$-1 add a,a dec a ret`
Necro-edit:
This code is working, and it performs clipping. Using the above benchmarks, this code draws approximately 4240 of those rectangle per two seconds at 6MHz on an actual calc. The downside is the size :( The core routine is 119 bytes, and the code for XOR rectangle is an additional 61 bytes, OR rectangle is also 61 bytes, and Erase rectangle is 66 bytes. They do not preserve registers. However, they were made to run in an app, so they don't rely on SMC-- If they did use SMC, it could probably fit all of the routines in just under 200 bytes.
Code: [Select]
`;;;;rectXOR;;rectOR;;rectErase;;  (B,C) = (x,y) signed;;  (D,E) = (w,h) unsigned;;  HL points to bufrectXOR:    push hl    call rectSub    pop ix    ret nc    ex de,hl    add ix,de    ex de,hl    push ix    pop hl    dec b    jp m,xorrect0    inc bxor_rect_loop:    push bc    push hl    ld a,(hl) \ xor d \ ld (hl),a \ inc hl    dec b    jr z,\$+8    ld a,(hl) \ cpl \ ld (hl),a \ inc hl \ djnz \$-4    ld a,(hl) \ xor e \ ld (hl),a    ld bc,12    pop hl    add hl,bc    pop bc    dec c    jr nz,xor_rect_loop    retxorrect0:    ld a,d    and e    ld b,c    ld c,a    ld de,12    ld a,c    xor (hl)    ld (hl),a    add hl,de    djnz \$-4    retrectErase:    push hl    call rectSub    pop ix    ret nc    ex de,hl    add ix,de    ex de,hl    push ix    pop hl    ld a,d    cpl    ld d,a    ld a,e    cpl    ld e,a    dec b    jp m,eraserect0    inc berase_rect_loop:    push bc    push hl    ld a,(hl) \ and d \ ld (hl),a \ inc hl    dec b    jr z,\$+7    xor a    ld (hl),a \ inc hl \ djnz \$-2    ld a,(hl) \ and e \ ld (hl),a    ld bc,12    pop hl    add hl,bc    pop bc    dec c    jr nz,erase_rect_loop    reteraserect0:    ld a,d    xor e    ld b,c    ld c,a    ld de,12    ld a,c    and (hl)    ld (hl),a    add hl,de    djnz \$-4    retrectOR:    push hl    call rectSub    pop ix    ret nc    ex de,hl    add ix,de    ex de,hl    push ix    pop hl    dec b    jp m,orrect0    inc bor_rect_loop:    push bc    push hl    ld a,(hl) \ or d \ ld (hl),a \ inc hl    dec b    jr z,\$+8    ld c,-1    ld (hl),c \ inc hl \ djnz \$-2    ld a,(hl) \ or e \ ld (hl),a    ld bc,12    pop hl    add hl,bc    pop bc    dec c    jr nz,or_rect_loop    retorrect0:    ld a,d    and e    ld b,c    ld c,a    ld de,12    ld a,c    or (hl)    ld (hl),a    add hl,de    djnz \$-4    retrectsub:;(B,C) = (x,y) signed;(D,E) = (w,h) unsigned;Output:;  Start Mask  D;  End Mask    E;  Byte width  B;  Height      C;  buf offset  HL  bit 7,b  jr z,+_  ;Here, b is negative, so we have to add width to x.  ;If the result is still negative, the entire box is out of bounds, so return  ;otherwise, set width=newvalue,b=0  ld a,d  add a,b  ret nc  ld d,a  ld b,0_:  bit 7,c  jr z,+_  ld a,e  add a,c  ret nc  ld e,a  ld c,0_:;We have clipped all negative areas.;Now we need to verify that (x,y) are on the screen.;If they aren't, then the whole rectangle is off-screen so no need to draw.  ld a,b  cp 96  ret nc  ld a,c  cp 64  ret nc;Let's also verfiy that height and width are non-zero:  ld a,d  or a  ret z  ld a,e  or a  ret z;Now we need to clip the width and height to be in-bounds  add a,c  cp 65  jr c,+_  ;Here we need to set e=64-c  ld a,64  sub c  ld e,a_:  ld a,d  add a,b  cp 97  jr c,+_  ;Here we need to set d=96-b  ld a,96  sub b  ld d,a_:;B is starting X;C is starting Y;D is width;E is height  push bc  ld a,b  and 7  ld b,a  ld a,-1  jr z,+_  rra \ djnz \$-1_:  inc a  cpl  ld h,a    ;start mask  ld a,b  add a,d  and 7  ld b,a  ld a,-1  jr z,+_  rra \ djnz \$-1_:  inc a  ld l,a  ;end mask  ex (sp),hl  ;stack now holds DE  ;HL is now the coordinates  ;B=0, C=height  ;A,BC are free to destroy  ld a,h  ld h,b  add hl,hl  add hl,bc  add hl,hl  add hl,hl  ld b,a  rrca  rrca  rrca  and 31  add a,l  ld l,a  jr nc,\$+3  inc h;B is the starting x, D is width;Only A,B,D,E are available  ld a,b  add a,d  and \$F8  ld d,a  ld a,b  and \$F8  ld b,a  ld a,d  sub b  rrca  rrca  rrca  ld b,a  ld c,e  pop de  scf  ret`
Title: Re: ASM Optimized routines
Post by: chickendude on February 04, 2013, 11:37:01 am
The other routine is just a plain rectangle (non-filled) drawing routine. I don't bother SMC'ing the or (hl), just the ld (hl),a. What it does is draw the first and last pixels of the line/border outside of the main loop then either draws (ld (hl),a) all pixels in between or skips (nop's) them all. It's a bit larger so unless the speed is that important you might be better off drawing two filled rectangles with the other routine, one OR'd and a slightly smaller one XOR'd.

Also, i just realized i don't need the "or a" after "ex af,af'" since the flags should still be preserved from the "and \$7", so we can delete that.

It's interesting to see how our syntax/style differs even on bits of code that do exactly the same thing :)

And if you don't mind sacrificing just a few clocks (altogether maybe between 8-64), maybe you could try this and save 2 bytes:
Code: [Select]
`ComputeByte: and 7 ld b,a ld a,\$FF ret z   srl a ;or or a \ rra   djnz \$-2 ret`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on February 04, 2013, 01:03:55 pm
It's interesting to see how our syntax/style differs even on bits of code that do exactly the same thing :)
I noticed that, too, it is kind of neat. I noticed that I mask A and then shift, whereas you shift, then mask. I've done both, but I typically do the former.
And if you don't mind sacrificing just a few clocks (altogether maybe between 8-64), maybe you could try this and save 2 bytes:
Nice, I like that! When B is not 0, that is actually anywhere from 6 cycles faster to 18 cycles slower, so I think that is a great trade-off.
By adding a byte, I can make your routine anywhere from 0 cycles to 24 cycles faster when b>0
Code: [Select]
`ComputeByte: neg   ; or 'cpl \ inc a and 7 ld b,a ld a,FFh ret z add a,a djnz \$-1 ret`This happens to be faster than my original and smaller by one byte. Still, it is larger than yours :/
Title: Re: ASM Optimized routines
Post by: Xeda112358 on March 07, 2013, 04:14:15 pm
EDIT3: (continued below) Smaller, faster version below.
I wanted to optimise an old routine in Grammer to compute the GCD of two 16-bit numbers. I came up with this:
Code: [Select]
`GCDDE_HL:;Inputs:;     HL,DE are the two values;Outputs:;     B is 0;     DE is 0;     HL is the GCD;     C is not changed;Destroys:;     A     xor a              ;AF     4     ld b,a             ;47     4CheckMax:               ;     sbc hl,de          ;ED52   15n     jr z,AdjustGCD     ;28**   12n-5     jr nc,ParityCheck  ;30**   12n-5    xor a              ;AF     4(n-a)     sub l              ;95     4(n-a)     ld l,a             ;6F     4(n-a)     sbc a,a            ;9F     4(n-a)     sub h              ;94     4(n-a)     ld h,a             ;67     4(n-a)     ex de,hl     jp CheckMax        ;C3**** 10(n-a)ParityCheck:            ;     bit 0,e            ;CB**   8a     jr nz,DE_Odd       ;20**   12a-5b     bit 0,l            ;CB**   8b     jr z,BothEven      ;28**   12b-5c     rr d               ;CB**   8(n-a-b-c)     rr e               ;CB**   8(n-a-b-c)     jp CheckMax        ;C3**** 10(n-a-b-c)BothEven:               ;     inc b              ;04     4c     rr d \ rr e        ;       16c     rr h \ rr l        ;       16c     jp CheckMax        ;       10cDE_Odd:                 ;     bit 0,l            ;       8b     jr nz,BothOdd      ;       12b-5d     rr h \ rr l        ;       16(n-a-b-d)     jp CheckMax        ;       10(n-a-b-d)BothOdd:                ;     sbc hl,de          ;       15d     rr h \ rr l        ;       16d     jp CheckMax        ;       10dAdjustGCD:              ;     ex de,hl           ;       4     inc b              ;       4     dec b              ;       4     ret z              ;       11+4(k>0)     add hl,hl          ;       11k     djnz \$-1           ;       13k-5     ret                ;       --`It is a lot faster than my other version which used division to compute the mod of two 16-bit numbers x.x The JP instructions can be changed to JR for better portability and to save a byte each time.

EDIT: And if I didn't make a mistake, the 8-bit version:
Code: [Select]
`GCD_A_C:;Outputs:;    A is the GCD;    C should be the smallest odd number that divides both inputs;    B is 0;Destroys:;    D     ld b,1CheckMax:     sub c     jr z,AdjustGCD     jr nc,ParityCheck     neg     ld d,a     ld a,c     ld c,a     jr CheckMaxParityCheck:     rrc c     jr c,c_Odd     inc b     rrca     jr nc,CheckMax     rlca     djnz CheckMaxc_Odd:     rlc c     rrca     jr nc,CheckMax     rlca     jr CheckMaxAdjustGCD:     ld a,c     dec b     ret z     add a,a     djnz \$-1     ret`EDIT2: I think I computed a massive overestimate of the slowest speed of the first routine to be a little over 4000 cycles. My old routine used about 1500 cycles at the fastest for a non-trivial result. 1500 is likely close to the slowest that the new routine will run at :)
EDIT3: 14 bytes saved, runs faster?
Code: [Select]
`GCDDE_HL:;Inputs:;     HL,DE are the two values;Outputs:;     B is 0;     DE is 0;     HL is the GCD;     C is not changed;     A is not changed     ld b,1     or aCheckMax:               ;     sbc hl,de          ;ED52   15n     jr z,AdjustGCD     ;28**   12n-5     jr nc,ParityCheck  ;30**   12n-5     add hl,de     or a     ex de,hlParityCheck:            ;     bit 0,e            ;CB**   8a     jr nz,DE_Odd       ;20**   12a-5b     bit 0,l            ;CB**   8b     jr z,BothEven      ;28**   12b-5c     rr d               ;CB**   8(n-a-b-c)     rr e               ;CB**   8(n-a-b-c)     jp CheckMax        ;C3**** 10(n-a-b-c)BothEven:               ;     inc b              ;04     4c     rr d \ rr e        ;       16cHL_Even:     rr h \ rr l        ;       16c     jp CheckMax        ;       10cDE_Odd:                 ;     bit 0,l            ;       8b     jr z,HL_Even       ;       12b-5d     sbc hl,de          ;       15d     rr h \ rr l        ;       16d     jp nz,CheckMax        ;       10dAdjustGCD:              ;     ex de,hl           ;       4     dec b              ;       4     ret z              ;       11+4(k>0)     add hl,hl          ;       11k     djnz \$-1           ;       13k-5     ret                ;       --`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on July 04, 2013, 08:46:10 am
EDIT: Fixed a problem to take care of the case where HL= 8000h (thanks Jacobly!)
This routine a few pages back can be optimised:
Code: [Select]
`SignedDivision: ld a,h xor d push af bit 7,h jr z,\$+8 xor a sub l ld l,a sbc a,a sub h ld h,a bit 7,d jr z,\$+8 xor a sub e ld e,a sbc a,a sub d ld d,a call RegularDivision pop af add a,a ret nc xor a sub l ld l,a sbc a,a sub h ld h,a ret`
For the sign testing, I came up with this:
Code: [Select]
`SignedDivision: ld a,h xor d push af xor d jp p,\$+9 xor a sub l ld l,a sbc a,a sub h ld h,a bit 7,d jr z,\$+8 xor a sub e ld e,a sbc a,a sub d ld d,a call RegularDivision pop af ret p xor a sub l ld l,a sbc a,a sub h ld h,a ret`In all, it saves 1 bytes and at least 5 t-states (it will be either 5 or 10).
Title: Re: ASM Optimized routines
Post by: Xeda112358 on October 20, 2013, 03:53:19 pm
Here is a sqrtL routine:
Code: [Select]
`SqrtL:;Inputs:;     L is the value to find the square root of;Outputs:;      C is the result;      B,L are 0;     DE is not changed;      H is how far away it is from the next smallest perfect square;      L is 0;      z flag set if it was a perfect square;Destroyed:;      A     ld bc,400h       ; 10    10     ld h,c           ; 4      4sqrt8Loop:            ;     add hl,hl        ;11     44     add hl,hl        ;11     44     rl c             ; 8     32     ld a,c           ; 4     16     rla              ; 4     16     sub a,h          ; 4     16     jr nc,\$+5        ;12|19  48+7x       inc c       cpl       ld h,a     djnz sqrt8Loop   ;13|8   47     ret              ;10     10;287+7x, x is the number of bits in the result;min: 287;max: 315;19 bytes`Also, in case anybody needed a small GCD (Greatest Common Divisor) routine, I have this:
Code: [Select]
`GCDHL_DE:;Outputs:;     DE is the GCDGCDLoop:     or a     sbc hl,de     ret z     jr nc,\$-3     add hl,de     ex de,hl     jp GCDLoop`(a faster one is a few posts up)
If you need a fast way to see if a 16-bit number is divisible by 3 (without actually dividing)
Code: [Select]
`HL_mod_3:;Outputs:;     Preserves HL;     A is the remainder;     destroys DE,BC;     z flag if divisible by 3, else nz     ld bc,030Fh     ld a,h     add a,l     sbc a,0   ;conditional decrement;Now we need to add the upper and lower nibble in a     ld d,a     and c     ld e,a     ld a,d     rlca     rlca     rlca     rlca     and c     add a,e     sub c     jr nc,\$+3     add a,c;add the lower half nibbles     ld d,a     sra d     sra d     and b     add a,d     sub b     ret nc     add a,b     ret;at most 132 cycles, at least 123`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on November 13, 2013, 03:09:47 pm
I made this rather fast 8.8 fixed point natural log (ln, loge) routine today based on some math I have been working on. It currently averages 677 t-states, but only works on [1,2] I plan to extend this to the whole range of numbers by using a 5 element LUT and a single division making the average somewhere around 1100 t-states:
Code: [Select]
`lognat:;Input:  H.L needs to be on [1,2];Output: H.L if z flag is set, else if nz, no result;avg speed: 677 t-states dec h    dec h jr nz,\$+8 inc l \ dec l ret nz ld l,177 ret inc h ret nz ld b,h ld c,l ld e,l ld d,8 add hl,hl add hl,hl add hl,de ex de,hl add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ adc a,a ld h,a \ ld l,b sla h \ jr c,\$+3 \ ld l,c add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc rl l ld a,h adc a,b ld l,a ld h,b cp a ret`I had fun twisting some of the logic, but there is a division and multiplication in there. It might seem backwards, but it was optimal :)
Title: Re: ASM Optimized routines
Post by: Xeda112358 on November 14, 2013, 08:17:22 pm
This version works on (0,128) and averages less than 1250 t-states:
Code: [Select]
`lognat:;Input:  H.L needs to be on (0,128.0);Output: H.L if c flag set; returns nc if input is negative (HL not modified);Error:; The error on the outputs is as follows:; 20592 inputs are exact; 12075 inputs are off by 1/256; 100 inputs are off by 2/256; So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.;Size: 177 bytes;Speed: average speed is less than 1250 t-states ld a,h \ or l \ jr nz,\$+5 ld h,80h \ ret dec h dec h jr nz,\$+9 inc l \ dec l jr nz,normalizeln-1 ld l,177 ret inc h jr nz,normalizeln ld b,h ld c,l ld e,l ld d,8 add hl,hl add hl,hl add hl,de ex de,hl call HL_Div_DE ld h,a \ ld l,b sla h \ jr c,\$+3 \ ld l,c add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc add hl,hl \ jr c,\$+3 \ add hl,bc rl l ld a,h adc a,b ld h,b ld l,a scf retHL_Div_DE: add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ adc a,a \ ret inc hnormalizeln: xor a inc h \ ret m ld d,a \ ld e,a ld a,l jr z,toosmall inc e \ srl h \ rra \ jr nz,\$-4 rla \ rl h dec estepin: ld l,a push de call lognat pop de ;now multiply DE by 355, then divide by 2 (rounding) ld b,d \ ld c,e \ ld a,d ex de,hl add hl,hl add hl,hl ;4 add hl,bc ;5 add hl,hl ;10 add hl,bc ;11 add hl,hl ;22 add hl,hl add hl,hl add hl,hl add hl,bc add hl,hl add hl,bc sra h \ rr l adc hl,de scf rettoosmall: dec d dec e \ add a,a \ jr nc,\$-2 inc h jp stepin`The worst error is 2/256, which is much better than I thought it would be :) On [1,2] it is at worst 1/256 and the average case is <1/683 which is smaller than 1/256 (the smallest positive 8.8 number).
Title: Re: ASM Optimized routines
Post by: Xeda112358 on November 14, 2013, 11:50:30 pm
If you are willing to use a 256-byte LUT, then lg() can be computed in an average of 340 t-states, and for an additional 325 (at most), ln() can be computed from that:
(these are fixed point 8.8 routines)
Code: [Select]
`ln_88:;Input: HL is a fixed point number;Output: ln(H.L)->H.L;Speed: Avg: 340+(325 worst case) call lg_88 ;now signed multiply HL by 355, then divide by 2 (rounding)    ld de,0    bit 7,h    jr z,\$+9    dec e \ xor a \ sub l \ ld l,a    sbc a,a \ sub h \ ld h,a    ld b,h    ld c,l    xor a add hl,hl    add hl,hl \ rla add hl,bc \ adc a,d add hl,hl \ rla add hl,bc \ adc a,d add hl,hl \ rla add hl,hl \ rla add hl,hl \ rla add hl,hl \ rla add hl,bc \ adc a,d add hl,hl \ rla add hl,bc \ adc a,d    sra a \ rr h    ld l,h    ld h,a    inc e    ret nz    xor a \ sub l \ ld l,a    sbc a,a \ sub h \ ld h,a    retlg_88:;Input: HL is a fixed point number;Output: lg(H.L)->H.L;Speed: Avg: 340 ld de,lgLUT ld b,0 ld a,h or a ret m ld a,l jr z,\$+8 inc b \ srl h \ rra \ jr nz,\$-4 or a \ jr nz,\$+6 ld hl,8000h \ ret rra \ inc b \ jr nc,\$-2;A is the element to look up in the LUT ld l,a    ld c,h    dec b add hl,hl add hl,de ld e,(hl) inc hl ld d,(hl)    ex de,hl add hl,bc retlglut:.dw \$F800.dw \$F996.dw \$FA52.dw \$FACF.dw \$FB2C.dw \$FB76.dw \$FBB3.dw \$FBE8.dw \$FC16.dw \$FC3F.dw \$FC64.dw \$FC86.dw \$FCA5.dw \$FCC1.dw \$FCDC.dw \$FCF4.dw \$FD0B.dw \$FD21.dw \$FD36.dw \$FD49.dw \$FD5C.dw \$FD6D.dw \$FD7E.dw \$FD8E.dw \$FD9D.dw \$FDAC.dw \$FDBA.dw \$FDC8.dw \$FDD5.dw \$FDE2.dw \$FDEE.dw \$FDFA.dw \$FE06.dw \$FE11.dw \$FE1C.dw \$FE26.dw \$FE31.dw \$FE3B.dw \$FE44.dw \$FE4E.dw \$FE57.dw \$FE60.dw \$FE69.dw \$FE71.dw \$FE7A.dw \$FE82.dw \$FE8A.dw \$FE92.dw \$FE9A.dw \$FEA1.dw \$FEA9.dw \$FEB0.dw \$FEB7.dw \$FEBE.dw \$FEC5.dw \$FECB.dw \$FED2.dw \$FED8.dw \$FEDF.dw \$FEE5.dw \$FEEB.dw \$FEF1.dw \$FEF7.dw \$FEFD.dw \$FF03.dw \$FF09.dw \$FF0E.dw \$FF14.dw \$FF19.dw \$FF1E.dw \$FF24.dw \$FF29.dw \$FF2E.dw \$FF33.dw \$FF38.dw \$FF3D.dw \$FF42.dw \$FF47.dw \$FF4B.dw \$FF50.dw \$FF55.dw \$FF59.dw \$FF5E.dw \$FF62.dw \$FF67.dw \$FF6B.dw \$FF6F.dw \$FF74.dw \$FF78.dw \$FF7C.dw \$FF80.dw \$FF84.dw \$FF88.dw \$FF8C.dw \$FF90.dw \$FF94.dw \$FF98.dw \$FF9B.dw \$FF9F.dw \$FFA3.dw \$FFA7.dw \$FFAA.dw \$FFAE.dw \$FFB2.dw \$FFB5.dw \$FFB9.dw \$FFBC.dw \$FFC0.dw \$FFC3.dw \$FFC6.dw \$FFCA.dw \$FFCD.dw \$FFD0.dw \$FFD4.dw \$FFD7.dw \$FFDA.dw \$FFDD.dw \$FFE0.dw \$FFE4.dw \$FFE7.dw \$FFEA.dw \$FFED.dw \$FFF0.dw \$FFF3.dw \$FFF6.dw \$FFF9.dw \$FFFC.dw \$FFFF`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on November 16, 2013, 09:51:46 am
Here is my 8.8 fixed point, table-based arctangent routine:
Code: [Select]
`arctan_88:;Input:;    D.E;Output: atan(D.E)->D.E    push de    ld a,d    or a    jp p,\$+5    neg    ld d,a    dec a    jr nz,checkneedinv    inc e \ dec e \ jr nz,checkneedinv    pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ retcheckneedinv:    inc a    call nz,DEgt1_Inv;0.E is the value to atan    ld hl,adjustatan    push hl    ld a,e    cp 46 \ ret c    dec a \ cp 42h \ ret c    dec a \ cp 4Eh \ ret c    dec a \ cp 57h \ ret c    dec a \ cp 5Eh \ ret c    dec a \ cp 64h \ ret c    dec a \ cp 6Ah \ ret c    dec a \ cp 6Fh \ ret c    sub 6Fh \ ld e,a    ld hl,atanlut    add hl,de    ld a,(hl)    retadjustatan:    ld e,a    pop bc    ld a,b    or a    jp p,\$+5    neg    jr z,\$+9    ld hl,402    or a    sbc hl,de    ex de,hl    rl b    ret nc    xor a    sub e    ld e,a    sbc a,a    sub d    ld d,a    retDEgt1_Inv:;Works if DE>1    ld hl,256    ld b,8InvLoop:    add hl,hl    sbc hl,de    jr nc,\$+3    add hl,de    adc a,a    djnz InvLoop    cpl    ld e,a    ld d,b    retatanlut:.db \$6F.db \$6F.db \$70.db \$71.db \$72.db \$73.db \$73.db \$74.db \$75.db \$76.db \$77.db \$77.db \$78.db \$79.db \$7A.db \$7B.db \$7B.db \$7C.db \$7D.db \$7E.db \$7F.db \$7F.db \$80.db \$81.db \$82.db \$82.db \$83.db \$84.db \$85.db \$85.db \$86.db \$87.db \$88.db \$88.db \$89.db \$8A.db \$8B.db \$8B.db \$8C.db \$8D.db \$8E.db \$8E.db \$8F.db \$90.db \$90.db \$91.db \$92.db \$93.db \$93.db \$94.db \$95.db \$95.db \$96.db \$97.db \$97.db \$98.db \$99.db \$9A.db \$9A.db \$9B.db \$9C.db \$9C.db \$9D.db \$9E.db \$9E.db \$9F.db \$A0.db \$A0.db \$A1.db \$A2.db \$A2.db \$A3.db \$A3.db \$A4.db \$A5.db \$A5.db \$A6.db \$A7.db \$A7.db \$A8.db \$A9.db \$A9.db \$AA.db \$AA.db \$AB.db \$AC.db \$AC.db \$AD.db \$AD.db \$AE.db \$AF.db \$AF.db \$B0.db \$B0.db \$B1.db \$B2.db \$B2.db \$B3.db \$B3.db \$B4.db \$B5.db \$B5.db \$B6.db \$B6.db \$B7.db \$B7.db \$B8.db \$B9.db \$B9.db \$BA.db \$BA.db \$BB.db \$BB.db \$BC.db \$BC.db \$BD.db \$BE.db \$BE.db \$BF.db \$BF.db \$C0.db \$C0.db \$C1.db \$C1.db \$C2.db \$C2.db \$C3.db \$C3.db \$C4.db \$C4.db \$C5.db \$C6.db \$C6.db \$C7.db \$C7.db \$C8.db \$C8.db \$C9`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on November 19, 2013, 12:15:26 pm
EDIT:(27-Oct-2015) Updated at the end of the original post with a muuuch better routine.

I wrote this routine today for generating pseudorandom numbers. If I did the math correctly, it has a period of 4292870399. To think of it differently, it executes in about 1430 1592 t-states, so that means at 15MHZ, it would take about 4 days, 19 hours, 49 minutes, 41 seconds to loop back to the start of the over 4 billion number sequence. Or, if you generated one number per second, it would take over 136 years to finish the cycle.
Code: [Select]
`Rand24:;Inputs: seed1,seed2;Outputs:; HLA is the pseudo-random number; seed1,seed2 incremented accordingly;Destroys: BC,DE;Notes:; seed1*243+83 mod 65519 -> seed1; seed2*251+43 mod 65521 -> seed2; Output seed1*seed2 ld hl,(seed1) xor a ld b,h \ ld c,l ld de,83 add hl,hl \ rla ;2 add hl,bc \ adc a,d ;3 add hl,hl \ rla ;6 add hl,bc \ adc a,d ;7 add hl,hl \ rla ;14 add hl,bc \ adc a,d ;15 add hl,hl \ rla ;30 add hl,hl \ rla ;60 add hl,hl \ rla ;120 add hl,bc \ adc a,d ;121 add hl,hl \ rla ;242 add hl,bc \ adc a,d ;243 add hl,de \ adc a,d ;243x+83;now AHL mod 65519; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL ex de,hl ld l,a ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl add hl,bc add hl,de ld de,65519 jr nc,\$+5 or a \ sbc hl,de or a \ sbc hl,de jr nc,\$+3 add hl,de ld (seed1),hl;seed1 done, now we need to do seed2 ld hl,(seed2); seed1*243+83 mod 65519 -> seed1; seed2*251+43 mod 65521 -> seed2; Output seed1*seed2 xor a ld b,h \ ld c,l ld de,43 add hl,hl \ rla ;2 add hl,bc \ adc a,d ;3 add hl,hl \ rla ;6 add hl,bc \ adc a,d ;7 add hl,hl \ rla ;14 add hl,bc \ adc a,d ;15 add hl,hl \ rla ;30 add hl,bc \ adc a,d ;31 add hl,hl \ rla ;62 add hl,hl \ rla ;124 add hl,bc \ adc a,d ;125 add hl,hl \ rla ;250 add hl,bc \ adc a,d ;251 add hl,de \ adc a,d ;251x+83;now AHL mod 65521; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL ex de,hl ld l,a ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl sbc hl,bc add hl,de ld de,65521 jr nc,\$+5 or a \ sbc hl,de or a \ sbc hl,de jr nc,\$+3 add hl,de ld (seed2),hl;now seed1 and seed 2 are computed ld bc,(seed1) ex de,hl call BC_Times_DE ex de,hl ld l,b ld h,0 ld b,h ld c,l add hl,hl add hl,hl add hl,bc add hl,hl add hl,hl add hl,bc add hl,hl add hl,bc ld c,d ld d,e ld e,a ld a,c sbc hl,bc sbc a,b ret nc ld c,43 add hl,bc ret;now do BC_times_DEBC_Times_DE:;BHLA is the result ld a,b or a ld hl,0 ld b,h;1 add a,a jr nc,\$+4 ld h,d ld l,e;2 add hl,hl rla jr nc,\$+4 add hl,de adc a,b;227+10b-7p add hl,hl rla jr nc,\$+4 add hl,de adc a,b add hl,hl rla jr nc,\$+4 add hl,de adc a,b add hl,hl rla jr nc,\$+4 add hl,de adc a,b add hl,hl rla jr nc,\$+4 add hl,de adc a,b add hl,hl rla jr nc,\$+4 add hl,de adc a,b add hl,hl rla jr nc,\$+4 add hl,de adc a,b;===;AHL is the result of B*DE*256 push hl ld h,b ld l,b ld b,a ld a,c ld c,h;1 add a,a jr nc,\$+4 ld h,d ld l,e;2 add hl,hl rla jr nc,\$+4 add hl,de adc a,c;227+10b-7p add hl,hl rla jr nc,\$+4 add hl,de adc a,c add hl,hl rla jr nc,\$+4 add hl,de adc a,c add hl,hl rla jr nc,\$+4 add hl,de adc a,c add hl,hl rla jr nc,\$+4 add hl,de adc a,c add hl,hl rla jr nc,\$+4 add hl,de adc a,c add hl,hl rla jr nc,\$+4 add hl,de adc a,c pop de;Now BDE*256+AHL ld c,a ld a,l ld l,h ld h,c add hl,de ret nc inc b;BHLA is the 32-bit result retseed1: .dw 0seed2: .dw 0`To test its 'randomness' I only used HL and it did not exhibit a perfectly uniform random distribution (standard deviation was slightly off from the expected value). However, it seems to be on par with the routine used by the OS and I believe I am using a modified version of that.

Now I will feel confident to use this in my math libraries :)

EDIT: Added a mod 16777259 to the output to fix a flaw.

EDIT: The much better routine:
• It passes all the tests at CAcert Labs (http://www.cacert.at/cgi-bin/rngresults)
• The cycle size is almost 4.3 million times longer .
• It is 550% faster (287cc versus 1592cc)
• It would take over 11,184,544 years at the calculator's max speed to reach the full cycle.
Code: [Select]
`rand:;Tested and passes all CAcert tests;Uses a very simple 32-bit LCG and 32-bit LFSR;it has a period of 18,446,744,069,414,584,320;roughly 18.4 quintillion.;291ccseed1_0=\$+1    ld hl,12345seed1_1=\$+1    ld de,6789    ld b,h    ld c,l    add hl,hl \ rl e \ rl d    add hl,hl \ rl e \ rl d    inc l    add hl,bc    ld (seed1_0),hl    ld hl,(seed1_1)    adc hl,de    ld (seed1_1),hl    ex de,hlseed2_0=\$+1    ld hl,9876seed2_1=\$+1    ld bc,54321    add hl,hl \ rl c \ rl b    ld (seed2_1),bc    sbc a,a    and %11000101    xor l    ld l,a    ld (seed2_0),hl    ex de,hl    add hl,bc    ret`EDIT (28 August 2019): Hey, so here is a compact version from a few years ago:
Code: [Select]
`;#define smc ;uncomment if you are using SMCrand16:;collaboration by Zeda with Runer112;160cc or 148cc if using SMC;26 bytes;cycle: 4,294,901,760 (almost 4.3 billion)#ifdef smcseed1=\$+1 ld hl,9999#else    ld hl,(seed1)#endif    ld b,h    ld c,l    add hl,hl    add hl,hl    inc l    add hl,bc    ld (seed1),hl#ifdef smcseed2=\$+1 ld hl,9999#else    ld hl,(seed2)#endif    add hl,hl    sbc a,a    and %00101101    xor l    ld l,a    ld (seed2),hl    add hl,bc    ret`And here is a different routine that is faster and might be better (it is on the tests I am using):
Code: [Select]
`xsp32:;Inputs: (seed1), (seed2), and (seed3) are 16-bit seeds. (seed1) and (seed2) can't both be 0.;Outputs: HL is the pseudorandom number;Destroys: A,DE,BC;cycle: 281,474,976,645,120;It would take about 185 years at 15MHz to repeat;min: 258cc (236cc if using SMC);max: 288cc (266cc if using SMC);avg: 273cc (251cc if using SMC);63 bytes (62 bytes if using SMC)#ifdef SMCseed1 = \$+1  ld hl,12345seed2 = \$+1  ld de,6789#else  ld hl,(seed1)  ld de,(seed2)#endif;first, XOR it with itself, shifted left 23 bits;low bit of d needs to be shifted in  ld a,h  rra  ld a,l  rra  jr nc,+_  rl e  ccf  rr e_:  xor d  ld d,a;XOR it with itself, shifted right 15 bits  ld a,h  rla  ld a,e  rla  xor l  ld l,a  ld a,e  rla  ld a,d  rla  jr nc,+_  rr e  ccf  rl e_:  xor h  ld h,a;XOR it with itself, shifted left 17 bits;HL<<1  ld (seed1),hl  add hl,hl  ld a,h  xor d  ld h,a  ld a,l  xor e  ld l,a  ld (seed2),hl  ex de,hl#ifdef SMCseed3 = \$+1  ld hl,33333#else  ld hl,(seed3)#endif  inc hl  inc h  ld (seed3),hl  add hl,de  ret`It has a smaller period, but still far larger than a calc needs. It uses smaller state, and combines a 32-bit xorshift with a basic 16-bit counter (increments by 257).

Just a 32-bit xorshift routine, which is pretty decent on its own:
Code: [Select]
`xs32:;32-bit xorshift;seed^=seed<<23;seed^=seed>>15;seed^=seed<<17;min: 209cc (193cc if using SMC);max: 239cc (223cc if using SMC);avg: 224cc (208cc if using SMC);53 bytes (52 bytes if using SMC)#ifdef SMCseed1 = \$+1  ld hl,12345seed2 = \$+1  ld de,6789#else  ld hl,(seed1)  ld de,(seed2)#endif;first, XOR it with itself, shifted left 23 bits;low bit of d needs to be shifted in  ld a,h  rra  ld a,l  rra  jr nc,+_  rl e  ccf  rr e_:  xor d  ld d,a;XOR it with itself, shifted right 15 bits  ld a,h  rla  ld a,e  rla  xor l  ld l,a  ld a,e  rla  ld a,d  rla  jr nc,+_  rr e  ccf  rl e_:  xor h  ld h,a;XOR it with itself, shifted left 17 bits;HL<<1  ld (seed1),hl  add hl,hl  ld a,h  xor d  ld h,a  ld a,l  xor e  ld l,a  ld (seed2),hl  ret`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on November 28, 2013, 08:45:11 am
Here is a 32-bit square root routine. I am fairly sure there are optimisations that can be made, especially to the stuff I added last night. I already found a bunch of code that I removed, about 32 bytes, saving 128 t-states.
Code: [Select]
`SqrtHLDE:;input: HLDE;Output: BC;310 bytes;Average is about 1443 t-states push de xor a ld b,a ld e,l ld l,h ld h,a add hl,hl add hl,hl cp h jr nc,\$+5 dec h ld a,4 add hl,hl add hl,hl ld c,a sub h jr nc,\$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,\$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,\$+6 cpl ld h,a inc c inc c ld a,c ld l,e add hl,hl add hl,hl add a,a ld c,a sub h jr nc,\$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,\$+6 cpl ld h,a inc c inc c ld a,c add a,a \ ld c,a add hl,hl add hl,hl jr nc,\$+6 sub h \ jp \$+6 sub h jr nc,\$+6 inc c \ inc c cpl ld h,a ld a,l ld l,h add a,a ld h,a adc hl,hl adc hl,hl sll c \ rl b sbc hl,bc jr nc,\$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a;iteration 9;now I need to rotate in more bits pop de sla d \ adc hl,hl \ sla d \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,\$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla d \ adc hl,hl \ sla d \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,\$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla d \ adc hl,hl \ sla d \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,\$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla d \ adc hl,hl \ sla d \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,\$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla e \ adc hl,hl \ sla e \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,\$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla e \ adc hl,hl \ sla e \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,\$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sll c \ rl b sla e \ adc hl,hl \ sla e \ adc hl,hl jr nc,\$+9    or a sbc hl,bc inc c jp \$+13 sbc hl,bc jr nc,\$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla e \ adc hl,hl \ jr nc,\$+4    inc c \ ret    sla e \ adc hl,hl \ jr nc,\$+8    sbc hl,bc \ or a \ jp \$+7    or a \ sbc hl,bc \ ret c    scf \ sbc hl,bc \ ret c    inc c \ ret`
Title: Re: ASM Optimized routines
Post by: Matrefeytontias on April 29, 2014, 06:04:00 am
If that's of any use to anyone, here is a signed AHL = AHL / DE routine - I think it's pretty optimized (70 bytes) :
Code: [Select]
`sDiv2416: xor d push af xor d ld b, a jp p, divdPos xor a sub l ld l, a ld a, 0 sbc a, h ld h, a ld a, 0 sbc a, b ld b, adivdPos: bit 7, d jr z, divrPos xor a sub e ld e, a ld a, 0 sbc a, d ld d, a ld a, bdivrPos: push hl pop ix ld hl, 0 ld b, 24divLoop: add ix, ix rla adc hl, hl sbc hl, de jr nc, \$ + 4 add hl, de ; jp nc, xxxx ; it works because the carry can not be set and "inc ix" is 2 bytes .db \$D2 inc ix djnz divLoop push ix pop hl ld b, a pop af ld a, b ret p xor a sub l ld l, a ld a, 0 sbc a, h ld h, a ld a, 0 sbc a, b ret`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on March 01, 2015, 10:14:07 pm
This routine is a fast way to copy a chunk of data. For BC>34, calling this proves to be faster than using LDIR:
Code: [Select]
`fastLDIR:;copy BC bytes from HL to DE;Cost:;    27cc for having to call;    110cc for setting up the loop, worst case;    10cc * ceiling(BC/n)        ;n=2^k for some k, see the line below "ldirloop:";    16cc * BC;costs roughly 152-BC*(5-10/n) more than a simple LDIR (worst case);for n=4,  BC>=61 saves;for n=8,  BC>=41 saves;for n=16, BC>=35 saves   * default, see the "ldirloop" to change;for n=32, BC>=33 saves;for n=64, BC>=32 saves    push hl    push af    xor a    sub c    and 15               ;change to n-1    add a,a    ld hl,ldirloop    add a,l    ld l,a    jr nc,\$+3  ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes.    inc h       ;    pop af    ex (sp),hl    retldirloop:;n=16, (number of LDI instructions, use qty of 4,8,16,32,64)    ldi    ldi    ldi    ldi        ldi    ldi    ldi    ldi        ldi    ldi    ldi    ldi        ldi    ldi    ldi_ldirloop_end:    ldi    jp pe,ldirloop    ret`This might be useful for things like copying code to RAM (from an App) in speed critical applications, or other data handling tasks.

EDIT 9 Oct 18: Saved 1 byte and 3cc. Originally, I had 'ld a,16 \ sub c \ and 15'. The 'ld a,16' could hold any multiple of n (16 in this example), including 0, so I just used 'xor a'. I also updated all the timing info.
Title: Re: ASM Optimized routines
Post by: Xeda112358 on March 02, 2015, 06:13:50 am
Not sure if this is useful for any actual applications, but to get a mask for the lowest set bit in C:
Code: [Select]
`    xor a    sub c    and c`So if c=%10110100, this would return a=%00000100 (it also always returns nc, and z only when c=0).
Title: Re: ASM Optimized routines
Post by: Xeda112358 on April 17, 2015, 05:39:14 pm
This is a fast 16-bit Lehmer RNG. It is seeded, and has a period of 65536. This code
Code: [Select]
`smc = 1   ;use 1 if the code is in RAM, since it is faster. If it is in an app, use 0 and define a 2-byte RAM location for the seed.lehmer:;;Input:;;  (seed) has the seed value of the RNG;;Output:;;  (seed) is updated, HL is the result;;Destroys:;;  A,DE,BC;;Timing:;;  if seed>0     231cc or 232cc, condition dependent;;  if seed=0     91cc;;  if smc=1      subtract 6cc;;Size: 44 bytes;;Notes:;;    Uses the Lehmer RNG used by the Sinclair ZX81;;    75x mod 65537 -> x#IF smc == 0    ld hl,(seed)#ELSEseed = \$+1    ld hl,0#ENDIF;multiply by 75    ld c,l    ld b,h    xor a    adc hl,hl \ jr z,special \ ld d,a \ rla    add hl,hl \ rla    add hl,hl \ rla \ add hl,bc \ adc a,d    add hl,hl \ rla    add hl,hl \ rla \ add hl,bc \ adc a,d    add hl,hl \ rla \ add hl,bc;modulo 65537, see note below on how this works    ld e,a    sbc hl,de       ;No need to reset the c flag since it is already    jr nc,\$+3    inc hl    ld (seed),hl    retspecial:;In the case that HL=0, this should be interpreted as 65536 = -1 mod 65537, so return -75 mod 65537 = -74 mod 65536 in HL    ld hl,-74    ld (seed),hl    ret    ;mod by 2^16 + 1 (a prime);current form is A*2^16+HL;need:;  (A*2^16+HL) mod (2^16+1);add 0 as +1-1;  (A*(2^16+1-1)+HL) mod (2^16+1);distribute;  (A*(2^16+1)-A+HL) mod (2^16+1);A*(2^16+1) mod 2^16+1 = 0, so remove;  (-A+HL) mod (2^16+1);Oh hey, that's easy! :P;I use this trick everywhere, you should, too.`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on August 02, 2015, 09:06:26 am
So earlier on IRC I posted a challenge that I was thinking about before bed: Convert an 8-bit unsigned integer to BCD as fast as you can.

I managed to get my code down to 143.5cc, then jacobly noticed an optimization on mine, and I noticed a bug that had been carried through from the beginning. After all was done, we got it down to 131cc.

The code given is a subroutine (so adding an ret to the end, +1 byte +10cc). Without further ado:
Code: [Select]
`L_To_Dec:;;Unrolled;;Converts the 8-bit register L to binary coded decimal;;Digits stored in LA (A has the lower 2 digits, L the upper).;;Inputs: L is the 8-bit unsigned int to convert;;Output: A has the lower 2 digits (in BCD form), L has the upper;;Destroys: H,F;;141cc ;;27 bytes    ld h,0    add hl,hl    add hl,hl    add hl,hl    add hl,hl    ld a,h \ daa  \ rl l    adc a,a \ daa \ rl l    adc a,a \ daa \ rl l    adc a,a \ daa \ rl l    adc a,a \ daa \ rl l    ret`

EDIT: Found a transcription error. Had to adjust code, size, and timing by +4 bytes, +16cc :(
Title: Re: ASM Optimized routines
Post by: Xeda112358 on August 03, 2015, 10:54:32 am
Related to the previous post, I offer several alternatives to the bcalls _SetXXOP1 and _SetXXOP2. These routines are for converting 8-bit integers to TI floats. The advantages with my routines are that they are faster, you don't need to truncate to only the bottom 2 digits, and you can store the output to any location instead of just OP1 or OP2.

So first, if you still want the output to be the same:
Code: [Select]
`setXXOP2:    ld hl,OP2    jr setXXsetXXOP1:    ld hl,OP1setXX:;;Inputs: A is the unsigned int;;        HL is where to write the TI float;;Destroys:All;;291cc+38b (or 144cc if A=0);;average: b=29/255;;295.3215686cc;;59 bytes    ld c,0    ld (hl),c \ inc hl    ld (hl),81h    inc hl \ ld (hl),c    ld d,h \ ld e,l    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    or a \ ret z    ;If A is zero, exit early. +138cc    ld l,a          ;\    ld h,c          ; |    add hl,hl       ; |Start converting A to BCD    add hl,hl       ; |    add hl,hl       ; |    add hl,hl       ; |    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD    adc a,a \ daa \ rl l    ; |Number is in LA    adc a,a \ daa \ rl l    ; |    adc a,a \ daa \ rl l    ; |    adc a,a \ daa           ;/ +124cc    ex de,hl    ld (hl),a    and \$F0    ret nz      ;+29cc    rld         ;\ Rotate up 1 digit    dec hl      ; |    ld (hl),80h ; |    ret         ; /`And if you want to get all 3 digits:
Code: [Select]
`setXXXOP1:    ld hl,OP1setXXX:;;Inputs: A is the unsigned int;;        HL is where to write the TI float;;Destroys:All;;423cc+13a+63b (or 233cc if A=0);;average: a=99/255, b=29/255;;435.2117647cc average;;64 bytes    ld bc,\$0700    ld (hl),c    inc hl    ld (hl),83h    ld d,h    ld e,l    inc hl \ ld (hl),c \ djnz \$-2    or a \ ret z    ;If A is zero, exit early. +227cc    ld l,a          ;\    ld h,c          ; |    add hl,hl       ; |Start converting A to BCD    add hl,hl       ; |    add hl,hl       ; |    add hl,hl       ; |    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD    adc a,a \ daa \ rl l    ; |Number is in LA    adc a,a \ daa \ rl l    ; |    adc a,a \ daa \ rl l    ; |    adc a,a \ daa \ rl l    ;/ +132cc    ex de,hl    jr nz,\$+6 \ ld e,a \ xor a \ ld (hl),81h    ;+(21+4/85)cc    inc hl    ld (hl),e    inc hl    ld (hl),a    ld a,e    and \$F0    ret nz      ;+48cc    rld         ;\ Rotate up 1 digit    dec hl      ; |    rld         ; |    dec hl      ; |    dec (hl)    ; |Decrement exponent    ret         ; /+63(29/255)cc`And if you want to do that, but maybe a little faster:
Code: [Select]
`setXXX:;;Inputs: A is the unsigned int;;        HL is where to write the TI float;;Destroys:All;;334cc+13a+63b (or 144cc if A=0);;average: a=99/255, b=29/255;;346.2117647cc average;;75 bytes    ld c,0    ld (hl),c    inc hl    ld (hl),83h    ld d,h    ld e,l    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c        or a \ ret z    ;If A is zero, exit early. +227cc    ld l,a          ;\    ld h,c          ; |    add hl,hl       ; |Start converting A to BCD    add hl,hl       ; |    add hl,hl       ; |    add hl,hl       ; |    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD    adc a,a \ daa \ rl l    ; |Number is in LA    adc a,a \ daa \ rl l    ; |    adc a,a \ daa \ rl l    ; |    adc a,a \ daa \ rl l    ;/ +132cc    ex de,hl    jr nz,\$+6 \ ld e,a \ xor a \ ld (hl),81h    ;+(21+4/85)cc    inc hl    ld (hl),e    inc hl    ld (hl),a    ld a,e    and \$F0    ret nz      ;+48cc    rld         ;\ Rotate up 1 digit    dec hl      ; |    rld         ; |    dec hl      ; |    dec (hl)    ; |Decrement exponent    ret         ; /+63(29/255)cc`And if you want to convert signed 8-bit ints:
Code: [Select]
`setXXX_signed:;;Inputs: A is the signed int;;        HL is where to write the TI float    ld c,0    ld (hl),c    add a,a    jr c,\$+6    neg    ld (hl),80h    inc hl    ld (hl),81h    ld d,h    ld e,l    inc hl \ ld (hl),c \ djnz \$-2    or a \ ret z    ld l,a          ;\    ld h,c          ; |    add hl,hl       ; |Start converting A to BCD    add hl,hl       ; |    add hl,hl       ; |    add hl,hl       ; |    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD    adc a,a \ daa \ rl l    ; |Number is in cA    adc a,a \ daa \ rl l    ; |(c is carry)    adc a,a \ daa           ;/    ex de,hl    jr nc,\$+15    ld (hl),82h    inc hl    inc hl    ld (hl),a    xor a    rld    or \$10    dec hl    ld (hl),a    ret    inc hl    ld (hl),a    and \$F0    ret nz    rld         ;\ Rotate up 1 digit    dec hl      ; |    ld (hl),80h ; |    ret         ; /`
The slowest routine here has a worst case of 499cc and slowest average case is 436cc.
Title: Re: ASM Optimized routines
Post by: Xeda112358 on August 04, 2015, 10:52:59 am
I decided to see if I could make a faster routine than what Runer made that did the same stuff and I came up with the following code. Mine is 3 bytes larger and on average almost twice as fast (ranging from 120cc to 525cc, and one case of 59cc). It has output in HL and returns a DataType error for inputs that are not non-negative reals, and a Domain error if the input exceeds a 16-bit unsigned integer.

One other advantage that mine has is that you can convert a float pointed to by DE, so you aren't limited to just OP1.
Code: [Select]
`ConvOP1:;;Output: HL is the 16-bit result.    ld de,OP1ConvFloat:;;Input: DE points to the float.;;Output: HL is the 16-bit result.;;Errors: DataType if the float is negative or complex;;        Domain if the integer exceeds 16 bits.;;Timings:  Assume no errors were called.;;  Input is on:;;  (0,1)         => 59cc                        Average=59;;  0 or [1,10)   => 120cc or 129cc                     =124.5;;  [10,100)      => 176cc or 177cc                     =176.5;;  [100,1000)    => 309cc, 310cc, 318cc, or 319cc.     =314;;  [1000,10000)  => 376cc to 378cc                     =377;;  [10000,65536) => 514cc to 516cc, or 523cc to 525cc  =519.5;;Average case:  496.577178955078125cc;;vs 959.656982421875cc;;87 bytes     ld a,(de)    or a    jr nz,ErrDataType    inc de    ld hl,0    ld a,(de)    inc de    sub 80h    ret c    jr z,final    cp 5    jp c,enterloopErrDomain:;Throws a domain error.    bcall(_ErrDomain)ErrDataType:;Throws a data type error.    bcall(_ErrDataType)loop:    ld a,b    ld b,h    ld c,l    add hl,hl    add hl,bc    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,hlenterloop:    ld b,a    ex de,hl    ld a,(hl) \ and \$F0 \ rra \ ld c,a \ rra \ rra \ sub c \ add a,(hl)    inc hl    ex de,hl    add a,l    ld l,a    jr nc,\$+3    inc h    dec b    ret z    djnz loop    ld b,h    ld c,l    xor a;check overflow in this mul by 10!    add hl,hl \ adc a,a    add hl,hl \ adc a,a    add hl,bc \ adc a,0    add hl,hl \ adc a,a    jr nz,ErrDomainfinal:    ld a,(de)    rrca    rrca    rrca    rrca    and 15    add a,l    ld l,a    ret nc    inc h    ret nz    jr ErrDomain`If you do not want to throw errors and are comfortable returning garbage results:
Code: [Select]
`ConvOP1:;;Output: HL is the 16-bit result.    ld de,OP1ConvFloat:;;Input: DE points to the float.;;Output: HL is the 16-bit result.;;  Input is on:;;  (0,1)         => 57cc                        Average=57;;  0 or [1,10)   => 117cc or 126cc                     =121.5;;  [10,100)      => 174cc or 175cc                     =174.5;;  [100,1000)    => 276cc, 277cc, 285cc, or 286cc.     =281;;  [1000,10000)  => 374cc to 376cc                     =375;;  [10000,65536) => 481cc to483cc, or 490cc to 492cc  =486.5;;Average case:  467.88153076171875cc     ld hl,0    ld a,(de)    or a    ret z    inc de    ld a,(de)    inc de    sub 80h    ret c    jr z,final    cp 5    jp c,enterloop    retloop:    ld a,b    ld b,h    ld c,l    add hl,hl    add hl,bc    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,hlenterloop:    ld b,a    ex de,hl    ld a,(hl) \ and \$F0 \ rra \ ld c,a \ rra \ rra \ sub c \ add a,(hl)    inc hl    ex de,hl    add a,l    ld l,a    jr nc,\$+3    inc h    dec b    ret z    djnz loop    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,bc    add hl,hlfinal:    ld a,(de)    rrca    rrca    rrca    rrca    and 15    add a,l    ld l,a    ret nc    inc h    ret`
EDIT: I do not have time right now, but I found some really significant speed improvements while I was at work yesterday. Gotta take a kitten to the vet, but later I'll post the update ^~^

Here, this one is a bit faster, a little larger, and has the same outputs (DE,A) as the OS routine, but with modified error handling (no negative or non-real  inputs, max input is 65535 instead of 9999).
Code: [Select]
`ConvOP1:    ld hl,OP1convFloat:;;Inputs: HL points to the TI Float.;;Outputs: The float is converted to a 16-bit integer held in DE, or LSB in A.;;avg: 436.723938;;  0-digit : 30cc              (0,1);;  1-digit : 92cc or 100cc     0 or [1,10);;  2-digit : 134cc             [10,100);;  3-digit : 257cc or 265cc    [100,1000);;  4-digit : 329cc or 330cc    [1000,10000);;  5-digit : 453cc or 454cc or 461cc or 462cc   [10000,65536);;95 bytes    ld a,(hl)    or a    jr nz,err_datatype    ld de,0    inc hl    ld a,(hl)    sub 80h    ld b,a      ;ugh, gotta keep A as the LSB, but we need to get B to be the exponent    ld a,e      ;    ret c    inc hl    ld a,(hl)    jr z,lastdigit2;    ld a,(hl)   ;\ I feel particularly proud of this code, so feel free to use it ^~^    and \$F0     ; |It converts a byte of BCD to an 8-bit int.    rra         ; |The catch is, I only have A and C to use, and (hl) is the BCD    ld c,a      ; |number.    rra         ; |If you come up with faster, please let me know and post it in    rra         ; |the optimized routines thread on popular TI forums.    sub a,c     ; |Algo: Multiply upper digit by -6, add the original byte.    add a,(hl)  ;/ Result is upper_digit*(-6+16)+lower_digit    ld e,a    dec b    ret z    dec b    jr z,lastdigit    inc hl    ex de,hl    ld a,b    sla l    add hl,hl    ld b,h    ld c,l    add hl,hl    add hl,bc    add hl,hl    add hl,hl    add hl,hl    add hl,bc    ex de,hl    ld b,a    ld a,(hl)   ;\ I feel particularly proud of this code, so feel free to use it ^~^    and \$F0     ; |It converts a byte of BCD to an 8-bit int.    rra         ; |The catch is, I only have A and C to use, and (hl) is the BCD    ld c,a      ; |number.    rra         ; |If you come up with faster, please let me know and post it in    rra         ; |the optimized routines thread on popular TI forums.    sub a,c     ; |Algo: Multiply upper digit by -6, add the original byte.    add a,(hl)  ;/ Result is upper_digit*(-6+16)+lower_digit. Ha, repost.;    inc hl      ;    add a,e    ld e,a    jr nc,\$+3    inc d    dec b    ret z    dec b    jr nz,err_domainlastdigit:    inc hl    ld a,(hl)    ld h,d    ld l,e    add hl,hl    add hl,hl    add hl,de    add hl,hl    jr c,err_domain    ex de,hllastdigit2:    rrca    rrca    rrca    rrca    and 15    add a,e    ld e,a    ret nc    inc d    ret nzerr_domain:    bcall(_ErrDomain) err_datatype:     bcall(_ErrDataType)`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on October 16, 2017, 10:19:05 pm
Since I was able to optimize my best 16-bit multiply even further today, I thought I'd share! I even think it can be further optimized. And for that matter, here is my favorite DE_Times_A
 mul16 596.34375cc, 92 bytes (incl. DE_Times_A)
Code: [Select]
`mul16:;Inputs:;   BC,DE are unsigned integers;Output:;   HL:DE is the 32-bit product;Destroys:;   A,B,C;min: 359cc;max: 717cc;avg: 596.34375cc;92 bytes    ld a,c    call DE_Times_A    push hl    push af    ld a,b    call DE_Times_A+2    pop bc    pop de;AHL; BDE    ld c,d    add hl,bc    adc a,0;AHLE    ld d,l    ld l,h    ld h,a;HLDE    retDE_Times_A:;Input: DE,A;Output: A:HL is the product, C=0, B,DE unaffected, z flag set if result is zero, c flag set if A is input as 1, else nc.;A:128~255 219+6{0,10}+{0,19}    avg=258.5   *1/2;A:64~127  203+5{0,10}+{0,19}    avg=237.5   *1/4;A:32~63   187+4{0,10}+{0,19}    avg=216.5   *1/8;A:16~31   171+3{0,10}+{0,19}    avg=195.5   *1/16;A:8~15    155+2{0,10}+{0,19}    avg=174.5   *1/32;A:4~7     139+{0,10}+{0,19}     avg=153.5   *1/64;A:2~3     123+{0,19}            avg=132.5   *1/128;A:1       107cc                 avg=107     *1/256;A:0       119cc                 avg=119     *1/256;overall avg: 237.671875cc    ld c,0    ld h,d    ld l,e    rla \ jr c,mul_07    rla \ jr c,mul_06    rla \ jr c,mul_05    rla \ jr c,mul_04    rla \ jr c,mul_03    rla \ jr c,mul_02    rla \ jr c,mul_01    rla    ret c    ld h,a    ld l,a    retmul_07:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_06:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_05:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_04:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_03:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_02:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_01:    add hl,hl \ rla \ ret nc \ add hl,de \ adc a,c    ret`

DE_Times_A, 237.671875cc, 72 bytes
Code: [Select]
`DE_Times_A:;Input: DE,A;Output: A:HL is the product, C=0, B,DE unaffected, z flag set if result is zero, c flag set if A is input as 1, else nc.;A:128~255 219+6{0,10}+{0,19}    avg=258.5   *1/2;A:64~127  203+5{0,10}+{0,19}    avg=237.5   *1/4;A:32~63   187+4{0,10}+{0,19}    avg=216.5   *1/8;A:16~31   171+3{0,10}+{0,19}    avg=195.5   *1/16;A:8~15    155+2{0,10}+{0,19}    avg=174.5   *1/32;A:4~7     139+{0,10}+{0,19}     avg=153.5   *1/64;A:2~3     123+{0,19}            avg=132.5   *1/128;A:1       107cc                 avg=107     *1/256;A:0       119cc                 avg=119     *1/256;overall avg: 237.671875cc    ld c,0    ld h,d    ld l,e    rla \ jr c,mul_07    rla \ jr c,mul_06    rla \ jr c,mul_05    rla \ jr c,mul_04    rla \ jr c,mul_03    rla \ jr c,mul_02    rla \ jr c,mul_01    rla    ret c    ld h,a    ld l,a    retmul_07:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_06:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_05:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_04:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_03:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_02:    add hl,hl \ rla \ jr nc,\$+4 \ add hl,de \ adc a,cmul_01:    add hl,hl \ rla \ ret nc \ add hl,de \ adc a,c    ret`

Title: Re: ASM Optimized routines
Post by: JamesV on October 16, 2017, 10:50:12 pm

This is great, Xeda!

I recently took a fairly standard Div16/8 routine and modified it slightly to allow the dividend to be signed. To do this, it checks HL (the dividend) at the start, and if it's negative then it negates it back to positive, performs the division, and then negates the result. Is this an efficient way of handling this, or would you suggest something else?

For what it's worth, it performs well enough for the task I was working on, but I'm just curious as I've not delved much into bit-level multiplication or division before :)

Code: [Select]
`; divide HL by C (HL is signed, C is not); output: HL = quotient, A = remainderdivHLbyCs:    bit     7,h    push    af    jr      z,divHLbyCsStart    ld a,h \ cpl \ ld h,a    ld a,l \ cpl \ ld l,a    inc     hldivHLbyCsStart:    xor     a    ld      b,16divHLbyCsLoop:    add     hl,hl    rla    cp      c    jp      c,divHLbyCsNext    sub     c    inc     ldivHLbyCsNext:    djnz    divHLbyCLoop    ld      b,a    pop     af    ld      a,b    ret     z    ld a,h \ cpl \ ld h,a    ld a,l \ cpl \ ld l,a    inc     hl    ret`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on October 16, 2017, 11:52:22 pm
I find that doing the sign check before and after is the best method. I did find a few optimizations and noticed that the remainder isn't properly restored to A in the case where the result is negative. Also keep in mind that the routine may fail when C>127.

Code: [Select]
`; divide HL by C (HL is signed, C is not); output: HL = quotient, A = remainderdivHLbyCs:    bit 7,h    push    af    jr      z,divHLbyCsStart    xor a \ sub l \ ld l,a    sbc a,a \ sub h \ ld h,adivHLbyCsStart:    xor     a    ld      b,16divHLbyCsLoop:    add     hl,hl    rla    cp      c    jp      c,divHLbyCsNext    sub     c    inc     ldivHLbyCsNext:    djnz    divHLbyCLoop    ld      b,a    pop     af    ld      a,b    ret     z    xor a \ sub l \ ld l,a    sbc a,a \ sub h \ ld h,a    ld a,c    sub b    ret`I changed the output remainder so that it returns the remainder modulo c. So -7/3 returns 2 instead of 1. This means that no bytes nor clock cycles were saved after all :P
Title: Re: ASM Optimized routines
Post by: JamesV on October 17, 2017, 04:49:18 pm
Thanks for the optimisations/fixes! I guess I was lucky that I don't actually use the remainder in my program, and also C will only ever be between 1-64 inclusive. I like the negate HL optimisation too, that's neat :)
Title: Re: ASM Optimized routines
Post by: Xeda112358 on July 27, 2018, 07:40:53 pm
I have two neat routines, pushpop and diRestore. If you want to preserve HL,DE,BC, and AF, then you can just put a call to pushpop at the start of your routine. If you want interrupts to be restored on exiting your routine, you can call diRestore at the top of your code.

They both mess with the stack to insert a return routine on the stack. For example, when your code calls diRestore, then an ret will first return to the code to restore the interrupt status, and then back to the calling routine.

Code: [Select]
`pushpop:;26 bytes, adds 229cc to the calling routine  ex (sp),hl  push de  push bc  push af  push hl  ld hl,pushpopret  ex (sp),hl  push hl  push af  ld hl,12  add hl,sp  ld a,(hl)  inc hl  ld h,(hl)  ld l,a  pop af  retpushpopret:  pop af  pop bc  pop de  pop hl  ret`
Code: [Select]
`diRestore:    ex (sp),hl    push hl    push af    ld hl,restoreei    ld a,r    jp pe,+_    dec hl    dec hl_:    di    inc sp    inc sp    inc sp    inc sp    ex (sp),hl    dec sp    dec sp    dec sp    dec sp    pop af    retrestoredi:    di    retrestoreei:    ei    ret`Edit: calc84maniac brought up that those 'inc sp' instructions might cause some issues if interrupts fire in between them. The code is modified to disable interrupts first, then mess with the stack.
Title: Re: ASM Optimized routines
Post by: calc84maniac on August 04, 2018, 06:55:30 pm
32-Bit Endian Swap (eZ80)

Swaps the byte order of a 32-bit value stored in EUHL. Can be adapted to work with other register combinations as well.

Code: [Select]
`endianswap32:    push hl    ld h,e    ld e,l    inc sp    push hl    inc sp    pop hl    inc sp    ret`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on October 16, 2018, 06:33:14 pm
Here is a decent arctangent routine that works on [0,1), so you'll have to do the work to extend it outside that range. It uses a small lookup table and linear interpolation.
Spoiler For Range Reduction:
atan(-x)=-atan(x) is useful to extend to negative numbers
atan(x)=pi/2-atan(1/x) is useful to extend to inputs with magnitude >=1
Code: [Select]
`atan8:;computes 256*atan(A/256)->A;56 bytes including the LUT;min: 246cc;max: 271cc;avg: 258.5cc  rlca  rlca  rlca  ld d,a  and 7  ld hl,atan8LUT  add a,l  ld l,a#if (atan8LUT&255)>248    ;this section not included in size/speed totals  jr nc,\$+3               ;can add three bytes, 12cc to max, 11cc to min, and 11.5cc to avg  inc h#endif  ld c,(hl)  inc hl  ld a,(hl)  sub c  ld e,0  ex de,hl  ld d,l  ld e,a  sla h \ jr nc,\$+3 \ ld l,e  add hl,hl \ jr nc,\$+3 \ add hl,de  add hl,hl \ jr nc,\$+3 \ add hl,de  add hl,hl \ jr nc,\$+3 \ add hl,de  add hl,hl \ jr nc,\$+3 \ add hl,de  add hl,hl  add hl,hl  add hl,hl;  add hl,hl    ;used in rounding...  ld a,h;  rra          ;but doesn't seem to improve the error  adc a,c  retatan8LUT:  .db 0,32,63,92,119,143,165,184,201`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on March 24, 2019, 11:58:29 am
32-bit square root:
Code: [Select]
`sqrtHLIX:;Input: HLIX;Output: DE is the sqrt, AHL is the remainder;speed: 751+6{0,6}+{0,3+{0,18}}+{0,38}+sqrtHL;min: 1103;max: 1237;avg: 1165.5;166 bytes  call sqrtHL   ;expects returns A as sqrt, HL as remainder, D = 0  add a,a  ld e,a  rl d  ld a,ixh  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e;Now we have four more iterations;The first two are no problem  ld a,ixl  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc esqrt32_iter15:;On the next iteration, HL might temporarily overflow by 1 bit  sll e \ rl d      ;sla e \ rl d \ inc e  add a,a  adc hl,hl  add a,a  adc hl,hl       ;This might overflow!  jr c,sqrt32_iter15_br0;  sbc hl,de  jr nc,+_  add hl,de  dec e  jr sqrt32_iter16sqrt32_iter15_br0:  or a  sbc hl,de_:  inc e;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anywayssqrt32_iter16:  add a,a  ld b,a        ;either 0x00 or 0x80  adc hl,hl  rla  adc hl,hl  rla;AHL - (DE+DE+1)  sbc hl,de \ sbc a,b  inc e  or a  sbc hl,de \ sbc a,b  ret p  add hl,de  adc a,b  dec e  add hl,de  adc a,b  ret`This uses this sqrtHL routine:
Code: [Select]
`;written by ZedasqrtHL:;returns A as the sqrt, HL as the remainder, D = 0;min: 352cc;max: 391cc;avg: 371.5cc  ld de,05040h  ; 10  ld a,h        ; 4  sub e         ; 4  jr nc,sq7     ;\  add a,e       ; | branch 1: 12cc  ld d,16       ; | branch 2: 18ccsq7:            ;/; ----------  cp d          ; 4  jr c,sq6      ;\  sub d         ; | branch 1: 12cc  set 5,d       ; | branch 2: 19ccsq6:            ;/; ----------  res 4,d       ; 8  srl d         ; 8  set 2,d       ; 8  cp d          ; 4  jr c,sq5      ;\  sub d         ; | branch 1: 12cc  set 3,d       ; | branch 2: 19ccsq5:            ;/  srl d         ; 8; ----------  inc a         ; 4  sub d         ; 4  jr nc,sq4     ;\  dec d         ; | branch 1: 12cc  add a,d       ; | branch 2: 19cc  dec d         ; | <-- this resets the low bit of D, so `srl d` resets carry.sq4:            ;/  srl d         ; 8  ld h,a        ; 4; ----------  ld a,e        ; 4  sbc hl,de     ; 15  jr nc,sq3     ;\  add hl,de     ; | 12cc or 18ccsq3:            ;/  ccf           ; 4  rra           ; 4  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr c,sq2      ;\  or 20h        ; | branch 1: 23cc  db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte.sq2:            ; | branch 2: 21cc  add hl,de     ;/  xor 18h       ; 7  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr c,sq1      ;\  or 8          ; | branch 1: 23cc  db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte.sq1:            ; | branch 2: 21cc  add hl,de     ;/  xor 6         ; 7  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr nc,+_      ;    \  add hl,de     ; 15  |  srl d         ; 8   |  rra           ; 4   | branch 1: 38cc  ret           ; 10  | branch 2: 40cc_:              ;     |  inc a         ; 4   |  srl d         ; 8   |  rra           ; 4   |  ret           ; 10 /`
sqrtHL was from my work on the float routines, sqrtHLIX was inspired by this (https://www.omnimaga.org/asm-language/(z80)-32-bit-by-16-bits-division-and-32-bit-square-root/msg406942/#msg406942) thread :)

EDIT: Optimized some more :)
Title: Re: ASM Optimized routines
Post by: Xeda112358 on August 16, 2019, 11:41:13 pm
Here are some routines that I've added to the repository (https://github.com/Zeda/Z80-Optimized-Routines):

itoa_8
Converts an 8-bit signed integer to an ASCII string.
Code: [Select]
`;Converts an 8-bit signed integer to a stringitoa_8:;Input:;   A is a signed integer;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes);Output:;   The number is converted to a null-terminated string at HL;Destroys:;   Up to five bytes at HL;   All registers preserved.;on 0 to 9:       252       D=0;on 10 to 99:     258+20D   D=0 to 9;on 100 to 127:   277+20D   D=0 to 2;on -1 to -9:     276       D=0;on -10 to -99:   282+20D   D=0 to 9;on -100 to -128: 301+20D   D=0 to 2;min: 252cc  (+23cc over original);max: 462cc  (-49cc over original);avg: 343.74609375cc = 87999/256;54 bytes  push hl  push de  push bc  push af  or a  jp p,itoa_pos  neg  ld (hl),\$1A  ;start if neg char on TI-OS  inc hlitoa_pos:;A is on [0,128];calculate 100s place, plus 1 for a future calculation  ld b,'0'  cp 100 \ jr c,\$+5 \ sub 100 \ inc b;calculate 10s place digit, +1 for future calculation  ld de,\$0A2F  inc e \ sub d \ jr nc,\$-2  ld c,a;Digits are now in D, C, A; strip leading zeros!  ld a,'0'  cp b \ jr z,\$+5 \ ld (hl),b \ inc hl \ .db \$FE  ; start of `cp *` to skip the next byte, turns into `cp \$BB` which will always return nz and nc  cp e \ jr z,\$+4 \ ld (hl),e \ inc hl  add a,c  add a,d  ld (hl),a  inc hl  ld (hl),0  pop af  pop bc  pop de  pop hl  ret`
fixed88_to_string
Uses the itoa_8 routine to convert an 8.8 fixed-point number to a string.
Code: [Select]
`;This converts a fixed-point number to a string.;It displays up to 3 digits after the decimal.fixed88_to_str:;Inputs:;   D.E is the fixed-point number;   HL points to where the string gets output.;      Needs at most 9 bytes.;Outputs:;   HL is preserved;Destroys:;   AF,DE,BC;First check if the input is negative.;If so, write a negative sign and negate  push hl  ld a,d  or a  jp p,+_  ld (hl),\$1A      ;negative sign on TI-OS  inc hl  xor a  sub e  ld e,a  sbc a,a  sub d_:;Our adjusted number is in A.E;Now we can print the integer part  call itoa_8;Check if we need to print the fractional part  xor a  cp e  jr z,fixed88_to_str_end;We need to write the fractional part, so seek the end of the string;Search for the null byte. A is already 0  cpir;Write a decimal  dec hl  ld (hl),'.'  ld b,3_:;Multiply E by 10, converting overflow to an ASCII digit  call fixed88_to_str_e_times_10  inc hl  ld (hl),a  djnz -_;Strip the ending zeros  ld a,'0'_:  cp (hl)  dec hl  jr z,-_;write a null byte  inc hl  inc hl  ld (hl),0fixed88_to_str_end:;restore HL  pop hl  retfixed88_to_str_e_times_10:  ld a,e  ld d,0  add a,a \ rl d  add a,a \ rl d  add a,e \ jr nc,\$+3 \ inc d  add a,a  ld e,a  ld a,d  rla  add a,'0'  ret`
sqrtA
This is a very fast, unrolled routine to compute the square root of A.

Code: [Select]
`sqrtA:;Input: A;Output: D is the square root, A is the remainder (input-D^2);Destroys: BC;speed: 161+{0,6}+{0,1}+{0,1}+{0,3};min: 161cc;max: 172cc;avg: 166.5cc;45 bytes  ld d,\$40  sub d  jr nc,+_  add a,d  ld d,0_:  set 4,d  sub d  jr nc,+_  add a,d  .db \$01   ;start of ld bc,** which is 10cc to skip the next two bytes._:  set 5,d  res 4,d  srl d  set 2,d  sub d  jr nc,+_  add a,d  .db \$01   ;start of ld bc,** which is 10cc to skip the next two bytes._:  set 3,d  res 2,d  srl d  inc d  sub d  jr nc,+_  add a,d  dec d_:  inc d  srl d  ret`
sqrtfixed_88
An unrolled, fast 8.8 fixed-point square root routine. Uses the above sqrtA routine.
Code: [Select]
`sqrtfixed_88:;Input: A.E ==> D.E;Output: DE is the sqrt, AHL is the remainder;Speed: 690+6{0,13}+{0,3+{0,18}}+{0,38}+sqrtA;min: 855cc;max: 1003cc;avg: 924.5cc;152 bytes  call sqrtA  ld l,a  ld a,e  ld h,0  ld e,d  ld d,h  sla e  rl d  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e;Now we have four more iterations;The first two are no problem  sll e \ rl d  add hl,hl  add hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc e  sll e \ rl d  add hl,hl  add hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db \$FE     ;start of `cp *`_:  inc esqrtfixed_88_iter11:;On the next iteration, HL might temporarily overflow by 1 bit  sll e \ rl d      ;sla e \ rl d \ inc e  add hl,hl  add hl,hl  jr c,sqrtfixed_88_iter11_br0;  sbc hl,de  jr nc,+_  add hl,de  dec e  jr sqrtfixed_88_iter12sqrtfixed_88_iter11_br0:  or a  sbc hl,de_:  inc e;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anywayssqrtfixed_88_iter12:  ld b,a      ;A is 0, so B is 0  add hl,hl  add hl,hl  rla;AHL - (DE+DE+1)  sbc hl,de \ sbc a,b  inc e  or a  sbc hl,de \ sbc a,b  ret p  add hl,de  adc a,b  dec e  add hl,de  adc a,b  ret`
ncr_HL_DE
Computes 'HL choose DE' in such a way so that overflow only occurs if the final result overflows 16 bits.
Code: [Select]
`; Requires;    mul16          ;BC*DE ==> DEHL;    DEHL_Div_BC    ;DEHL/BC ==> DEHLncr_HL_DE:;"n choose r", defined as n!/(r!(n-r)!);Computes "HL choose DE";Inputs: HL,DE;Outputs:;   HL is the result;       "HL choose DE";   carry flag reset means overflow;Destroys:;   A,BC,DE,IX;Notes:;   Overflow is returned as 0;   Overflow happens if HL choose DE exceeds 65535;   This algorithm is constructed in such a way that intermediate;   operations won't erroneously trigger overflow.;66 bytes  ld bc,1  or a  sbc hl,de  jr c,ncr_oob  jr z,ncr_exit  sbc hl,de  add hl,de  jr c,\$+3  ex de,hl  ld a,h  or l  push hl  pop ixncr_exit:  ld h,b  ld l,c  scf  ret zncr_loop:  push bc \ push de  push hl \ push bc  ld b,h  ld c,l  call mul16          ;BC*DE ==> DEHL  pop bc  call DEHL_Div_BC    ;result in DEHL  ld a,d  or e  pop bc  pop de  jr nz,ncr_overflow  add hl,bc  jr c,ncr_overflow  pop bc  inc bc  ld a,b  cp ixh  jr c,ncr_loop  ld a,ixl  cp c  jr nc,ncr_loop  retncr_overflow:  pop bc  xor a  ld b,ancr_oob:  ld h,b  ld l,b  ret`
EDIT: Optimized itoa_8 above. Here are some more routines:
uitoa_8
Converts an 8-bit unsigned integer to an ASCII string.
Code: [Select]
`;Converts an 8-bit unsigned integer to a stringuitoa_8:;Input:;   A is a signed integer;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes);Output:;   The number is converted to a null-terminated string at HL;Destroys:;   Up to four bytes at HL;   All registers preserved.;on 0 to 9:     238              D=0;on 10 to 99:   244+20D          D=0 to 9;on 100 to 255: 257+2{0,6}+20D   D=0 to 5;min: 238cc;max: 424cc;avg: 317.453125cc = 81268/256 = (238*10 + 334*90+313*156)/256;52 bytes  push hl  push de  push bc  push af;A is on [0,255];calculate 100s place, plus 1 for a future calculation  ld b,'0'  cp 100 \ jr c,\$+5 \ sub 100 \ inc b  cp 100 \ jr c,\$+5 \ sub 100 \ inc b;calculate 10s place digit, +1 for future calculation  ld de,\$0A2F  inc e \ sub d \ jr nc,\$-2  ld c,a;Digits are now in D, C, A; strip leading zeros!  ld a,'0'  cp b \ jr z,\$+5 \ ld (hl),b \ inc hl \ .db \$FE  ; start of `cp *` to skip the next byte, turns into `cp \$BB` which will always return nz and nc  cp e \ jr z,\$+4 \ ld (hl),e \ inc hl  add a,c  add a,d  ld (hl),a  inc hl  ld (hl),0  pop af  pop bc  pop de  pop hl  ret`
itoa_16
Converts a 16-bit signed integer to an ASCII string.
Code: [Select]
`;Converts a 16-bit signed integer to an ASCII string.itoa_16:;Input:;   DE is the number to convert;   HL points to where to write the ASCII string (up to 7 bytes needed).;Output:;   HL points to the null-terminated ASCII string;      NOTE: This isn't necessarily the same as the input HL.  push de  push bc  push af  push hl  bit 7,d  jr z,+_  xor a  sub e  ld e,a  sbc a,a  sub d  ld d,a  ld (hl),\$1A     ;negative char on TI-OS  inc hl_:  ex de,hl  ld bc,-10000  ld a,'0'-1  inc a \ add hl,bc \ jr c,\$-2  ld (de),a  inc de  ld bc,1000  ld a,'9'+1  dec a \ add hl,bc \ jr nc,\$-2  ld (de),a  inc de  ld bc,-100  ld a,'0'-1  inc a \ add hl,bc \ jr c,\$-2  ld (de),a  inc de  ld a,l  ld h,'9'+1  dec h \ add a,10 \ jr nc,\$-3  add a,'0'  ex de,hl  ld (hl),d  inc hl  ld (hl),a  inc hl  ld (hl),0;No strip the leading zeros  pop hl;If the first char is a negative sign, skip it  ld a,(hl)  cp \$1A  push af  ld a,'0'  jr nz,\$+3  inc hl  cp (hl)  jr z,\$-2;Check if we need to re-write the negative sign  pop af  jr nz,+_  dec hl  ld (hl),a_:  pop af  pop bc  pop de  ret`
uitoa_16
Converts a 16-bit unsigned integer to an ASCII string.
Code: [Select]
`;Converts a 16-bit unsigned integer to an ASCII string.uitoa_16:;Input:;   DE is the number to convert;   HL points to where to write the ASCII string (up to 6 bytes needed).;Output:;   HL points to the null-terminated ASCII string;      NOTE: This isn't necessarily the same as the input HL.  push de  push bc  push af  ex de,hl  ld bc,-10000  ld a,'0'-1  inc a \ add hl,bc \ jr c,\$-2  ld (de),a  inc de  ld bc,1000  ld a,'9'+1  dec a \ add hl,bc \ jr nc,\$-2  ld (de),a  inc de  ld bc,-100  ld a,'0'-1  inc a \ add hl,bc \ jr c,\$-2  ld (de),a  inc de  ld a,l  ld h,'9'+1  dec h \ add a,10 \ jr nc,\$-3  add a,'0'  ex de,hl  ld (hl),d  inc hl  ld (hl),a  inc hl  ld (hl),0;No strip the leading zeros  ld c,-6  add hl,bc  ld a,'0'  inc hl \ cp (hl) \ jr z,\$-2  pop af  pop bc  pop de  ret`
Title: Re: ASM Optimized routines
Post by: Xeda112358 on August 28, 2019, 09:33:29 am
Here is a masked sprite routine (no clipping)! Interleave the data with the mask, where the MASK is ANDed with the buffer, and the data is ORed on top of that:
Code: [Select]
`;Masked Sprite routineputsprite_masked:;Inputs:;   (A,L) = (x,y);   B is height;   IX points to the sprite data;       first byte is the data;       second byte is mask;       continues, alternating like this.;;Outputs:;   Mask is ANDed to the buffer, then data is ORed on top of that.;;Destroys:;   AF, BC, DE, HL, IX;;Notes:;   To set a pixel...;     black: mask is any, data is 1;     white: mask is 0, data is 0;     clear: mask is 1, data is 0 (keeps the data from the buffer);;This routine is free to use :);65 bytes (or 66 bytes if gbuf is not located at 0x**40  ld e,l  ld h,0  ld d,h  add hl,hl  add hl,de  add hl,hl  add hl,hl  ld e,a  and 7  ld c,a  xor e  ;essentially gets E with the bottom 3 bits reset#if (plotSScreen&255) = 64  inc a  rra  rra  rra  ld e,a  ld d,plotSScreen>>8#else  rra  rra  rra  ld e,a  add hl,de  ld de,plotSScreen#endif  add hl,deputsprite_masked_loop:  push bc  xor a  ld d,(ix)  ld e,a  sub c  ld b,c  ld c,\$FF  inc ix  ld a,(ix)  jr z,putsprite_masked_rotdoneputsprite_masked_rot:  scf  rra  rr c  srl d  rr e  djnz putsprite_masked_rotputsprite_masked_rotdone:  and (hl)  or d  ld (hl),a  inc hl  ld a,(hl)  and c  or e  ld (hl),a  ld c,11  add hl,bc  inc ix  pop bc  djnz putsprite_masked_loop  ret`But if you want even faster and smaller, use a non-traditional mask technique by ORing the mask onto the buffer, then XORing the data on top of it. The format is less intuitive, but it allows for white/black/clear/invert instead of just white/black/clear:
Code: [Select]
`;Masked Sprite routineputsprite_masked:;Inputs:;   (A,L) = (x,y);   B is height;   IX points to the sprite data;       first byte is the data;       second byte is mask;       continues, alternating like this.;;Outputs:;   Mask is ORed to the buffer, then data is XORed on top of that.;;Destroys:;   AF, BC, DE, HL, IX;;Notes:;   To set a pixel...;     black: mask is 1, data is 0;     white: mask is 1, data is 1;     clear: mask is 0, data is 0 (keeps the data from the buffer);     invert: mask is 0, data is 1 (inverts the data from the buffer);;This routine is free to use :);63 bytes (or 64 bytes if gbuf is not located at 0x**40  ld e,l  ld h,0  ld d,h  add hl,hl  add hl,de  add hl,hl  add hl,hl  ld e,a  and 7  ld c,a  xor e  ;essentially gets E with the bottom 3 bits reset#if (plotSScreen&255) = 64  inc a  rra  rra  rra  ld e,a  ld d,plotSScreen>>8#else  rra  rra  rra  ld e,a  add hl,de  ld de,plotSScreen#endif  add hl,deputsprite_masked_loop:  push bc  xor a  ld d,(ix)  ld e,a  or c  ld b,c  ld c,e  inc ix  ld a,(ix)  jr z,putsprite_masked_rotdoneputsprite_masked_rot:  rra  rr c  srl d  rr e  djnz putsprite_masked_rotputsprite_masked_rotdone:  or (hl)  xor d  ld (hl),a  inc hl  ld a,(hl)  or c  xor e  ld (hl),a  ld c,11  add hl,bc  inc ix  pop bc  djnz putsprite_masked_loop  ret`

I also made some "bigsprite" routines! These do have clipping, too. First, they use some common subroutines for computing masks and performing most of the clipping and shifting:
Code: [Select]
`;133 bytes total;This is made by Zeda, feel free to use it for whatever.;Takes inputs for a big sprite and sets up masks and clipping;requires 4 bytes of temporary RAM, but doesn't use SMCspritetmp = 8000h     ;relocate this as needed! Just need 4 bytes.sprite_width  = spritetmp+0sprite_x      = spritetmp+1sprite_mask0  = spritetmp+2sprite_mask1  = spritetmp+3bigsprite_subroutine:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;Outputs:;   carry flag is set if okay to draw, nc if out-of-bounds.;   B is height.;   C is width.;   HL points to the byte to start drawing at.;   DE points to where to start sourcing the sprite data;   (sprite_width) is the width of the sprite in bytes;   (sprite_x) is the intitial x coordinate to begin drawing at;   (sprite_mask0) is the left mask;   (sprite_mask1) is the right mask;92 bytes;First check if the sprite is on-screen in the horizontal direction  ld a,c  cp 64  jr c,+_  add a,h  ret nc  ld h,a  push hl  xor a  ld h,a  sub c  ex de,hl  add hl,de  dec a  jr nz,\$-2  ex de,hl  pop hl  xor a  ld c,a_:;Next check h+c<=64  ld a,64  sub c  cp h  jr nc,+_  ld h,a_:;Make sure the height is not now 0  ld a,h  or a  ret z;Save the width and height of the sprite  push hl               ;height,width  ld h,b  ld (sprite_width),hl  ;x,width  push de               ;sprite pointer;Set up a pointer to the routine for shifting the  routine for shifting the sprite data  ld ixh,rshiftHA_7>>8  ld a,h  cpl  and 7  ld l,a  add a,a  add a,l  add a,rshiftHA_7&255  ld ixl,a#if (rshiftHA_7&255)>234  jr nc,\$+4  inc ixh#endif  ld a,b  and 7  ld de,spritemask  add a,e  ld e,a#if spritemask&255>248  jr nc,\$+3  inc d#endif  ld a,(de)  ld (sprite_mask0),a  cpl  ld (sprite_mask1),a;;  ld a,c  add a,a  sbc a,a  ld h,a  ld a,b  ld b,h  ld l,c  add hl,hl  add hl,bc  add hl,hl  add hl,hl  ld c,a  add a,a  sbc a,a  ld b,a  ld a,c  sra c  sra c  sra c  add hl,bc  ld bc,plotSScreen  add hl,bc  pop de  pop bc  ;B is height  ;C is width  ex de,hl  scf  retrshiftHA_7:  rr h \ rra  rr h \ rra  rr h \ rra  rr h \ rra  rr h \ rra  rr h \ rra  rr h \ rra  ex de,hl  ld e,a  retspritemask:  .db \$00,\$80,\$C0,\$E0,\$F0,\$F8,\$FC,\$FEcall_ix:  jp (ix)`Then you can draw a big sprite with OR logic:
Code: [Select]
`bigsprite_OR:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;68 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_OR_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_OR:  push bc  push hl  ld h,(hl)  xor a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,(hl)  or d  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,(sprite_mask1)  ld a,(hl)  or e  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_OR  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_OR_loop  ret`Or draw with XOR logic:
Code: [Select]
`bigsprite_XOR:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;68 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_XOR_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_XOR:  push bc  push hl  ld h,(hl)  xor a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,(hl)  xor d  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,(sprite_mask1)  ld a,(hl)  xor e  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_XOR  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_XOR_loop  ret`Or draw with AND logic:
Code: [Select]
`bigsprite_AND:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;69 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_AND_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_AND:  push bc  push hl  ld h,(hl)  scf \ sbc a,a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,(hl)  and d  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,(sprite_mask1)  ld a,(hl)  and e  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_AND  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_AND_loop  ret`Or draw with Erase logic:
Code: [Select]
`bigsprite_Erase:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;67 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_Erase_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_Erase:  push bc  push hl  ld h,(hl)  xor a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,d  cpl  and (hl)  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,e  cpl  and (hl)  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_Erase  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_Erase_loop  ret`Or draw with Overwrite logic:
Code: [Select]
`bigsprite_Overwrite:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;71 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_Overwrite_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_Overwrite:  push bc  push hl  ld h,(hl)  xor a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,(sprite_mask0)  and (hl)  or d  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,(sprite_mask1)  and (hl)  or e  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_Overwrite  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_Overwrite_loop  ret`
Title: Re: ASM Optimized routines
Post by: SpiroH on August 28, 2019, 10:07:05 am
@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language :/.

Title: Re: ASM Optimized routines
Post by: Xeda112358 on August 28, 2019, 10:11:55 am
@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language :/.

Thanks :) I wrote these with apps in mind, so I tried to reduce the need for external RAM. I should definitely make versions that take full advantage of SMC, though!
Title: Re: ASM Optimized routines
Post by: Xeda112358 on September 10, 2019, 06:18:29 pm
Here is a circle routine that I made!
Code: [Select]
`;Written by Zeda Thomas, free to use.;This draws a circle centered at 8-bit coordinates and with radius up to 127.;IX points to a `plot` routine that takes (B,C)=(x,y) as input and does something;with it, like plot the pixel a certain color, or plot a "big" pixel, or whatever.;   plot;     Takes coordinates, (B,C) = (x,y) and plots the point.;;For example, on the TI-83+/84+/SE the plot routine might look like:; plot:;   call getpixelloc;   ret nc            ;Exit if the coordinates are out-of-bounds;   or (hl);   ld (hl),a;   ret;;; Required subroutines:;     call_ix:;       jp (ix)circle:;Input:; (B,C) is the center (x,y); D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).; IX points to a `plot` routine that takes (B,C)=(x,y) as input.  ld a,d  add a,a  ret c  ret z  ld l,d  dec a  ld e,a  dec a        ;if the pixel is only 1 wide, just plot the point  jp z,call_ix ;Jump to the plot routine  xor a  ld h,-1  ld d,1  scf     ;skip the first plotcircleloop:  call nc,plot4  inc h  sub d  inc d  inc d  jr nc,circleloop_:  dec l  call plot4  add a,e  dec e  ret z  dec e  jr nc,-_  jp circleloopplot4:;BC is center;HL is x,y  push de  push af  push hl  push bc;If H is 0, or L is 0, we need to draw only half  push hl  ld a,b  sub h  ld b,a  add a,h  add a,h  ld h,a  ld a,c  sub l  ld c,a  add a,l  add a,l  ld l,a;B is x0-x;C is y0-y;H is x0+x;L is y0+y;plot(x0-x,y0-y);plot(x0+x,y0+y)  push bc  push hl  call call_ix    ;call the plot routine  pop bc  push bc  call call_ix    ;call the plot routine;now swap the y coords  pop hl  pop bc  ld a,l  ld l,c  ld c,a  pop de  xor a  cp d  jr z,+_  cp e  jr z,+_;plot(x0-x,y0+y);plot(x0+x,y0-y)  push hl  call call_ix    ;call the plot routine  pop bc  call call_ix    ;call the plot routine_:  pop bc  pop hl  pop af  pop de  ret`The really cool feature about this is that you can define a custom plot routine pointed to by IX, so it isn't TI-specific, and you can do all sorts of wonky things like:
Draw 2x2 pixels:
(https://i.imgur.com/qJAWeQl.png)
Code: [Select]
`;calling with `ld ix,pixelOn_2x2`pixelOn_2x2:  sla b  ret c  sla c  ret c  push bc  call pixelOn  pop bc  inc b  push bc  call pixelOn  pop bc  inc c  push bc  call pixelOn  pop bc  dec b  jp pixelOn`
Or draw a circle whose "pixels" are circles:
(https://i.imgur.com/vyHQXaE.png)
Code: [Select]
`;calling with `ld ix,pixelOn_circle`pixelOn_circle:  ld a,b  cp 32  ret nc  add a,a  add a,a  ld b,a  ld a,c  cp 32  ret nc  add a,a  add a,a  ld c,a  ld d,4  push ix    ;need to save IX!  ld ix,pixelOn  call circle  pop ix  ret`EDIT: I inlined some subroutines because there was no reason to have them called. It was a waste of clock cycles and space!
EDIT: Have a separate, filled rectangle routine!
Note that if you pass the same arguments as the regular circle routine, this only draws the inside part  and skips the border.
Code: [Select]
`;Written by Zeda Thomas, free to use.;This draws the fill of a circle centered at 8-bit coordinates and with radius;up to 127.;IX points to a `horizontal line` routine that takes E=x, A=y, D=width as input;and does something with it, like plot a horizontal line.;; For example, on the ti-83+/84+/SE calculators, you might have:;     horizontal_line:;       ld b,e;       ld c,a;       ld e,1;       ld hl,gbuf;       jp rectOR; Required subroutines:;     call_ix:;       jp (ix)filledcircle:;Input:; (B,C) is the center (x,y); D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).; IX points to a `plot` routine that takes (B,C)=(x,y) as input.  ld a,d  add a,a  ret c  ret z  ld l,d  dec a  ld e,a  xor a  ld h,-1  ld d,1filledcircleloop:  ; call c,fillcircle_plot  inc h  sub d  inc d  inc d  jr nc,filledcircleloop_:  dec l  call fillcircle_plot  add a,e  dec e  ret z  dec e  jr nc,-_  jp filledcircleloopfillcircle_plot:  inc h  dec h  ret z  push hl  push de  push bc  push af  dec h  ld a,b  sub h  ld e,a  ld d,h  sll d   ;aka `slia`, undocumented  ld a,l  or a  ld h,c  jr z,+_  add a,h  push de  push hl  call nz,call_ix  pop hl  pop de_:  ld a,h  sub l  call call_ix  pop af  pop bc  pop de  pop hl  ret`