• ASM Optimized routines 5 1
Currently:

### Author Topic: ASM Optimized routines  (Read 43685 times)

0 Members and 1 Guest are viewing this topic.

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4639
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #90 on: October 16, 2017, 11:52:22 pm »
I find that doing the sign check before and after is the best method. I did find a few optimizations and noticed that the remainder isn't properly restored to A in the case where the result is negative. Also keep in mind that the routine may fail when C>127.

Code: [Select]
; divide HL by C (HL is signed, C is not); output: HL = quotient, A = remainderdivHLbyCs:    bit 7,h    push    af    jr      z,divHLbyCsStart    xor a \ sub l \ ld l,a    sbc a,a \ sub h \ ld h,adivHLbyCsStart:    xor     a    ld      b,16divHLbyCsLoop:    add     hl,hl    rla    cp      c    jp      c,divHLbyCsNext    sub     c    inc     ldivHLbyCsNext:    djnz    divHLbyCLoop    ld      b,a    pop     af    ld      a,b    ret     z    xor a \ sub l \ ld l,a    sbc a,a \ sub h \ ld h,a    ld a,c    sub b    retI changed the output remainder so that it returns the remainder modulo c. So -7/3 returns 2 instead of 1. This means that no bytes nor clock cycles were saved after all

#### JamesV

• Posts: 266
• Rating: +76/-0
##### Re: ASM Optimized routines
« Reply #91 on: October 17, 2017, 04:49:18 pm »
Thanks for the optimisations/fixes! I guess I was lucky that I don't actually use the remainder in my program, and also C will only ever be between 1-64 inclusive. I like the negate HL optimisation too, that's neat

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4639
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #92 on: July 27, 2018, 07:40:53 pm »
I have two neat routines, pushpop and diRestore. If you want to preserve HL,DE,BC, and AF, then you can just put a call to pushpop at the start of your routine. If you want interrupts to be restored on exiting your routine, you can call diRestore at the top of your code.

They both mess with the stack to insert a return routine on the stack. For example, when your code calls diRestore, then an ret will first return to the code to restore the interrupt status, and then back to the calling routine.

Code: [Select]
pushpop:;26 bytes, adds 229cc to the calling routine  ex (sp),hl  push de  push bc  push af  push hl  ld hl,pushpopret  ex (sp),hl  push hl  push af  ld hl,12  add hl,sp  ld a,(hl)  inc hl  ld h,(hl)  ld l,a  pop af  retpushpopret:  pop af  pop bc  pop de  pop hl  ret
Code: [Select]
diRestore:    ex (sp),hl    push hl    push af    ld hl,restoreei    ld a,r    jp pe,+_    dec hl    dec hl_:    di    inc sp    inc sp    inc sp    inc sp    ex (sp),hl    dec sp    dec sp    dec sp    dec sp    pop af    retrestoredi:    di    retrestoreei:    ei    retEdit: calc84maniac brought up that those 'inc sp' instructions might cause some issues if interrupts fire in between them. The code is modified to disable interrupts first, then mess with the stack.
« Last Edit: July 28, 2018, 09:34:41 am by Xeda112358 »

#### calc84maniac

• eZ80 Guru
• Coder Of Tomorrow
• LV11 Super Veteran (Next: 3000)
• Posts: 2897
• Rating: +467/-17
##### Re: ASM Optimized routines
« Reply #93 on: August 04, 2018, 06:55:30 pm »
32-Bit Endian Swap (eZ80)

Swaps the byte order of a 32-bit value stored in EUHL. Can be adapted to work with other register combinations as well.

Code: [Select]
endianswap32:    push hl    ld h,e    ld e,l    inc sp    push hl    inc sp    pop hl    inc sp    ret
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4639
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #94 on: October 16, 2018, 06:33:14 pm »
Here is a decent arctangent routine that works on [0,1), so you'll have to do the work to extend it outside that range. It uses a small lookup table and linear interpolation.
Spoiler For Range Reduction:
atan(-x)=-atan(x) is useful to extend to negative numbers
atan(x)=pi/2-atan(1/x) is useful to extend to inputs with magnitude >=1
Code: [Select]
atan8:;computes 256*atan(A/256)->A;56 bytes including the LUT;min: 246cc;max: 271cc;avg: 258.5cc  rlca  rlca  rlca  ld d,a  and 7  ld hl,atan8LUT  add a,l  ld l,a#if (atan8LUT&255)>248    ;this section not included in size/speed totals  jr nc,$+3 ;can add three bytes, 12cc to max, 11cc to min, and 11.5cc to avg inc h#endif ld c,(hl) inc hl ld a,(hl) sub c ld e,0 ex de,hl ld d,l ld e,a sla h \ jr nc,$+3 \ ld l,e  add hl,hl \ jr nc,$+3 \ add hl,de add hl,hl \ jr nc,$+3 \ add hl,de  add hl,hl \ jr nc,$+3 \ add hl,de add hl,hl \ jr nc,$+3 \ add hl,de  add hl,hl  add hl,hl  add hl,hl;  add hl,hl    ;used in rounding...  ld a,h;  rra          ;but doesn't seem to improve the error  adc a,c  retatan8LUT:  .db 0,32,63,92,119,143,165,184,201

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4639
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #95 on: March 24, 2019, 11:58:29 am »
32-bit square root:
Code: [Select]
sqrtHLIX:;Input: HLIX;Output: DE is the sqrt, AHL is the remainder;speed: 751+6{0,6}+{0,3+{0,18}}+{0,38}+sqrtHL;min: 1103;max: 1237;avg: 1165.5;166 bytes  call sqrtHL   ;expects returns A as sqrt, HL as remainder, D = 0  add a,a  ld e,a  rl d  ld a,ixh  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e;Now we have four more iterations;The first two are no problem  ld a,ixl  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc esqrt32_iter15:;On the next iteration, HL might temporarily overflow by 1 bit  sll e \ rl d      ;sla e \ rl d \ inc e  add a,a  adc hl,hl  add a,a  adc hl,hl       ;This might overflow!  jr c,sqrt32_iter15_br0;  sbc hl,de  jr nc,+_  add hl,de  dec e  jr sqrt32_iter16sqrt32_iter15_br0:  or a  sbc hl,de_:  inc e;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anywayssqrt32_iter16:  add a,a  ld b,a        ;either 0x00 or 0x80  adc hl,hl  rla  adc hl,hl  rla;AHL - (DE+DE+1)  sbc hl,de \ sbc a,b  inc e  or a  sbc hl,de \ sbc a,b  ret p  add hl,de  adc a,b  dec e  add hl,de  adc a,b  retThis uses this sqrtHL routine:
Code: [Select]
;written by ZedasqrtHL:;returns A as the sqrt, HL as the remainder, D = 0;min: 352cc;max: 391cc;avg: 371.5cc  ld de,05040h  ; 10  ld a,h        ; 4  sub e         ; 4  jr nc,sq7     ;\  add a,e       ; | branch 1: 12cc  ld d,16       ; | branch 2: 18ccsq7:            ;/; ----------  cp d          ; 4  jr c,sq6      ;\  sub d         ; | branch 1: 12cc  set 5,d       ; | branch 2: 19ccsq6:            ;/; ----------  res 4,d       ; 8  srl d         ; 8  set 2,d       ; 8  cp d          ; 4  jr c,sq5      ;\  sub d         ; | branch 1: 12cc  set 3,d       ; | branch 2: 19ccsq5:            ;/  srl d         ; 8; ----------  inc a         ; 4  sub d         ; 4  jr nc,sq4     ;\  dec d         ; | branch 1: 12cc  add a,d       ; | branch 2: 19cc  dec d         ; | <-- this resets the low bit of D, so srl d resets carry.sq4:            ;/  srl d         ; 8  ld h,a        ; 4; ----------  ld a,e        ; 4  sbc hl,de     ; 15  jr nc,sq3     ;\  add hl,de     ; | 12cc or 18ccsq3:            ;/  ccf           ; 4  rra           ; 4  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr c,sq2      ;\  or 20h        ; | branch 1: 23cc  db 254        ; |   <-- start of cp * which is 7cc to skip the next byte.sq2:            ; | branch 2: 21cc  add hl,de     ;/  xor 18h       ; 7  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr c,sq1      ;\  or 8          ; | branch 1: 23cc  db 254        ; |   <-- start of cp * which is 7cc to skip the next byte.sq1:            ; | branch 2: 21cc  add hl,de     ;/  xor 6         ; 7  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr nc,+_      ;    \  add hl,de     ; 15  |  srl d         ; 8   |  rra           ; 4   | branch 1: 38cc  ret           ; 10  | branch 2: 40cc_:              ;     |  inc a         ; 4   |  srl d         ; 8   |  rra           ; 4   |  ret           ; 10 /
sqrtHL was from my work on the float routines, sqrtHLIX was inspired by this thread

EDIT: Optimized some more
« Last Edit: September 09, 2019, 04:33:21 pm by Xeda112358 »

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4639
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #96 on: August 16, 2019, 11:41:13 pm »
Here are some routines that I've added to the repository:

itoa_8
Converts an 8-bit signed integer to an ASCII string.
Code: [Select]
;Converts an 8-bit signed integer to a stringitoa_8:;Input:;   A is a signed integer;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes);Output:;   The number is converted to a null-terminated string at HL;Destroys:;   Up to five bytes at HL;   All registers preserved.;on 0 to 9:       252       D=0;on 10 to 99:     258+20D   D=0 to 9;on 100 to 127:   277+20D   D=0 to 2;on -1 to -9:     276       D=0;on -10 to -99:   282+20D   D=0 to 9;on -100 to -128: 301+20D   D=0 to 2;min: 252cc  (+23cc over original);max: 462cc  (-49cc over original);avg: 343.74609375cc = 87999/256;54 bytes  push hl  push de  push bc  push af  or a  jp p,itoa_pos  neg  ld (hl),$1A ;start if neg char on TI-OS inc hlitoa_pos:;A is on [0,128];calculate 100s place, plus 1 for a future calculation ld b,'0' cp 100 \ jr c,$+5 \ sub 100 \ inc b;calculate 10s place digit, +1 for future calculation  ld de,$0A2F inc e \ sub d \ jr nc,$-2  ld c,a;Digits are now in D, C, A; strip leading zeros!  ld a,'0'  cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db$FE  ; start of cp * to skip the next byte, turns into cp $BB which will always return nz and nc cp e \ jr z,$+4 \ ld (hl),e \ inc hl  add a,c  add a,d  ld (hl),a  inc hl  ld (hl),0  pop af  pop bc  pop de  pop hl  ret
fixed88_to_string
Uses the itoa_8 routine to convert an 8.8 fixed-point number to a string.
Code: [Select]
;This converts a fixed-point number to a string.;It displays up to 3 digits after the decimal.fixed88_to_str:;Inputs:;   D.E is the fixed-point number;   HL points to where the string gets output.;      Needs at most 9 bytes.;Outputs:;   HL is preserved;Destroys:;   AF,DE,BC;First check if the input is negative.;If so, write a negative sign and negate  push hl  ld a,d  or a  jp p,+_  ld (hl),$1A ;negative sign on TI-OS inc hl xor a sub e ld e,a sbc a,a sub d_:;Our adjusted number is in A.E;Now we can print the integer part call itoa_8;Check if we need to print the fractional part xor a cp e jr z,fixed88_to_str_end;We need to write the fractional part, so seek the end of the string;Search for the null byte. A is already 0 cpir;Write a decimal dec hl ld (hl),'.' ld b,3_:;Multiply E by 10, converting overflow to an ASCII digit call fixed88_to_str_e_times_10 inc hl ld (hl),a djnz -_;Strip the ending zeros ld a,'0'_: cp (hl) dec hl jr z,-_;write a null byte inc hl inc hl ld (hl),0fixed88_to_str_end:;restore HL pop hl retfixed88_to_str_e_times_10: ld a,e ld d,0 add a,a \ rl d add a,a \ rl d add a,e \ jr nc,$+3 \ inc d  add a,a  ld e,a  ld a,d  rla  add a,'0'  ret
sqrtA
This is a very fast, unrolled routine to compute the square root of A.

Code: [Select]
sqrtA:;Input: A;Output: D is the square root, A is the remainder (input-D^2);Destroys: BC;speed: 161+{0,6}+{0,1}+{0,1}+{0,3};min: 161cc;max: 172cc;avg: 166.5cc;45 bytes  ld d,$40 sub d jr nc,+_ add a,d ld d,0_: set 4,d sub d jr nc,+_ add a,d .db$01   ;start of ld bc,** which is 10cc to skip the next two bytes._:  set 5,d  res 4,d  srl d  set 2,d  sub d  jr nc,+_  add a,d  .db $01 ;start of ld bc,** which is 10cc to skip the next two bytes._: set 3,d res 2,d srl d inc d sub d jr nc,+_ add a,d dec d_: inc d srl d ret sqrtfixed_88 An unrolled, fast 8.8 fixed-point square root routine. Uses the above sqrtA routine. Code: [Select] sqrtfixed_88:;Input: A.E ==> D.E;Output: DE is the sqrt, AHL is the remainder;Speed: 690+6{0,13}+{0,3+{0,18}}+{0,38}+sqrtA;min: 855cc;max: 1003cc;avg: 924.5cc;152 bytes call sqrtA ld l,a ld a,e ld h,0 ld e,d ld d,h sla e rl d sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e;Now we have four more iterations;The first two are no problem sll e \ rl d add hl,hl add hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e  sll e \ rl d  add hl,hl  add hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc esqrtfixed_88_iter11:;On the next iteration, HL might temporarily overflow by 1 bit sll e \ rl d ;sla e \ rl d \ inc e add hl,hl add hl,hl jr c,sqrtfixed_88_iter11_br0; sbc hl,de jr nc,+_ add hl,de dec e jr sqrtfixed_88_iter12sqrtfixed_88_iter11_br0: or a sbc hl,de_: inc e;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anywayssqrtfixed_88_iter12: ld b,a ;A is 0, so B is 0 add hl,hl add hl,hl rla;AHL - (DE+DE+1) sbc hl,de \ sbc a,b inc e or a sbc hl,de \ sbc a,b ret p add hl,de adc a,b dec e add hl,de adc a,b ret ncr_HL_DE Computes 'HL choose DE' in such a way so that overflow only occurs if the final result overflows 16 bits. Code: [Select] ; Requires; mul16 ;BC*DE ==> DEHL; DEHL_Div_BC ;DEHL/BC ==> DEHLncr_HL_DE:;"n choose r", defined as n!/(r!(n-r)!);Computes "HL choose DE";Inputs: HL,DE;Outputs:; HL is the result; "HL choose DE"; carry flag reset means overflow;Destroys:; A,BC,DE,IX;Notes:; Overflow is returned as 0; Overflow happens if HL choose DE exceeds 65535; This algorithm is constructed in such a way that intermediate; operations won't erroneously trigger overflow.;66 bytes ld bc,1 or a sbc hl,de jr c,ncr_oob jr z,ncr_exit sbc hl,de add hl,de jr c,$+3  ex de,hl  ld a,h  or l  push hl  pop ixncr_exit:  ld h,b  ld l,c  scf  ret zncr_loop:  push bc \ push de  push hl \ push bc  ld b,h  ld c,l  call mul16          ;BC*DE ==> DEHL  pop bc  call DEHL_Div_BC    ;result in DEHL  ld a,d  or e  pop bc  pop de  jr nz,ncr_overflow  add hl,bc  jr c,ncr_overflow  pop bc  inc bc  ld a,b  cp ixh  jr c,ncr_loop  ld a,ixl  cp c  jr nc,ncr_loop  retncr_overflow:  pop bc  xor a  ld b,ancr_oob:  ld h,b  ld l,b  ret
EDIT: Optimized itoa_8 above. Here are some more routines:
uitoa_8
Converts an 8-bit unsigned integer to an ASCII string.
Code: [Select]
;Converts an 8-bit unsigned integer to a stringuitoa_8:;Input:;   A is a signed integer;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes);Output:;   The number is converted to a null-terminated string at HL;Destroys:;   Up to four bytes at HL;   All registers preserved.;on 0 to 9:     238              D=0;on 10 to 99:   244+20D          D=0 to 9;on 100 to 255: 257+2{0,6}+20D   D=0 to 5;min: 238cc;max: 424cc;avg: 317.453125cc = 81268/256 = (238*10 + 334*90+313*156)/256;52 bytes  push hl  push de  push bc  push af;A is on [0,255];calculate 100s place, plus 1 for a future calculation  ld b,'0'  cp 100 \ jr c,$+5 \ sub 100 \ inc b cp 100 \ jr c,$+5 \ sub 100 \ inc b;calculate 10s place digit, +1 for future calculation  ld de,$0A2F inc e \ sub d \ jr nc,$-2  ld c,a;Digits are now in D, C, A; strip leading zeros!  ld a,'0'  cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db$FE  ; start of cp * to skip the next byte, turns into cp $BB which will always return nz and nc cp e \ jr z,$+4 \ ld (hl),e \ inc hl  add a,c  add a,d  ld (hl),a  inc hl  ld (hl),0  pop af  pop bc  pop de  pop hl  ret
itoa_16
Converts a 16-bit signed integer to an ASCII string.
Code: [Select]
;Converts a 16-bit signed integer to an ASCII string.itoa_16:;Input:;   DE is the number to convert;   HL points to where to write the ASCII string (up to 7 bytes needed).;Output:;   HL points to the null-terminated ASCII string;      NOTE: This isn't necessarily the same as the input HL.  push de  push bc  push af  push hl  bit 7,d  jr z,+_  xor a  sub e  ld e,a  sbc a,a  sub d  ld d,a  ld (hl),$1A ;negative char on TI-OS inc hl_: ex de,hl ld bc,-10000 ld a,'0'-1 inc a \ add hl,bc \ jr c,$-2  ld (de),a  inc de  ld bc,1000  ld a,'9'+1  dec a \ add hl,bc \ jr nc,$-2 ld (de),a inc de ld bc,-100 ld a,'0'-1 inc a \ add hl,bc \ jr c,$-2  ld (de),a  inc de  ld a,l  ld h,'9'+1  dec h \ add a,10 \ jr nc,$-3 add a,'0' ex de,hl ld (hl),d inc hl ld (hl),a inc hl ld (hl),0;No strip the leading zeros pop hl;If the first char is a negative sign, skip it ld a,(hl) cp$1A  push af  ld a,'0'  jr nz,$+3 inc hl cp (hl) jr z,$-2;Check if we need to re-write the negative sign  pop af  jr nz,+_  dec hl  ld (hl),a_:  pop af  pop bc  pop de  ret
uitoa_16
Converts a 16-bit unsigned integer to an ASCII string.
Code: [Select]
;Converts a 16-bit unsigned integer to an ASCII string.uitoa_16:;Input:;   DE is the number to convert;   HL points to where to write the ASCII string (up to 6 bytes needed).;Output:;   HL points to the null-terminated ASCII string;      NOTE: This isn't necessarily the same as the input HL.  push de  push bc  push af  ex de,hl  ld bc,-10000  ld a,'0'-1  inc a \ add hl,bc \ jr c,$-2 ld (de),a inc de ld bc,1000 ld a,'9'+1 dec a \ add hl,bc \ jr nc,$-2  ld (de),a  inc de  ld bc,-100  ld a,'0'-1  inc a \ add hl,bc \ jr c,$-2 ld (de),a inc de ld a,l ld h,'9'+1 dec h \ add a,10 \ jr nc,$-3  add a,'0'  ex de,hl  ld (hl),d  inc hl  ld (hl),a  inc hl  ld (hl),0;No strip the leading zeros  ld c,-6  add hl,bc  ld a,'0'  inc hl \ cp (hl) \ jr z,$-2 pop af pop bc pop de ret « Last Edit: August 17, 2019, 10:35:30 am by Xeda112358 » #### Xeda112358 • they/them • Moderator • LV12 Extreme Poster (Next: 5000) • Posts: 4639 • Rating: +717/-6 • Calc-u-lator, do doo doo do do do. ##### Re: ASM Optimized routines « Reply #97 on: August 28, 2019, 09:33:29 am » Here is a masked sprite routine (no clipping)! Interleave the data with the mask, where the MASK is ANDed with the buffer, and the data is ORed on top of that: Code: [Select] ;Masked Sprite routineputsprite_masked:;Inputs:; (A,L) = (x,y); B is height; IX points to the sprite data; first byte is the data; second byte is mask; continues, alternating like this.;;Outputs:; Mask is ANDed to the buffer, then data is ORed on top of that.;;Destroys:; AF, BC, DE, HL, IX;;Notes:; To set a pixel...; black: mask is any, data is 1; white: mask is 0, data is 0; clear: mask is 1, data is 0 (keeps the data from the buffer);;This routine is free to use :);65 bytes (or 66 bytes if gbuf is not located at 0x**40 ld e,l ld h,0 ld d,h add hl,hl add hl,de add hl,hl add hl,hl ld e,a and 7 ld c,a xor e ;essentially gets E with the bottom 3 bits reset#if (plotSScreen&255) = 64 inc a rra rra rra ld e,a ld d,plotSScreen>>8#else rra rra rra ld e,a add hl,de ld de,plotSScreen#endif add hl,deputsprite_masked_loop: push bc xor a ld d,(ix) ld e,a sub c ld b,c ld c,$FF  inc ix  ld a,(ix)  jr z,putsprite_masked_rotdoneputsprite_masked_rot:  scf  rra  rr c  srl d  rr e  djnz putsprite_masked_rotputsprite_masked_rotdone:  and (hl)  or d  ld (hl),a  inc hl  ld a,(hl)  and c  or e  ld (hl),a  ld c,11  add hl,bc  inc ix  pop bc  djnz putsprite_masked_loop  retBut if you want even faster and smaller, use a non-traditional mask technique by ORing the mask onto the buffer, then XORing the data on top of it. The format is less intuitive, but it allows for white/black/clear/invert instead of just white/black/clear:
Code: [Select]
;Masked Sprite routineputsprite_masked:;Inputs:;   (A,L) = (x,y);   B is height;   IX points to the sprite data;       first byte is the data;       second byte is mask;       continues, alternating like this.;;Outputs:;   Mask is ORed to the buffer, then data is XORed on top of that.;;Destroys:;   AF, BC, DE, HL, IX;;Notes:;   To set a pixel...;     black: mask is 1, data is 0;     white: mask is 1, data is 1;     clear: mask is 0, data is 0 (keeps the data from the buffer);     invert: mask is 0, data is 1 (inverts the data from the buffer);;This routine is free to use :);63 bytes (or 64 bytes if gbuf is not located at 0x**40  ld e,l  ld h,0  ld d,h  add hl,hl  add hl,de  add hl,hl  add hl,hl  ld e,a  and 7  ld c,a  xor e  ;essentially gets E with the bottom 3 bits reset#if (plotSScreen&255) = 64  inc a  rra  rra  rra  ld e,a  ld d,plotSScreen>>8#else  rra  rra  rra  ld e,a  add hl,de  ld de,plotSScreen#endif  add hl,deputsprite_masked_loop:  push bc  xor a  ld d,(ix)  ld e,a  or c  ld b,c  ld c,e  inc ix  ld a,(ix)  jr z,putsprite_masked_rotdoneputsprite_masked_rot:  rra  rr c  srl d  rr e  djnz putsprite_masked_rotputsprite_masked_rotdone:  or (hl)  xor d  ld (hl),a  inc hl  ld a,(hl)  or c  xor e  ld (hl),a  ld c,11  add hl,bc  inc ix  pop bc  djnz putsprite_masked_loop  ret

I also made some "bigsprite" routines! These do have clipping, too. First, they use some common subroutines for computing masks and performing most of the clipping and shifting:
Code: [Select]
;133 bytes total;This is made by Zeda, feel free to use it for whatever.;Takes inputs for a big sprite and sets up masks and clipping;requires 4 bytes of temporary RAM, but doesn't use SMCspritetmp = 8000h     ;relocate this as needed! Just need 4 bytes.sprite_width  = spritetmp+0sprite_x      = spritetmp+1sprite_mask0  = spritetmp+2sprite_mask1  = spritetmp+3bigsprite_subroutine:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;Outputs:;   carry flag is set if okay to draw, nc if out-of-bounds.;   B is height.;   C is width.;   HL points to the byte to start drawing at.;   DE points to where to start sourcing the sprite data;   (sprite_width) is the width of the sprite in bytes;   (sprite_x) is the intitial x coordinate to begin drawing at;   (sprite_mask0) is the left mask;   (sprite_mask1) is the right mask;92 bytes;First check if the sprite is on-screen in the horizontal direction  ld a,c  cp 64  jr c,+_  add a,h  ret nc  ld h,a  push hl  xor a  ld h,a  sub c  ex de,hl  add hl,de  dec a  jr nz,$-2 ex de,hl pop hl xor a ld c,a_:;Next check h+c<=64 ld a,64 sub c cp h jr nc,+_ ld h,a_:;Make sure the height is not now 0 ld a,h or a ret z;Save the width and height of the sprite push hl ;height,width ld h,b ld (sprite_width),hl ;x,width push de ;sprite pointer;Set up a pointer to the routine for shifting the routine for shifting the sprite data ld ixh,rshiftHA_7>>8 ld a,h cpl and 7 ld l,a add a,a add a,l add a,rshiftHA_7&255 ld ixl,a#if (rshiftHA_7&255)>234 jr nc,$+4  inc ixh#endif  ld a,b  and 7  ld de,spritemask  add a,e  ld e,a#if spritemask&255>248  jr nc,$+3 inc d#endif ld a,(de) ld (sprite_mask0),a cpl ld (sprite_mask1),a;; ld a,c add a,a sbc a,a ld h,a ld a,b ld b,h ld l,c add hl,hl add hl,bc add hl,hl add hl,hl ld c,a add a,a sbc a,a ld b,a ld a,c sra c sra c sra c add hl,bc ld bc,plotSScreen add hl,bc pop de pop bc ;B is height ;C is width ex de,hl scf retrshiftHA_7: rr h \ rra rr h \ rra rr h \ rra rr h \ rra rr h \ rra rr h \ rra rr h \ rra ex de,hl ld e,a retspritemask: .db$00,$80,$C0,$E0,$F0,$F8,$FC,\$FEcall_ix:  jp (ix)Then you can draw a big sprite with OR logic:
Code: [Select]
bigsprite_OR:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;68 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_OR_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_OR:  push bc  push hl  ld h,(hl)  xor a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,(hl)  or d  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,(sprite_mask1)  ld a,(hl)  or e  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_OR  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_OR_loop  retOr draw with XOR logic:
Code: [Select]
bigsprite_XOR:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;68 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_XOR_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_XOR:  push bc  push hl  ld h,(hl)  xor a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,(hl)  xor d  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,(sprite_mask1)  ld a,(hl)  xor e  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_XOR  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_XOR_loop  retOr draw with AND logic:
Code: [Select]
bigsprite_AND:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;69 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_AND_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_AND:  push bc  push hl  ld h,(hl)  scf \ sbc a,a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,(hl)  and d  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,(sprite_mask1)  ld a,(hl)  and e  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_AND  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_AND_loop  retOr draw with Erase logic:
Code: [Select]
bigsprite_Erase:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;67 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_Erase_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_Erase:  push bc  push hl  ld h,(hl)  xor a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,d  cpl  and (hl)  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,e  cpl  and (hl)  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_Erase  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_Erase_loop  retOr draw with Overwrite logic:
Code: [Select]
bigsprite_Overwrite:;Inputs:;     B is the X-coordinate;     C is the Y-Coordinate;     DE points to the sprite;     H is the height;     L is the width in bytes;71 bytes;Set up the clipping  call bigsprite_subroutine  ret ncbigsprite_Overwrite_loop:  push bc   ;height,width  push de   ;gbuf ptr  push hl   ;sprite data pointer  ld a,(sprite_x)  ld c,a  add a,8  ld (sprite_x),aspriteloop_Overwrite:  push bc  push hl  ld h,(hl)  xor a  call call_ix  ld a,c  cp 96  jr nc,+_  ld a,(sprite_mask0)  and (hl)  or d  ld (hl),a  ld a,c_:  inc hl  add a,8  cp 96  jr nc,+_  ld a,(sprite_mask1)  and (hl)  or e  ld (hl),a_:  ld bc,11  add hl,bc  ex de,hl  pop hl  ld a,(sprite_width)  ld c,a  add hl,bc  pop bc  djnz spriteloop_Overwrite  pop hl  inc hl  pop de  inc de  pop bc  dec c  jr nz,bigsprite_Overwrite_loop  ret

#### SpiroH

• Posts: 715
• Rating: +153/-23
##### Re: ASM Optimized routines
« Reply #98 on: August 28, 2019, 10:07:05 am »
@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language .

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4639
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #99 on: August 28, 2019, 10:11:55 am »
@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language .

Thanks I wrote these with apps in mind, so I tried to reduce the need for external RAM. I should definitely make versions that take full advantage of SMC, though!

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4639
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #100 on: September 10, 2019, 06:18:29 pm »
Here is a circle routine that I made!
Code: [Select]
;Written by Zeda Thomas, free to use.;This draws a circle centered at 8-bit coordinates and with radius up to 127.;IX points to a plot routine that takes (B,C)=(x,y) as input and does something;with it, like plot the pixel a certain color, or plot a "big" pixel, or whatever.;   plot;     Takes coordinates, (B,C) = (x,y) and plots the point.;;For example, on the TI-83+/84+/SE the plot routine might look like:; plot:;   call getpixelloc;   ret nc            ;Exit if the coordinates are out-of-bounds;   or (hl);   ld (hl),a;   ret;;; Required subroutines:;     call_ix:;       jp (ix)circle:;Input:; (B,C) is the center (x,y); D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).; IX points to a plot routine that takes (B,C)=(x,y) as input.  ld a,d  add a,a  ret c  ret z  ld l,d  dec a  ld e,a  dec a        ;if the pixel is only 1 wide, just plot the point  jp z,call_ix ;Jump to the plot routine  xor a  ld h,-1  ld d,1  scf     ;skip the first plotcircleloop:  call nc,plot4  inc h  sub d  inc d  inc d  jr nc,circleloop_:  dec l  call plot4  add a,e  dec e  ret z  dec e  jr nc,-_  jp circleloopplot4:;BC is center;HL is x,y  push de  push af  push hl  push bc;If H is 0, or L is 0, we need to draw only half  push hl  ld a,b  sub h  ld b,a  add a,h  add a,h  ld h,a  ld a,c  sub l  ld c,a  add a,l  add a,l  ld l,a;B is x0-x;C is y0-y;H is x0+x;L is y0+y;plot(x0-x,y0-y);plot(x0+x,y0+y)  push bc  push hl  call call_ix    ;call the plot routine  pop bc  push bc  call call_ix    ;call the plot routine;now swap the y coords  pop hl  pop bc  ld a,l  ld l,c  ld c,a  pop de  xor a  cp d  jr z,+_  cp e  jr z,+_;plot(x0-x,y0+y);plot(x0+x,y0-y)  push hl  call call_ix    ;call the plot routine  pop bc  call call_ix    ;call the plot routine_:  pop bc  pop hl  pop af  pop de  retThe really cool feature about this is that you can define a custom plot routine pointed to by IX, so it isn't TI-specific, and you can do all sorts of wonky things like:
Draw 2x2 pixels:

Code: [Select]
;calling with ld ix,pixelOn_2x2pixelOn_2x2:  sla b  ret c  sla c  ret c  push bc  call pixelOn  pop bc  inc b  push bc  call pixelOn  pop bc  inc c  push bc  call pixelOn  pop bc  dec b  jp pixelOn
Or draw a circle whose "pixels" are circles:

Code: [Select]
;calling with ld ix,pixelOn_circlepixelOn_circle:  ld a,b  cp 32  ret nc  add a,a  add a,a  ld b,a  ld a,c  cp 32  ret nc  add a,a  add a,a  ld c,a  ld d,4  push ix    ;need to save IX!  ld ix,pixelOn  call circle  pop ix  retEDIT: I inlined some subroutines because there was no reason to have them called. It was a waste of clock cycles and space!
EDIT: Have a separate, filled rectangle routine!
Note that if you pass the same arguments as the regular circle routine, this only draws the inside part  and skips the border.
Code: [Select]
;Written by Zeda Thomas, free to use.;This draws the fill of a circle centered at 8-bit coordinates and with radius;up to 127.;IX points to a horizontal line routine that takes E=x, A=y, D=width as input;and does something with it, like plot a horizontal line.;; For example, on the ti-83+/84+/SE calculators, you might have:;     horizontal_line:;       ld b,e;       ld c,a;       ld e,1;       ld hl,gbuf;       jp rectOR; Required subroutines:;     call_ix:;       jp (ix)filledcircle:;Input:; (B,C) is the center (x,y); D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).; IX points to a plot routine that takes (B,C)=(x,y) as input.  ld a,d  add a,a  ret c  ret z  ld l,d  dec a  ld e,a  xor a  ld h,-1  ld d,1filledcircleloop:  ; call c,fillcircle_plot  inc h  sub d  inc d  inc d  jr nc,filledcircleloop_:  dec l  call fillcircle_plot  add a,e  dec e  ret z  dec e  jr nc,-_  jp filledcircleloopfillcircle_plot:  inc h  dec h  ret z  push hl  push de  push bc  push af  dec h  ld a,b  sub h  ld e,a  ld d,h  sll d   ;aka slia, undocumented  ld a,l  or a  ld h,c  jr z,+_  add a,h  push de  push hl  call nz,call_ix  pop hl  pop de_:  ld a,h  sub l  call call_ix  pop af  pop bc  pop de  pop hl  ret
« Last Edit: September 10, 2019, 09:05:45 pm by Xeda112358 »