• ASM Optimized routines 5 1
Currently:

### Author Topic: ASM Optimized routines  (Read 43627 times)

0 Members and 1 Guest are viewing this topic.

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4638
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #90 on: October 16, 2017, 11:52:22 pm »
I find that doing the sign check before and after is the best method. I did find a few optimizations and noticed that the remainder isn't properly restored to A in the case where the result is negative. Also keep in mind that the routine may fail when C>127.

Code: [Select]
; divide HL by C (HL is signed, C is not)
; output: HL = quotient, A = remainder
divHLbyCs:
bit 7,h
push    af
jr      z,divHLbyCsStart
xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
divHLbyCsStart:
xor     a
ld      b,16
divHLbyCsLoop:
rla
cp      c
jp      c,divHLbyCsNext
sub     c
inc     l
divHLbyCsNext:
djnz    divHLbyCLoop
ld      b,a
pop     af
ld      a,b
ret     z
xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ld a,c
sub b
ret
I changed the output remainder so that it returns the remainder modulo c. So -7/3 returns 2 instead of 1. This means that no bytes nor clock cycles were saved after all

#### JamesV

• LV5 Advanced (Next: 300)
• Posts: 266
• Rating: +76/-0
##### Re: ASM Optimized routines
« Reply #91 on: October 17, 2017, 04:49:18 pm »
Thanks for the optimisations/fixes! I guess I was lucky that I don't actually use the remainder in my program, and also C will only ever be between 1-64 inclusive. I like the negate HL optimisation too, that's neat

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4638
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #92 on: July 27, 2018, 07:40:53 pm »
I have two neat routines, pushpop and diRestore. If you want to preserve HL,DE,BC, and AF, then you can just put a call to pushpop at the start of your routine. If you want interrupts to be restored on exiting your routine, you can call diRestore at the top of your code.

They both mess with the stack to insert a return routine on the stack. For example, when your code calls diRestore, then an ret will first return to the code to restore the interrupt status, and then back to the calling routine.

Code: [Select]
pushpop:
;26 bytes, adds 229cc to the calling routine
ex (sp),hl
push de
push bc
push af
push hl
ld hl,pushpopret
ex (sp),hl
push hl
push af
ld hl,12
ld a,(hl)
inc hl
ld h,(hl)
ld l,a
pop af
ret
pushpopret:
pop af
pop bc
pop de
pop hl
ret
Code: [Select]
diRestore:
ex (sp),hl
push hl
push af
ld hl,restoreei
ld a,r
jp pe,+_
dec hl
dec hl
_:
di
inc sp
inc sp
inc sp
inc sp
ex (sp),hl
dec sp
dec sp
dec sp
dec sp
pop af
ret
restoredi:
di
ret
restoreei:
ei
ret
Edit: calc84maniac brought up that those 'inc sp' instructions might cause some issues if interrupts fire in between them. The code is modified to disable interrupts first, then mess with the stack.
« Last Edit: July 28, 2018, 09:34:41 am by Xeda112358 »

#### calc84maniac

• eZ80 Guru
• Coder Of Tomorrow
• LV11 Super Veteran (Next: 3000)
• Posts: 2897
• Rating: +467/-17
##### Re: ASM Optimized routines
« Reply #93 on: August 04, 2018, 06:55:30 pm »
32-Bit Endian Swap (eZ80)

Swaps the byte order of a 32-bit value stored in EUHL. Can be adapted to work with other register combinations as well.

Code: [Select]
endianswap32:
push hl
ld h,e
ld e,l
inc sp
push hl
inc sp
pop hl
inc sp
ret
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4638
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #94 on: October 16, 2018, 06:33:14 pm »
Here is a decent arctangent routine that works on [0,1), so you'll have to do the work to extend it outside that range. It uses a small lookup table and linear interpolation.
Spoiler For Range Reduction:
atan(-x)=-atan(x) is useful to extend to negative numbers
atan(x)=pi/2-atan(1/x) is useful to extend to inputs with magnitude >=1
Code: [Select]
atan8:
;computes 256*atan(A/256)->A
;56 bytes including the LUT
;min: 246cc
;max: 271cc
;avg: 258.5cc
rlca
rlca
rlca
ld d,a
and 7
ld hl,atan8LUT
ld l,a
#if (atan8LUT&255)>248    ;this section not included in size/speed totals
jr nc,$+3 ;can add three bytes, 12cc to max, 11cc to min, and 11.5cc to avg inc h #endif ld c,(hl) inc hl ld a,(hl) sub c ld e,0 ex de,hl ld d,l ld e,a sla h \ jr nc,$+3 \ ld l,e
add hl,hl \ jr nc,$+3 \ add hl,de add hl,hl \ jr nc,$+3 \ add hl,de
add hl,hl \ jr nc,$+3 \ add hl,de add hl,hl \ jr nc,$+3 \ add hl,de
;  add hl,hl    ;used in rounding...
ld a,h
;  rra          ;but doesn't seem to improve the error
ret
atan8LUT:
.db 0,32,63,92,119,143,165,184,201

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4638
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #95 on: March 24, 2019, 11:58:29 am »
32-bit square root:
Code: [Select]
sqrtHLIX:
;Input: HLIX
;Output: DE is the sqrt, AHL is the remainder
;speed: 751+6{0,6}+{0,3+{0,18}}+{0,38}+sqrtHL
;min: 1103
;max: 1237
;avg: 1165.5
;166 bytes

call sqrtHL   ;expects returns A as sqrt, HL as remainder, D = 0
ld e,a
rl d

ld a,ixh
sll e \ rl d
sbc hl,de
jr nc,+_
dec e
.db $FE ;start of cp * _: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *
_:
inc e

sll e \ rl d
sbc hl,de
jr nc,+_
dec e
.db $FE ;start of cp * _: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *
_:
inc e

;Now we have four more iterations
;The first two are no problem
ld a,ixl
sll e \ rl d
sbc hl,de
jr nc,+_
dec e
.db $FE ;start of cp * _: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *
_:
inc e

sqrt32_iter15:
;On the next iteration, HL might temporarily overflow by 1 bit
sll e \ rl d      ;sla e \ rl d \ inc e
adc hl,hl       ;This might overflow!
jr c,sqrt32_iter15_br0
;
sbc hl,de
jr nc,+_
dec e
jr sqrt32_iter16
sqrt32_iter15_br0:
or a
sbc hl,de
_:
inc e

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrt32_iter16:
ld b,a        ;either 0x00 or 0x80
rla
rla
;AHL - (DE+DE+1)
sbc hl,de \ sbc a,b
inc e
or a
sbc hl,de \ sbc a,b
ret p
dec e
ret
This uses this sqrtHL routine:
Code: [Select]
;written by Zeda
sqrtHL:
;returns A as the sqrt, HL as the remainder, D = 0
;min: 352cc
;max: 391cc
;avg: 371.5cc

ld de,05040h  ; 10
ld a,h        ; 4
sub e         ; 4
jr nc,sq7     ;\
add a,e       ; | branch 1: 12cc
ld d,16       ; | branch 2: 18cc
sq7:            ;/

; ----------

cp d          ; 4
jr c,sq6      ;\
sub d         ; | branch 1: 12cc
set 5,d       ; | branch 2: 19cc
sq6:            ;/

; ----------
res 4,d       ; 8
srl d         ; 8
set 2,d       ; 8
cp d          ; 4
jr c,sq5      ;\
sub d         ; | branch 1: 12cc
set 3,d       ; | branch 2: 19cc
sq5:            ;/
srl d         ; 8

; ----------

inc a         ; 4
sub d         ; 4
jr nc,sq4     ;\
dec d         ; | branch 1: 12cc
add a,d       ; | branch 2: 19cc
dec d         ; | <-- this resets the low bit of D, so srl d resets carry.
sq4:            ;/
srl d         ; 8
ld h,a        ; 4

; ----------

ld a,e        ; 4
sbc hl,de     ; 15
jr nc,sq3     ;\
add hl,de     ; | 12cc or 18cc
sq3:            ;/
ccf           ; 4
rra           ; 4
srl d         ; 8
rra           ; 4

; ----------

ld e,a        ; 4
sbc hl,de     ; 15
jr c,sq2      ;\
or 20h        ; | branch 1: 23cc
db 254        ; |   <-- start of cp * which is 7cc to skip the next byte.
sq2:            ; | branch 2: 21cc

xor 18h       ; 7
srl d         ; 8
rra           ; 4

; ----------

ld e,a        ; 4
sbc hl,de     ; 15
jr c,sq1      ;\
or 8          ; | branch 1: 23cc
db 254        ; |   <-- start of cp * which is 7cc to skip the next byte.
sq1:            ; | branch 2: 21cc

xor 6         ; 7
srl d         ; 8
rra           ; 4

; ----------

ld e,a        ; 4
sbc hl,de     ; 15
jr nc,+_      ;    \
add hl,de     ; 15  |
srl d         ; 8   |
rra           ; 4   | branch 1: 38cc
ret           ; 10  | branch 2: 40cc
_:              ;     |
inc a         ; 4   |
srl d         ; 8   |
rra           ; 4   |
ret           ; 10 /

sqrtHL was from my work on the float routines, sqrtHLIX was inspired by this thread

EDIT: Optimized some more
« Last Edit: September 09, 2019, 04:33:21 pm by Xeda112358 »

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4638
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #96 on: August 16, 2019, 11:41:13 pm »
Here are some routines that I've added to the repository:

itoa_8
Converts an 8-bit signed integer to an ASCII string.
Code: [Select]
;Converts an 8-bit signed integer to a string

itoa_8:
;Input:
;   A is a signed integer
;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes)
;Output:
;   The number is converted to a null-terminated string at HL
;Destroys:
;   Up to five bytes at HL
;   All registers preserved.
;on 0 to 9:       252       D=0
;on 10 to 99:     258+20D   D=0 to 9
;on 100 to 127:   277+20D   D=0 to 2
;on -1 to -9:     276       D=0
;on -10 to -99:   282+20D   D=0 to 9
;on -100 to -128: 301+20D   D=0 to 2

;min: 252cc  (+23cc over original)
;max: 462cc  (-49cc over original)
;avg: 343.74609375cc = 87999/256
;54 bytes
push hl
push de
push bc
push af
or a
jp p,itoa_pos
neg
ld (hl),$1A ;start if neg char on TI-OS inc hl itoa_pos: ;A is on [0,128] ;calculate 100s place, plus 1 for a future calculation ld b,'0' cp 100 \ jr c,$+5 \ sub 100 \ inc b

;calculate 10s place digit, +1 for future calculation
ld de,$0A2F inc e \ sub d \ jr nc,$-2
ld c,a

;Digits are now in D, C, A
; strip leading zeros!
ld a,'0'
cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db$FE  ; start of cp * to skip the next byte, turns into cp $BB which will always return nz and nc cp e \ jr z,$+4 \ ld (hl),e \ inc hl
ld (hl),a
inc hl
ld (hl),0

pop af
pop bc
pop de
pop hl
ret

fixed88_to_string
Uses the itoa_8 routine to convert an 8.8 fixed-point number to a string.
Code: [Select]
;This converts a fixed-point number to a string.
;It displays up to 3 digits after the decimal.

fixed88_to_str:
;Inputs:
;   D.E is the fixed-point number
;   HL points to where the string gets output.
;      Needs at most 9 bytes.
;Outputs:
;   HL is preserved
;Destroys:
;   AF,DE,BC

;First check if the input is negative.
;If so, write a negative sign and negate
push hl
ld a,d
or a
jp p,+_
ld (hl),$1A ;negative sign on TI-OS inc hl xor a sub e ld e,a sbc a,a sub d _: ;Our adjusted number is in A.E ;Now we can print the integer part call itoa_8 ;Check if we need to print the fractional part xor a cp e jr z,fixed88_to_str_end ;We need to write the fractional part, so seek the end of the string ;Search for the null byte. A is already 0 cpir ;Write a decimal dec hl ld (hl),'.' ld b,3 _: ;Multiply E by 10, converting overflow to an ASCII digit call fixed88_to_str_e_times_10 inc hl ld (hl),a djnz -_ ;Strip the ending zeros ld a,'0' _: cp (hl) dec hl jr z,-_ ;write a null byte inc hl inc hl ld (hl),0 fixed88_to_str_end: ;restore HL pop hl ret fixed88_to_str_e_times_10: ld a,e ld d,0 add a,a \ rl d add a,a \ rl d add a,e \ jr nc,$+3 \ inc d
ld e,a
ld a,d
rla
ret

sqrtA
This is a very fast, unrolled routine to compute the square root of A.

Code: [Select]
sqrtA:
;Input: A
;Output: D is the square root, A is the remainder (input-D^2)
;Destroys: BC
;speed: 161+{0,6}+{0,1}+{0,1}+{0,3}
;min: 161cc
;max: 172cc
;avg: 166.5cc
;45 bytes
ld d,$40 sub d jr nc,+_ add a,d ld d,0 _: set 4,d sub d jr nc,+_ add a,d .db$01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
set 5,d
res 4,d
srl d

set 2,d
sub d
jr nc,+_
.db $01 ;start of ld bc,** which is 10cc to skip the next two bytes. _: set 3,d res 2,d srl d inc d sub d jr nc,+_ add a,d dec d _: inc d srl d ret sqrtfixed_88 An unrolled, fast 8.8 fixed-point square root routine. Uses the above sqrtA routine. Code: [Select] sqrtfixed_88: ;Input: A.E ==> D.E ;Output: DE is the sqrt, AHL is the remainder ;Speed: 690+6{0,13}+{0,3+{0,18}}+{0,38}+sqrtA ;min: 855cc ;max: 1003cc ;avg: 924.5cc ;152 bytes call sqrtA ld l,a ld a,e ld h,0 ld e,d ld d,h sla e rl d sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *
_:
inc e

sll e \ rl d
sbc hl,de
jr nc,+_
dec e
.db $FE ;start of cp * _: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *
_:
inc e

sll e \ rl d
sbc hl,de
jr nc,+_
dec e
.db $FE ;start of cp * _: inc e ;Now we have four more iterations ;The first two are no problem sll e \ rl d add hl,hl add hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *
_:
inc e

sll e \ rl d
sbc hl,de
jr nc,+_
dec e
.db $FE ;start of cp * _: inc e sqrtfixed_88_iter11: ;On the next iteration, HL might temporarily overflow by 1 bit sll e \ rl d ;sla e \ rl d \ inc e add hl,hl add hl,hl jr c,sqrtfixed_88_iter11_br0 ; sbc hl,de jr nc,+_ add hl,de dec e jr sqrtfixed_88_iter12 sqrtfixed_88_iter11_br0: or a sbc hl,de _: inc e ;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways sqrtfixed_88_iter12: ld b,a ;A is 0, so B is 0 add hl,hl add hl,hl rla ;AHL - (DE+DE+1) sbc hl,de \ sbc a,b inc e or a sbc hl,de \ sbc a,b ret p add hl,de adc a,b dec e add hl,de adc a,b ret ncr_HL_DE Computes 'HL choose DE' in such a way so that overflow only occurs if the final result overflows 16 bits. Code: [Select] ; Requires ; mul16 ;BC*DE ==> DEHL ; DEHL_Div_BC ;DEHL/BC ==> DEHL ncr_HL_DE: ;"n choose r", defined as n!/(r!(n-r)!) ;Computes "HL choose DE" ;Inputs: HL,DE ;Outputs: ; HL is the result ; "HL choose DE" ; carry flag reset means overflow ;Destroys: ; A,BC,DE,IX ;Notes: ; Overflow is returned as 0 ; Overflow happens if HL choose DE exceeds 65535 ; This algorithm is constructed in such a way that intermediate ; operations won't erroneously trigger overflow. ;66 bytes ld bc,1 or a sbc hl,de jr c,ncr_oob jr z,ncr_exit sbc hl,de add hl,de jr c,$+3
ex de,hl
ld a,h
or l
push hl
pop ix
ncr_exit:
ld h,b
ld l,c
scf
ret z
ncr_loop:
push bc \ push de
push hl \ push bc
ld b,h
ld c,l
call mul16          ;BC*DE ==> DEHL
pop bc
call DEHL_Div_BC    ;result in DEHL
ld a,d
or e
pop bc
pop de
jr nz,ncr_overflow
jr c,ncr_overflow
pop bc
inc bc
ld a,b
cp ixh
jr c,ncr_loop
ld a,ixl
cp c
jr nc,ncr_loop
ret
ncr_overflow:
pop bc
xor a
ld b,a
ncr_oob:
ld h,b
ld l,b
ret

EDIT: Optimized itoa_8 above. Here are some more routines:
uitoa_8
Converts an 8-bit unsigned integer to an ASCII string.
Code: [Select]
;Converts an 8-bit unsigned integer to a string

uitoa_8:
;Input:
;   A is a signed integer
;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes)
;Output:
;   The number is converted to a null-terminated string at HL
;Destroys:
;   Up to four bytes at HL
;   All registers preserved.
;on 0 to 9:     238              D=0
;on 10 to 99:   244+20D          D=0 to 9
;on 100 to 255: 257+2{0,6}+20D   D=0 to 5
;min: 238cc
;max: 424cc
;avg: 317.453125cc = 81268/256 = (238*10 + 334*90+313*156)/256
;52 bytes

push hl
push de
push bc
push af
;A is on [0,255]
;calculate 100s place, plus 1 for a future calculation
ld b,'0'
cp 100 \ jr c,$+5 \ sub 100 \ inc b cp 100 \ jr c,$+5 \ sub 100 \ inc b

;calculate 10s place digit, +1 for future calculation
ld de,$0A2F inc e \ sub d \ jr nc,$-2
ld c,a

;Digits are now in D, C, A
; strip leading zeros!
ld a,'0'
cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db$FE  ; start of cp * to skip the next byte, turns into cp $BB which will always return nz and nc cp e \ jr z,$+4 \ ld (hl),e \ inc hl
ld (hl),a
inc hl
ld (hl),0

pop af
pop bc
pop de
pop hl
ret

itoa_16
Converts a 16-bit signed integer to an ASCII string.
Code: [Select]
;Converts a 16-bit signed integer to an ASCII string.

itoa_16:
;Input:
;   DE is the number to convert
;   HL points to where to write the ASCII string (up to 7 bytes needed).
;Output:
;   HL points to the null-terminated ASCII string
;      NOTE: This isn't necessarily the same as the input HL.
push de
push bc
push af
push hl
bit 7,d
jr z,+_
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ld (hl),$1A ;negative char on TI-OS inc hl _: ex de,hl ld bc,-10000 ld a,'0'-1 inc a \ add hl,bc \ jr c,$-2
ld (de),a
inc de

ld bc,1000
ld a,'9'+1
dec a \ add hl,bc \ jr nc,$-2 ld (de),a inc de ld bc,-100 ld a,'0'-1 inc a \ add hl,bc \ jr c,$-2
ld (de),a
inc de

ld a,l
ld h,'9'+1
dec h \ add a,10 \ jr nc,$-3 add a,'0' ex de,hl ld (hl),d inc hl ld (hl),a inc hl ld (hl),0 ;No strip the leading zeros pop hl ;If the first char is a negative sign, skip it ld a,(hl) cp$1A
push af
ld a,'0'
jr nz,$+3 inc hl cp (hl) jr z,$-2

;Check if we need to re-write the negative sign
pop af
jr nz,+_
dec hl
ld (hl),a
_:

pop af
pop bc
pop de
ret

uitoa_16
Converts a 16-bit unsigned integer to an ASCII string.
Code: [Select]
;Converts a 16-bit unsigned integer to an ASCII string.

uitoa_16:
;Input:
;   DE is the number to convert
;   HL points to where to write the ASCII string (up to 6 bytes needed).
;Output:
;   HL points to the null-terminated ASCII string
;      NOTE: This isn't necessarily the same as the input HL.
push de
push bc
push af
ex de,hl

ld bc,-10000
ld a,'0'-1
inc a \ add hl,bc \ jr c,$-2 ld (de),a inc de ld bc,1000 ld a,'9'+1 dec a \ add hl,bc \ jr nc,$-2
ld (de),a
inc de

ld bc,-100
ld a,'0'-1
inc a \ add hl,bc \ jr c,$-2 ld (de),a inc de ld a,l ld h,'9'+1 dec h \ add a,10 \ jr nc,$-3
ex de,hl
ld (hl),d
inc hl
ld (hl),a
inc hl
ld (hl),0

;No strip the leading zeros
ld c,-6
ld a,'0'
inc hl \ cp (hl) \ jr z,$-2 pop af pop bc pop de ret « Last Edit: August 17, 2019, 10:35:30 am by Xeda112358 » #### Xeda112358 • they/them • Moderator • LV12 Extreme Poster (Next: 5000) • Posts: 4638 • Rating: +717/-6 • Calc-u-lator, do doo doo do do do. ##### Re: ASM Optimized routines « Reply #97 on: August 28, 2019, 09:33:29 am » Here is a masked sprite routine (no clipping)! Interleave the data with the mask, where the MASK is ANDed with the buffer, and the data is ORed on top of that: Code: [Select] ;Masked Sprite routine putsprite_masked: ;Inputs: ; (A,L) = (x,y) ; B is height ; IX points to the sprite data ; first byte is the data ; second byte is mask ; continues, alternating like this. ; ;Outputs: ; Mask is ANDed to the buffer, then data is ORed on top of that. ; ;Destroys: ; AF, BC, DE, HL, IX ; ;Notes: ; To set a pixel... ; black: mask is any, data is 1 ; white: mask is 0, data is 0 ; clear: mask is 1, data is 0 (keeps the data from the buffer) ; ;This routine is free to use :) ;65 bytes (or 66 bytes if gbuf is not located at 0x**40 ld e,l ld h,0 ld d,h add hl,hl add hl,de add hl,hl add hl,hl ld e,a and 7 ld c,a xor e ;essentially gets E with the bottom 3 bits reset #if (plotSScreen&255) = 64 inc a rra rra rra ld e,a ld d,plotSScreen>>8 #else rra rra rra ld e,a add hl,de ld de,plotSScreen #endif add hl,de putsprite_masked_loop: push bc xor a ld d,(ix) ld e,a sub c ld b,c ld c,$FF
inc ix
ld a,(ix)
scf
rra
rr c
srl d
rr e
and (hl)
or d
ld (hl),a
inc hl
ld a,(hl)
and c
or e
ld (hl),a
ld c,11
inc ix
pop bc
ret
But if you want even faster and smaller, use a non-traditional mask technique by ORing the mask onto the buffer, then XORing the data on top of it. The format is less intuitive, but it allows for white/black/clear/invert instead of just white/black/clear:
Code: [Select]
;Inputs:
;   (A,L) = (x,y)
;   B is height
;   IX points to the sprite data
;       first byte is the data
;       second byte is mask
;       continues, alternating like this.
;
;Outputs:
;   Mask is ORed to the buffer, then data is XORed on top of that.
;
;Destroys:
;   AF, BC, DE, HL, IX
;
;Notes:
;   To set a pixel...
;     black: mask is 1, data is 0
;     white: mask is 1, data is 1
;     clear: mask is 0, data is 0 (keeps the data from the buffer)
;     invert: mask is 0, data is 1 (inverts the data from the buffer)
;
;This routine is free to use :)
;63 bytes (or 64 bytes if gbuf is not located at 0x**40

ld e,l
ld h,0
ld d,h
ld e,a
and 7
ld c,a
xor e  ;essentially gets E with the bottom 3 bits reset
#if (plotSScreen&255) = 64
inc a
rra
rra
rra
ld e,a
ld d,plotSScreen>>8
#else
rra
rra
rra
ld e,a
ld de,plotSScreen
#endif

push bc
xor a
ld d,(ix)
ld e,a
or c
ld b,c
ld c,e
inc ix
ld a,(ix)
rra
rr c
srl d
rr e
or (hl)
xor d
ld (hl),a
inc hl
ld a,(hl)
or c
xor e
ld (hl),a
ld c,11
inc ix
pop bc
ret

I also made some "bigsprite" routines! These do have clipping, too. First, they use some common subroutines for computing masks and performing most of the clipping and shifting:
Code: [Select]
;133 bytes total

;This is made by Zeda, feel free to use it for whatever.
;Takes inputs for a big sprite and sets up masks and clipping
;requires 4 bytes of temporary RAM, but doesn't use SMC

spritetmp = 8000h     ;relocate this as needed! Just need 4 bytes.
sprite_width  = spritetmp+0
sprite_x      = spritetmp+1

bigsprite_subroutine:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;Outputs:
;   carry flag is set if okay to draw, nc if out-of-bounds.
;   B is height.
;   C is width.
;   HL points to the byte to start drawing at.
;   DE points to where to start sourcing the sprite data
;   (sprite_width) is the width of the sprite in bytes
;   (sprite_x) is the intitial x coordinate to begin drawing at
;92 bytes

;First check if the sprite is on-screen in the horizontal direction
ld a,c
cp 64
jr c,+_
ret nc
ld h,a
push hl
xor a
ld h,a
sub c
ex de,hl
dec a
jr nz,$-2 ex de,hl pop hl xor a ld c,a _: ;Next check h+c<=64 ld a,64 sub c cp h jr nc,+_ ld h,a _: ;Make sure the height is not now 0 ld a,h or a ret z ;Save the width and height of the sprite push hl ;height,width ld h,b ld (sprite_width),hl ;x,width push de ;sprite pointer ;Set up a pointer to the routine for shifting the routine for shifting the sprite data ld ixh,rshiftHA_7>>8 ld a,h cpl and 7 ld l,a add a,a add a,l add a,rshiftHA_7&255 ld ixl,a #if (rshiftHA_7&255)>234 jr nc,$+4
inc ixh
#endif

ld a,b
and 7
ld e,a
jr nc,$+3 inc d #endif ld a,(de) ld (sprite_mask0),a cpl ld (sprite_mask1),a ; ; ld a,c add a,a sbc a,a ld h,a ld a,b ld b,h ld l,c add hl,hl add hl,bc add hl,hl add hl,hl ld c,a add a,a sbc a,a ld b,a ld a,c sra c sra c sra c add hl,bc ld bc,plotSScreen add hl,bc pop de pop bc ;B is height ;C is width ex de,hl scf ret rshiftHA_7: rr h \ rra rr h \ rra rr h \ rra rr h \ rra rr h \ rra rr h \ rra rr h \ rra ex de,hl ld e,a ret spritemask: .db$00,$80,$C0,$E0,$F0,$F8,$FC,\$FE
call_ix:
jp (ix)
Then you can draw a big sprite with OR logic:
Code: [Select]
bigsprite_OR:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;68 bytes

;Set up the clipping
call bigsprite_subroutine
ret nc

bigsprite_OR_loop:
push bc   ;height,width
push de   ;gbuf ptr
push hl   ;sprite data pointer
ld a,(sprite_x)
ld c,a
ld (sprite_x),a

spriteloop_OR:
push bc
push hl
ld h,(hl)
xor a
call call_ix
ld a,c
cp 96
jr nc,+_
ld a,(hl)
or d
ld (hl),a
ld a,c
_:
inc hl
cp 96
jr nc,+_
ld a,(hl)
or e
ld (hl),a
_:
ld bc,11
ex de,hl
pop hl
ld a,(sprite_width)
ld c,a
pop bc
djnz spriteloop_OR

pop hl
inc hl
pop de
inc de
pop bc
dec c
jr nz,bigsprite_OR_loop
ret
Or draw with XOR logic:
Code: [Select]
bigsprite_XOR:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;68 bytes

;Set up the clipping
call bigsprite_subroutine
ret nc

bigsprite_XOR_loop:
push bc   ;height,width
push de   ;gbuf ptr
push hl   ;sprite data pointer
ld a,(sprite_x)
ld c,a
ld (sprite_x),a

spriteloop_XOR:
push bc
push hl
ld h,(hl)
xor a
call call_ix
ld a,c
cp 96
jr nc,+_
ld a,(hl)
xor d
ld (hl),a
ld a,c
_:
inc hl
cp 96
jr nc,+_
ld a,(hl)
xor e
ld (hl),a
_:
ld bc,11
ex de,hl
pop hl
ld a,(sprite_width)
ld c,a
pop bc
djnz spriteloop_XOR

pop hl
inc hl
pop de
inc de
pop bc
dec c
jr nz,bigsprite_XOR_loop
ret
Or draw with AND logic:
Code: [Select]
bigsprite_AND:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;69 bytes

;Set up the clipping
call bigsprite_subroutine
ret nc

bigsprite_AND_loop:
push bc   ;height,width
push de   ;gbuf ptr
push hl   ;sprite data pointer
ld a,(sprite_x)
ld c,a
ld (sprite_x),a

spriteloop_AND:
push bc
push hl
ld h,(hl)
scf \ sbc a,a
call call_ix
ld a,c
cp 96
jr nc,+_
ld a,(hl)
and d
ld (hl),a
ld a,c
_:
inc hl
cp 96
jr nc,+_
ld a,(hl)
and e
ld (hl),a
_:
ld bc,11
ex de,hl
pop hl
ld a,(sprite_width)
ld c,a
pop bc
djnz spriteloop_AND

pop hl
inc hl
pop de
inc de
pop bc
dec c
jr nz,bigsprite_AND_loop
ret
Or draw with Erase logic:
Code: [Select]
bigsprite_Erase:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;67 bytes

;Set up the clipping
call bigsprite_subroutine
ret nc

bigsprite_Erase_loop:
push bc   ;height,width
push de   ;gbuf ptr
push hl   ;sprite data pointer
ld a,(sprite_x)
ld c,a
ld (sprite_x),a

spriteloop_Erase:
push bc
push hl
ld h,(hl)
xor a
call call_ix
ld a,c
cp 96
jr nc,+_
ld a,d
cpl
and (hl)
ld (hl),a
ld a,c
_:
inc hl
cp 96
jr nc,+_
ld a,e
cpl
and (hl)
ld (hl),a
_:
ld bc,11
ex de,hl
pop hl
ld a,(sprite_width)
ld c,a
pop bc
djnz spriteloop_Erase

pop hl
inc hl
pop de
inc de
pop bc
dec c
jr nz,bigsprite_Erase_loop
ret
Or draw with Overwrite logic:
Code: [Select]
bigsprite_Overwrite:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;71 bytes

;Set up the clipping
call bigsprite_subroutine
ret nc

bigsprite_Overwrite_loop:
push bc   ;height,width
push de   ;gbuf ptr
push hl   ;sprite data pointer
ld a,(sprite_x)
ld c,a
ld (sprite_x),a

spriteloop_Overwrite:
push bc
push hl
ld h,(hl)
xor a
call call_ix
ld a,c
cp 96
jr nc,+_
and (hl)
or d
ld (hl),a
ld a,c
_:
inc hl
cp 96
jr nc,+_
and (hl)
or e
ld (hl),a
_:
ld bc,11
ex de,hl
pop hl
ld a,(sprite_width)
ld c,a
pop bc
djnz spriteloop_Overwrite

pop hl
inc hl
pop de
inc de
pop bc
dec c
jr nz,bigsprite_Overwrite_loop
ret

#### SpiroH

• LV8 Addict (Next: 1000)
• Posts: 715
• Rating: +153/-23
##### Re: ASM Optimized routines
« Reply #98 on: August 28, 2019, 10:07:05 am »
@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language .

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4638
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #99 on: August 28, 2019, 10:11:55 am »
@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language .

Thanks I wrote these with apps in mind, so I tried to reduce the need for external RAM. I should definitely make versions that take full advantage of SMC, though!

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4638
• Rating: +717/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #100 on: September 10, 2019, 06:18:29 pm »
Here is a circle routine that I made!
Code: [Select]
;Written by Zeda Thomas, free to use.

;This draws a circle centered at 8-bit coordinates and with radius up to 127.
;IX points to a plot routine that takes (B,C)=(x,y) as input and does something
;with it, like plot the pixel a certain color, or plot a "big" pixel, or whatever.
;   plot
;     Takes coordinates, (B,C) = (x,y) and plots the point.
;
;For example, on the TI-83+/84+/SE the plot routine might look like:
; plot:
;   call getpixelloc
;   ret nc            ;Exit if the coordinates are out-of-bounds
;   or (hl)
;   ld (hl),a
;   ret
;
;
; Required subroutines:
;     call_ix:
;       jp (ix)

circle:
;Input:
; (B,C) is the center (x,y)
; D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).
; IX points to a plot routine that takes (B,C)=(x,y) as input.
ld a,d
ret c
ret z
ld l,d
dec a
ld e,a
dec a        ;if the pixel is only 1 wide, just plot the point
jp z,call_ix ;Jump to the plot routine
xor a
ld h,-1
ld d,1
scf     ;skip the first plot
circleloop:
call nc,plot4
inc h
sub d
inc d
inc d
jr nc,circleloop
_:
dec l
call plot4
dec e
ret z
dec e
jr nc,-_
jp circleloop

plot4:
;BC is center
;HL is x,y

push de
push af
push hl
push bc

;If H is 0, or L is 0, we need to draw only half
push hl

ld a,b
sub h
ld b,a
ld h,a

ld a,c
sub l
ld c,a
ld l,a

;B is x0-x
;C is y0-y
;H is x0+x
;L is y0+y

;plot(x0-x,y0-y)
;plot(x0+x,y0+y)
push bc
push hl
call call_ix    ;call the plot routine
pop bc
push bc
call call_ix    ;call the plot routine

;now swap the y coords
pop hl
pop bc
ld a,l
ld l,c
ld c,a
pop de
xor a
cp d
jr z,+_
cp e
jr z,+_

;plot(x0-x,y0+y)
;plot(x0+x,y0-y)
push hl
call call_ix    ;call the plot routine
pop bc
call call_ix    ;call the plot routine
_:

pop bc
pop hl
pop af
pop de
ret
The really cool feature about this is that you can define a custom plot routine pointed to by IX, so it isn't TI-specific, and you can do all sorts of wonky things like:
Draw 2x2 pixels:

Code: [Select]
;calling with ld ix,pixelOn_2x2

pixelOn_2x2:
sla b
ret c
sla c
ret c
push bc
call pixelOn
pop bc
inc b
push bc
call pixelOn
pop bc
inc c
push bc
call pixelOn
pop bc
dec b
jp pixelOn

Or draw a circle whose "pixels" are circles:

Code: [Select]
;calling with ld ix,pixelOn_circle

pixelOn_circle:
ld a,b
cp 32
ret nc
ld b,a
ld a,c
cp 32
ret nc
ld c,a
ld d,4
push ix    ;need to save IX!
ld ix,pixelOn
call circle
pop ix
ret
EDIT: I inlined some subroutines because there was no reason to have them called. It was a waste of clock cycles and space!
EDIT: Have a separate, filled rectangle routine!
Note that if you pass the same arguments as the regular circle routine, this only draws the inside part  and skips the border.
Code: [Select]
;Written by Zeda Thomas, free to use.

;This draws the fill of a circle centered at 8-bit coordinates and with radius
;up to 127.
;IX points to a horizontal line routine that takes E=x, A=y, D=width as input
;and does something with it, like plot a horizontal line.
;
; For example, on the ti-83+/84+/SE calculators, you might have:
;     horizontal_line:
;       ld b,e
;       ld c,a
;       ld e,1
;       ld hl,gbuf
;       jp rectOR

; Required subroutines:
;     call_ix:
;       jp (ix)

filledcircle:
;Input:
; (B,C) is the center (x,y)
; D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).
; IX points to a plot routine that takes (B,C)=(x,y) as input.
ld a,d
ret c
ret z
ld l,d
dec a
ld e,a
xor a
ld h,-1
ld d,1
filledcircleloop:
; call c,fillcircle_plot
inc h
sub d
inc d
inc d
jr nc,filledcircleloop
_:
dec l
call fillcircle_plot
dec e
ret z
dec e
jr nc,-_
jp filledcircleloop

fillcircle_plot:
inc h
dec h
ret z
push hl
push de
push bc
push af
dec h
ld a,b
sub h
ld e,a
ld d,h
sll d   ;aka slia, undocumented

ld a,l
or a
ld h,c
jr z,+_
push de
push hl
call nz,call_ix
pop hl
pop de
_:
ld a,h
sub l
call call_ix
pop af
pop bc
pop de
pop hl
ret
« Last Edit: September 10, 2019, 09:05:45 pm by Xeda112358 »