• ASM Optimized routines 5 1
Currently:

### Author Topic: ASM Optimized routines  (Read 61238 times)

0 Members and 1 Guest are viewing this topic.

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4676
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #90 on: October 16, 2017, 11:52:22 pm »
I find that doing the sign check before and after is the best method. I did find a few optimizations and noticed that the remainder isn't properly restored to A in the case where the result is negative. Also keep in mind that the routine may fail when C>127.

Code: [Select]
; divide HL by C (HL is signed, C is not); output: HL = quotient, A = remainderdivHLbyCs:    bit 7,h    push    af    jr      z,divHLbyCsStart    xor a \ sub l \ ld l,a    sbc a,a \ sub h \ ld h,adivHLbyCsStart:    xor     a    ld      b,16divHLbyCsLoop:    add     hl,hl    rla    cp      c    jp      c,divHLbyCsNext    sub     c    inc     ldivHLbyCsNext:    djnz    divHLbyCLoop    ld      b,a    pop     af    ld      a,b    ret     z    xor a \ sub l \ ld l,a    sbc a,a \ sub h \ ld h,a    ld a,c    sub b    retI changed the output remainder so that it returns the remainder modulo c. So -7/3 returns 2 instead of 1. This means that no bytes nor clock cycles were saved after all

#### JamesV

• Posts: 275
• Rating: +76/-0
##### Re: ASM Optimized routines
« Reply #91 on: October 17, 2017, 04:49:18 pm »
Thanks for the optimisations/fixes! I guess I was lucky that I don't actually use the remainder in my program, and also C will only ever be between 1-64 inclusive. I like the negate HL optimisation too, that's neat

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4676
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #92 on: July 27, 2018, 07:40:53 pm »
I have two neat routines, pushpop and diRestore. If you want to preserve HL,DE,BC, and AF, then you can just put a call to pushpop at the start of your routine. If you want interrupts to be restored on exiting your routine, you can call diRestore at the top of your code.

They both mess with the stack to insert a return routine on the stack. For example, when your code calls diRestore, then an ret will first return to the code to restore the interrupt status, and then back to the calling routine.

Code: [Select]
pushpop:;26 bytes, adds 229cc to the calling routine  ex (sp),hl  push de  push bc  push af  push hl  ld hl,pushpopret  ex (sp),hl  push hl  push af  ld hl,12  add hl,sp  ld a,(hl)  inc hl  ld h,(hl)  ld l,a  pop af  retpushpopret:  pop af  pop bc  pop de  pop hl  ret
Code: [Select]
diRestore:    ex (sp),hl    push hl    push af    ld hl,restoreei    ld a,r    jp pe,+_    dec hl    dec hl_:    di    inc sp    inc sp    inc sp    inc sp    ex (sp),hl    dec sp    dec sp    dec sp    dec sp    pop af    retrestoredi:    di    retrestoreei:    ei    retEdit: calc84maniac brought up that those 'inc sp' instructions might cause some issues if interrupts fire in between them. The code is modified to disable interrupts first, then mess with the stack.
« Last Edit: July 28, 2018, 09:34:41 am by Xeda112358 »

#### calc84maniac

• eZ80 Guru
• Coder Of Tomorrow
• LV11 Super Veteran (Next: 3000)
• Posts: 2898
• Rating: +467/-17
##### Re: ASM Optimized routines
« Reply #93 on: August 04, 2018, 06:55:30 pm »
32-Bit Endian Swap (eZ80)

Swaps the byte order of a 32-bit value stored in EUHL. Can be adapted to work with other register combinations as well.

Code: [Select]
endianswap32:    push hl    ld h,e    ld e,l    inc sp    push hl    inc sp    pop hl    inc sp    ret
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4676
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #94 on: October 16, 2018, 06:33:14 pm »
Here is a decent arctangent routine that works on [0,1), so you'll have to do the work to extend it outside that range. It uses a small lookup table and linear interpolation.
Spoiler For Range Reduction:
atan(-x)=-atan(x) is useful to extend to negative numbers
atan(x)=pi/2-atan(1/x) is useful to extend to inputs with magnitude >=1
Code: [Select]
atan8:;computes 256*atan(A/256)->A;56 bytes including the LUT;min: 246cc;max: 271cc;avg: 258.5cc  rlca  rlca  rlca  ld d,a  and 7  ld hl,atan8LUT  add a,l  ld l,a#if (atan8LUT&255)>248    ;this section not included in size/speed totals  jr nc,$+3 ;can add three bytes, 12cc to max, 11cc to min, and 11.5cc to avg inc h#endif ld c,(hl) inc hl ld a,(hl) sub c ld e,0 ex de,hl ld d,l ld e,a sla h \ jr nc,$+3 \ ld l,e  add hl,hl \ jr nc,$+3 \ add hl,de add hl,hl \ jr nc,$+3 \ add hl,de  add hl,hl \ jr nc,$+3 \ add hl,de add hl,hl \ jr nc,$+3 \ add hl,de  add hl,hl  add hl,hl  add hl,hl;  add hl,hl    ;used in rounding...  ld a,h;  rra          ;but doesn't seem to improve the error  adc a,c  retatan8LUT:  .db 0,32,63,92,119,143,165,184,201

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4676
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #95 on: March 24, 2019, 11:58:29 am »
32-bit square root:
Code: [Select]
sqrtHLIX:;Input: HLIX;Output: DE is the sqrt, AHL is the remainder;speed: 751+6{0,6}+{0,3+{0,18}}+{0,38}+sqrtHL;min: 1103;max: 1237;avg: 1165.5;166 bytes  call sqrtHL   ;expects returns A as sqrt, HL as remainder, D = 0  add a,a  ld e,a  rl d  ld a,ixh  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e;Now we have four more iterations;The first two are no problem  ld a,ixl  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc esqrt32_iter15:;On the next iteration, HL might temporarily overflow by 1 bit  sll e \ rl d      ;sla e \ rl d \ inc e  add a,a  adc hl,hl  add a,a  adc hl,hl       ;This might overflow!  jr c,sqrt32_iter15_br0;  sbc hl,de  jr nc,+_  add hl,de  dec e  jr sqrt32_iter16sqrt32_iter15_br0:  or a  sbc hl,de_:  inc e;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anywayssqrt32_iter16:  add a,a  ld b,a        ;either 0x00 or 0x80  adc hl,hl  rla  adc hl,hl  rla;AHL - (DE+DE+1)  sbc hl,de \ sbc a,b  inc e  or a  sbc hl,de \ sbc a,b  ret p  add hl,de  adc a,b  dec e  add hl,de  adc a,b  retThis uses this sqrtHL routine:
Code: [Select]
;written by ZedasqrtHL:;returns A as the sqrt, HL as the remainder, D = 0;min: 352cc;max: 391cc;avg: 371.5cc  ld de,05040h  ; 10  ld a,h        ; 4  sub e         ; 4  jr nc,sq7     ;\  add a,e       ; | branch 1: 12cc  ld d,16       ; | branch 2: 18ccsq7:            ;/; ----------  cp d          ; 4  jr c,sq6      ;\  sub d         ; | branch 1: 12cc  set 5,d       ; | branch 2: 19ccsq6:            ;/; ----------  res 4,d       ; 8  srl d         ; 8  set 2,d       ; 8  cp d          ; 4  jr c,sq5      ;\  sub d         ; | branch 1: 12cc  set 3,d       ; | branch 2: 19ccsq5:            ;/  srl d         ; 8; ----------  inc a         ; 4  sub d         ; 4  jr nc,sq4     ;\  dec d         ; | branch 1: 12cc  add a,d       ; | branch 2: 19cc  dec d         ; | <-- this resets the low bit of D, so srl d resets carry.sq4:            ;/  srl d         ; 8  ld h,a        ; 4; ----------  ld a,e        ; 4  sbc hl,de     ; 15  jr nc,sq3     ;\  add hl,de     ; | 12cc or 18ccsq3:            ;/  ccf           ; 4  rra           ; 4  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr c,sq2      ;\  or 20h        ; | branch 1: 23cc  db 254        ; |   <-- start of cp * which is 7cc to skip the next byte.sq2:            ; | branch 2: 21cc  add hl,de     ;/  xor 18h       ; 7  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr c,sq1      ;\  or 8          ; | branch 1: 23cc  db 254        ; |   <-- start of cp * which is 7cc to skip the next byte.sq1:            ; | branch 2: 21cc  add hl,de     ;/  xor 6         ; 7  srl d         ; 8  rra           ; 4; ----------  ld e,a        ; 4  sbc hl,de     ; 15  jr nc,+_      ;    \  add hl,de     ; 15  |  srl d         ; 8   |  rra           ; 4   | branch 1: 38cc  ret           ; 10  | branch 2: 40cc_:              ;     |  inc a         ; 4   |  srl d         ; 8   |  rra           ; 4   |  ret           ; 10 /
sqrtHL was from my work on the float routines, sqrtHLIX was inspired by this thread

EDIT: Optimized some more
« Last Edit: September 09, 2019, 04:33:21 pm by Xeda112358 »

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4676
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #96 on: August 16, 2019, 11:41:13 pm »
Here are some routines that I've added to the repository:

itoa_8
Converts an 8-bit signed integer to an ASCII string.
Code: [Select]
;Converts an 8-bit signed integer to a stringitoa_8:;Input:;   A is a signed integer;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes);Output:;   The number is converted to a null-terminated string at HL;Destroys:;   Up to five bytes at HL;   All registers preserved.;on 0 to 9:       252       D=0;on 10 to 99:     258+20D   D=0 to 9;on 100 to 127:   277+20D   D=0 to 2;on -1 to -9:     276       D=0;on -10 to -99:   282+20D   D=0 to 9;on -100 to -128: 301+20D   D=0 to 2;min: 252cc  (+23cc over original);max: 462cc  (-49cc over original);avg: 343.74609375cc = 87999/256;54 bytes  push hl  push de  push bc  push af  or a  jp p,itoa_pos  neg  ld (hl),$1A ;start if neg char on TI-OS inc hlitoa_pos:;A is on [0,128];calculate 100s place, plus 1 for a future calculation ld b,'0' cp 100 \ jr c,$+5 \ sub 100 \ inc b;calculate 10s place digit, +1 for future calculation  ld de,$0A2F inc e \ sub d \ jr nc,$-2  ld c,a;Digits are now in D, C, A; strip leading zeros!  ld a,'0'  cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db$FE  ; start of cp * to skip the next byte, turns into cp $BB which will always return nz and nc cp e \ jr z,$+4 \ ld (hl),e \ inc hl  add a,c  add a,d  ld (hl),a  inc hl  ld (hl),0  pop af  pop bc  pop de  pop hl  ret
fixed88_to_string
Uses the itoa_8 routine to convert an 8.8 fixed-point number to a string.
Code: [Select]
;This converts a fixed-point number to a string.;It displays up to 3 digits after the decimal.fixed88_to_str:;Inputs:;   D.E is the fixed-point number;   HL points to where the string gets output.;      Needs at most 9 bytes.;Outputs:;   HL is preserved;Destroys:;   AF,DE,BC;First check if the input is negative.;If so, write a negative sign and negate  push hl  ld a,d  or a  jp p,+_  ld (hl),$1A ;negative sign on TI-OS inc hl xor a sub e ld e,a sbc a,a sub d_:;Our adjusted number is in A.E;Now we can print the integer part call itoa_8;Check if we need to print the fractional part xor a cp e jr z,fixed88_to_str_end;We need to write the fractional part, so seek the end of the string;Search for the null byte. A is already 0 cpir;Write a decimal dec hl ld (hl),'.' ld b,3_:;Multiply E by 10, converting overflow to an ASCII digit call fixed88_to_str_e_times_10 inc hl ld (hl),a djnz -_;Strip the ending zeros ld a,'0'_: cp (hl) dec hl jr z,-_;write a null byte inc hl inc hl ld (hl),0fixed88_to_str_end:;restore HL pop hl retfixed88_to_str_e_times_10: ld a,e ld d,0 add a,a \ rl d add a,a \ rl d add a,e \ jr nc,$+3 \ inc d  add a,a  ld e,a  ld a,d  rla  add a,'0'  ret
sqrtA
This is a very fast, unrolled routine to compute the square root of A.

Code: [Select]
sqrtA:;Input: A;Output: D is the square root, A is the remainder (input-D^2);Destroys: BC;speed: 161+{0,6}+{0,1}+{0,1}+{0,3};min: 161cc;max: 172cc;avg: 166.5cc;45 bytes  ld d,$40 sub d jr nc,+_ add a,d ld d,0_: set 4,d sub d jr nc,+_ add a,d .db$01   ;start of ld bc,** which is 10cc to skip the next two bytes._:  set 5,d  res 4,d  srl d  set 2,d  sub d  jr nc,+_  add a,d  .db $01 ;start of ld bc,** which is 10cc to skip the next two bytes._: set 3,d res 2,d srl d inc d sub d jr nc,+_ add a,d dec d_: inc d srl d ret sqrtfixed_88 An unrolled, fast 8.8 fixed-point square root routine. Uses the above sqrtA routine. Code: [Select] sqrtfixed_88:;Input: A.E ==> D.E;Output: DE is the sqrt, AHL is the remainder;Speed: 690+6{0,13}+{0,3+{0,18}}+{0,38}+sqrtA;min: 855cc;max: 1003cc;avg: 924.5cc;152 bytes call sqrtA ld l,a ld a,e ld h,0 ld e,d ld d,h sla e rl d sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e sll e \ rl d add a,a \ adc hl,hl add a,a \ adc hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e  sll e \ rl d  add a,a \ adc hl,hl  add a,a \ adc hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc e;Now we have four more iterations;The first two are no problem sll e \ rl d add hl,hl add hl,hl sbc hl,de jr nc,+_ add hl,de dec e .db$FE     ;start of cp *_:  inc e  sll e \ rl d  add hl,hl  add hl,hl  sbc hl,de  jr nc,+_  add hl,de  dec e  .db $FE ;start of cp *_: inc esqrtfixed_88_iter11:;On the next iteration, HL might temporarily overflow by 1 bit sll e \ rl d ;sla e \ rl d \ inc e add hl,hl add hl,hl jr c,sqrtfixed_88_iter11_br0; sbc hl,de jr nc,+_ add hl,de dec e jr sqrtfixed_88_iter12sqrtfixed_88_iter11_br0: or a sbc hl,de_: inc e;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anywayssqrtfixed_88_iter12: ld b,a ;A is 0, so B is 0 add hl,hl add hl,hl rla;AHL - (DE+DE+1) sbc hl,de \ sbc a,b inc e or a sbc hl,de \ sbc a,b ret p add hl,de adc a,b dec e add hl,de adc a,b ret ncr_HL_DE Computes 'HL choose DE' in such a way so that overflow only occurs if the final result overflows 16 bits. Code: [Select] ; Requires; mul16 ;BC*DE ==> DEHL; DEHL_Div_BC ;DEHL/BC ==> DEHLncr_HL_DE:;"n choose r", defined as n!/(r!(n-r)!);Computes "HL choose DE";Inputs: HL,DE;Outputs:; HL is the result; "HL choose DE"; carry flag reset means overflow;Destroys:; A,BC,DE,IX;Notes:; Overflow is returned as 0; Overflow happens if HL choose DE exceeds 65535; This algorithm is constructed in such a way that intermediate; operations won't erroneously trigger overflow.;66 bytes ld bc,1 or a sbc hl,de jr c,ncr_oob jr z,ncr_exit sbc hl,de add hl,de jr c,$+3  ex de,hl  ld a,h  or l  push hl  pop ixncr_exit:  ld h,b  ld l,c  scf  ret zncr_loop:  push bc \ push de  push hl \ push bc  ld b,h  ld c,l  call mul16          ;BC*DE ==> DEHL  pop bc  call DEHL_Div_BC    ;result in DEHL  ld a,d  or e  pop bc  pop de  jr nz,ncr_overflow  add hl,bc  jr c,ncr_overflow  pop bc  inc bc  ld a,b  cp ixh  jr c,ncr_loop  ld a,ixl  cp c  jr nc,ncr_loop  retncr_overflow:  pop bc  xor a  ld b,ancr_oob:  ld h,b  ld l,b  ret
EDIT: Optimized itoa_8 above. Here are some more routines:
uitoa_8
Converts an 8-bit unsigned integer to an ASCII string.
Code: [Select]
;Converts an 8-bit unsigned integer to a stringuitoa_8:;Input:;   A is a signed integer;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes);Output:;   The number is converted to a null-terminated string at HL;Destroys:;   Up to four bytes at HL;   All registers preserved.;on 0 to 9:     238              D=0;on 10 to 99:   244+20D          D=0 to 9;on 100 to 255: 257+2{0,6}+20D   D=0 to 5;min: 238cc;max: 424cc;avg: 317.453125cc = 81268/256 = (238*10 + 334*90+313*156)/256;52 bytes  push hl  push de  push bc  push af;A is on [0,255];calculate 100s place, plus 1 for a future calculation  ld b,'0'  cp 100 \ jr c,$+5 \ sub 100 \ inc b cp 100 \ jr c,$+5 \ sub 100 \ inc b;calculate 10s place digit, +1 for future calculation  ld de,$0A2F inc e \ sub d \ jr nc,$-2  ld c,a;Digits are now in D, C, A; strip leading zeros!  ld a,'0'  cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db$FE  ; start of cp * to skip the next byte, turns into cp $BB which will always return nz and nc cp e \ jr z,$+4 \ ld (hl),e \ inc hl  add a,c  add a,d  ld (hl),a  inc hl  ld (hl),0  pop af  pop bc  pop de  pop hl  ret
itoa_16
Converts a 16-bit signed integer to an ASCII string.
Code: [Select]
;Converts a 16-bit signed integer to an ASCII string.itoa_16:;Input:;   DE is the number to convert;   HL points to where to write the ASCII string (up to 7 bytes needed).;Output:;   HL points to the null-terminated ASCII string;      NOTE: This isn't necessarily the same as the input HL.  push de  push bc  push af  push hl  bit 7,d  jr z,+_  xor a  sub e  ld e,a  sbc a,a  sub d  ld d,a  ld (hl),$1A ;negative char on TI-OS inc hl_: ex de,hl ld bc,-10000 ld a,'0'-1 inc a \ add hl,bc \ jr c,$-2  ld (de),a  inc de  ld bc,1000  ld a,'9'+1  dec a \ add hl,bc \ jr nc,$-2 ld (de),a inc de ld bc,-100 ld a,'0'-1 inc a \ add hl,bc \ jr c,$-2  ld (de),a  inc de  ld a,l  ld h,'9'+1  dec h \ add a,10 \ jr nc,$-3 add a,'0' ex de,hl ld (hl),d inc hl ld (hl),a inc hl ld (hl),0;No strip the leading zeros pop hl;If the first char is a negative sign, skip it ld a,(hl) cp$1A  push af  ld a,'0'  jr nz,$+3 inc hl cp (hl) jr z,$-2;Check if we need to re-write the negative sign  pop af  jr nz,+_  dec hl  ld (hl),a_:  pop af  pop bc  pop de  ret
uitoa_16
Converts a 16-bit unsigned integer to an ASCII string.
Code: [Select]
;Converts a 16-bit unsigned integer to an ASCII string.uitoa_16:;Input:;   DE is the number to convert;   HL points to where to write the ASCII string (up to 6 bytes needed).;Output:;   HL points to the null-terminated ASCII string;      NOTE: This isn't necessarily the same as the input HL.  push de  push bc  push af  ex de,hl  ld bc,-10000  ld a,'0'-1  inc a \ add hl,bc \ jr c,$-2 ld (de),a inc de ld bc,1000 ld a,'9'+1 dec a \ add hl,bc \ jr nc,$-2  ld (de),a  inc de  ld bc,-100  ld a,'0'-1  inc a \ add hl,bc \ jr c,$-2 ld (de),a inc de ld a,l ld h,'9'+1 dec h \ add a,10 \ jr nc,$-3  add a,'0'  ex de,hl  ld (hl),d  inc hl  ld (hl),a  inc hl  ld (hl),0;No strip the leading zeros  ld c,-6  add hl,bc  ld a,'0'  inc hl \ cp (hl) \ jr z,$-2 pop af pop bc pop de ret « Last Edit: August 17, 2019, 10:35:30 am by Xeda112358 » #### Xeda112358 • they/them • Moderator • LV12 Extreme Poster (Next: 5000) • Posts: 4676 • Rating: +718/-6 • Calc-u-lator, do doo doo do do do. ##### Re: ASM Optimized routines « Reply #97 on: August 28, 2019, 09:33:29 am » Here is a masked sprite routine (no clipping)! Interleave the data with the mask, where the MASK is ANDed with the buffer, and the data is ORed on top of that: Code: [Select] ;Masked Sprite routineputsprite_masked:;Inputs:; (A,L) = (x,y); B is height; IX points to the sprite data; first byte is the data; second byte is mask; continues, alternating like this.;;Outputs:; Mask is ANDed to the buffer, then data is ORed on top of that.;;Destroys:; AF, BC, DE, HL, IX;;Notes:; To set a pixel...; black: mask is any, data is 1; white: mask is 0, data is 0; clear: mask is 1, data is 0 (keeps the data from the buffer);;This routine is free to use :);65 bytes (or 66 bytes if gbuf is not located at 0x**40 ld e,l ld h,0 ld d,h add hl,hl add hl,de add hl,hl add hl,hl ld e,a and 7 ld c,a xor e ;essentially gets E with the bottom 3 bits reset#if (plotSScreen&255) = 64 inc a rra rra rra ld e,a ld d,plotSScreen>>8#else rra rra rra ld e,a add hl,de ld de,plotSScreen#endif add hl,deputsprite_masked_loop: push bc xor a ld d,(ix) ld e,a sub c ld b,c ld c,$FF  inc ix  ld a,(ix)  jr z,putsprite_masked_rotdoneputsprite_masked_rot:  scf  rra  rr c  srl d  rr e  djnz putsprite_masked_rotputsprite_masked_rotdone:  and (hl)  or d  ld (hl),a  inc hl  ld a,(hl)  and c  or e  ld (hl),a  ld c,11  add hl,bc  inc ix  pop bc  djnz putsprite_masked_loop  retBut if you want even faster and smaller, use a non-traditional mask technique by ORing the mask onto the buffer, then XORing the data on top of it. The format is less intuitive, but it allows for white/black/clear/invert instead of just white/black/clear:
Code: [Select]
;Masked Sprite routineputsprite_masked:;Inputs:;   (A,L) = (x,y);   B is height;   IX points to the sprite data;       first byte is the data;       second byte is mask;       continues, alternating like this.;;Outputs:;   Mask is ORed to the buffer, then data is XORed on top of that.;;Destroys:;   AF, BC, DE, HL, IX;;Notes:;   To set a pixel...;     black: mask is 1, data is 0;     white: mask is 1, data is 1;     clear: mask is 0, data is 0 (keeps the data from the buffer);     invert: mask is 0, data is 1 (inverts the data from the buffer);;This routine is free to use :);63 bytes (or 64 bytes if gbuf is not located at 0x**40  ld e,l  ld h,0  ld d,h  add hl,hl  add hl,de  add hl,hl  add hl,hl  ld e,a  and 7  ld c,a  xor e  ;essentially gets E with the bottom 3 bits reset#if (plotSScreen&255) = 64  inc a  rra  rra  rra  ld e,a  ld d,plotSScreen>>8#else  rra  rra  rra  ld e,a  add hl,de  ld de,plotSScreen#endif  add hl,deputsprite_masked_loop:  push bc  xor a  ld d,(ix)  ld e,a  or c  ld b,c  ld c,e  inc ix  ld a,(ix)  jr z,putsprite_masked_rotdoneputsprite_masked_rot:  rra  rr c  srl d  rr e  djnz putsprite_masked_rotputsprite_masked_rotdone:  or (hl)  xor d  ld (hl),a  inc hl  ld a,(hl)  or c  xor e  ld (hl),a  ld c,11  add hl,bc  inc ix  pop bc  djnz putsprite_masked_loop  ret

I also made some "bigsprite" routines! These do have clipping, too. First, they use some common subroutines for computing masks and performing most of the clipping and shifting:
Code: [Select]

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4676
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #102 on: July 23, 2020, 10:53:14 pm »
rand8 (very fast)
I have a beautiful RNG. It is 47cc and 10 bytes.

Code: [Select]
;HL must be non-zero  add hl,hl  sbc a,a  and %00101101  xor l  ld l,a  ld a,r  add a,a   ;Because R is a 7-bit register  add a,h;HL is the new seed;A is the output, or AL for 16 bits.
Spoiler For Explanation:
Upon first inspection, that use of the R register might look like it is being used as a source of randomness, but it is in fact being used as a cheap counter. This leaves us in a very interesting position: If the R register happens to be random or chaotic in your use-case, then that is good! But if R just acts as a counter, then that is also good, because that is what the PRNG needs.

The basic trick is to combine a counter or LCG with an LFSR to smooth out the distribution of the LSFR. Since LSFRs are very fast, and R is a built-in counter on the Z80, we can combine them to make a very, very fast random number generator that is decent quality.

Now, there are some caveats. Due to the nature of the R register, this does not have a fixed cycle length. On average, it has a cycle length of 5592320 (meaning you could expect it to loop after every 5592320 numbers generated). The cycle length could be as low as 65535 (it is essentially just the LFSR), or as high as 8388480. If you wait for external input between calls, then you basically have a true RNG.
Spoiler For Images:
Here is a gif demonstration:

Another good use-case is fire animation! (sorry for the poor quality gif, it's really a lot smoother IRL)

eZ80 8-bit RNG
If HL is 24-bit, as in ADL mode on the eZ80, then your LFSR needs different taps (but in this case, it doesn't improve cycle length because we are still using H instead of UHL):
Code: [Select]
;HL must be non-zero  add hl,hl  sbc a,a  and %10000111  xor l  ld l,a  ld a,r  add a,a   ;Because R is a 7-bit register  add a,h;HL is the new seed;A is the output, or AL for 16 bits.
I have no idea what the speed is on the eZ80. 10 fetches, maybe?
Spoiler For Comparison Images:
this comparison might be a little easier to interpret:
(LFSR only)

(combined with R as a counter)

8-bit RNG, 39cc
Apparently even an 8-bit LFSR combined with the R register does a great job at generating noise, so here is an even faster version (faster for the Z80, though this code also works on the eZ80):
Code: [Select]
;E must be non-zero;10 bytes, 39cc;min period: 255;max period: 32640;avg period: 21760  sla e  sbc a,a  and %01011111  xor e  ld e,a  ld a,r  add a,e;E is the new seed, non-zero so long as the input was non-zero;A is the output
Spoiler For Image:

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4676
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
##### Re: ASM Optimized routines
« Reply #103 on: June 18, 2021, 05:19:34 pm »
I wrote a program to generate 16-bit multiplication-by-constant routines. It isn't perfect, but it probably produced optimal code for a lot of them!

The attached tar.xz file has the (undocumented) program and a 28.8MB file of routines

For example, it has these two routines for mul_hl_1337:
Code: [Select]
mul_hl_1337:  ; size optimized;17 bytes, 173cc  ld b,h  ld c,l  add hl,hl  add hl,hl  add hl,bc  add hl,hl  add hl,hl  add hl,hl  add hl,bc  add hl,hl  add hl,bc  add hl,hl  add hl,bc  add hl,hl  add hl,hl  add hl,hl  add hl,bc
Code: [Select]
mul_hl_1337:  ; speed optimized;25 bytes, 148cc  ld b,h  ld c,l  add hl,hl  add hl,bc  add hl,hl  add hl,bc  ld b,h  ld c,l  add hl,hl  add hl,bc  ld a,h  ld h,l  rra  rr h  rra  rr h  rra  and 11000000b  ld l,a  or a  sbc hl,bc
« Last Edit: June 18, 2021, 05:35:44 pm by Xeda112358 »

#### E37

• LV6 Super Member (Next: 500)
• Posts: 313
• Rating: +21/-0
• Trial and error is the best teacher
##### Re: ASM Optimized routines
« Reply #104 on: June 18, 2021, 06:15:20 pm »
Wow, that's really neat. I don't have a use for that at the moment but I bet I can find one in some old, slow code.

How did you go about making the routine generator? I assume there are some simple rules it follows.
I'm still around... kind of.