ASM Optimized routines

« previous next »

Print

Pages: 1 ... 5 6 [7] 8 Go Down

Author Topic: ASM Optimized routines (Read 102272 times)

0 Members and 1 Guest are viewing this topic.

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #90 on: October 16, 2017, 11:52:22 pm »

I find that doing the sign check before and after is the best method. I did find a few optimizations and noticed that the remainder isn't properly restored to A in the case where the result is negative. Also keep in mind that the routine may fail when C>127.

; divide HL by C (HL is signed, C is not)
; output: HL = quotient, A = remainder
divHLbyCs:
    bit 7,h
    push    af
    jr      z,divHLbyCsStart
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
divHLbyCsStart:
    xor     a
    ld      b,16
divHLbyCsLoop:
    add     hl,hl
    rla
    cp      c
    jp      c,divHLbyCsNext
    sub     c
    inc     l
divHLbyCsNext:
    djnz    divHLbyCLoop
    ld      b,a
    pop     af
    ld      a,b
    ret     z
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ld a,c
    sub b
    ret

I changed the output remainder so that it returns the remainder modulo c. So -7/3 returns 2 instead of 1. This means that no bytes nor clock cycles were saved after all

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

JamesV

LV5 Advanced (Next: 300)
Posts: 276
Rating: +77/-0

Re: ASM Optimized routines

« Reply #91 on: October 17, 2017, 04:49:18 pm »

Thanks for the optimisations/fixes! I guess I was lucky that I don't actually use the remainder in my program, and also C will only ever be between 1-64 inclusive. I like the negate HL optimisation too, that's neat

Logged

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #92 on: July 27, 2018, 07:40:53 pm »

I have two neat routines, pushpop and diRestore. If you want to preserve HL,DE,BC, and AF, then you can just put a call to pushpop at the start of your routine. If you want interrupts to be restored on exiting your routine, you can call diRestore at the top of your code.

They both mess with the stack to insert a return routine on the stack. For example, when your code calls diRestore, then an ret will first return to the code to restore the interrupt status, and then back to the calling routine.

pushpop:
;26 bytes, adds 229cc to the calling routine
  ex (sp),hl
  push de
  push bc
  push af
  push hl
  ld hl,pushpopret
  ex (sp),hl
  push hl
  push af
  ld hl,12
  add hl,sp
  ld a,(hl)
  inc hl
  ld h,(hl)
  ld l,a
  pop af
  ret
pushpopret:
  pop af
  pop bc
  pop de
  pop hl
  ret

diRestore:
    ex (sp),hl
    push hl
    push af
    ld hl,restoreei
    ld a,r
    jp pe,+_
    dec hl
    dec hl
_:
    di
    inc sp
    inc sp
    inc sp
    inc sp
    ex (sp),hl
    dec sp
    dec sp
    dec sp
    dec sp
    pop af
    ret
restoredi:
    di
    ret
restoreei:
    ei
    ret

Edit: calc84maniac brought up that those 'inc sp' instructions might cause some issues if interrupts fire in between them. The code is modified to disable interrupts first, then mess with the stack.

« Last Edit: July 28, 2018, 09:34:41 am by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

calc84maniac

eZ80 Guru
Coder Of Tomorrow
LV11 Super Veteran (Next: 3000)
Posts: 2912
Rating: +471/-17

Re: ASM Optimized routines

« Reply #93 on: August 04, 2018, 06:55:30 pm »

32-Bit Endian Swap (eZ80)

Swaps the byte order of a 32-bit value stored in EUHL. Can be adapted to work with other register combinations as well.

endianswap32:
    push hl
    ld h,e
    ld e,l
    inc sp
    push hl
    inc sp
    pop hl
    inc sp
    ret

Logged

"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #94 on: October 16, 2018, 06:33:14 pm »

Here is a decent arctangent routine that works on [0,1), so you'll have to do the work to extend it outside that range. It uses a small lookup table and linear interpolation.

Spoiler For Range Reduction:

atan8:
;computes 256*atan(A/256)->A
;56 bytes including the LUT
;min: 246cc
;max: 271cc
;avg: 258.5cc
  rlca
  rlca
  rlca
  ld d,a
  and 7
  ld hl,atan8LUT
  add a,l
  ld l,a
#if (atan8LUT&255)>248    ;this section not included in size/speed totals
  jr nc,$+3               ;can add three bytes, 12cc to max, 11cc to min, and 11.5cc to avg
  inc h
#endif
  ld c,(hl)
  inc hl
  ld a,(hl)
  sub c
  ld e,0
  ex de,hl
  ld d,l
  ld e,a
  sla h \ jr nc,$+3 \ ld l,e
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl
  add hl,hl
  add hl,hl
;  add hl,hl    ;used in rounding...
  ld a,h
;  rra          ;but doesn't seem to improve the error
  adc a,c
  ret
atan8LUT:
  .db 0,32,63,92,119,143,165,184,201

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #95 on: March 24, 2019, 11:58:29 am »

32-bit square root:

sqrtHLIX:
;Input: HLIX
;Output: DE is the sqrt, AHL is the remainder
;speed: 751+6{0,6}+{0,3+{0,18}}+{0,38}+sqrtHL
;min: 1103
;max: 1237
;avg: 1165.5
;166 bytes

  call sqrtHL   ;expects returns A as sqrt, HL as remainder, D = 0
  add a,a
  ld e,a
  rl d

  ld a,ixh
  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

;Now we have four more iterations
;The first two are no problem
  ld a,ixl
  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

sqrt32_iter15:
;On the next iteration, HL might temporarily overflow by 1 bit
  sll e \ rl d      ;sla e \ rl d \ inc e
  add a,a
  adc hl,hl
  add a,a
  adc hl,hl       ;This might overflow!
  jr c,sqrt32_iter15_br0
;
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  jr sqrt32_iter16
sqrt32_iter15_br0:
  or a
  sbc hl,de
_:
  inc e

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrt32_iter16:
  add a,a
  ld b,a        ;either 0x00 or 0x80
  adc hl,hl
  rla
  adc hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,b
  inc e
  or a
  sbc hl,de \ sbc a,b
  ret p
  add hl,de
  adc a,b
  dec e
  add hl,de
  adc a,b
  ret

This uses this sqrtHL routine:

;written by Zeda
sqrtHL:
;returns A as the sqrt, HL as the remainder, D = 0
;min: 352cc
;max: 391cc
;avg: 371.5cc


  ld de,05040h  ; 10
  ld a,h        ; 4
  sub e         ; 4
  jr nc,sq7     ;\
  add a,e       ; | branch 1: 12cc
  ld d,16       ; | branch 2: 18cc
sq7:            ;/

; ----------

  cp d          ; 4
  jr c,sq6      ;\
  sub d         ; | branch 1: 12cc
  set 5,d       ; | branch 2: 19cc
sq6:            ;/

; ----------
  res 4,d       ; 8
  srl d         ; 8
  set 2,d       ; 8
  cp d          ; 4
  jr c,sq5      ;\
  sub d         ; | branch 1: 12cc
  set 3,d       ; | branch 2: 19cc
sq5:            ;/
  srl d         ; 8

; ----------

  inc a         ; 4
  sub d         ; 4
  jr nc,sq4     ;\
  dec d         ; | branch 1: 12cc
  add a,d       ; | branch 2: 19cc
  dec d         ; | <-- this resets the low bit of D, so `srl d` resets carry.
sq4:            ;/
  srl d         ; 8
  ld h,a        ; 4

; ----------

  ld a,e        ; 4
  sbc hl,de     ; 15
  jr nc,sq3     ;\
  add hl,de     ; | 12cc or 18cc
sq3:            ;/
  ccf           ; 4
  rra           ; 4
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr c,sq2      ;\
  or 20h        ; | branch 1: 23cc
  db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte.
sq2:            ; | branch 2: 21cc
  add hl,de     ;/

  xor 18h       ; 7
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr c,sq1      ;\
  or 8          ; | branch 1: 23cc
  db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte.
sq1:            ; | branch 2: 21cc
  add hl,de     ;/

  xor 6         ; 7
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr nc,+_      ;    \
  add hl,de     ; 15  |
  srl d         ; 8   |
  rra           ; 4   | branch 1: 38cc
  ret           ; 10  | branch 2: 40cc
_:              ;     |
  inc a         ; 4   |
  srl d         ; 8   |
  rra           ; 4   |
  ret           ; 10 /

sqrtHL was from my work on the float routines, sqrtHLIX was inspired by this thread

EDIT: Optimized some more

« Last Edit: September 09, 2019, 04:33:21 pm by Xeda112358 »

Logged

+1/-0 karm for this message

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #96 on: August 16, 2019, 11:41:13 pm »

Here are some routines that I've added to the repository:

itoa_8
Converts an 8-bit signed integer to an ASCII string.

;Converts an 8-bit signed integer to a string

itoa_8:
;Input:
;   A is a signed integer
;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes)
;Output:
;   The number is converted to a null-terminated string at HL
;Destroys:
;   Up to five bytes at HL
;   All registers preserved.
;on 0 to 9:       252       D=0
;on 10 to 99:     258+20D   D=0 to 9
;on 100 to 127:   277+20D   D=0 to 2
;on -1 to -9:     276       D=0
;on -10 to -99:   282+20D   D=0 to 9
;on -100 to -128: 301+20D   D=0 to 2

;min: 252cc  (+23cc over original)
;max: 462cc  (-49cc over original)
;avg: 343.74609375cc = 87999/256
;54 bytes
  push hl
  push de
  push bc
  push af
  or a
  jp p,itoa_pos
  neg
  ld (hl),$1A  ;start if neg char on TI-OS
  inc hl
itoa_pos:
;A is on [0,128]
;calculate 100s place, plus 1 for a future calculation
  ld b,'0'
  cp 100 \ jr c,$+5 \ sub 100 \ inc b

;calculate 10s place digit, +1 for future calculation
  ld de,$0A2F
  inc e \ sub d \ jr nc,$-2
  ld c,a

;Digits are now in D, C, A
; strip leading zeros!
  ld a,'0'
  cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db $FE  ; start of `cp *` to skip the next byte, turns into `cp $BB` which will always return nz and nc
  cp e \ jr z,$+4 \ ld (hl),e \ inc hl
  add a,c
  add a,d
  ld (hl),a
  inc hl
  ld (hl),0

  pop af
  pop bc
  pop de
  pop hl
  ret

fixed88_to_string
Uses the itoa_8 routine to convert an 8.8 fixed-point number to a string.

;This converts a fixed-point number to a string.
;It displays up to 3 digits after the decimal.

fixed88_to_str:
;Inputs:
;   D.E is the fixed-point number
;   HL points to where the string gets output.
;      Needs at most 9 bytes.
;Outputs:
;   HL is preserved
;Destroys:
;   AF,DE,BC

;First check if the input is negative.
;If so, write a negative sign and negate
  push hl
  ld a,d
  or a
  jp p,+_
  ld (hl),$1A      ;negative sign on TI-OS
  inc hl
  xor a
  sub e
  ld e,a
  sbc a,a
  sub d
_:

;Our adjusted number is in A.E
;Now we can print the integer part
  call itoa_8

;Check if we need to print the fractional part
  xor a
  cp e
  jr z,fixed88_to_str_end

;We need to write the fractional part, so seek the end of the string
;Search for the null byte. A is already 0
  cpir

;Write a decimal
  dec hl
  ld (hl),'.'

  ld b,3
_:
;Multiply E by 10, converting overflow to an ASCII digit
  call fixed88_to_str_e_times_10
  inc hl
  ld (hl),a
  djnz -_

;Strip the ending zeros
  ld a,'0'
_:
  cp (hl)
  dec hl
  jr z,-_

;write a null byte
  inc hl
  inc hl
  ld (hl),0

fixed88_to_str_end:
;restore HL
  pop hl
  ret

fixed88_to_str_e_times_10:
  ld a,e
  ld d,0
  add a,a \ rl d
  add a,a \ rl d
  add a,e \ jr nc,$+3 \ inc d
  add a,a
  ld e,a
  ld a,d
  rla
  add a,'0'
  ret

sqrtA
This is a very fast, unrolled routine to compute the square root of A.

sqrtA:
;Input: A
;Output: D is the square root, A is the remainder (input-D^2)
;Destroys: BC
;speed: 161+{0,6}+{0,1}+{0,1}+{0,3}
;min: 161cc
;max: 172cc
;avg: 166.5cc
;45 bytes
  ld d,$40

  sub d
  jr nc,+_
  add a,d
  ld d,0
_:

  set 4,d
  sub d
  jr nc,+_
  add a,d
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  set 5,d
  res 4,d
  srl d

  set 2,d
  sub d
  jr nc,+_
  add a,d
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  set 3,d
  res 2,d
  srl d

  inc d
  sub d
  jr nc,+_
  add a,d
  dec d
_:
  inc d
  srl d
  ret

sqrtfixed_88
An unrolled, fast 8.8 fixed-point square root routine. Uses the above sqrtA routine.

sqrtfixed_88:
;Input: A.E ==> D.E
;Output: DE is the sqrt, AHL is the remainder
;Speed: 690+6{0,13}+{0,3+{0,18}}+{0,38}+sqrtA
;min: 855cc
;max: 1003cc
;avg: 924.5cc
;152 bytes

  call sqrtA
  ld l,a
  ld a,e
  ld h,0
  ld e,d
  ld d,h

  sla e
  rl d

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

;Now we have four more iterations
;The first two are no problem
  sll e \ rl d
  add hl,hl
  add hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add hl,hl
  add hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

sqrtfixed_88_iter11:
;On the next iteration, HL might temporarily overflow by 1 bit
  sll e \ rl d      ;sla e \ rl d \ inc e
  add hl,hl
  add hl,hl
  jr c,sqrtfixed_88_iter11_br0
;
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  jr sqrtfixed_88_iter12
sqrtfixed_88_iter11_br0:
  or a
  sbc hl,de
_:
  inc e

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrtfixed_88_iter12:
  ld b,a      ;A is 0, so B is 0
  add hl,hl
  add hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,b
  inc e
  or a
  sbc hl,de \ sbc a,b
  ret p
  add hl,de
  adc a,b
  dec e
  add hl,de
  adc a,b
  ret

ncr_HL_DE
Computes 'HL choose DE' in such a way so that overflow only occurs if the final result overflows 16 bits.

; Requires
;    mul16          ;BC*DE ==> DEHL
;    DEHL_Div_BC    ;DEHL/BC ==> DEHL

ncr_HL_DE:
;"n choose r", defined as n!/(r!(n-r)!)
;Computes "HL choose DE"
;Inputs: HL,DE
;Outputs:
;   HL is the result
;       "HL choose DE"
;   carry flag reset means overflow
;Destroys:
;   A,BC,DE,IX
;Notes:
;   Overflow is returned as 0
;   Overflow happens if HL choose DE exceeds 65535
;   This algorithm is constructed in such a way that intermediate
;   operations won't erroneously trigger overflow.
;66 bytes
  ld bc,1
  or a
  sbc hl,de
  jr c,ncr_oob
  jr z,ncr_exit
  sbc hl,de
  add hl,de
  jr c,$+3
  ex de,hl
  ld a,h
  or l
  push hl
  pop ix
ncr_exit:
  ld h,b
  ld l,c
  scf
  ret z
ncr_loop:
  push bc \ push de
  push hl \ push bc
  ld b,h
  ld c,l
  call mul16          ;BC*DE ==> DEHL
  pop bc
  call DEHL_Div_BC    ;result in DEHL
  ld a,d
  or e
  pop bc
  pop de
  jr nz,ncr_overflow
  add hl,bc
  jr c,ncr_overflow
  pop bc
  inc bc
  ld a,b
  cp ixh
  jr c,ncr_loop
  ld a,ixl
  cp c
  jr nc,ncr_loop
  ret
ncr_overflow:
  pop bc
  xor a
  ld b,a
ncr_oob:
  ld h,b
  ld l,b
  ret

EDIT: Optimized itoa_8 above. Here are some more routines:
uitoa_8
Converts an 8-bit unsigned integer to an ASCII string.

;Converts an 8-bit unsigned integer to a string

uitoa_8:
;Input:
;   A is a signed integer
;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes)
;Output:
;   The number is converted to a null-terminated string at HL
;Destroys:
;   Up to four bytes at HL
;   All registers preserved.
;on 0 to 9:     238              D=0
;on 10 to 99:   244+20D          D=0 to 9
;on 100 to 255: 257+2{0,6}+20D   D=0 to 5
;min: 238cc
;max: 424cc
;avg: 317.453125cc = 81268/256 = (238*10 + 334*90+313*156)/256
;52 bytes

  push hl
  push de
  push bc
  push af
;A is on [0,255]
;calculate 100s place, plus 1 for a future calculation
  ld b,'0'
  cp 100 \ jr c,$+5 \ sub 100 \ inc b
  cp 100 \ jr c,$+5 \ sub 100 \ inc b

;calculate 10s place digit, +1 for future calculation
  ld de,$0A2F
  inc e \ sub d \ jr nc,$-2
  ld c,a

;Digits are now in D, C, A
; strip leading zeros!
  ld a,'0'
  cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db $FE  ; start of `cp *` to skip the next byte, turns into `cp $BB` which will always return nz and nc
  cp e \ jr z,$+4 \ ld (hl),e \ inc hl
  add a,c
  add a,d
  ld (hl),a
  inc hl
  ld (hl),0

  pop af
  pop bc
  pop de
  pop hl
  ret

itoa_16
Converts a 16-bit signed integer to an ASCII string.

;Converts a 16-bit signed integer to an ASCII string.

itoa_16:
;Input:
;   DE is the number to convert
;   HL points to where to write the ASCII string (up to 7 bytes needed).
;Output:
;   HL points to the null-terminated ASCII string
;      NOTE: This isn't necessarily the same as the input HL.
  push de
  push bc
  push af
  push hl
  bit 7,d
  jr z,+_
  xor a
  sub e
  ld e,a
  sbc a,a
  sub d
  ld d,a
  ld (hl),$1A     ;negative char on TI-OS
  inc hl
_:
  ex de,hl

  ld bc,-10000
  ld a,'0'-1
  inc a \ add hl,bc \ jr c,$-2
  ld (de),a
  inc de

  ld bc,1000
  ld a,'9'+1
  dec a \ add hl,bc \ jr nc,$-2
  ld (de),a
  inc de

  ld bc,-100
  ld a,'0'-1
  inc a \ add hl,bc \ jr c,$-2
  ld (de),a
  inc de

  ld a,l
  ld h,'9'+1
  dec h \ add a,10 \ jr nc,$-3
  add a,'0'
  ex de,hl
  ld (hl),d
  inc hl
  ld (hl),a
  inc hl
  ld (hl),0

;No strip the leading zeros
  pop hl

;If the first char is a negative sign, skip it
  ld a,(hl)
  cp $1A
  push af
  ld a,'0'
  jr nz,$+3
  inc hl
  cp (hl)
  jr z,$-2

;Check if we need to re-write the negative sign
  pop af
  jr nz,+_
  dec hl
  ld (hl),a
_:

  pop af
  pop bc
  pop de
  ret

uitoa_16
Converts a 16-bit unsigned integer to an ASCII string.

;Converts a 16-bit unsigned integer to an ASCII string.

uitoa_16:
;Input:
;   DE is the number to convert
;   HL points to where to write the ASCII string (up to 6 bytes needed).
;Output:
;   HL points to the null-terminated ASCII string
;      NOTE: This isn't necessarily the same as the input HL.
  push de
  push bc
  push af
  ex de,hl

  ld bc,-10000
  ld a,'0'-1
  inc a \ add hl,bc \ jr c,$-2
  ld (de),a
  inc de

  ld bc,1000
  ld a,'9'+1
  dec a \ add hl,bc \ jr nc,$-2
  ld (de),a
  inc de

  ld bc,-100
  ld a,'0'-1
  inc a \ add hl,bc \ jr c,$-2
  ld (de),a
  inc de

  ld a,l
  ld h,'9'+1
  dec h \ add a,10 \ jr nc,$-3
  add a,'0'
  ex de,hl
  ld (hl),d
  inc hl
  ld (hl),a
  inc hl
  ld (hl),0

;No strip the leading zeros
  ld c,-6
  add hl,bc
  ld a,'0'
  inc hl \ cp (hl) \ jr z,$-2
  pop af
  pop bc
  pop de
  ret

« Last Edit: August 17, 2019, 10:35:30 am by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #97 on: August 28, 2019, 09:33:29 am »

Here is a masked sprite routine (no clipping)! Interleave the data with the mask, where the MASK is ANDed with the buffer, and the data is ORed on top of that:

;Masked Sprite routine
putsprite_masked:
;Inputs:
;   (A,L) = (x,y)
;   B is height
;   IX points to the sprite data
;       first byte is the data
;       second byte is mask
;       continues, alternating like this.
;
;Outputs:
;   Mask is ANDed to the buffer, then data is ORed on top of that.
;
;Destroys:
;   AF, BC, DE, HL, IX
;
;Notes:
;   To set a pixel...
;     black: mask is any, data is 1
;     white: mask is 0, data is 0
;     clear: mask is 1, data is 0 (keeps the data from the buffer)
;
;This routine is free to use :)
;65 bytes (or 66 bytes if gbuf is not located at 0x**40

  ld e,l
  ld h,0
  ld d,h
  add hl,hl
  add hl,de
  add hl,hl
  add hl,hl
  ld e,a
  and 7
  ld c,a
  xor e  ;essentially gets E with the bottom 3 bits reset
#if (plotSScreen&255) = 64
  inc a
  rra
  rra
  rra
  ld e,a
  ld d,plotSScreen>>8
#else
  rra
  rra
  rra
  ld e,a
  add hl,de
  ld de,plotSScreen
#endif
  add hl,de


putsprite_masked_loop:
  push bc
  xor a
  ld d,(ix)
  ld e,a
  sub c
  ld b,c
  ld c,$FF
  inc ix
  ld a,(ix)
  jr z,putsprite_masked_rotdone
putsprite_masked_rot:
  scf
  rra
  rr c
  srl d
  rr e
  djnz putsprite_masked_rot
putsprite_masked_rotdone:
  and (hl)
  or d
  ld (hl),a
  inc hl
  ld a,(hl)
  and c
  or e
  ld (hl),a
  ld c,11
  add hl,bc
  inc ix
  pop bc
  djnz putsprite_masked_loop
  ret

But if you want even faster and smaller, use a non-traditional mask technique by ORing the mask onto the buffer, then XORing the data on top of it. The format is less intuitive, but it allows for white/black/clear/invert instead of just white/black/clear:

;Masked Sprite routine
putsprite_masked:
;Inputs:
;   (A,L) = (x,y)
;   B is height
;   IX points to the sprite data
;       first byte is the data
;       second byte is mask
;       continues, alternating like this.
;
;Outputs:
;   Mask is ORed to the buffer, then data is XORed on top of that.
;
;Destroys:
;   AF, BC, DE, HL, IX
;
;Notes:
;   To set a pixel...
;     black: mask is 1, data is 0
;     white: mask is 1, data is 1
;     clear: mask is 0, data is 0 (keeps the data from the buffer)
;     invert: mask is 0, data is 1 (inverts the data from the buffer)
;
;This routine is free to use :)
;63 bytes (or 64 bytes if gbuf is not located at 0x**40

  ld e,l
  ld h,0
  ld d,h
  add hl,hl
  add hl,de
  add hl,hl
  add hl,hl
  ld e,a
  and 7
  ld c,a
  xor e  ;essentially gets E with the bottom 3 bits reset
#if (plotSScreen&255) = 64
  inc a
  rra
  rra
  rra
  ld e,a
  ld d,plotSScreen>>8
#else
  rra
  rra
  rra
  ld e,a
  add hl,de
  ld de,plotSScreen
#endif
  add hl,de

putsprite_masked_loop:
  push bc
  xor a
  ld d,(ix)
  ld e,a
  or c
  ld b,c
  ld c,e
  inc ix
  ld a,(ix)
  jr z,putsprite_masked_rotdone
putsprite_masked_rot:
  rra
  rr c
  srl d
  rr e
  djnz putsprite_masked_rot
putsprite_masked_rotdone:
  or (hl)
  xor d
  ld (hl),a
  inc hl
  ld a,(hl)
  or c
  xor e
  ld (hl),a
  ld c,11
  add hl,bc
  inc ix
  pop bc
  djnz putsprite_masked_loop
  ret

I also made some "bigsprite" routines! These do have clipping, too. First, they use some common subroutines for computing masks and performing most of the clipping and shifting:

;133 bytes total

;This is made by Zeda, feel free to use it for whatever.
;Takes inputs for a big sprite and sets up masks and clipping
;requires 4 bytes of temporary RAM, but doesn't use SMC


spritetmp = 8000h     ;relocate this as needed! Just need 4 bytes.
sprite_width  = spritetmp+0
sprite_x      = spritetmp+1
sprite_mask0  = spritetmp+2
sprite_mask1  = spritetmp+3

bigsprite_subroutine:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;Outputs:
;   carry flag is set if okay to draw, nc if out-of-bounds.
;   B is height.
;   C is width.
;   HL points to the byte to start drawing at.
;   DE points to where to start sourcing the sprite data
;   (sprite_width) is the width of the sprite in bytes
;   (sprite_x) is the intitial x coordinate to begin drawing at
;   (sprite_mask0) is the left mask
;   (sprite_mask1) is the right mask
;92 bytes

;First check if the sprite is on-screen in the horizontal direction
  ld a,c
  cp 64
  jr c,+_
  add a,h
  ret nc
  ld h,a
  push hl
  xor a
  ld h,a
  sub c
  ex de,hl
  add hl,de
  dec a
  jr nz,$-2
  ex de,hl
  pop hl
  xor a
  ld c,a
_:
;Next check h+c<=64
  ld a,64
  sub c
  cp h
  jr nc,+_
  ld h,a
_:

;Make sure the height is not now 0
  ld a,h
  or a
  ret z

;Save the width and height of the sprite
  push hl               ;height,width
  ld h,b
  ld (sprite_width),hl  ;x,width
  push de               ;sprite pointer

;Set up a pointer to the routine for shifting the  routine for shifting the sprite data
  ld ixh,rshiftHA_7>>8
  ld a,h
  cpl
  and 7
  ld l,a
  add a,a
  add a,l
  add a,rshiftHA_7&255
  ld ixl,a
#if (rshiftHA_7&255)>234
  jr nc,$+4
  inc ixh
#endif

  ld a,b
  and 7
  ld de,spritemask
  add a,e
  ld e,a
#if spritemask&255>248
  jr nc,$+3
  inc d
#endif
  ld a,(de)
  ld (sprite_mask0),a
  cpl
  ld (sprite_mask1),a
;
;
  ld a,c
  add a,a
  sbc a,a
  ld h,a
  ld a,b
  ld b,h
  ld l,c
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,hl
  ld c,a
  add a,a
  sbc a,a
  ld b,a
  ld a,c
  sra c
  sra c
  sra c
  add hl,bc
  ld bc,plotSScreen
  add hl,bc

  pop de
  pop bc
  ;B is height
  ;C is width
  ex de,hl
  scf
  ret

rshiftHA_7:
  rr h \ rra
  rr h \ rra
  rr h \ rra
  rr h \ rra
  rr h \ rra
  rr h \ rra
  rr h \ rra
  ex de,hl
  ld e,a
  ret

spritemask:
  .db $00,$80,$C0,$E0,$F0,$F8,$FC,$FE
call_ix:
  jp (ix)

Then you can draw a big sprite with OR logic:

bigsprite_OR:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;68 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_OR_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_OR:
  push bc
  push hl
  ld h,(hl)
  xor a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,(hl)
  or d
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,(sprite_mask1)
  ld a,(hl)
  or e
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_OR


  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_OR_loop
  ret

Or draw with XOR logic:

bigsprite_XOR:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;68 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_XOR_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_XOR:
  push bc
  push hl
  ld h,(hl)
  xor a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,(hl)
  xor d
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,(sprite_mask1)
  ld a,(hl)
  xor e
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_XOR


  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_XOR_loop
  ret

Or draw with AND logic:

bigsprite_AND:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;69 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_AND_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_AND:
  push bc
  push hl
  ld h,(hl)
  scf \ sbc a,a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,(hl)
  and d
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,(sprite_mask1)
  ld a,(hl)
  and e
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_AND


  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_AND_loop
  ret

Or draw with Erase logic:

bigsprite_Erase:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;67 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_Erase_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_Erase:
  push bc
  push hl
  ld h,(hl)
  xor a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,d
  cpl
  and (hl)
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,e
  cpl
  and (hl)
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_Erase

  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_Erase_loop
  ret

Or draw with Overwrite logic:

bigsprite_Overwrite:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;71 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_Overwrite_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_Overwrite:
  push bc
  push hl
  ld h,(hl)
  xor a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,(sprite_mask0)
  and (hl)
  or d
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,(sprite_mask1)
  and (hl)
  or e
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_Overwrite

  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_Overwrite_loop
  ret

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

SpiroH

LV8 Addict (Next: 1000)
Posts: 729
Rating: +153/-23

Re: ASM Optimized routines

« Reply #98 on: August 28, 2019, 10:07:05 am »

@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language

.

Logged

Windows: kArmTI - TI-Nspire emulator with skin | XML2TXT, XML2LUA, TXT2XML, LUA2XML and 2TNS
Mac: nSpiKx - TI-Nspire emulator for Mac
TI-Nspire: nTris - Update for CX | nTris - SDL implementation | nMah - Mahjong clone

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #99 on: August 28, 2019, 10:11:55 am »

Quote from: SpiroH on August 28, 2019, 10:07:05 am

@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language .

Thanks

I wrote these with apps in mind, so I tried to reduce the need for external RAM. I should definitely make versions that take full advantage of SMC, though!

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #100 on: September 10, 2019, 06:18:29 pm »

Here is a circle routine that I made!

;Written by Zeda Thomas, free to use.

;This draws a circle centered at 8-bit coordinates and with radius up to 127.
;IX points to a `plot` routine that takes (B,C)=(x,y) as input and does something
;with it, like plot the pixel a certain color, or plot a "big" pixel, or whatever.
;   plot
;     Takes coordinates, (B,C) = (x,y) and plots the point.
;
;For example, on the TI-83+/84+/SE the plot routine might look like:
; plot:
;   call getpixelloc
;   ret nc            ;Exit if the coordinates are out-of-bounds
;   or (hl)
;   ld (hl),a
;   ret
;
;
; Required subroutines:
;     call_ix:
;       jp (ix)

circle:
;Input:
; (B,C) is the center (x,y)
; D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).
; IX points to a `plot` routine that takes (B,C)=(x,y) as input.
  ld a,d
  add a,a
  ret c
  ret z
  ld l,d
  dec a
  ld e,a
  dec a        ;if the pixel is only 1 wide, just plot the point
  jp z,call_ix ;Jump to the plot routine
  xor a
  ld h,-1
  ld d,1
  scf     ;skip the first plot
circleloop:
  call nc,plot4
  inc h
  sub d
  inc d
  inc d
  jr nc,circleloop
_:
  dec l
  call plot4
  add a,e
  dec e
  ret z
  dec e
  jr nc,-_
  jp circleloop


plot4:
;BC is center
;HL is x,y

  push de
  push af
  push hl
  push bc

;If H is 0, or L is 0, we need to draw only half
  push hl

  ld a,b
  sub h
  ld b,a
  add a,h
  add a,h
  ld h,a

  ld a,c
  sub l
  ld c,a
  add a,l
  add a,l
  ld l,a

;B is x0-x
;C is y0-y
;H is x0+x
;L is y0+y


;plot(x0-x,y0-y)
;plot(x0+x,y0+y)
  push bc
  push hl
  call call_ix    ;call the plot routine
  pop bc
  push bc
  call call_ix    ;call the plot routine

;now swap the y coords
  pop hl
  pop bc
  ld a,l
  ld l,c
  ld c,a
  pop de
  xor a
  cp d
  jr z,+_
  cp e
  jr z,+_


;plot(x0-x,y0+y)
;plot(x0+x,y0-y)
  push hl
  call call_ix    ;call the plot routine
  pop bc
  call call_ix    ;call the plot routine
_:

  pop bc
  pop hl
  pop af
  pop de
  ret

The really cool feature about this is that you can define a custom plot routine pointed to by IX, so it isn't TI-specific, and you can do all sorts of wonky things like:
Draw 2x2 pixels:

;calling with `ld ix,pixelOn_2x2`

pixelOn_2x2:
  sla b
  ret c
  sla c
  ret c
  push bc
  call pixelOn
  pop bc
  inc b
  push bc
  call pixelOn
  pop bc
  inc c
  push bc
  call pixelOn
  pop bc
  dec b
  jp pixelOn

Or draw a circle whose "pixels" are circles:

;calling with `ld ix,pixelOn_circle`

pixelOn_circle:
  ld a,b
  cp 32
  ret nc
  add a,a
  add a,a
  ld b,a
  ld a,c
  cp 32
  ret nc
  add a,a
  add a,a
  ld c,a
  ld d,4
  push ix    ;need to save IX!
  ld ix,pixelOn
  call circle
  pop ix
  ret

EDIT: I inlined some subroutines because there was no reason to have them called. It was a waste of clock cycles and space!
EDIT: Have a separate, filled rectangle routine!
Note that if you pass the same arguments as the regular circle routine, this only draws the inside part and skips the border.

;Written by Zeda Thomas, free to use.

;This draws the fill of a circle centered at 8-bit coordinates and with radius
;up to 127.
;IX points to a `horizontal line` routine that takes E=x, A=y, D=width as input
;and does something with it, like plot a horizontal line.
;
; For example, on the ti-83+/84+/SE calculators, you might have:
;     horizontal_line:
;       ld b,e
;       ld c,a
;       ld e,1
;       ld hl,gbuf
;       jp rectOR

; Required subroutines:
;     call_ix:
;       jp (ix)

filledcircle:
;Input:
; (B,C) is the center (x,y)
; D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).
; IX points to a `plot` routine that takes (B,C)=(x,y) as input.
  ld a,d
  add a,a
  ret c
  ret z
  ld l,d
  dec a
  ld e,a
  xor a
  ld h,-1
  ld d,1
filledcircleloop:
  ; call c,fillcircle_plot
  inc h
  sub d
  inc d
  inc d
  jr nc,filledcircleloop
_:
  dec l
  call fillcircle_plot
  add a,e
  dec e
  ret z
  dec e
  jr nc,-_
  jp filledcircleloop

fillcircle_plot:
  inc h
  dec h
  ret z
  push hl
  push de
  push bc
  push af
  dec h
  ld a,b
  sub h
  ld e,a
  ld d,h
  sll d   ;aka `slia`, undocumented

  ld a,l
  or a
  ld h,c
  jr z,+_
  add a,h
  push de
  push hl
  call nz,call_ix
  pop hl
  pop de
_:
  ld a,h
  sub l
  call call_ix
  pop af
  pop bc
  pop de
  pop hl
  ret

« Last Edit: September 10, 2019, 09:05:45 pm by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #101 on: April 05, 2020, 10:56:59 am »

I have two routines that I am proud of and thought they'd be useful!

Binary Search
Binary Search is an efficient way to search through sorted data. Regardless of the data, the core of the algorithm is the same, so I made a routine that is the core algorithm!
All you have to do is pass a pointer to a routine in IX that compares two pieces of data

;Requires `call_ix` routine that looks like:
;   call_ix:
;       jp (ix)
;
;IX points to a comparison routine that takes as input:
;   HL points to the input element to find
;   DE is the element to compare to
;Outputs are:
;   carry set if the HL element is less than the DE element
;   carry reset if the HL element is greater than or equal to the DE element
;   z set if the HL element is equal to the DE element
;   nz set if the HL element is not equal to the DE element
;
;This is useful if you have a table of pointers to strings, and want to find a
;string in the array. See the end of this file for an example.
;

binarysearch:
;This is a general-purpose binary search routine.
;
;Inputs:
;   BC is the element to find
;   HL is the number of elements
;   IX points to a callback routine that compares the input element
;Outputs:
;   DE is the matching element index
;   z means match found
;   nz means no match
;       In this case, DE is interpreted as where the match should have been
;Destroys:
;   AF, DE
;RAM needed:
;   10 bytes of stack space (4 pushes and a call)
;
  ld de,-1
binarysearch_loop_inc_lower:
  inc de
binarysearch_loop:
  push hl
  push de
  or a
  sbc hl,de
  jr z,binarysearch_nomatch
  rr h
  rr l
  add hl,de
  ld d,h
  ld e,l
  push hl   ;test index
  push bc   ;input
  ld h,b
  ld l,c
  call call_ix
  pop bc    ;input
  jr nc,binarysearch_input_bigger_or_equal
  pop hl    ;test index is the new upper index
  pop de    ;restore the lower index
  pop af    ;dummy pop
  jr binarysearch_loop

binarysearch_input_bigger_or_equal:
  pop de    ;test index +1 is the new lower index
  pop hl    ;dummy pop
  pop hl    ;restore upper index
  jr nz,binarysearch_loop_inc_lower
  ;a match was found
  ;DE is left as the index
  ret

binarysearch_nomatch:
  pop de    ;lower index
  pop hl    ;upper index (also lower index and test index)
  or 1
  ret

Heapsort
Heapsort is a clever and efficient sorting algorithm. Much like the Binary Search routine above, you pass IX pointing to a routine that actually interprets your data, so this can work on even your weirdest data format. IX points to a routine that either swaps two elements, or compares two elements (based on carry flag passed in).
NOTE: This also includes the heapify routine, used by heapsort. This might also be useful if you only need to arrange data into a heap structure, but don't need to sort.

; This routine requires the following subroutine:
;     call_ix:
;         jp (ix)
;
; To use SMC, define SMC.
;     #define SMC
;
; If you are not using SMC, you'll need to define `arraylen` to point
;to 2 bytes of scrap RAM

heapsort:
;Inputs:
;   BC is the size of the array.
;   IX points to the routine that compares or swpas two entries
;     HL is the index of the first element
;     DE is the index of the second element
;     c means swap
;     nc means compare HL to DE
;       Should return:
;         z if they are equal
;         nc if HL>=DE
;         c if HL<DE
;Outputs:
;   The data is sorted.
;Destroys:
;   All
;Notes:
;   You can make the comparison routine work any way that you want :)
;   For example, you can invert the carry flag output to sort descending.

;If the array is size 0 or 1, then it is sorted
    inc b
    dec b
    jr nz,heapsort_heapify
    dec c
    ret z
    inc c
    ret z

heapsort_heapify:
    call heapify
    ld hl,(arraylen)
heapsort_loop:
    dec hl
    ld (arraylen),hl
    push hl
;HL is the last element
    ld de,0
    call heapsort_swap
    ld bc,1
    call propogate
    pop hl
    ld a,h
    or l
    jr nz,heapsort_loop
    ret

heapify:
;Inputs:
;   HL points to the array data
;   BC is the size of the array. NOT GREATER THAN 32767
;   IX points to the routine that compares the values
    ld (arraylen),bc
    srl b
    rr c
_:
    push bc
    call propogate
    pop bc
    dec bc
    ld a,b
    or c
    jr nz,-_
    ret

propogate:
;BC-1 is the parent index
;2BC-1 is the child0 index
;2BC is the child1 index
    sla c
    rl b
    ld d,b
    ld e,c

#ifdef SMC
arraylen=$+1
    ld hl,0
#else
    ld hl,(arraylen)
#endif

    sbc hl,de
    add hl,de
    ret c  ;no children
;z means one child
;compare the two children
    ld h,b
    ld l,c
    dec hl
;HL is the child0 index
;DE is the child1 index
    call nz,heapsort_cmp
    jr nc,+_
    ex de,hl
_:

;parent is (HL+1)>>1
;child is HL
    ld d,h
    ld e,l
    inc hl
    srl h
    rr l
    dec hl
    call heapsort_cmp
    ret nc
    call heapsort_swap

;values heapsort_swapped, now set parent=child+1
    ld b,d
    ld c,e
    inc bc
    jr propogate

heapsort_swap:
;HL and DE are the indices of the elements to heapsort_swap
    push hl
    push de
    scf
    call call_ix
    pop de
    pop hl
    ret

heapsort_cmp:
    push hl
    push de
    or a
    call call_ix
    pop de
    pop hl
    ret

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #102 on: July 23, 2020, 10:53:14 pm »

rand8 (very fast)
I have a beautiful RNG. It is 47cc and 10 bytes.

;HL must be non-zero
  add hl,hl
  sbc a,a
  and %00101101
  xor l
  ld l,a
  ld a,r
  add a,a   ;Because R is a 7-bit register
  add a,h
;HL is the new seed
;A is the output, or AL for 16 bits.

Spoiler For Explanation:

Spoiler For Images:

eZ80 8-bit RNG
If HL is 24-bit, as in ADL mode on the eZ80, then your LFSR needs different taps (but in this case, it doesn't improve cycle length because we are still using H instead of UHL):

;HL must be non-zero
  add hl,hl
  sbc a,a
  and %10000111
  xor l
  ld l,a
  ld a,r
  add a,a   ;Because R is a 7-bit register
  add a,h
;HL is the new seed
;A is the output, or AL for 16 bits.

I have no idea what the speed is on the eZ80. 10 fetches, maybe?

Spoiler For Comparison Images:

8-bit RNG, 39cc
Apparently even an 8-bit LFSR combined with the R register does a great job at generating noise, so here is an even faster version (faster for the Z80, though this code also works on the eZ80):

;E must be non-zero
;10 bytes, 39cc
;min period: 255
;max period: 32640
;avg period: 21760

  sla e
  sbc a,a
  and %01011111
  xor e
  ld e,a
  ld a,r
  add a,e

;E is the new seed, non-zero so long as the input was non-zero
;A is the output

Spoiler For Image:

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #103 on: June 18, 2021, 05:19:34 pm »

I wrote a program to generate 16-bit multiplication-by-constant routines. It isn't perfect, but it probably produced optimal code for a lot of them!

The attached tar.xz file has the (undocumented) program and a 28.8MB file of routines

For example, it has these two routines for mul_hl_1337:

mul_hl_1337:  ; size optimized
;17 bytes, 173cc
  ld b,h
  ld c,l
  add hl,hl
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,hl
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,hl
  add hl,hl
  add hl,bc

mul_hl_1337:  ; speed optimized
;25 bytes, 148cc
  ld b,h
  ld c,l
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,bc
  ld b,h
  ld c,l
  add hl,hl
  add hl,bc
  ld a,h
  ld h,l
  rra
  rr h
  rra
  rr h
  rra
  and 11000000b
  ld l,a
  or a
  sbc hl,bc

z80-const-mul.tar.xz (919.95 kB - downloaded 213 times.)

« Last Edit: June 18, 2021, 05:35:44 pm by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

E37

LV6 Super Member (Next: 500)
Posts: 358
Rating: +23/-0
Trial and error is the best teacher

Re: ASM Optimized routines

« Reply #104 on: June 18, 2021, 06:15:20 pm »

Wow, that's really neat. I don't have a use for that at the moment but I bet I can find one in some old, slow code.

How did you go about making the routine generator? I assume there are some simple rules it follows.

Logged

I'm still around... kind of.

Print

Pages: 1 ... 5 6 [7] 8 Go Up

« previous next »

Server load over the past 5, 10 and 15 minutes respectively: 0.5380859375, 0.5205078125, 0.6005859375

Page created in 0.118 seconds with 61 queries.