• ASM Optimized routines 5 1
Currently:  

Author Topic: ASM Optimized routines  (Read 35339 times)

0 Members and 1 Guest are viewing this topic.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4571
  • Rating: +716/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #90 on: October 16, 2017, 11:52:22 pm »
I find that doing the sign check before and after is the best method. I did find a few optimizations and noticed that the remainder isn't properly restored to A in the case where the result is negative. Also keep in mind that the routine may fail when C>127.

Code: [Select]
; divide HL by C (HL is signed, C is not)
; output: HL = quotient, A = remainder
divHLbyCs:
    bit 7,h
    push    af
    jr      z,divHLbyCsStart
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
divHLbyCsStart:
    xor     a
    ld      b,16
divHLbyCsLoop:
    add     hl,hl
    rla
    cp      c
    jp      c,divHLbyCsNext
    sub     c
    inc     l
divHLbyCsNext:
    djnz    divHLbyCLoop
    ld      b,a
    pop     af
    ld      a,b
    ret     z
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ld a,c
    sub b
    ret
I changed the output remainder so that it returns the remainder modulo c. So -7/3 returns 2 instead of 1. This means that no bytes nor clock cycles were saved after all :P

Offline JamesV

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 266
  • Rating: +76/-0
    • View Profile
    • James V's TI-Calculator page
Re: ASM Optimized routines
« Reply #91 on: October 17, 2017, 04:49:18 pm »
Thanks for the optimisations/fixes! I guess I was lucky that I don't actually use the remainder in my program, and also C will only ever be between 1-64 inclusive. I like the negate HL optimisation too, that's neat :)

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4571
  • Rating: +716/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #92 on: July 27, 2018, 07:40:53 pm »
I have two neat routines, pushpop and diRestore. If you want to preserve HL,DE,BC, and AF, then you can just put a call to pushpop at the start of your routine. If you want interrupts to be restored on exiting your routine, you can call diRestore at the top of your code.

They both mess with the stack to insert a return routine on the stack. For example, when your code calls diRestore, then an ret will first return to the code to restore the interrupt status, and then back to the calling routine.

Code: [Select]
pushpop:
;26 bytes, adds 229cc to the calling routine
  ex (sp),hl
  push de
  push bc
  push af
  push hl
  ld hl,pushpopret
  ex (sp),hl
  push hl
  push af
  ld hl,12
  add hl,sp
  ld a,(hl)
  inc hl
  ld h,(hl)
  ld l,a
  pop af
  ret
pushpopret:
  pop af
  pop bc
  pop de
  pop hl
  ret
Code: [Select]
diRestore:
    ex (sp),hl
    push hl
    push af
    ld hl,restoreei
    ld a,r
    jp pe,+_
    dec hl
    dec hl
_:
    di
    inc sp
    inc sp
    inc sp
    inc sp
    ex (sp),hl
    dec sp
    dec sp
    dec sp
    dec sp
    pop af
    ret
restoredi:
    di
    ret
restoreei:
    ei
    ret
Edit: calc84maniac brought up that those 'inc sp' instructions might cause some issues if interrupts fire in between them. The code is modified to disable interrupts first, then mess with the stack.

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2897
  • Rating: +467/-17
    • View Profile
    • TI-Boy CE
Re: ASM Optimized routines
« Reply #93 on: August 04, 2018, 06:55:30 pm »
32-Bit Endian Swap (eZ80)

Swaps the byte order of a 32-bit value stored in EUHL. Can be adapted to work with other register combinations as well.

Code: [Select]
endianswap32:
    push hl
    ld h,e
    ld e,l
    inc sp
    push hl
    inc sp
    pop hl
    inc sp
    ret
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4571
  • Rating: +716/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #94 on: October 16, 2018, 06:33:14 pm »
Here is a decent arctangent routine that works on [0,1), so you'll have to do the work to extend it outside that range. It uses a small lookup table and linear interpolation.
Spoiler For Range Reduction:
atan(-x)=-atan(x) is useful to extend to negative numbers
atan(x)=pi/2-atan(1/x) is useful to extend to inputs with magnitude >=1
Code: [Select]
atan8:
;computes 256*atan(A/256)->A
;56 bytes including the LUT
;min: 246cc
;max: 271cc
;avg: 258.5cc
  rlca
  rlca
  rlca
  ld d,a
  and 7
  ld hl,atan8LUT
  add a,l
  ld l,a
#if (atan8LUT&255)>248    ;this section not included in size/speed totals
  jr nc,$+3               ;can add three bytes, 12cc to max, 11cc to min, and 11.5cc to avg
  inc h
#endif
  ld c,(hl)
  inc hl
  ld a,(hl)
  sub c
  ld e,0
  ex de,hl
  ld d,l
  ld e,a
  sla h \ jr nc,$+3 \ ld l,e
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl
  add hl,hl
  add hl,hl
;  add hl,hl    ;used in rounding...
  ld a,h
;  rra          ;but doesn't seem to improve the error
  adc a,c
  ret
atan8LUT:
  .db 0,32,63,92,119,143,165,184,201

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4571
  • Rating: +716/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #95 on: March 24, 2019, 11:58:29 am »
32-bit square root:
Code: [Select]
sqrtHLIX:
;Input: HLIX
;Output: DE is the sqrt, AHL is the remainder
;speed: 751+6{0,6}+{0,3+{0,18}}+{0,38}+sqrtHL
;min: 1103
;max: 1237
;avg: 1165.5
;166 bytes

  call sqrtHL   ;expects returns A as sqrt, HL as remainder, D = 0
  add a,a
  ld e,a
  rl d

  ld a,ixh
  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

;Now we have four more iterations
;The first two are no problem
  ld a,ixl
  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

sqrt32_iter15:
;On the next iteration, HL might temporarily overflow by 1 bit
  sll e \ rl d      ;sla e \ rl d \ inc e
  add a,a
  adc hl,hl
  add a,a
  adc hl,hl       ;This might overflow!
  jr c,sqrt32_iter15_br0
;
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  jr sqrt32_iter16
sqrt32_iter15_br0:
  or a
  sbc hl,de
_:
  inc e

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrt32_iter16:
  add a,a
  ld b,a        ;either 0x00 or 0x80
  adc hl,hl
  rla
  adc hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,b
  inc e
  or a
  sbc hl,de \ sbc a,b
  ret p
  add hl,de
  adc a,b
  dec e
  add hl,de
  adc a,b
  ret
This uses this sqrtHL routine:
Code: [Select]
;written by Zeda
sqrtHL:
;returns A as the sqrt, HL as the remainder, D = 0
;min: 352cc
;max: 391cc
;avg: 371.5cc


  ld de,05040h  ; 10
  ld a,h        ; 4
  sub e         ; 4
  jr nc,sq7     ;\
  add a,e       ; | branch 1: 12cc
  ld d,16       ; | branch 2: 18cc
sq7:            ;/

; ----------

  cp d          ; 4
  jr c,sq6      ;\
  sub d         ; | branch 1: 12cc
  set 5,d       ; | branch 2: 19cc
sq6:            ;/

; ----------
  res 4,d       ; 8
  srl d         ; 8
  set 2,d       ; 8
  cp d          ; 4
  jr c,sq5      ;\
  sub d         ; | branch 1: 12cc
  set 3,d       ; | branch 2: 19cc
sq5:            ;/
  srl d         ; 8

; ----------

  inc a         ; 4
  sub d         ; 4
  jr nc,sq4     ;\
  dec d         ; | branch 1: 12cc
  add a,d       ; | branch 2: 19cc
  dec d         ; | <-- this resets the low bit of D, so `srl d` resets carry.
sq4:            ;/
  srl d         ; 8
  ld h,a        ; 4

; ----------

  ld a,e        ; 4
  sbc hl,de     ; 15
  jr nc,sq3     ;\
  add hl,de     ; | 12cc or 18cc
sq3:            ;/
  ccf           ; 4
  rra           ; 4
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr c,sq2      ;\
  or 20h        ; | branch 1: 23cc
  db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte.
sq2:            ; | branch 2: 21cc
  add hl,de     ;/

  xor 18h       ; 7
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr c,sq1      ;\
  or 8          ; | branch 1: 23cc
  db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte.
sq1:            ; | branch 2: 21cc
  add hl,de     ;/

  xor 6         ; 7
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr nc,+_      ;    \
  add hl,de     ; 15  |
  srl d         ; 8   |
  rra           ; 4   | branch 1: 38cc
  ret           ; 10  | branch 2: 40cc
_:              ;     |
  inc a         ; 4   |
  srl d         ; 8   |
  rra           ; 4   |
  ret           ; 10 /

sqrtHL was from my work on the float routines, sqrtHLIX was inspired by this thread :)

EDIT: Optimized some more :)

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4571
  • Rating: +716/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #96 on: August 16, 2019, 11:41:13 pm »
Here are some routines that I've added to the repository:

itoa_8
Converts an 8-bit signed integer to an ASCII string.
Code: [Select]
;Converts an 8-bit signed integer to a string

itoa_8:
;Input:
;   A is a signed integer
;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes)
;Output:
;   The number is converted to a null-terminated string at HL
;Destroys:
;   Up to five bytes at HL
;   All registers preserved.
;on 0 to 9:       252       D=0
;on 10 to 99:     258+20D   D=0 to 9
;on 100 to 127:   277+20D   D=0 to 2
;on -1 to -9:     276       D=0
;on -10 to -99:   282+20D   D=0 to 9
;on -100 to -128: 301+20D   D=0 to 2

;min: 252cc  (+23cc over original)
;max: 462cc  (-49cc over original)
;avg: 343.74609375cc = 87999/256
;54 bytes
  push hl
  push de
  push bc
  push af
  or a
  jp p,itoa_pos
  neg
  ld (hl),$1A  ;start if neg char on TI-OS
  inc hl
itoa_pos:
;A is on [0,128]
;calculate 100s place, plus 1 for a future calculation
  ld b,'0'
  cp 100 \ jr c,$+5 \ sub 100 \ inc b

;calculate 10s place digit, +1 for future calculation
  ld de,$0A2F
  inc e \ sub d \ jr nc,$-2
  ld c,a

;Digits are now in D, C, A
; strip leading zeros!
  ld a,'0'
  cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db $FE  ; start of `cp *` to skip the next byte, turns into `cp $BB` which will always return nz and nc
  cp e \ jr z,$+4 \ ld (hl),e \ inc hl
  add a,c
  add a,d
  ld (hl),a
  inc hl
  ld (hl),0

  pop af
  pop bc
  pop de
  pop hl
  ret

fixed88_to_string
Uses the itoa_8 routine to convert an 8.8 fixed-point number to a string.
Code: [Select]
;This converts a fixed-point number to a string.
;It displays up to 3 digits after the decimal.

fixed88_to_str:
;Inputs:
;   D.E is the fixed-point number
;   HL points to where the string gets output.
;      Needs at most 9 bytes.
;Outputs:
;   HL is preserved
;Destroys:
;   AF,DE,BC

;First check if the input is negative.
;If so, write a negative sign and negate
  push hl
  ld a,d
  or a
  jp p,+_
  ld (hl),$1A      ;negative sign on TI-OS
  inc hl
  xor a
  sub e
  ld e,a
  sbc a,a
  sub d
_:

;Our adjusted number is in A.E
;Now we can print the integer part
  call itoa_8

;Check if we need to print the fractional part
  xor a
  cp e
  jr z,fixed88_to_str_end

;We need to write the fractional part, so seek the end of the string
;Search for the null byte. A is already 0
  cpir

;Write a decimal
  dec hl
  ld (hl),'.'

  ld b,3
_:
;Multiply E by 10, converting overflow to an ASCII digit
  call fixed88_to_str_e_times_10
  inc hl
  ld (hl),a
  djnz -_

;Strip the ending zeros
  ld a,'0'
_:
  cp (hl)
  dec hl
  jr z,-_

;write a null byte
  inc hl
  inc hl
  ld (hl),0

fixed88_to_str_end:
;restore HL
  pop hl
  ret

fixed88_to_str_e_times_10:
  ld a,e
  ld d,0
  add a,a \ rl d
  add a,a \ rl d
  add a,e \ jr nc,$+3 \ inc d
  add a,a
  ld e,a
  ld a,d
  rla
  add a,'0'
  ret

sqrtA
This is a very fast, unrolled routine to compute the square root of A.

Code: [Select]
sqrtA:
;Input: A
;Output: D is the square root, A is the remainder (input-D^2)
;Destroys: BC
;speed: 161+{0,6}+{0,1}+{0,1}+{0,3}
;min: 161cc
;max: 172cc
;avg: 166.5cc
;45 bytes
  ld d,$40

  sub d
  jr nc,+_
  add a,d
  ld d,0
_:

  set 4,d
  sub d
  jr nc,+_
  add a,d
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  set 5,d
  res 4,d
  srl d

  set 2,d
  sub d
  jr nc,+_
  add a,d
  .db $01   ;start of ld bc,** which is 10cc to skip the next two bytes.
_:
  set 3,d
  res 2,d
  srl d

  inc d
  sub d
  jr nc,+_
  add a,d
  dec d
_:
  inc d
  srl d
  ret

sqrtfixed_88
An unrolled, fast 8.8 fixed-point square root routine. Uses the above sqrtA routine.
Code: [Select]
sqrtfixed_88:
;Input: A.E ==> D.E
;Output: DE is the sqrt, AHL is the remainder
;Speed: 690+6{0,13}+{0,3+{0,18}}+{0,38}+sqrtA
;min: 855cc
;max: 1003cc
;avg: 924.5cc
;152 bytes

  call sqrtA
  ld l,a
  ld a,e
  ld h,0
  ld e,d
  ld d,h

  sla e
  rl d

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add a,a \ adc hl,hl
  add a,a \ adc hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

;Now we have four more iterations
;The first two are no problem
  sll e \ rl d
  add hl,hl
  add hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

  sll e \ rl d
  add hl,hl
  add hl,hl
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  .db $FE     ;start of `cp *`
_:
  inc e

sqrtfixed_88_iter11:
;On the next iteration, HL might temporarily overflow by 1 bit
  sll e \ rl d      ;sla e \ rl d \ inc e
  add hl,hl
  add hl,hl
  jr c,sqrtfixed_88_iter11_br0
;
  sbc hl,de
  jr nc,+_
  add hl,de
  dec e
  jr sqrtfixed_88_iter12
sqrtfixed_88_iter11_br0:
  or a
  sbc hl,de
_:
  inc e

;On the next iteration, HL is allowed to overflow, DE could overflow with our current routine, but it needs to be shifted right at the end, anyways
sqrtfixed_88_iter12:
  ld b,a      ;A is 0, so B is 0
  add hl,hl
  add hl,hl
  rla
;AHL - (DE+DE+1)
  sbc hl,de \ sbc a,b
  inc e
  or a
  sbc hl,de \ sbc a,b
  ret p
  add hl,de
  adc a,b
  dec e
  add hl,de
  adc a,b
  ret

ncr_HL_DE
Computes 'HL choose DE' in such a way so that overflow only occurs if the final result overflows 16 bits.
Code: [Select]
; Requires
;    mul16          ;BC*DE ==> DEHL
;    DEHL_Div_BC    ;DEHL/BC ==> DEHL

ncr_HL_DE:
;"n choose r", defined as n!/(r!(n-r)!)
;Computes "HL choose DE"
;Inputs: HL,DE
;Outputs:
;   HL is the result
;       "HL choose DE"
;   carry flag reset means overflow
;Destroys:
;   A,BC,DE,IX
;Notes:
;   Overflow is returned as 0
;   Overflow happens if HL choose DE exceeds 65535
;   This algorithm is constructed in such a way that intermediate
;   operations won't erroneously trigger overflow.
;66 bytes
  ld bc,1
  or a
  sbc hl,de
  jr c,ncr_oob
  jr z,ncr_exit
  sbc hl,de
  add hl,de
  jr c,$+3
  ex de,hl
  ld a,h
  or l
  push hl
  pop ix
ncr_exit:
  ld h,b
  ld l,c
  scf
  ret z
ncr_loop:
  push bc \ push de
  push hl \ push bc
  ld b,h
  ld c,l
  call mul16          ;BC*DE ==> DEHL
  pop bc
  call DEHL_Div_BC    ;result in DEHL
  ld a,d
  or e
  pop bc
  pop de
  jr nz,ncr_overflow
  add hl,bc
  jr c,ncr_overflow
  pop bc
  inc bc
  ld a,b
  cp ixh
  jr c,ncr_loop
  ld a,ixl
  cp c
  jr nc,ncr_loop
  ret
ncr_overflow:
  pop bc
  xor a
  ld b,a
ncr_oob:
  ld h,b
  ld l,b
  ret

EDIT: Optimized itoa_8 above. Here are some more routines:
uitoa_8
Converts an 8-bit unsigned integer to an ASCII string.
Code: [Select]
;Converts an 8-bit unsigned integer to a string

uitoa_8:
;Input:
;   A is a signed integer
;   HL points to where the null-terminated ASCII string is stored (needs at most 5 bytes)
;Output:
;   The number is converted to a null-terminated string at HL
;Destroys:
;   Up to four bytes at HL
;   All registers preserved.
;on 0 to 9:     238              D=0
;on 10 to 99:   244+20D          D=0 to 9
;on 100 to 255: 257+2{0,6}+20D   D=0 to 5
;min: 238cc
;max: 424cc
;avg: 317.453125cc = 81268/256 = (238*10 + 334*90+313*156)/256
;52 bytes

  push hl
  push de
  push bc
  push af
;A is on [0,255]
;calculate 100s place, plus 1 for a future calculation
  ld b,'0'
  cp 100 \ jr c,$+5 \ sub 100 \ inc b
  cp 100 \ jr c,$+5 \ sub 100 \ inc b

;calculate 10s place digit, +1 for future calculation
  ld de,$0A2F
  inc e \ sub d \ jr nc,$-2
  ld c,a

;Digits are now in D, C, A
; strip leading zeros!
  ld a,'0'
  cp b \ jr z,$+5 \ ld (hl),b \ inc hl \ .db $FE  ; start of `cp *` to skip the next byte, turns into `cp $BB` which will always return nz and nc
  cp e \ jr z,$+4 \ ld (hl),e \ inc hl
  add a,c
  add a,d
  ld (hl),a
  inc hl
  ld (hl),0

  pop af
  pop bc
  pop de
  pop hl
  ret

itoa_16
Converts a 16-bit signed integer to an ASCII string.
Code: [Select]
;Converts a 16-bit signed integer to an ASCII string.

itoa_16:
;Input:
;   DE is the number to convert
;   HL points to where to write the ASCII string (up to 7 bytes needed).
;Output:
;   HL points to the null-terminated ASCII string
;      NOTE: This isn't necessarily the same as the input HL.
  push de
  push bc
  push af
  push hl
  bit 7,d
  jr z,+_
  xor a
  sub e
  ld e,a
  sbc a,a
  sub d
  ld d,a
  ld (hl),$1A     ;negative char on TI-OS
  inc hl
_:
  ex de,hl

  ld bc,-10000
  ld a,'0'-1
  inc a \ add hl,bc \ jr c,$-2
  ld (de),a
  inc de

  ld bc,1000
  ld a,'9'+1
  dec a \ add hl,bc \ jr nc,$-2
  ld (de),a
  inc de

  ld bc,-100
  ld a,'0'-1
  inc a \ add hl,bc \ jr c,$-2
  ld (de),a
  inc de

  ld a,l
  ld h,'9'+1
  dec h \ add a,10 \ jr nc,$-3
  add a,'0'
  ex de,hl
  ld (hl),d
  inc hl
  ld (hl),a
  inc hl
  ld (hl),0

;No strip the leading zeros
  pop hl

;If the first char is a negative sign, skip it
  ld a,(hl)
  cp $1A
  push af
  ld a,'0'
  jr nz,$+3
  inc hl
  cp (hl)
  jr z,$-2

;Check if we need to re-write the negative sign
  pop af
  jr nz,+_
  dec hl
  ld (hl),a
_:

  pop af
  pop bc
  pop de
  ret

uitoa_16
Converts a 16-bit unsigned integer to an ASCII string.
Code: [Select]
;Converts a 16-bit unsigned integer to an ASCII string.

uitoa_16:
;Input:
;   DE is the number to convert
;   HL points to where to write the ASCII string (up to 6 bytes needed).
;Output:
;   HL points to the null-terminated ASCII string
;      NOTE: This isn't necessarily the same as the input HL.
  push de
  push bc
  push af
  ex de,hl

  ld bc,-10000
  ld a,'0'-1
  inc a \ add hl,bc \ jr c,$-2
  ld (de),a
  inc de

  ld bc,1000
  ld a,'9'+1
  dec a \ add hl,bc \ jr nc,$-2
  ld (de),a
  inc de

  ld bc,-100
  ld a,'0'-1
  inc a \ add hl,bc \ jr c,$-2
  ld (de),a
  inc de

  ld a,l
  ld h,'9'+1
  dec h \ add a,10 \ jr nc,$-3
  add a,'0'
  ex de,hl
  ld (hl),d
  inc hl
  ld (hl),a
  inc hl
  ld (hl),0

;No strip the leading zeros
  ld c,-6
  add hl,bc
  ld a,'0'
  inc hl \ cp (hl) \ jr z,$-2
  pop af
  pop bc
  pop de
  ret

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4571
  • Rating: +716/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #97 on: August 28, 2019, 09:33:29 am »
Here is a masked sprite routine (no clipping)! Interleave the data with the mask, where the MASK is ANDed with the buffer, and the data is ORed on top of that:
Code: [Select]
;Masked Sprite routine
putsprite_masked:
;Inputs:
;   (A,L) = (x,y)
;   B is height
;   IX points to the sprite data
;       first byte is the data
;       second byte is mask
;       continues, alternating like this.
;
;Outputs:
;   Mask is ANDed to the buffer, then data is ORed on top of that.
;
;Destroys:
;   AF, BC, DE, HL, IX
;
;Notes:
;   To set a pixel...
;     black: mask is any, data is 1
;     white: mask is 0, data is 0
;     clear: mask is 1, data is 0 (keeps the data from the buffer)
;
;This routine is free to use :)
;65 bytes (or 66 bytes if gbuf is not located at 0x**40

  ld e,l
  ld h,0
  ld d,h
  add hl,hl
  add hl,de
  add hl,hl
  add hl,hl
  ld e,a
  and 7
  ld c,a
  xor e  ;essentially gets E with the bottom 3 bits reset
#if (plotSScreen&255) = 64
  inc a
  rra
  rra
  rra
  ld e,a
  ld d,plotSScreen>>8
#else
  rra
  rra
  rra
  ld e,a
  add hl,de
  ld de,plotSScreen
#endif
  add hl,de


putsprite_masked_loop:
  push bc
  xor a
  ld d,(ix)
  ld e,a
  sub c
  ld b,c
  ld c,$FF
  inc ix
  ld a,(ix)
  jr z,putsprite_masked_rotdone
putsprite_masked_rot:
  scf
  rra
  rr c
  srl d
  rr e
  djnz putsprite_masked_rot
putsprite_masked_rotdone:
  and (hl)
  or d
  ld (hl),a
  inc hl
  ld a,(hl)
  and c
  or e
  ld (hl),a
  ld c,11
  add hl,bc
  inc ix
  pop bc
  djnz putsprite_masked_loop
  ret
But if you want even faster and smaller, use a non-traditional mask technique by ORing the mask onto the buffer, then XORing the data on top of it. The format is less intuitive, but it allows for white/black/clear/invert instead of just white/black/clear:
Code: [Select]
;Masked Sprite routine
putsprite_masked:
;Inputs:
;   (A,L) = (x,y)
;   B is height
;   IX points to the sprite data
;       first byte is the data
;       second byte is mask
;       continues, alternating like this.
;
;Outputs:
;   Mask is ORed to the buffer, then data is XORed on top of that.
;
;Destroys:
;   AF, BC, DE, HL, IX
;
;Notes:
;   To set a pixel...
;     black: mask is 1, data is 0
;     white: mask is 1, data is 1
;     clear: mask is 0, data is 0 (keeps the data from the buffer)
;     invert: mask is 0, data is 1 (inverts the data from the buffer)
;
;This routine is free to use :)
;63 bytes (or 64 bytes if gbuf is not located at 0x**40

  ld e,l
  ld h,0
  ld d,h
  add hl,hl
  add hl,de
  add hl,hl
  add hl,hl
  ld e,a
  and 7
  ld c,a
  xor e  ;essentially gets E with the bottom 3 bits reset
#if (plotSScreen&255) = 64
  inc a
  rra
  rra
  rra
  ld e,a
  ld d,plotSScreen>>8
#else
  rra
  rra
  rra
  ld e,a
  add hl,de
  ld de,plotSScreen
#endif
  add hl,de

putsprite_masked_loop:
  push bc
  xor a
  ld d,(ix)
  ld e,a
  or c
  ld b,c
  ld c,e
  inc ix
  ld a,(ix)
  jr z,putsprite_masked_rotdone
putsprite_masked_rot:
  rra
  rr c
  srl d
  rr e
  djnz putsprite_masked_rot
putsprite_masked_rotdone:
  or (hl)
  xor d
  ld (hl),a
  inc hl
  ld a,(hl)
  or c
  xor e
  ld (hl),a
  ld c,11
  add hl,bc
  inc ix
  pop bc
  djnz putsprite_masked_loop
  ret



I also made some "bigsprite" routines! These do have clipping, too. First, they use some common subroutines for computing masks and performing most of the clipping and shifting:
Code: [Select]
;133 bytes total

;This is made by Zeda, feel free to use it for whatever.
;Takes inputs for a big sprite and sets up masks and clipping
;requires 4 bytes of temporary RAM, but doesn't use SMC


spritetmp = 8000h     ;relocate this as needed! Just need 4 bytes.
sprite_width  = spritetmp+0
sprite_x      = spritetmp+1
sprite_mask0  = spritetmp+2
sprite_mask1  = spritetmp+3

bigsprite_subroutine:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;Outputs:
;   carry flag is set if okay to draw, nc if out-of-bounds.
;   B is height.
;   C is width.
;   HL points to the byte to start drawing at.
;   DE points to where to start sourcing the sprite data
;   (sprite_width) is the width of the sprite in bytes
;   (sprite_x) is the intitial x coordinate to begin drawing at
;   (sprite_mask0) is the left mask
;   (sprite_mask1) is the right mask
;92 bytes

;First check if the sprite is on-screen in the horizontal direction
  ld a,c
  cp 64
  jr c,+_
  add a,h
  ret nc
  ld h,a
  push hl
  xor a
  ld h,a
  sub c
  ex de,hl
  add hl,de
  dec a
  jr nz,$-2
  ex de,hl
  pop hl
  xor a
  ld c,a
_:
;Next check h+c<=64
  ld a,64
  sub c
  cp h
  jr nc,+_
  ld h,a
_:

;Make sure the height is not now 0
  ld a,h
  or a
  ret z

;Save the width and height of the sprite
  push hl               ;height,width
  ld h,b
  ld (sprite_width),hl  ;x,width
  push de               ;sprite pointer

;Set up a pointer to the routine for shifting the  routine for shifting the sprite data
  ld ixh,rshiftHA_7>>8
  ld a,h
  cpl
  and 7
  ld l,a
  add a,a
  add a,l
  add a,rshiftHA_7&255
  ld ixl,a
#if (rshiftHA_7&255)>234
  jr nc,$+4
  inc ixh
#endif

  ld a,b
  and 7
  ld de,spritemask
  add a,e
  ld e,a
#if spritemask&255>248
  jr nc,$+3
  inc d
#endif
  ld a,(de)
  ld (sprite_mask0),a
  cpl
  ld (sprite_mask1),a
;
;
  ld a,c
  add a,a
  sbc a,a
  ld h,a
  ld a,b
  ld b,h
  ld l,c
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,hl
  ld c,a
  add a,a
  sbc a,a
  ld b,a
  ld a,c
  sra c
  sra c
  sra c
  add hl,bc
  ld bc,plotSScreen
  add hl,bc

  pop de
  pop bc
  ;B is height
  ;C is width
  ex de,hl
  scf
  ret

rshiftHA_7:
  rr h \ rra
  rr h \ rra
  rr h \ rra
  rr h \ rra
  rr h \ rra
  rr h \ rra
  rr h \ rra
  ex de,hl
  ld e,a
  ret

spritemask:
  .db $00,$80,$C0,$E0,$F0,$F8,$FC,$FE
call_ix:
  jp (ix)
Then you can draw a big sprite with OR logic:
Code: [Select]
bigsprite_OR:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;68 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_OR_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_OR:
  push bc
  push hl
  ld h,(hl)
  xor a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,(hl)
  or d
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,(sprite_mask1)
  ld a,(hl)
  or e
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_OR


  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_OR_loop
  ret
Or draw with XOR logic:
Code: [Select]
bigsprite_XOR:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;68 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_XOR_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_XOR:
  push bc
  push hl
  ld h,(hl)
  xor a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,(hl)
  xor d
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,(sprite_mask1)
  ld a,(hl)
  xor e
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_XOR


  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_XOR_loop
  ret
Or draw with AND logic:
Code: [Select]
bigsprite_AND:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;69 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_AND_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_AND:
  push bc
  push hl
  ld h,(hl)
  scf \ sbc a,a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,(hl)
  and d
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,(sprite_mask1)
  ld a,(hl)
  and e
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_AND


  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_AND_loop
  ret
Or draw with Erase logic:
Code: [Select]
bigsprite_Erase:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;67 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_Erase_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_Erase:
  push bc
  push hl
  ld h,(hl)
  xor a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,d
  cpl
  and (hl)
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,e
  cpl
  and (hl)
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_Erase

  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_Erase_loop
  ret
Or draw with Overwrite logic:
Code: [Select]
bigsprite_Overwrite:
;Inputs:
;     B is the X-coordinate
;     C is the Y-Coordinate
;     DE points to the sprite
;     H is the height
;     L is the width in bytes
;71 bytes

;Set up the clipping
  call bigsprite_subroutine
  ret nc

bigsprite_Overwrite_loop:
  push bc   ;height,width
  push de   ;gbuf ptr
  push hl   ;sprite data pointer
  ld a,(sprite_x)
  ld c,a
  add a,8
  ld (sprite_x),a

spriteloop_Overwrite:
  push bc
  push hl
  ld h,(hl)
  xor a
  call call_ix
  ld a,c
  cp 96
  jr nc,+_
  ld a,(sprite_mask0)
  and (hl)
  or d
  ld (hl),a
  ld a,c
_:
  inc hl
  add a,8
  cp 96
  jr nc,+_
  ld a,(sprite_mask1)
  and (hl)
  or e
  ld (hl),a
_:
  ld bc,11
  add hl,bc
  ex de,hl
  pop hl
  ld a,(sprite_width)
  ld c,a
  add hl,bc
  pop bc
  djnz spriteloop_Overwrite

  pop hl
  inc hl
  pop de
  inc de
  pop bc
  dec c
  jr nz,bigsprite_Overwrite_loop
  ret

Offline SpiroH

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 713
  • Rating: +153/-23
    • View Profile
Re: ASM Optimized routines
« Reply #98 on: August 28, 2019, 10:07:05 am »
@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language :/.


Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4571
  • Rating: +716/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #99 on: August 28, 2019, 10:11:55 am »
@Zeda: Nice performance aware exercise! (Except for the many push and pop which look a bit dated to me).
I wonder how many are really interested in speeding up the calculations these days.
It seems all they care about is python, java and what-have-you-funky-high-level-language :/.


Thanks :) I wrote these with apps in mind, so I tried to reduce the need for external RAM. I should definitely make versions that take full advantage of SMC, though!

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4571
  • Rating: +716/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #100 on: September 10, 2019, 06:18:29 pm »
Here is a circle routine that I made!
Code: [Select]
;Written by Zeda Thomas, free to use.

;This draws a circle centered at 8-bit coordinates and with radius up to 127.
;IX points to a `plot` routine that takes (B,C)=(x,y) as input and does something
;with it, like plot the pixel a certain color, or plot a "big" pixel, or whatever.
;   plot
;     Takes coordinates, (B,C) = (x,y) and plots the point.
;
;For example, on the TI-83+/84+/SE the plot routine might look like:
; plot:
;   call getpixelloc
;   ret nc            ;Exit if the coordinates are out-of-bounds
;   or (hl)
;   ld (hl),a
;   ret
;
;
; Required subroutines:
;     call_ix:
;       jp (ix)

circle:
;Input:
; (B,C) is the center (x,y)
; D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).
; IX points to a `plot` routine that takes (B,C)=(x,y) as input.
  ld a,d
  add a,a
  ret c
  ret z
  ld l,d
  dec a
  ld e,a
  dec a        ;if the pixel is only 1 wide, just plot the point
  jp z,call_ix ;Jump to the plot routine
  xor a
  ld h,-1
  ld d,1
  scf     ;skip the first plot
circleloop:
  call nc,plot4
  inc h
  sub d
  inc d
  inc d
  jr nc,circleloop
_:
  dec l
  call plot4
  add a,e
  dec e
  ret z
  dec e
  jr nc,-_
  jp circleloop


plot4:
;BC is center
;HL is x,y

  push de
  push af
  push hl
  push bc

;If H is 0, or L is 0, we need to draw only half
  push hl

  ld a,b
  sub h
  ld b,a
  add a,h
  add a,h
  ld h,a

  ld a,c
  sub l
  ld c,a
  add a,l
  add a,l
  ld l,a

;B is x0-x
;C is y0-y
;H is x0+x
;L is y0+y


;plot(x0-x,y0-y)
;plot(x0+x,y0+y)
  push bc
  push hl
  call call_ix    ;call the plot routine
  pop bc
  push bc
  call call_ix    ;call the plot routine

;now swap the y coords
  pop hl
  pop bc
  ld a,l
  ld l,c
  ld c,a
  pop de
  xor a
  cp d
  jr z,+_
  cp e
  jr z,+_


;plot(x0-x,y0+y)
;plot(x0+x,y0-y)
  push hl
  call call_ix    ;call the plot routine
  pop bc
  call call_ix    ;call the plot routine
_:

  pop bc
  pop hl
  pop af
  pop de
  ret
The really cool feature about this is that you can define a custom plot routine pointed to by IX, so it isn't TI-specific, and you can do all sorts of wonky things like:
Draw 2x2 pixels:

Code: [Select]
;calling with `ld ix,pixelOn_2x2`

pixelOn_2x2:
  sla b
  ret c
  sla c
  ret c
  push bc
  call pixelOn
  pop bc
  inc b
  push bc
  call pixelOn
  pop bc
  inc c
  push bc
  call pixelOn
  pop bc
  dec b
  jp pixelOn

Or draw a circle whose "pixels" are circles:

Code: [Select]
;calling with `ld ix,pixelOn_circle`

pixelOn_circle:
  ld a,b
  cp 32
  ret nc
  add a,a
  add a,a
  ld b,a
  ld a,c
  cp 32
  ret nc
  add a,a
  add a,a
  ld c,a
  ld d,4
  push ix    ;need to save IX!
  ld ix,pixelOn
  call circle
  pop ix
  ret
EDIT: I inlined some subroutines because there was no reason to have them called. It was a waste of clock cycles and space!
EDIT: Have a separate, filled rectangle routine!
Note that if you pass the same arguments as the regular circle routine, this only draws the inside part  and skips the border.
Code: [Select]
;Written by Zeda Thomas, free to use.

;This draws the fill of a circle centered at 8-bit coordinates and with radius
;up to 127.
;IX points to a `horizontal line` routine that takes E=x, A=y, D=width as input
;and does something with it, like plot a horizontal line.
;
; For example, on the ti-83+/84+/SE calculators, you might have:
;     horizontal_line:
;       ld b,e
;       ld c,a
;       ld e,1
;       ld hl,gbuf
;       jp rectOR

; Required subroutines:
;     call_ix:
;       jp (ix)

filledcircle:
;Input:
; (B,C) is the center (x,y)
; D is the radius, unsigned, less than 128 (0 or greater than 128 just quits).
; IX points to a `plot` routine that takes (B,C)=(x,y) as input.
  ld a,d
  add a,a
  ret c
  ret z
  ld l,d
  dec a
  ld e,a
  xor a
  ld h,-1
  ld d,1
filledcircleloop:
  ; call c,fillcircle_plot
  inc h
  sub d
  inc d
  inc d
  jr nc,filledcircleloop
_:
  dec l
  call fillcircle_plot
  add a,e
  dec e
  ret z
  dec e
  jr nc,-_
  jp filledcircleloop

fillcircle_plot:
  inc h
  dec h
  ret z
  push hl
  push de
  push bc
  push af
  dec h
  ld a,b
  sub h
  ld e,a
  ld d,h
  sll d   ;aka `slia`, undocumented

  ld a,l
  or a
  ld h,c
  jr z,+_
  add a,h
  push de
  push hl
  call nz,call_ix
  pop hl
  pop de
_:
  ld a,h
  sub l
  call call_ix
  pop af
  pop bc
  pop de
  pop hl
  ret