• ASM Optimized routines 5 1
Currently:

Author Topic: ASM Optimized routines  (Read 52388 times)

0 Members and 1 Guest are viewing this topic.

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #75 on: November 13, 2013, 03:09:47 pm »
I made this rather fast 8.8 fixed point natural log (ln, loge) routine today based on some math I have been working on. It currently averages 677 t-states, but only works on [1,2] I plan to extend this to the whole range of numbers by using a 5 element LUT and a single division making the average somewhere around 1100 t-states:
Code: [Select]
lognat:;Input:  H.L needs to be on [1,2];Output: H.L if z flag is set, else if nz, no result;avg speed: 677 t-states dec h    dec h jr nz,$+8 inc l \ dec l ret nz ld l,177 ret inc h ret nz ld b,h ld c,l ld e,l ld d,8 add hl,hl add hl,hl add hl,de ex de,hl add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ adc a,a ld h,a \ ld l,b sla h \ jr c,$+3 \ ld l,c add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc rl l ld a,h adc a,b ld l,a ld h,b cp a retI had fun twisting some of the logic, but there is a division and multiplication in there. It might seem backwards, but it was optimal

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #76 on: November 14, 2013, 08:17:22 pm »
This version works on (0,128) and averages less than 1250 t-states:
Code: [Select]

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #78 on: November 16, 2013, 09:51:46 am »
Here is my 8.8 fixed point, table-based arctangent routine:
Code: [Select]
arctan_88:;Input:;    D.E;Output: atan(D.E)->D.E    push de    ld a,d    or a    jp p,$+5 neg ld d,a dec a jr nz,checkneedinv inc e \ dec e \ jr nz,checkneedinv pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ retcheckneedinv: inc a call nz,DEgt1_Inv;0.E is the value to atan ld hl,adjustatan push hl ld a,e cp 46 \ ret c dec a \ cp 42h \ ret c dec a \ cp 4Eh \ ret c dec a \ cp 57h \ ret c dec a \ cp 5Eh \ ret c dec a \ cp 64h \ ret c dec a \ cp 6Ah \ ret c dec a \ cp 6Fh \ ret c sub 6Fh \ ld e,a ld hl,atanlut add hl,de ld a,(hl) retadjustatan: ld e,a pop bc ld a,b or a jp p,$+5    neg    jr z,$+9 ld hl,402 or a sbc hl,de ex de,hl rl b ret nc xor a sub e ld e,a sbc a,a sub d ld d,a retDEgt1_Inv:;Works if DE>1 ld hl,256 ld b,8InvLoop: add hl,hl sbc hl,de jr nc,$+3    add hl,de    adc a,a    djnz InvLoop    cpl    ld e,a    ld d,b    retatanlut:.db $6F.db$6F.db $70.db$71.db $72.db$73.db $73.db$74.db $75.db$76.db $77.db$77.db $78.db$79.db $7A.db$7B.db $7B.db$7C.db $7D.db$7E.db $7F.db$7F.db $80.db$81.db $82.db$82.db $83.db$84.db $85.db$85.db $86.db$87.db $88.db$88.db $89.db$8A.db $8B.db$8B.db $8C.db$8D.db $8E.db$8E.db $8F.db$90.db $90.db$91.db $92.db$93.db $93.db$94.db $95.db$95.db $96.db$97.db $97.db$98.db $99.db$9A.db $9A.db$9B.db $9C.db$9C.db $9D.db$9E.db $9E.db$9F.db $A0.db$A0.db $A1.db$A2.db $A2.db$A3.db $A3.db$A4.db $A5.db$A5.db $A6.db$A7.db $A7.db$A8.db $A9.db$A9.db $AA.db$AA.db $AB.db$AC.db $AC.db$AD.db $AD.db$AE.db $AF.db$AF.db $B0.db$B0.db $B1.db$B2.db $B2.db$B3.db $B3.db$B4.db $B5.db$B5.db $B6.db$B6.db $B7.db$B7.db $B8.db$B9.db $B9.db$BA.db $BA.db$BB.db $BB.db$BC.db $BC.db$BD.db $BE.db$BE.db $BF.db$BF.db $C0.db$C0.db $C1.db$C1.db $C2.db$C2.db $C3.db$C3.db $C4.db$C4.db $C5.db$C6.db $C6.db$C7.db $C7.db$C8.db $C8.db$C9

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #79 on: November 19, 2013, 12:15:26 pm »
EDIT:(27-Oct-2015) Updated at the end of the original post with a muuuch better routine.

I wrote this routine today for generating pseudorandom numbers. If I did the math correctly, it has a period of 4292870399. To think of it differently, it executes in about 1430 1592 t-states, so that means at 15MHZ, it would take about 4 days, 19 hours, 49 minutes, 41 seconds to loop back to the start of the over 4 billion number sequence. Or, if you generated one number per second, it would take over 136 years to finish the cycle.
Code: [Select]
Rand24:;Inputs: seed1,seed2;Outputs:; HLA is the pseudo-random number; seed1,seed2 incremented accordingly;Destroys: BC,DE;Notes:; seed1*243+83 mod 65519 -> seed1; seed2*251+43 mod 65521 -> seed2; Output seed1*seed2 ld hl,(seed1) xor a ld b,h \ ld c,l ld de,83 add hl,hl \ rla ;2 add hl,bc \ adc a,d ;3 add hl,hl \ rla ;6 add hl,bc \ adc a,d ;7 add hl,hl \ rla ;14 add hl,bc \ adc a,d ;15 add hl,hl \ rla ;30 add hl,hl \ rla ;60 add hl,hl \ rla ;120 add hl,bc \ adc a,d ;121 add hl,hl \ rla ;242 add hl,bc \ adc a,d ;243 add hl,de \ adc a,d ;243x+83;now AHL mod 65519; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL ex de,hl ld l,a ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl add hl,bc add hl,de ld de,65519 jr nc,$+5 or a \ sbc hl,de or a \ sbc hl,de jr nc,$+3 add hl,de ld (seed1),hl;seed1 done, now we need to do seed2 ld hl,(seed2); seed1*243+83 mod 65519 -> seed1; seed2*251+43 mod 65521 -> seed2; Output seed1*seed2 xor a ld b,h \ ld c,l ld de,43 add hl,hl \ rla ;2 add hl,bc \ adc a,d ;3 add hl,hl \ rla ;6 add hl,bc \ adc a,d ;7 add hl,hl \ rla ;14 add hl,bc \ adc a,d ;15 add hl,hl \ rla ;30 add hl,bc \ adc a,d ;31 add hl,hl \ rla ;62 add hl,hl \ rla ;124 add hl,bc \ adc a,d ;125 add hl,hl \ rla ;250 add hl,bc \ adc a,d ;251 add hl,de \ adc a,d ;251x+83;now AHL mod 65521; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL ex de,hl ld l,a ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl sbc hl,bc add hl,de ld de,65521 jr nc,$+5 or a \ sbc hl,de or a \ sbc hl,de jr nc,$+3 add hl,de ld (seed2),hl;now seed1 and seed 2 are computed ld bc,(seed1) ex de,hl call BC_Times_DE ex de,hl ld l,b ld h,0 ld b,h ld c,l add hl,hl add hl,hl add hl,bc add hl,hl add hl,hl add hl,bc add hl,hl add hl,bc ld c,d ld d,e ld e,a ld a,c sbc hl,bc sbc a,b ret nc ld c,43 add hl,bc ret;now do BC_times_DEBC_Times_DE:;BHLA is the result ld a,b or a ld hl,0 ld b,h;1 add a,a jr nc,$+4 ld h,d ld l,e;2 add hl,hl rla jr nc,$+4 add hl,de adc a,b;227+10b-7p add hl,hl rla jr nc,$+4 add hl,de adc a,b add hl,hl rla jr nc,$+4 add hl,de adc a,b add hl,hl rla jr nc,$+4 add hl,de adc a,b add hl,hl rla jr nc,$+4 add hl,de adc a,b add hl,hl rla jr nc,$+4 add hl,de adc a,b add hl,hl rla jr nc,$+4 add hl,de adc a,b;===;AHL is the result of B*DE*256 push hl ld h,b ld l,b ld b,a ld a,c ld c,h;1 add a,a jr nc,$+4 ld h,d ld l,e;2 add hl,hl rla jr nc,$+4 add hl,de adc a,c;227+10b-7p add hl,hl rla jr nc,$+4 add hl,de adc a,c add hl,hl rla jr nc,$+4 add hl,de adc a,c add hl,hl rla jr nc,$+4 add hl,de adc a,c add hl,hl rla jr nc,$+4 add hl,de adc a,c add hl,hl rla jr nc,$+4 add hl,de adc a,c add hl,hl rla jr nc,$+4 add hl,de adc a,c pop de;Now BDE*256+AHL ld c,a ld a,l ld l,h ld h,c add hl,de ret nc inc b;BHLA is the 32-bit result retseed1: .dw 0seed2: .dw 0To test its 'randomness' I only used HL and it did not exhibit a perfectly uniform random distribution (standard deviation was slightly off from the expected value). However, it seems to be on par with the routine used by the OS and I believe I am using a modified version of that.

Now I will feel confident to use this in my math libraries

EDIT: Added a mod 16777259 to the output to fix a flaw.

EDIT: The much better routine:
• It passes all the tests at CAcert Labs
• The cycle size is almost 4.3 million times longer .
• It is 550% faster (287cc versus 1592cc)
• It would take over 11,184,544 years at the calculator's max speed to reach the full cycle.
Code: [Select]
rand:;Tested and passes all CAcert tests;Uses a very simple 32-bit LCG and 32-bit LFSR;it has a period of 18,446,744,069,414,584,320;roughly 18.4 quintillion.;291ccseed1_0=$+1 ld hl,12345seed1_1=$+1    ld de,6789    ld b,h    ld c,l    add hl,hl \ rl e \ rl d    add hl,hl \ rl e \ rl d    inc l    add hl,bc    ld (seed1_0),hl    ld hl,(seed1_1)    adc hl,de    ld (seed1_1),hl    ex de,hlseed2_0=$+1 ld hl,9876seed2_1=$+1    ld bc,54321    add hl,hl \ rl c \ rl b    ld (seed2_1),bc    sbc a,a    and %11000101    xor l    ld l,a    ld (seed2_0),hl    ex de,hl    add hl,bc    retEDIT (28 August 2019): Hey, so here is a compact version from a few years ago:
Code: [Select]
;#define smc ;uncomment if you are using SMCrand16:;collaboration by Zeda with Runer112;160cc or 148cc if using SMC;26 bytes;cycle: 4,294,901,760 (almost 4.3 billion)#ifdef smcseed1=$+1 ld hl,9999#else ld hl,(seed1)#endif ld b,h ld c,l add hl,hl add hl,hl inc l add hl,bc ld (seed1),hl#ifdef smcseed2=$+1 ld hl,9999#else    ld hl,(seed2)#endif    add hl,hl    sbc a,a    and %00101101    xor l    ld l,a    ld (seed2),hl    add hl,bc    retAnd here is a different routine that is faster and might be better (it is on the tests I am using):
Code: [Select]
xsp32:;Inputs: (seed1), (seed2), and (seed3) are 16-bit seeds. (seed1) and (seed2) can't both be 0.;Outputs: HL is the pseudorandom number;Destroys: A,DE,BC;cycle: 281,474,976,645,120;It would take about 185 years at 15MHz to repeat;min: 258cc (236cc if using SMC);max: 288cc (266cc if using SMC);avg: 273cc (251cc if using SMC);63 bytes (62 bytes if using SMC)#ifdef SMCseed1 = $+1 ld hl,12345seed2 =$+1  ld de,6789#else  ld hl,(seed1)  ld de,(seed2)#endif;first, XOR it with itself, shifted left 23 bits;low bit of d needs to be shifted in  ld a,h  rra  ld a,l  rra  jr nc,+_  rl e  ccf  rr e_:  xor d  ld d,a;XOR it with itself, shifted right 15 bits  ld a,h  rla  ld a,e  rla  xor l  ld l,a  ld a,e  rla  ld a,d  rla  jr nc,+_  rr e  ccf  rl e_:  xor h  ld h,a;XOR it with itself, shifted left 17 bits;HL<<1  ld (seed1),hl  add hl,hl  ld a,h  xor d  ld h,a  ld a,l  xor e  ld l,a  ld (seed2),hl  ex de,hl#ifdef SMCseed3 = $+1 ld hl,33333#else ld hl,(seed3)#endif inc hl inc h ld (seed3),hl add hl,de retIt has a smaller period, but still far larger than a calc needs. It uses smaller state, and combines a 32-bit xorshift with a basic 16-bit counter (increments by 257). Just a 32-bit xorshift routine, which is pretty decent on its own: Code: [Select] xs32:;32-bit xorshift;seed^=seed<<23;seed^=seed>>15;seed^=seed<<17;min: 209cc (193cc if using SMC);max: 239cc (223cc if using SMC);avg: 224cc (208cc if using SMC);53 bytes (52 bytes if using SMC)#ifdef SMCseed1 =$+1  ld hl,12345seed2 = $+1 ld de,6789#else ld hl,(seed1) ld de,(seed2)#endif;first, XOR it with itself, shifted left 23 bits;low bit of d needs to be shifted in ld a,h rra ld a,l rra jr nc,+_ rl e ccf rr e_: xor d ld d,a;XOR it with itself, shifted right 15 bits ld a,h rla ld a,e rla xor l ld l,a ld a,e rla ld a,d rla jr nc,+_ rr e ccf rl e_: xor h ld h,a;XOR it with itself, shifted left 17 bits;HL<<1 ld (seed1),hl add hl,hl ld a,h xor d ld h,a ld a,l xor e ld l,a ld (seed2),hl ret « Last Edit: August 28, 2019, 05:27:53 pm by Xeda112358 » Xeda112358 • they/them • Moderator • LV12 Extreme Poster (Next: 5000) • Posts: 4659 • Rating: +718/-6 • Calc-u-lator, do doo doo do do do. Re: ASM Optimized routines « Reply #80 on: November 28, 2013, 08:45:11 am » Here is a 32-bit square root routine. I am fairly sure there are optimisations that can be made, especially to the stuff I added last night. I already found a bunch of code that I removed, about 32 bytes, saving 128 t-states. Code: [Select] SqrtHLDE:;input: HLDE;Output: BC;310 bytes;Average is about 1443 t-states push de xor a ld b,a ld e,l ld l,h ld h,a add hl,hl add hl,hl cp h jr nc,$+5 dec h ld a,4 add hl,hl add hl,hl ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c ld l,e add hl,hl add hl,hl add a,a ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c add a,a \ ld c,a add hl,hl add hl,hl jr nc,$+6 sub h \ jp $+6 sub h jr nc,$+6 inc c \ inc c cpl ld h,a ld a,l ld l,h add a,a ld h,a adc hl,hl adc hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a;iteration 9;now I need to rotate in more bits pop de sla d \ adc hl,hl \ sla d \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla d \ adc hl,hl \ sla d \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla d \ adc hl,hl \ sla d \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla d \ adc hl,hl \ sla d \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla e \ adc hl,hl \ sla e \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla e \ adc hl,hl \ sla e \ adc hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sll c \ rl b sla e \ adc hl,hl \ sla e \ adc hl,hl jr nc,$+9    or a sbc hl,bc inc c jp $+13 sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a sla e \ adc hl,hl \ jr nc,$+4 inc c \ ret sla e \ adc hl,hl \ jr nc,$+8    sbc hl,bc \ or a \ jp $+7 or a \ sbc hl,bc \ ret c scf \ sbc hl,bc \ ret c inc c \ ret Matrefeytontias • Axe roxxor (kinda) • LV10 31337 u53r (Next: 2000) • Posts: 1982 • Rating: +310/-12 • Axe roxxor Re: ASM Optimized routines « Reply #81 on: April 29, 2014, 06:04:00 am » If that's of any use to anyone, here is a signed AHL = AHL / DE routine - I think it's pretty optimized (70 bytes) : Code: [Select] sDiv2416: xor d push af xor d ld b, a jp p, divdPos xor a sub l ld l, a ld a, 0 sbc a, h ld h, a ld a, 0 sbc a, b ld b, adivdPos: bit 7, d jr z, divrPos xor a sub e ld e, a ld a, 0 sbc a, d ld d, a ld a, bdivrPos: push hl pop ix ld hl, 0 ld b, 24divLoop: add ix, ix rla adc hl, hl sbc hl, de jr nc,$ + 4 add hl, de ; jp nc, xxxx ; it works because the carry can not be set and "inc ix" is 2 bytes .db $D2 inc ix djnz divLoop push ix pop hl ld b, a pop af ld a, b ret p xor a sub l ld l, a ld a, 0 sbc a, h ld h, a ld a, 0 sbc a, b ret « Last Edit: April 29, 2014, 06:07:13 am by Matrefeytontias » Xeda112358 • they/them • Moderator • LV12 Extreme Poster (Next: 5000) • Posts: 4659 • Rating: +718/-6 • Calc-u-lator, do doo doo do do do. Re: ASM Optimized routines « Reply #82 on: March 01, 2015, 10:14:07 pm » This routine is a fast way to copy a chunk of data. For BC>34, calling this proves to be faster than using LDIR: Code: [Select] fastLDIR:;copy BC bytes from HL to DE;Cost:; 27cc for having to call; 110cc for setting up the loop, worst case; 10cc * ceiling(BC/n) ;n=2^k for some k, see the line below "ldirloop:"; 16cc * BC;costs roughly 152-BC*(5-10/n) more than a simple LDIR (worst case);for n=4, BC>=61 saves;for n=8, BC>=41 saves;for n=16, BC>=35 saves * default, see the "ldirloop" to change;for n=32, BC>=33 saves;for n=64, BC>=32 saves push hl push af xor a sub c and 15 ;change to n-1 add a,a ld hl,ldirloop add a,l ld l,a jr nc,$+3  ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes.    inc h       ;    pop af    ex (sp),hl    retldirloop:;n=16, (number of LDI instructions, use qty of 4,8,16,32,64)    ldi    ldi    ldi    ldi        ldi    ldi    ldi    ldi        ldi    ldi    ldi    ldi        ldi    ldi    ldi_ldirloop_end:    ldi    jp pe,ldirloop    retThis might be useful for things like copying code to RAM (from an App) in speed critical applications, or other data handling tasks.

EDIT 9 Oct 18: Saved 1 byte and 3cc. Originally, I had 'ld a,16 \ sub c \ and 15'. The 'ld a,16' could hold any multiple of n (16 in this example), including 0, so I just used 'xor a'. I also updated all the timing info.
« Last Edit: October 09, 2018, 07:31:40 pm by Xeda112358 »

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #83 on: March 02, 2015, 06:13:50 am »
Not sure if this is useful for any actual applications, but to get a mask for the lowest set bit in C:
Code: [Select]
    xor a    sub c    and cSo if c=%10110100, this would return a=%00000100 (it also always returns nc, and z only when c=0).

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #84 on: April 17, 2015, 05:39:14 pm »
This is a fast 16-bit Lehmer RNG. It is seeded, and has a period of 65536. This code
Code: [Select]
smc = 1   ;use 1 if the code is in RAM, since it is faster. If it is in an app, use 0 and define a 2-byte RAM location for the seed.lehmer:;;Input:;;  (seed) has the seed value of the RNG;;Output:;;  (seed) is updated, HL is the result;;Destroys:;;  A,DE,BC;;Timing:;;  if seed>0     231cc or 232cc, condition dependent;;  if seed=0     91cc;;  if smc=1      subtract 6cc;;Size: 44 bytes;;Notes:;;    Uses the Lehmer RNG used by the Sinclair ZX81;;    75x mod 65537 -> x#IF smc == 0    ld hl,(seed)#ELSEseed = $+1 ld hl,0#ENDIF;multiply by 75 ld c,l ld b,h xor a adc hl,hl \ jr z,special \ ld d,a \ rla add hl,hl \ rla add hl,hl \ rla \ add hl,bc \ adc a,d add hl,hl \ rla add hl,hl \ rla \ add hl,bc \ adc a,d add hl,hl \ rla \ add hl,bc;modulo 65537, see note below on how this works ld e,a sbc hl,de ;No need to reset the c flag since it is already jr nc,$+3    inc hl    ld (seed),hl    retspecial:;In the case that HL=0, this should be interpreted as 65536 = -1 mod 65537, so return -75 mod 65537 = -74 mod 65536 in HL    ld hl,-74    ld (seed),hl    ret    ;mod by 2^16 + 1 (a prime);current form is A*2^16+HL;need:;  (A*2^16+HL) mod (2^16+1);add 0 as +1-1;  (A*(2^16+1-1)+HL) mod (2^16+1);distribute;  (A*(2^16+1)-A+HL) mod (2^16+1);A*(2^16+1) mod 2^16+1 = 0, so remove;  (-A+HL) mod (2^16+1);Oh hey, that's easy! :P;I use this trick everywhere, you should, too.

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #85 on: August 02, 2015, 09:06:26 am »
So earlier on IRC I posted a challenge that I was thinking about before bed: Convert an 8-bit unsigned integer to BCD as fast as you can.

I managed to get my code down to 143.5cc, then jacobly noticed an optimization on mine, and I noticed a bug that had been carried through from the beginning. After all was done, we got it down to 131cc.

The code given is a subroutine (so adding an ret to the end, +1 byte +10cc). Without further ado:
Code: [Select]
L_To_Dec:;;Unrolled;;Converts the 8-bit register L to binary coded decimal;;Digits stored in LA (A has the lower 2 digits, L the upper).;;Inputs: L is the 8-bit unsigned int to convert;;Output: A has the lower 2 digits (in BCD form), L has the upper;;Destroys: H,F;;141cc ;;27 bytes    ld h,0    add hl,hl    add hl,hl    add hl,hl    add hl,hl    ld a,h \ daa  \ rl l    adc a,a \ daa \ rl l    adc a,a \ daa \ rl l    adc a,a \ daa \ rl l    adc a,a \ daa \ rl l    ret

EDIT: Found a transcription error. Had to adjust code, size, and timing by +4 bytes, +16cc
« Last Edit: August 03, 2015, 08:55:23 am by Xeda112358 »

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #86 on: August 03, 2015, 10:54:32 am »
Related to the previous post, I offer several alternatives to the bcalls _SetXXOP1 and _SetXXOP2. These routines are for converting 8-bit integers to TI floats. The advantages with my routines are that they are faster, you don't need to truncate to only the bottom 2 digits, and you can store the output to any location instead of just OP1 or OP2.

So first, if you still want the output to be the same:
Code: [Select]
setXXOP2:    ld hl,OP2    jr setXXsetXXOP1:    ld hl,OP1setXX:;;Inputs: A is the unsigned int;;        HL is where to write the TI float;;Destroys:All;;291cc+38b (or 144cc if A=0);;average: b=29/255;;295.3215686cc;;59 bytes    ld c,0    ld (hl),c \ inc hl    ld (hl),81h    inc hl \ ld (hl),c    ld d,h \ ld e,l    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    inc hl \ ld (hl),c    or a \ ret z    ;If A is zero, exit early. +138cc    ld l,a          ;\    ld h,c          ; |    add hl,hl       ; |Start converting A to BCD    add hl,hl       ; |    add hl,hl       ; |    add hl,hl       ; |    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD    adc a,a \ daa \ rl l    ; |Number is in LA    adc a,a \ daa \ rl l    ; |    adc a,a \ daa \ rl l    ; |    adc a,a \ daa           ;/ +124cc    ex de,hl    ld (hl),a    and $F0 ret nz ;+29cc rld ;\ Rotate up 1 digit dec hl ; | ld (hl),80h ; | ret ; /And if you want to get all 3 digits: Code: [Select] setXXXOP1: ld hl,OP1setXXX:;;Inputs: A is the unsigned int;; HL is where to write the TI float;;Destroys:All;;423cc+13a+63b (or 233cc if A=0);;average: a=99/255, b=29/255;;435.2117647cc average;;64 bytes ld bc,$0700    ld (hl),c    inc hl    ld (hl),83h    ld d,h    ld e,l    inc hl \ ld (hl),c \ djnz $-2 or a \ ret z ;If A is zero, exit early. +227cc ld l,a ;\ ld h,c ; | add hl,hl ; |Start converting A to BCD add hl,hl ; | add hl,hl ; | add hl,hl ; | ld a,h \ daa \ rl l ; |Finish converting A to BCD adc a,a \ daa \ rl l ; |Number is in LA adc a,a \ daa \ rl l ; | adc a,a \ daa \ rl l ; | adc a,a \ daa \ rl l ;/ +132cc ex de,hl jr nz,$+6 \ ld e,a \ xor a \ ld (hl),81h    ;+(21+4/85)cc    inc hl    ld (hl),e    inc hl    ld (hl),a    ld a,e    and $F0 ret nz ;+48cc rld ;\ Rotate up 1 digit dec hl ; | rld ; | dec hl ; | dec (hl) ; |Decrement exponent ret ; /+63(29/255)ccAnd if you want to do that, but maybe a little faster: Code: [Select] setXXX:;;Inputs: A is the unsigned int;; HL is where to write the TI float;;Destroys:All;;334cc+13a+63b (or 144cc if A=0);;average: a=99/255, b=29/255;;346.2117647cc average;;75 bytes ld c,0 ld (hl),c inc hl ld (hl),83h ld d,h ld e,l inc hl \ ld (hl),c inc hl \ ld (hl),c inc hl \ ld (hl),c inc hl \ ld (hl),c inc hl \ ld (hl),c inc hl \ ld (hl),c inc hl \ ld (hl),c or a \ ret z ;If A is zero, exit early. +227cc ld l,a ;\ ld h,c ; | add hl,hl ; |Start converting A to BCD add hl,hl ; | add hl,hl ; | add hl,hl ; | ld a,h \ daa \ rl l ; |Finish converting A to BCD adc a,a \ daa \ rl l ; |Number is in LA adc a,a \ daa \ rl l ; | adc a,a \ daa \ rl l ; | adc a,a \ daa \ rl l ;/ +132cc ex de,hl jr nz,$+6 \ ld e,a \ xor a \ ld (hl),81h    ;+(21+4/85)cc    inc hl    ld (hl),e    inc hl    ld (hl),a    ld a,e    and $F0 ret nz ;+48cc rld ;\ Rotate up 1 digit dec hl ; | rld ; | dec hl ; | dec (hl) ; |Decrement exponent ret ; /+63(29/255)ccAnd if you want to convert signed 8-bit ints: Code: [Select] setXXX_signed:;;Inputs: A is the signed int;; HL is where to write the TI float ld c,0 ld (hl),c add a,a jr c,$+6    neg    ld (hl),80h    inc hl    ld (hl),81h    ld d,h    ld e,l    inc hl \ ld (hl),c \ djnz $-2 or a \ ret z ld l,a ;\ ld h,c ; | add hl,hl ; |Start converting A to BCD add hl,hl ; | add hl,hl ; | add hl,hl ; | ld a,h \ daa \ rl l ; |Finish converting A to BCD adc a,a \ daa \ rl l ; |Number is in cA adc a,a \ daa \ rl l ; |(c is carry) adc a,a \ daa ;/ ex de,hl jr nc,$+15    ld (hl),82h    inc hl    inc hl    ld (hl),a    xor a    rld    or $10 dec hl ld (hl),a ret inc hl ld (hl),a and$F0    ret nz    rld         ;\ Rotate up 1 digit    dec hl      ; |    ld (hl),80h ; |    ret         ; /
The slowest routine here has a worst case of 499cc and slowest average case is 436cc.

Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4659
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
Re: ASM Optimized routines
« Reply #87 on: August 04, 2015, 10:52:59 am »
I decided to see if I could make a faster routine than what Runer made that did the same stuff and I came up with the following code. Mine is 3 bytes larger and on average almost twice as fast (ranging from 120cc to 525cc, and one case of 59cc). It has output in HL and returns a DataType error for inputs that are not non-negative reals, and a Domain error if the input exceeds a 16-bit unsigned integer.

One other advantage that mine has is that you can convert a float pointed to by DE, so you aren't limited to just OP1.
Code: [Select]
ConvOP1:;;Output: HL is the 16-bit result.    ld de,OP1ConvFloat:;;Input: DE points to the float.;;Output: HL is the 16-bit result.;;Errors: DataType if the float is negative or complex;;        Domain if the integer exceeds 16 bits.;;Timings:  Assume no errors were called.;;  Input is on:;;  (0,1)         => 59cc                        Average=59;;  0 or [1,10)   => 120cc or 129cc                     =124.5;;  [10,100)      => 176cc or 177cc                     =176.5;;  [100,1000)    => 309cc, 310cc, 318cc, or 319cc.     =314;;  [1000,10000)  => 376cc to 378cc                     =377;;  [10000,65536) => 514cc to 516cc, or 523cc to 525cc  =519.5;;Average case:  496.577178955078125cc;;vs 959.656982421875cc;;87 bytes     ld a,(de)    or a    jr nz,ErrDataType    inc de    ld hl,0    ld a,(de)    inc de    sub 80h    ret c    jr z,final    cp 5    jp c,enterloopErrDomain:;Throws a domain error.    bcall(_ErrDomain)ErrDataType:;Throws a data type error.    bcall(_ErrDataType)loop:    ld a,b    ld b,h    ld c,l    add hl,hl    add hl,bc    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,hlenterloop:    ld b,a    ex de,hl    ld a,(hl) \ and $F0 \ rra \ ld c,a \ rra \ rra \ sub c \ add a,(hl) inc hl ex de,hl add a,l ld l,a jr nc,$+3    inc h    dec b    ret z    djnz loop    ld b,h    ld c,l    xor a;check overflow in this mul by 10!    add hl,hl \ adc a,a    add hl,hl \ adc a,a    add hl,bc \ adc a,0    add hl,hl \ adc a,a    jr nz,ErrDomainfinal:    ld a,(de)    rrca    rrca    rrca    rrca    and 15    add a,l    ld l,a    ret nc    inc h    ret nz    jr ErrDomainIf you do not want to throw errors and are comfortable returning garbage results:
Code: [Select]
ConvOP1:;;Output: HL is the 16-bit result.    ld de,OP1ConvFloat:;;Input: DE points to the float.;;Output: HL is the 16-bit result.;;  Input is on:;;  (0,1)         => 57cc                        Average=57;;  0 or [1,10)   => 117cc or 126cc                     =121.5;;  [10,100)      => 174cc or 175cc                     =174.5;;  [100,1000)    => 276cc, 277cc, 285cc, or 286cc.     =281;;  [1000,10000)  => 374cc to 376cc                     =375;;  [10000,65536) => 481cc to483cc, or 490cc to 492cc  =486.5;;Average case:  467.88153076171875cc     ld hl,0    ld a,(de)    or a    ret z    inc de    ld a,(de)    inc de    sub 80h    ret c    jr z,final    cp 5    jp c,enterloop    retloop:    ld a,b    ld b,h    ld c,l    add hl,hl    add hl,bc    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,hlenterloop:    ld b,a    ex de,hl    ld a,(hl) \ and $F0 \ rra \ ld c,a \ rra \ rra \ sub c \ add a,(hl) inc hl ex de,hl add a,l ld l,a jr nc,$+3    inc h    dec b    ret z    djnz loop    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,bc    add hl,hlfinal:    ld a,(de)    rrca    rrca    rrca    rrca    and 15    add a,l    ld l,a    ret nc    inc h    ret
EDIT: I do not have time right now, but I found some really significant speed improvements while I was at work yesterday. Gotta take a kitten to the vet, but later I'll post the update ^~^

Here, this one is a bit faster, a little larger, and has the same outputs (DE,A) as the OS routine, but with modified error handling (no negative or non-real  inputs, max input is 65535 instead of 9999).
Code: [Select]

JamesV

• Posts: 266
• Rating: +76/-0
Re: ASM Optimized routines
« Reply #89 on: October 16, 2017, 10:50:12 pm »

This is great, Xeda!

I recently took a fairly standard Div16/8 routine and modified it slightly to allow the dividend to be signed. To do this, it checks HL (the dividend) at the start, and if it's negative then it negates it back to positive, performs the division, and then negates the result. Is this an efficient way of handling this, or would you suggest something else?

For what it's worth, it performs well enough for the task I was working on, but I'm just curious as I've not delved much into bit-level multiplication or division before

Code: [Select]
; divide HL by C (HL is signed, C is not); output: HL = quotient, A = remainderdivHLbyCs:    bit     7,h    push    af    jr      z,divHLbyCsStart    ld a,h \ cpl \ ld h,a    ld a,l \ cpl \ ld l,a    inc     hldivHLbyCsStart:    xor     a    ld      b,16divHLbyCsLoop:    add     hl,hl    rla    cp      c    jp      c,divHLbyCsNext    sub     c    inc     ldivHLbyCsNext:    djnz    divHLbyCLoop    ld      b,a    pop     af    ld      a,b    ret     z    ld a,h \ cpl \ ld h,a    ld a,l \ cpl \ ld l,a    inc     hl    ret