# Omnimaga

## Calculator Community => TI Calculators => ASM => Topic started by: Xeda112358 on January 29, 2015, 09:25:53 am

Title: eZ80 Optimized Routines
Post by: Xeda112358 on January 29, 2015, 09:25:53 am
EDIT: Routines so far:
mul16
mul32
gcd16
sqrt16
sqrt24
rand24
setbuffer
hl*a/255
div16
prng24

Now that the eZ80 has been confirmed as the new processor in the next line of TI calculators, I thought it would be fun to start making optimized routines in preparation!

We can take advantage of the 8-bit multiplication routine for multiplication of course. As a note, I did not find Karatusba multiplication to be faster (it was 66 to 76 clock cycles).
mul16 optimized
Code: [Select]
`mul16:;;Expects Z80 mode;;Inputs: hl,bc;;Outputs: Upper 16 bits in DE, lower 16 bits in BC;;54 or 56 t-states;;30 bytes     ld d,c \ ld e,l \ mlt de \ push de ;11    ld d,h \ ld e,b \ mlt de           ;8    ld a,l \ ld l,c \ ld c,a           ;3    mlt hl \ mlt bc \ add hl,bc        ;13    jr nc,\$+3 \ inc de \ pop bc        ;6    ld a,b \ add a,l \ ld b,a          ;3    ld a,e \ adc a,h \ ld e,a          ;3    ret nc \ inc d \ ret               ;7 (+2 for carry)`
Notice on the right side are the instruction timings. MLT is always 6 cycles and can be used in Z80 or ADL mode. All they did was include it in the extended instructions. In this case, I use Z80 mode to take advantage of some of the math and as a result, I also save 3 t-states over ADL mode which would require that add hl,bc to be done in z80 mode, so "add.s hl,bc" which costs 2 cycles instead of one, and the pushes/pops would require 4cc instead of 3cc each.

So, who wants to have fun? :D

EDIT: 5-Feb-15
mul32 optimized
Code: [Select]
`mul32:;;Expects Z80 mode;;Inputs:  ix points to the first 32-bit number;;         ix+4 points to the next 32-bit number;;Outputs: ix+8 points to the 64-bit result;;Algorithm: Karatsuba;348cc to 375cc;compute z0    ld hl,(ix)        ;5    ld bc,(ix+4)      ;5    call mul16        ;59+2    ld (ix+8),bc      ;5    ld (ix+10),de     ;5;compute z2                ld hl,(ix+2)      ;5    ld bc,(ix+6)      ;5    call mul16        ;59+2    ld (ix+12),bc     ;5    ld (ix+14),de     ;5;compute z1, most complicated as it requires 17-bits of info for each factor    ld hl,(ix+2)      ;5    ld bc,(ix)        ;5    add hl,bc         ;1    rla               ;1    ld b,h            ;1    ld c,l            ;1                           ld hl,(ix+6)      ;5    ld de,(ix+4)      ;5    add hl,de         ;1    rla               ;1                           push hl           ;3    push bc           ;3    push af           ;3    call mul16        ;59+2    pop af            ;3    and 3             ;2    ex de,hl          ;1    pop de            ;3    jr z,\$+7 ;label0  ;3|(6+1);bit 0 means add [stack0] to the upper 16 bits;bit 1 means add [stack1] to the upper 16 bits;both mean add 2^32        jp po,\$+5 \ or 4                     ;--    rra \ jr nc,\$+7                      ;4+4    rrca \ add hl,bc \ adc a,0 \ rlca    ;--                                         ;    srl a \ pop de \ jr nc,\$+5           ;8+2    add hl,bc \ adc a,0                  ;    ld d,b \ ld e,c                      ;2;z1 = AHLDE-z0-z2    ld bc,(ix+8) \ ex de,hl \ sbc hl,bc              ;8    ld bc,(ix+10) \ ex de,hl \ sbc hl,bc \ sbc a,0   ;10    ld bc,(ix+12) \ ex de,hl \ sbc hl,bc             ;8    ld bc,(ix+14) \ ex de,hl \ sbc hl,bc \ sbc a,0   ;10    ex de,hl                                         ;1    ld bc,(ix+10) \ add hl,bc \ ld (ix+10),hl \ ex de,hl  ;12    ld bc,(ix+12) \ adc hl,bc \ ld (ix+12),hl \ adc a,0   ;13    ret z \ ld hl,(ix+14) \ add a,l \ ld l,a              ;7+16    jr nc,\$+3 \ inc h \ ld (ix+14),hl \ ret               ;-- `gcd16 optimized
Code: [Select]
`GCD16:;;Expect Z80 mode;;Inputs: HL,DE;;Output: HL;;        BC=0;;        DE=0;;        A=0;Binary GCD algorithm;About 432cc on average (average of 500 iterations with randomized inputs on [0,65535];78 bytes    xor a    ld b,a    ld c,a    sbc hl,bc    ex de,hl    ret z    sbc hl,bc    ex de,hl    ret z;factor out cofactor powers of two    ld a,l \ or e \ rra \ jr nc,\$+16    srl h \ rr l    srl d \ rr e    inc b    ld a,l \ or e \ rra \ jr nc,\$-12.loop:;factor out powers of 2 from hl    ld a,l \ rra \ ld a,h \ jr c,\$+10    rra \ rr l \ bit 0,l \ jr z,\$-5 \ ld h,a;factor out powers of 2 from de    ld a,e \ rra \ ld a,d \ jr c,\$+10    rra \ rr e \ bit 0,e \ jr z,\$-5 \ ld d,a    xor a    sbc hl,de    jr z,finish    jr nc,loop    add hl,de    or a    ex de,hl    sbc hl,de    jr loop.finish:    ex de,hl    dec b    inc b    ret z    add hl,hl \ djnz \$-1 \ ret`sqrt16 optimized
Code: [Select]
`sqrt16:;;Expext Z80 mode;;Inputs: HL;;Output: A;Unrolled, speed optimized.;At most 112 clock cycles;111 bytes.    xor a \ ld c,a \ ld d,a \ ld e,l \ ld l,h \ ld h,c    add hl,hl \ add hl,hl \ sub h \ jr nc,\$+5 \ inc c \ cpl \ ld h,a    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla    sub h \ jr nc,\$+5 \ inc c \ cpl \ ld h,a    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla    sub h \ jr nc,\$+5 \ inc c \ cpl \ ld h,a    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla    sub h \ jr nc,\$+5 \ inc c \ cpl \ ld h,a    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla    sub h \ jr nc,\$+5 \ inc c \ cpl \ ld h,a    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla    sub h \ jr nc,\$+5 \ inc c \ cpl \ ld h,a    sla c \ ld a,c \ add a,a \ add hl,hl \ add hl,hl    jr nc,\$+5 \ sub h \ jr \$+5 \ sub h \ jr nc,\$+5 \ inc c \ cpl \ ld h,a    ld e,h \ ld h,l \ ld l,c \ ld a,l    add hl,hl \ rl e \ rl d    add hl,hl \ rl e \ rl d    sbc hl,de \ rla \ ret `rand24 optimized
Code: [Select]
`Rand24:;;Expects Z80 mode;;Inputs: seed1,seed2;;Outputs:;; AHL is the pseudo-random number;; seed1,seed2 incremented accordingly;;Destroys: BC,DE;;Notes:;; seed1*243+83 mod 65519 -> seed1;; seed2*251+43 mod 65521 -> seed2;; Output seed1*seed2 mod 16777259;;215cc worst case ld hl,(seed1)    ld d,h    ld h,243    ld e,h    mlt hl    mlt de    ld a,83    add a,l    ld l,a    ld a,h    adc a,e    ld h,a    ld a,d    adc a,0;now AHL mod 65519; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL    ld c,a    ld b,17    mlt bc    add hl,bc ld de,65519 jr nc,\$+5 or a \ sbc hl,de or a \ sbc hl,de jr nc,\$+3 add hl,de ld (seed1),hl;seed1 done, now we need to do seed2 ld hl,(seed2)    ld d,h    ld h,251    ld e,h    mlt hl    mlt de    ld a,43    add a,l    ld l,a    ld a,h    adc a,e    ld h,a    ld a,d    adc a,0;now AHL mod 65521; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL    ld c,a    ld b,15    mlt bc    add hl,bc ld de,65521 jr nc,\$+5 or a \ sbc hl,de or a \ sbc hl,de jr nc,\$+3 add hl,de ld (seed2),hl;now seed1 and seed 2 are computed;seed1*seed2 mod 16777259 ld bc,(seed1) call mul16   ; -> DEBC;16777259 = 2^24+43;D2^24+EBC mod 16777259 = EBC+(D*2^24+43D-43D) mod 16777259;D2^24+EBC mod 16777259 = EBC+(16777259D-43D) mod 16777259;D2^24+EBC mod 16777259 = EBC-43D mod 16777259    ld a,e    ld e,43    mlt de;ABC-de    ld h,b \ ld l,c    or a    sbc hl,de    sbc a,0    ret nc    ld bc,43    add hl,bc    ret`
Set Buffer optimized
Code: [Select]
`setBuf:;;Any mode;;Inputs: A is the byte to set each byte in the buffer to;;        HL points to the buffer;;        BC is the buffer size, greater than 1;;8 bytes;;14+3*BC clock cycles    ld (hl),a    dec bc    ld d,h    ld e,l    inc de    ldir    ret`
EDIT 9-Feb-15
HL*A/255 (can be used for division by 3,5,15,17,51, and 85, among others)
This one performs A*HL/255. Be warned that this does not work on certain boundary values. For example, A=85 would imply division by 3, but if you input HL as divisible by 3, you get one less than the actual result. So 9*85/255 should return 3, but this routine returns 2 instead.
Code: [Select]
`fastMul_Ndiv255:;;Expects Z80 mode;;Description: Multiplies HL by A/255;;Inputs: A,HL;;Outputs: HL is the result;;         A is a copy of the upper byte of the result;;         DE is the product of the input A,H;;         B is a copy of E;;         C is a copy of the upper byte of the product of inputs A,L;;32cc;;18 bytes    ld d,a    ld e,h    ld h,d    mlt hl    mlt de;15;DE; DE; HL;  HL    ld c,h    ld b,e    ld a,d    add hl,bc \ adc a,0    add hl,de \ adc a,0    ld l,h \ ld h,a    ret;17`HL*A/255 "fixed"
Modifying this to correct the rounding issue is a pain in terms of speed hit and size hit:
Code: [Select]
`fastMul_Ndiv255_fix:;;Expects Z80 mode;;Description: Multiplies HL by A/255;;Inputs: A,HL;;Outputs: HL is the result;;52cc;;26 bytes    push af    ld d,a    ld e,l    ld l,d    mlt hl    mlt de;HL; HL; DE;  DE    ld c,d    ld b,l    ld a,h    add hl,de \ adc a,0    add hl,bc \ adc a,0    ld d,l \ ld l,h \ ld h,a    pop af  \ add a,e    ret nc    adc a,d    ret nc    inc hl    ret `HL*A/255 Rounded
However, rounding it is at little cost and works well:
Code: [Select]
`fastMul_Ndiv255_round:;;Expects Z80 mode;;Description: Multiplies HL by A/255;;Inputs: A,HL;;Outputs: HL is the result;;37cc;;23 bytes    ld d,a    ld e,l    ld l,d    mlt hl    mlt de;15;HL; HL; DE;  DE    ld c,d    ld b,l    ld a,h    add hl,de \ adc a,0    add hl,bc \ adc a,0    ld l,h \ ld h,a    sla e \ jr nc,\$+3 \ inc hl    ret;22 `sqrt24 optimized
Finally, a routine that expects ADL mode instead of Z80 mode! This is an integer square root routine, but can be used for 8.8 fixed point numbers by copying the fixed point number to HL as 8.16 where the lower 8 bits are zero (or if you had extended precision from a previous calculation, feel free to use that). then the square root is 4.8
Code: [Select]
`sqrt24:;;Expects ADL mode;;Inputs: HL;;Outputs: DE is the integer square root;;         HL is the difference inputHL-DE^2;;         c flag reset    xor a \ ld b,l \ push bc \ ld b,a \ ld d,a \ ld c,a \ ld l,a \ ld e,a;Iteration 1    add hl,hl \ rl c \ add hl,hl \ rl c    sub c \ jr nc,\$+6 \ inc e \ inc e \ cpl \ ld c,a;Iteration 2    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e    sub c \ jr nc,\$+6 \ inc e \ inc e \ cpl \ ld c,a;Iteration 3    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e    sub c \ jr nc,\$+6 \ inc e \ inc e \ cpl \ ld c,a;Iteration 4    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e    sub c \ jr nc,\$+6 \ inc e \ inc e \ cpl \ ld c,a;Iteration 5    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e    sub c \ jr nc,\$+6 \ inc e \ inc e \ cpl \ ld c,a;Iteration 6    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e    sub c \ jr nc,\$+6 \ inc e \ inc e \ cpl \ ld c,a;Iteration 7    add hl,hl \ rl c \ add hl,hl \ rl c \ rl b    ex de,hl \ add hl,hl \ push hl \ sbc hl,bc \ jr nc,\$+8    ld a,h \ cpl \ ld b,a    ld a,l \ cpl \ ld c,a    pop hl    jr nc,\$+4 \ inc hl \ inc hl    ex de,hl;Iteration 8    add hl,hl \ ld l,c \ ld h,b \ adc hl,hl \ adc hl,hl    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl    jr nc,\$+6 \ sbc hl,de \ inc de \ inc de;Iteration 9    pop af    rla \ adc hl,hl \ rla \ adc hl,hl    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl    jr nc,\$+6 \ sbc hl,de \ inc de \ inc de;Iteration 10    rla \ adc hl,hl \ rla \ adc hl,hl    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl    jr nc,\$+6 \ sbc hl,de \ inc de \ inc de;Iteration 11    rla \ adc hl,hl \ rla \ adc hl,hl    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl    jr nc,\$+6 \ sbc hl,de \ inc de \ inc de;Iteration 11    rla \ adc hl,hl \ rla \ adc hl,hl    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl    jr nc,\$+6 \ sbc hl,de \ inc de \ inc de    rr d \ rr e \ ret`It is huge and I believe less than 240cc.
div16 optimized
Code: [Select]
`div16:;;Inputs: DE is the numerator, BC is the divisor;;Outputs: DE is the result;;         A is a copy of E;;         HL is the remainder;;         BC is not changed;140 bytes;145cc    xor a \ sbc hl,hl    ld a,d    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ cpl \ ld d,a    ld a,e    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ adc hl,hl \ sbc hl,bc \ jr nc,\$+3 \ add hl,bc    rla \ cpl \ ld e,a    ret`prng24 optimized
This is a lot faster and smaller than the rand24 routine. I'm also pretty sure this routine wins on unpredictability, too, and it has a huge cycle length.
Code: [Select]
`prng24:;;Expects ADL mode.;;Output: HL;;50cc;;33 bytes;;cycle length: 281,474,959,933,440 (about 2.8 trillion)    ld de,(seed1)    or a    sbc hl,hl    add hl,de    add hl,hl    add hl,hl    inc l    add hl,de    ld (seed1),hl    ld hl,(seed2)    add hl,hl    sbc a,a    and %00011011    xor l    ld l,a    ld (seed2),hl    add hl,de    ret`
Spoiler For template for Zeda because she is lazy:
routine optimized
Code: [Select]
Title: Re: eZ80 Optimized Routines
Post by: Sorunome on January 30, 2015, 08:52:16 am
Well I didn't read it through but that would be sure helpful!
Also, the comments assume z80 mode? So it only works for 16bit multiplication? And the result also only being in 16 bit?
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on January 30, 2015, 09:02:00 am
No, the eZ80 has two modes, ADL mode (24-bit) and Z80 (16-bit). In Z80 mode, certain instructions are one or two cycles faster, so it is beneficial for me to use that mode since I don't use any ADL-specific instructions. As well, add hl,hl sets the c flag on 16-bit overflow in Z80 mode, which I needed, but doesn't set it in ADL mode (it has to overflow 24 bits). Since there is no easy way to access the top bits of HL, it would have to be performed in Z80 mode, making it take even more clock cycles.

The full 32-bit result is returned.
Title: Re: eZ80 Optimized Routines
Post by: Sorunome on January 30, 2015, 09:03:44 am
*24-bit result is returned :P

Also, does that mean the factors can only be 16 bit?
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on January 30, 2015, 09:11:02 am
As it is only a 16-bit by 16-bit multiplication, yes, only up to 16-bit factors :P

And the upper 16 bits of the result are in DE, lower 16 in BC.
Title: Re: eZ80 Optimized Routines
Post by: Sorunome on February 05, 2015, 02:12:41 pm
Ui, so many new routines! I especially like your template!
Also, I didn't read them through but the titles are promising :)

EDIT: what is gcd?
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on February 05, 2015, 02:17:37 pm
GCD is the Greatest Common Divisor. So GCD(15,27) is 3 since 3 is the largest number that divides both 27 and 15. It isn't used often, bt if somebody wanted to make a 16-bit rational library *cough*I should*cough* it is extremely useful.
Title: Re: eZ80 Optimized Routines
Post by: Sorunome on February 05, 2015, 02:54:35 pm
Oh, thanks!
And yay more library stuff by xeda :P
Title: Re: eZ80 Optimized Routines
Post by: NanoWar on February 05, 2015, 05:34:59 pm
We need a math lib for z80. Do it.
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on February 06, 2015, 08:35:51 am
For the Z80, I made floating point routines for 80-bit floats (64-bit precision) and 24-bit floats (16-bit precision). I just started working on a single-precision floating point library for the eZ80 and Z80 that is conforming to IEEE-754 standards and it is going alright.

I've written a bunch of fixed-point and integer routines, too.
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on February 09, 2015, 10:33:25 am
So far today I have added
hl*a/255 which can be useful for things like dividing by anything divisible by 255 (like 3,5,15,17,51,85) as well as other values!
sqrt2424-bit integer square root, which can be useful for 8.8 fixed point square root (as 12 bits of precision are needed).
div16 is the 16-bit division. It is 145cc versus the 56cc worst case for mul16. The eZ80 will have a big advantage with multiplication over division.
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on March 01, 2015, 10:27:22 pm
For those who don't care to check for timings and stuff, there was a trick for the Z80 to speed up copying data in RAM faster than LDIR. THe premise was to unroll the loop as LDI \ LDI \ ... \ LDI \ jp pe,ldi_loop
LDI is 16cc, the jump is 10cc, and an LDIR is 21cc for each byte, minus 5 (the last copied byte is 16cc). So unrolled to 4 LDI instructions saves 10cc for every 4 bytes, unrolled 8, 12, or 16 times saves  30, 50, or 70 cc for each chunk.

For the eZ80, this trick does not work. LDIR takes 3cc for each byte copied, plus 2cc for the last one, whereas LDI takes 5cc. This makes LDIR asymptotically faster than any unrolled LDI loop by 40% and the very worst (less than 1 in a ten million opportunities) is exactly the same speed.

TL;DR: On the eZ80, always use LDIR to copy chunks of data.
Title: Re: eZ80 Optimized Routines
Post by: chickendude on March 05, 2015, 07:21:40 am
That "one in ten million" is when bc = 1? How does pipelining work for an instruction like ldir (or any of the other repeat instructions)?
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on March 18, 2015, 10:53:06 pm
That "one in ten million" is when bc = 1? How does pipelining work for an instruction like ldir (or any of the other repeat instructions)?
Yup, it is when BC=1. Pipelining works for ldir by reading the first byte of the instruction (1cc) then the next (1cc), then I am assuming each iteration does the RAM copy (2cc to read byte, then write), increment/decrement stuff (1cc). I assume looping comes at no additional cost.
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on October 15, 2018, 10:45:21 pm
Here is a fast and accurate arctan routine. It only works on the range [0,1), but it is easy enough to extend the range to any input. It's only good to approximately 8 bits, though.

Input is in E, output is H=256*atan(E/256). For example, E=177 returns H=154. In reality, 256*atan(177/256)=154.8633...
Code: [Select]
`atan8:;returns H=256*arctan(E/256);48cc if ADL mode;takes 164cc to call this on the TI-84+CE  ld c,201  ld b,e  mlt bc    ;x*201  xor a  sub e  ld d,a  mlt de    ;x(256-x)  ld l,e  ld h,70  ld e,h  mlt de  ;upper bytes  mlt hl  ;lower bytes  ld a,e  add a,h  ld l,a  ld h,d  jr nc,\$+3  inc h  add hl,bc  ret`EDIT: Added some screen shots comparing the functions and showing the error. It looks like the result deviates as far as 1.5/256 from the actual.
Title: Re: eZ80 Optimized Routines
Post by: jcochran on October 11, 2021, 12:28:05 pm
The mult16 routine from Xeda112358 has a bug, and can be made one byte shorter.

To demonstrate bug, multiply 0ffffh by 0ffffh.
Correct result = 0fffe0001
Incorrect result = 0feff0001

New routine.
Code: [Select]
`mul16:;;Expects Z80 mode;;Inputs: hl,bc;;Outputs: Upper 16 bits in DE, lower 16 bits in BC;;53 or 55 t-states;;29 bytes     ld d,c    ld e,l    mlt de    push de     ; Push low product onto stack.    ld d,h      ; Prepare for high product    ld e,b    ld h,e      ; Swap bytes for 2 middle products. Do here when I have copies of high values    ld b,d      ; instead of later, needing A as temp.    mlt de      ; Calculate high product    mlt hl      ; Calculate first middle product    mlt bc      ; Calculate second middle product    add hl,bc   ; Sum middle products    jr nc,\$+3    inc d       ; Inc D if carry.  (The INC DE was a bug).    pop bc      ; Retrieve low product.    ld a,b      ; Add in middle products to create final answer.    add a,l    ld b,a    ld a,e    adc a,h    ld e,a    ret nc    inc d    ret`
Title: Re: eZ80 Optimized Routines
Post by: Xeda112358 on October 11, 2021, 12:32:17 pm
Oh, fantastic! Good catch!
Title: Re: eZ80 Optimized Routines
Post by: jcochran on October 12, 2021, 04:18:27 pm
If you don't insist on the current register usage, you can get a bit shorter and faster.
Code: [Select]
`; HLDE = HL * DE; 52 cc, 28 bytesMult16: ld b,h ld c,d mlt bc push bc ld b,l ld c,e ld l,c ld e,b mlt bc mlt de mlt hl xor a add hl,de ld e,h ld h,l ld l,a adc a,a ld d,a add hl,bc ex de,hl pop bc adc hl,bc ret`
Unfortunately, I don't see any way to make it shorter or faster.
Title: Re: eZ80 Optimized Routines
Post by: jcochran on October 27, 2021, 10:36:07 am
Was wondering how "optimized" the multiply 32 routine was using the karatsuba algorithm vs a straight forward multiply. And the answer is the straight forward multiply is significantly faster.

Following routine was written in several stages.
1. Write straight forward multiple, doing the 16 sub products from least significant to most significant, using AHL as a running sum.
2. For 3 of the additions, it's impossible for a carry to propagate from the 16 bit add. This allowed for the adc a,a to be deleted, resulting in 3 bytes and cycles saved.
3. Since the B and C registers weren't used, decided to use them to cache a few values from the numbers being multiplied. This saves 20 bytes and 32 cycles.
4. Finally, a minor modification was made for the 30 product. Saves 1 byte.

Code: [Select]
`; (IX+8) = (IX+0)*(IX+4);; Register usage; AHL = running sum for result.;       Periodically, L is saved as next byte and sum shifted right by 1 byte.; DE = Next sub product to be added; B  = Cached value of (ix+0) and later (ix+7); C  = Cached value of (ix+4) and later (ix+3); IX = pointer to values being multiplied and result.;; 178 bytes; 275 clock cyclesMult32: xor a; Calc byte 0 ld b,(ix+0) ld h,b ; Prod 00 ld c,(ix+4) ld l,c mlt hl ld (ix+8),l ld l,h ld h,a; Calc byte 1 ld d,b ; Prod 01 ld e,(ix+5) mlt de add hl,de ; No carry possible ld d,(ix+1) ; Prod 10 ld e,c mlt de add hl,de adc a,a ld (ix+9),l ld l,h ld h,a xor a; Calc byte 2 ld d,b ; Prod 02 ld e,(ix+6) mlt de add hl,de ; No carry possible ld d,(ix+2) ; Prod 20 ld e,c mlt de add hl,de adc a,a ld d,(ix+1) ; Prod 11 ld e,(ix+5) mlt de add hl,de adc a,0 ld (ix+10),l ld l,h ld h,a xor a; Calc byte 3 ld d,b ; Prod 03 (Last use of cached (ix+0) in b) ld b,(ix+7) ; Cache (ix+7) ld e,b mlt de add hl,de adc a,a ld de,(ix+3) ; Prod 30. Minor optimization that saves a byte ld c,e ; Cache (ix+3) mlt de add hl,de adc a,0 ld d,(ix+1) ; Prod 12 ld e,(ix+6) mlt de add hl,de adc a,0 ld d,(ix+2) ; Prod 21 ld e,(ix+5) mlt de add hl,de adc a,0 ld (ix+11),l ld l,h ld h,a xor a; Calc byte 4 ld d,c ; Prod 31 ld e,(ix+5) mlt de add hl,de adc a,a ld d,(ix+1) ; Prod 13 ld e,b mlt de add hl,de adc a,0 ld d,(ix+2) ; Prod 22 ld e,(ix+6) mlt de add hl,de adc a,0 ld (ix+12),l ld l,h ld h,a xor a; Calc byte 5 ld d,(ix+2) ; Prod 23 ld e,b mlt de add hl,de adc a,a ld d,c ; Prod 32 ld e,(ix+6) mlt de add hl,de adc a,0 ld (ix+13),l ld l,h ld h,a; Calc bytes 6 & 7 mlt bc ; Prod 33 add hl,bc ; No carry possible ld (ix+14),hl ret`