Calculator Community > ASM

eZ80 Optimized Routines

<< < (3/4) > >>

Xeda112358:
So far today I have added
hl*a/255 which can be useful for things like dividing by anything divisible by 255 (like 3,5,15,17,51,85) as well as other values!
sqrt2424-bit integer square root, which can be useful for 8.8 fixed point square root (as 12 bits of precision are needed).
div16 is the 16-bit division. It is 145cc versus the 56cc worst case for mul16. The eZ80 will have a big advantage with multiplication over division.

Xeda112358:
For those who don't care to check for timings and stuff, there was a trick for the Z80 to speed up copying data in RAM faster than LDIR. THe premise was to unroll the loop as LDI \ LDI \ ... \ LDI \ jp pe,ldi_loop
LDI is 16cc, the jump is 10cc, and an LDIR is 21cc for each byte, minus 5 (the last copied byte is 16cc). So unrolled to 4 LDI instructions saves 10cc for every 4 bytes, unrolled 8, 12, or 16 times saves  30, 50, or 70 cc for each chunk.

For the eZ80, this trick does not work. LDIR takes 3cc for each byte copied, plus 2cc for the last one, whereas LDI takes 5cc. This makes LDIR asymptotically faster than any unrolled LDI loop by 40% and the very worst (less than 1 in a ten million opportunities) is exactly the same speed.




TL;DR: On the eZ80, always use LDIR to copy chunks of data.

chickendude:
That "one in ten million" is when bc = 1? How does pipelining work for an instruction like ldir (or any of the other repeat instructions)?

Xeda112358:

--- Quote from: chickendude on March 05, 2015, 07:21:40 am ---That "one in ten million" is when bc = 1? How does pipelining work for an instruction like ldir (or any of the other repeat instructions)?

--- End quote ---
Yup, it is when BC=1. Pipelining works for ldir by reading the first byte of the instruction (1cc) then the next (1cc), then I am assuming each iteration does the RAM copy (2cc to read byte, then write), increment/decrement stuff (1cc). I assume looping comes at no additional cost.

Xeda112358:
Here is a fast and accurate arctan routine. It only works on the range [0,1), but it is easy enough to extend the range to any input. It's only good to approximately 8 bits, though.

Input is in E, output is H=256*atan(E/256). For example, E=177 returns H=154. In reality, 256*atan(177/256)=154.8633...

--- Code: ---atan8:
;returns H=256*arctan(E/256)
;48cc if ADL mode
;takes 164cc to call this on the TI-84+CE
  ld c,201
  ld b,e
  mlt bc    ;x*201
  xor a
  sub e
  ld d,a
  mlt de    ;x(256-x)
  ld l,e
  ld h,70
  ld e,h
  mlt de  ;upper bytes
  mlt hl  ;lower bytes
  ld a,e
  add a,h
  ld l,a
  ld h,d
  jr nc,$+3
  inc h
  add hl,bc
  ret

--- End code ---
EDIT: Added some screen shots comparing the functions and showing the error. It looks like the result deviates as far as 1.5/256 from the actual.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version