Author Topic: [z80] Floating Point Routines (Read 52426 times)

Streetwalrus · « **Reply #30 on:** April 12, 2014, 05:48:16 pm »

No I think it goes beyond 16bits, and if it uses division then the decimal part is needed for accuracy. Gonna check that later and tell you more.

TIfanx1999 · « **Reply #31 on:** April 12, 2014, 07:17:54 pm »

Quote from: Xeda112358 on April 12, 2014, 05:42:20 pm

Hmm, isn't 16-bit math sufficient, then? Also, in good news, I actually shaved off close to 18000 more clock cycles from the square root routine, putting it at a little over 3 times faster than TI's. I am back to working on the exponential and logarithm routine, but they are table based (using a single 64-element LUT with 9 bytes to each element). From this I will build the Float->Str routine.

Awesome Xeda!

chickendude · « **Reply #32 on:** April 26, 2014, 04:59:12 pm »

Xeda, i was just looking through the 24-bit division routine and saw this line:

Code: [Select]

or a \ sbc hl,de \ jr c,$+7 \ set 7,b \ jp $+4 \ add hl,de \ srl d \ rr eI was just wondering if instead of all those "jp $+4"s, if you tried using "set 7,b \ .db $38 ;jr c,... \ add hl,de \ ; ..." you might be able to save two bytes and 3 t-states (x 15 repetitions). Since the carry will never be set there, it'll just skip the add hl,de which it will read as part of the jr. When the condition is false, jr is actually faster (and, of course, smaller) than a jp.

calc84maniac · « **Reply #33 on:** April 26, 2014, 05:01:41 pm »

Quote from: chickendude on April 26, 2014, 04:59:12 pm

Xeda, i was just looking through the 24-bit division routine and saw this line:
Code: [Select]
or a \ sbc hl,de \ jr c,$+7 \ set 7,b \ jp $+4 \ add hl,de \ srl d \ rr eI was just wondering if instead of all those "jp $+4"s, if you tried using "set 7,b \ .db $56 ;jr c,... \ add hl,de \ ; ..." you might be able to save a byte and 3 t-states (x 15 repetitions). Since the carry will never be set there, it'll just skip the add hl,de which it will read as part of the jr. When the condition is false, jr is actually faster (and, of course, smaller) than a jp.

That sounds like a good idea. However, remember that jr c is $38 and not $56

Also, that actually saves two bytes, so even better.

chickendude · « **Reply #34 on:** April 26, 2014, 07:49:07 pm »

Oops, i was looking at the decimal version

Thanks for catching that (and that it actually saves two bytes), though i'm pretty sure it wouldn't have been an issue for Xeda

Xeda112358 · « **Reply #35 on:** April 30, 2014, 09:24:13 am »

Oh, that's a really good idea! I actually have to rewrite that division, though (it doesn't return accurate results because it isn't keeping track of all of the remainder term). But even so, i wonder how many more returns can use a similar optimization? Excellent!

Xeda112358 · « **Reply #36 on:** November 24, 2015, 11:49:48 am »

Necro update:

I have some Single Precision Floating Point Routines that seem to be working. They have been tested, but not rigorously tested. The routines currently available seem to be working as expected including being able to compute and display ##\log_{2} \left(\pi^{-32}\right)## to 7 digits of accuracy. The single->string routine is not as complete as it could be, but it is working.

Syntax is consistent with HL pointing to the first argument, DE to a second argument if needed, and BC pointing to where the result is output. Only cmpSingle is different, because it does not store an output. Instead, it returns the result of the comparison in the flags register. With that exception, no float routines modify the main or shadow registers.

Now for an example of use on a TI-OS, let's compute and display ##\log_{2} \left(\pi^{-32}\right)##:

Code: [Select]

    ld hl,const_pi  ;pi is the first arg
    ld d,h \ ld e,l ;pi is also the second
    ld bc,scrap
    call mulSingle  ;pi*pi = pi^2
    ld h,b \ ld l,c ;Gonna square the result
    ld d,b \ ld e,c ;BC points to the result of the previous multiply, now HL and DE do, too.
    call mulSingle  ;= pi^4
    call mulSingle  ;= pi^8
    call mulSingle  ;= pi^16
    call mulSingle  ;= pi^32
    call invSingle  ;= 1/pi^32 = pi^-32
    call lgSingle   ;= lg(pi^-32)
    call single2string
    bcall(_PutS)

From the readme, included routines are:

Code: [Select]

absSingle
  func: |x| -> z
  mem:  None
addSingle
  func: x+y -> z
  mem:  6 bytes
  Note: special cases not done
subSingle
  func: x-y -> z
  mem:  10 bytes
  Note: special cases not done
rsubSingle
  func: -x+y -> z
  mem:  10 bytes
  Note: special cases not done
invSingle
  func: 1/x -> z
  mem:  5 bytes
divSingle
  func: x/y -> z
  mem:  5 bytes
cmpSingle
  func: compare x to y, no output
        return z flag if x=y (error is up to the last 2 bits)
        return c flag if x<y
        return nc if x>=y
  mem:  None
single2string
  func: string(x) -> z
  mem:  44 bytes
mulSingle
  func: x*y -> z
  mem:  6 bytes
negSingle
  func: -x -> z
  mem:  None

Xeda112358 · « **Reply #37 on:** November 25, 2015, 01:33:01 pm »

I haven't update the public download, but I finished:
log₁₀(x)
ln(x)
log_y(x)
2^x
e^x
y^x

And I'm working on making a much more general and better (asymptotically faster, less RAM used) routine for converting floats to base 10 strings (and TI floats).

Short Term Plan:
After the better float->string, I plan to implement a string->single and tifloat->single. These should be fairly easy to implement (I have done it several times before with success).
After that, a very simple math parser and I/O. This will be difficult-ish (6 on a scale of 1 to 10).

Longterm Schedule:
Update the exponential and log routine with a complex algorithm. Medium difficulty using the existing framework. This will supply most of the trig routines, too.
Add complex support to the math parser.
Extend these routines to the 80-bit floats.
Go nuts because now I can pump out tons of routines since I have the building blocks for most of math.

Xeda112358 · « **Reply #38 on:** November 26, 2015, 12:35:06 pm »

I have not added extra-precision calculation, but I really need to. The loss of accuracy gets built up in those bottom bits meaning a lose 1 decimal digit of accuracy (so we only have 6 digits of accuracy). TI uses 4 extra digits of precision in their intermediate calculations which is why they manage to keep much of the error out.

Anyways, I added a randSingle function, hyperbolic functions (sinh, cosh, tanh), a faked sqrt (using logs and exponents, until I find or rewrite lost code). I also overhauled the single2str function as planned, and I made a few other routines including a single->TI Float routine.

I've updated the public download, but I'll also upload to this post

Have a screenshot!

Xeda112358 · « **Reply #39 on:** December 05, 2015, 06:55:20 pm »

Bugs fixed:

Spoiler For Special Formats:

Spoiler For single2str:

New and Updated Routines

Spoiler For str2Single:

Spoiler For geomeanSingle:

New Bugs:

Spoiler For str2single:

joijoi · « **Reply #40 on:** December 24, 2015, 01:05:39 am »

Hello,
are you still working on the 80bits version of these routines?
Thanks in advances for any answer that you could give

Xeda112358 · « **Reply #41 on:** December 28, 2015, 12:13:44 am »

At the moment, I am not working on the 80-bit versions. However, I am not finished working with them. Life is busy again, so it will probably be a while yet before I work on floating point routines again.

Xeda112358 · « **Reply #42 on:** December 10, 2018, 11:45:55 pm »

As an update and backup to Omni, I have been working on some of the extended precision routines. At this moment, I've rewritten multiplication and square roots, and I am about to work on division.

The square root routine is now averaging 9183.443cc, a full three times faster than the previous implementation ! (and over 9 times faster than the TI-OS float algorithm).

The multiplication routine is now averaging 9925.527cc, a more humble 8.5% improvement. (still ~3.5 times faster than the TI-OS float multiplication). A good chunk of the speed improvement comes from a slightly faster 16-bit multiply routine from Runer112, which also has much nicer/more useful output registers.

I had an issue with the new square root algorithm, so for now I have patched it up so that it works, but it's about 1000cc slower than necessary. In the process of implementing this new algorithm, I rewrote some faster division routines and I might be able to get division down below 13000cc, a 30+% speed improvement (3 times faster than the OS routine).

Xeda112358 · « **Reply #43 on:** December 12, 2018, 10:37:30 pm »

Okay, with a lot of discussion with Runer112 over IRC/Discord, and a lot of coding, here is an update!
First, it seems like there might be an accuracy issue with the lower bits of xdiv, but that could just be an issue from truncating the input.

xsqrt is averaging ~9183.443cc
xmul is averaging ~9925.527cc
xdiv is averaging ~11107.370cc

Comparing to the old set of routines from a few years ago, division is now almost exactly 40% faster, multiplication is still about 8.5% faster, and square roots are three times faster.

Comparing to the OS routines, multiply is about 3.59 times faster, divide is about 3.65 times faster, and square root is about 9.45 times faster.

Pretty much the one thing the OS has better is converting floats to strings, which the OS, using BCD floats, is wayyyyy faster for. I project that converting to a string could take on average 50000cc with my method, versus maybe 600cc for a TI float (assuming no special formatting).

EDIT: I'm still not good with GitHub, so I'm going to try to get this to work, but I make no promises. z80float

Xeda112358 · « **Reply #44 on:** December 15, 2018, 11:52:40 pm »

So I added in addition/subtraction and float->str. I haven't calculated timing for the add/sub, or float->str, but I would guess about 1500cc and 90000cc are reasonable guesses. The conversion introduces error in the last digits, so I should only return about 16 digits. At the moment, the only formatting that truncates to 16 digits max is when the exponent is too high in magnitude (ex. 1.234567890123456e-9).

Anyways, you can git the source on GitHub, but here are some ugly screenshots attached with evident rounding issues

EDIT: It turns out I computed the 10^-(2^k) table with lower precision. Now that it is fixed, numbers are being displayed with a bit better precision. I am currently working on implementing the B-G algorithm.

Author Topic: [z80] Floating Point Routines (Read 52426 times)

Streetwalrus

Re: [z80] Floating Point Routines

TIfanx1999

Re: [z80] Floating Point Routines

chickendude

Re: [z80] Floating Point Routines

calc84maniac

Re: [z80] Floating Point Routines

chickendude

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines

joijoi

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines

Xeda112358

Re: [z80] Floating Point Routines