Omnimaga

Calculator Community => TI Calculators => ASM => Topic started by: Xeda112358 on August 13, 2013, 12:09:18 pm

Title: [z80] Floating Point Routines
Post by: Xeda112358 on August 13, 2013, 12:09:18 pm
EDIT: I'm still not good with GitHub, so I'm going to try to get this to work, but I make no promises. z80float (https://github.com/Zeda/z80float)


Hopefully, this can evolve into a collection of floating point routines, with this first post edited to keep them in one easy place. I am working on some floating point routines for use in things like Grammer and I hope it will some day make its way into a community made OS, so I hope that people find these useful enough to use (and feel free to do so). I am still working on these routines, trying to optimise them, too.

edit: I removed the code because it was too much. Instead, I have made a public folder available here (https://drive.google.com/folderview?id=0B4HNIXQZLWM8Z3NQMGJTbHVhRm8&usp=sharing) to keep these organised.

Tari (http://www.cemetech.net/forum/profile.php?mode=viewprofile&u=433), from Cemetech posted a link to some routines received from Scott Dial (http://detachedsolutions.com/aboutus/) for BCD format (the same as the TI-OS). They can be found here (http://pastebin.com/Myk7xnSN).

Examples of use might be:
Code: [Select]
   ld hl,FP1
   ld de,FP2
   call FloatDiv
   <<code>>
FP1:
  .db 0,2,  $c9,$0f,$da,$a2,$21,$68,$c2,$34    ;pi, not rounded up
FP2:
  .db 0,2,  $ad,$f8,$54,$58,$a2,$bb,$4a,$9A    ;e, not rounded up
The 80-bit division and multiplication routines do not handle overflow yet. On the to-do list:




From these, routines such as those dealing with probability, statistics, and other similar routines should be easier to make. The advantage to using BCD is for displaying numbers or reading them during parsing. With the 80-bit format, most non-integers don't have an easy representation (like .1) and in order to parse a base-10 string of digits, you would have to do some very time consuming processes (probably a hundred times slower than with BCD). However, for compiled languages and where text output isn't necessary, 80-bit floats offer a much higher range of numbers, greater accuracy, and all for a decent speed.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on August 14, 2013, 12:13:29 pm
I added the 80-bit Floating Point Addition routine which was friendlier than I expected. Also, for current timings, Division is a little under 60 000 t-states, multiplication is a little over 30 000 t-states, and addition is just under 1400 t-states. I need to make all of these faster x.x

EDIT: Oh, also, The floating point addition code includes a routine for setting a number to infinity, keeping its sign, among other things. It also has rounding implemented.

EDIT: In my test app, here are the sizes so far:
FloatAdd : 244 bytes (this includes the code for handling infinity and a few other subroutines that will be shared by others)
FloatDiv : 244 bytes
FloatMul: 205 (this includes a subroutine routine for 64-bit integer multiplication)
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on August 16, 2013, 04:28:26 pm
I modified the addition routine (I rewrote it from scratch) and it now handles adding negative numbers and whatnot. This means the subtraction routine only needed to toggle the sign of the second input, so multiplication, division, addition, and subtraction are working :)

I also put the code here (https://drive.google.com/folderview?id=0B4HNIXQZLWM8Z3NQMGJTbHVhRm8&usp=sharing) because the first post was getting too cluttered for my liking. If you check it out, you will notice the yet-to-be-tested code for the CORDIC algorithm that should compute sine and cosine at the same time.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on August 18, 2013, 11:45:16 pm
I added a few 24-bit floating point routines including a log2 routine. The advantage to the 24-bit floats is that it can all be done in the registers, giving them even better speed. The obvious downside is a much more limited range for accuracy (about 4 to 5 decimal digits), but it does provide a fast alternative to integer math. It could be used in languages like Axe or Grammer.

I haven't fully tested the log2 routine, but I did test a few key values that were more likely to have issues with accuracy. I am pleased to say, that the worst I got was about .0000763 away from the actual value (but I won't be surprised if there are values that yield as much as an error of 5 times that). The main issue is that it doesn't handle negative inputs.

By the way, for timing, the log2 routine is around 1200 to 3500 t-states, so it can do 1700 to 5000 computations per second at 6MHz. Graphing a log() equation should be pretty fast ;D
Title: Re: [z80] Floating Point Routines
Post by: Eiyeron on August 19, 2013, 05:24:08 am
That's neat. i won't certainly use them, but keep up the good work!
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on October 21, 2013, 08:11:39 am
I updated the 80-bit floating point multiplication routine. It is a bit larger now, but I used a modified approach from the last version and it is now much faster. From ~32000 t-states, I brought it down to ~24000 2 days ago, and last night, I brought it down to <16000 t-states. Now it is more accurate and provides a wider range of numbers and it is faster than the OS. Division on the other hand is still pretty slow (the OS does it faster).

As well, I have a squareroot routine for the 24-bit floats that works, but I want to make it faster. I haven't added that to the routines list yet.
Title: Re: [z80] Floating Point Routines
Post by: Matrefeytontias on October 21, 2013, 08:58:57 am
You could pretty much write an OS out of that.
Title: Re: [z80] Floating Point Routines
Post by: Sorunome on October 21, 2013, 09:15:50 am
/me waits on xeda to write a custom OS which tops all of TI, even has mathprint and supports the normal basic progs - all only far better :P
Title: Re: [z80] Floating Point Routines
Post by: Matrefeytontias on October 21, 2013, 11:35:44 am
*cough* *cough* Join KnightOS dev team *cough* *cough* add float routines *cough* *cough* write a homescreen app *cough*
Title: Re: Re: [z80] Floating Point Routines
Post by: DJ Omnimaga on October 21, 2013, 12:18:19 pm
Given what SirCmpwn did to AssemblyBandit when he tried to contribute to KnightOS CSE, I would be careful about contributing for it.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on October 21, 2013, 03:04:15 pm
Yeah, I dunno about joining that team, but you are free to use any of these routines. They haven't yet been rigorously optimised (obviously, if I managed to make a routine twice as fast there is huge room for improvement). My hope was that these would actually be used for an OS, whether it is an OS I write or an OS others write. I also planned to use these routines if I ever finish Grammer 3, or Grammer 4 (an OS).

I might make a concept programming language using these 80-bit and 24-bit floats, or I might incorporate them into FileSyst somehow.
Title: Re: [z80] Floating Point Routines
Post by: Sorunome on October 21, 2013, 03:08:48 pm
Given what SirCmpwn did to AssemblyBandit when he tried to contribute to KnightOS CSE, I would be careful about contributing for it.
What happend to AssemblyBandit O.O
Title: Re: Re: [z80] Floating Point Routines
Post by: DJ Omnimaga on October 22, 2013, 02:58:58 pm
He was falsely accused of stealing KnightOS code after he provided sircmpwn with a fixed version of the 84+CSE remake. Sir needed help to get KOS to run on the color calc, hence why AsmBandit wanted to help.
Title: Re: [z80] Floating Point Routines
Post by: Streetwalrus on October 22, 2013, 03:15:48 pm
Wtf ? O.O I he thanked him though at least. Also with an open source project stealing would be claiming the work as yours and not releasing the source. Knowing AsmBandit, this couldn't happen.
Title: Re: Re: [z80] Floating Point Routines
Post by: DJ Omnimaga on October 22, 2013, 07:31:00 pm
True. He just published a fixed copy on Sir's repository anyway. Oh well
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on October 24, 2013, 08:48:06 pm
I have updated the 24-bit floating point routines. Most of the routines have easy input/output. If there are two inputs, it is BHL and CDE. If it is one input, it is AHL or BHL. Outputs for the floating point routines are AHL or BHL. There are also a handful of extra routines, such as:

BC_Times_DE (returns 32-bit result, worst case is less than 700 t-states)
normalise24 to renormalise 24-bit floats
SetInf to return a float as infinity (keeping the original sign)
FloatToUInt to convert a float to an unsigned integer
FloatToSInt to convert a float to a signed integer
SqrtHL_prec16 returns the square root of HL to 16 bits of accuracy.

The actual floating point routines are:

Float24Mul
Float24Div
Float24Add
Float24Sub
Float24Lg
Float24Sqrt
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on March 16, 2014, 01:40:43 pm
A while ago, I thought of a way to possibly make division faster, but I never had the time or focus to actually write it out in assembly. I was working on it today after satisfying myself with the math involved and it doesn't work yet (there is a bug somewhere that seems to be incrementing the wrong byte in the result on occasion). However, the fix for that should not make it much slower. My goal was to get division below 45 000 t-states and from current tests, it will probably be more like 25 000 t-states. This is nearly twice as fast as the OS division and a little faster than the OS' multiplication and this is for 19 digits of accuracy.

My worry when I thought of the algorithm was that it seemed like it would be pretty close to the speed of long division and might only be faster for very large or small sized numbers. It seems to be that on the z80, this method blows away long division when using a floating point storage format.

Here is the way it works-- If my numbers are always normalised (this means the upper bit is always set, unless the number is 0), then I might want something like the following in hexadecimal 8AD176/AC0980 (in base 10, 9097590/11274624). What I do is I make an estimate for the first byte. Normally on the z80, we look at the bits because handling too many bytes is difficult. However, my algorithm isn't quite long division. The speed up comes from how we choose the first chunk of 8 bits. I just take the upper 16 bits of the numerator, upper 8-bits of the denominator, and divide those to get my estimate. The nice part is that this will only be off by as much as 1 (and if it is, it was an overestimate). Then whatever the estimate was, multiply that by the denominator (like in base 10) and subtract. So for the example:
Code: [Select]
First digit = 0 ('digits' are 8-bit ints, so on [0,255])
Now 8AD1/AC = CE, so 8AD176.00 - AC0980*0.CE    = 8AD176-8A6FAF = 61D1
Now 61D1/AC = 91, so 61D1.0000 - AC0980*.0091   = 61D1.0-6171.6180 = 5F.9E80
Now 5F9E/AC = 8E, so 5F.9E80   - AC0980*.00008E = 5F.9E8000-5F.6D4500 = .313B
In this case, there were no over estimates. We would have know if the subtraction step yeilded a negative output. To adjust this, decrement the new digit by 1 and add AC0980 to the int. So the example gives 8AD176/AC0980 = 0.CE918E, or in base 10, 9097590/11274624=.806908488274

That multiplication step is just multiplying an 8-bit number by a 64-bit number in my routine, and I combine this with subtraction (so it is a fused multiply-subtract routine) and it only needs to be performed 8 times. Hopefully when I get out of work later, I won't be too tired to keep working on this :)
Title: Re: [z80] Floating Point Routines
Post by: chickendude on March 18, 2014, 01:21:38 am
This stuff is mostly all way beyond me but don't think no one is reading/that no one finds it interesting. I think these routines are really cool and i'm trying to make the effort to understand them. I hope you keep working on them!
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on March 18, 2014, 10:45:35 am
Cool, thanks :) I finally got all of the small bugs out and it currently takes 25053 t-states to perform pi/e and 22483 t-states for e/pi. However, I have the number format slightly different from my other 80-bit floats. After experimenting earlier this year, it is actually a little friendlier to store the exponent+sign in little-endian mode and have 16384 represent the exponent 0.

Floating point is basically like scientific notation. So in base 10, 314.1593=3.141593*102. The way it would be stored in floating point is a sign bit (positive, so usually 0), followed by the exponent (2), followed by the string "3141593". Now if we want to square this, we can multiply 3141593*3141593 and add the exponents, and XOR the sign. So you get +:4:9869606577649 and this tells us that the number is +98696.06577649. In practice, we truncate and round the digits, so to 6 digits, +:4:986961.

In binary, it is the same idea. If I keep the numbers normalized (keeping the mantissa on [1,2)), then division of 3/5 for example, is dividing +:1:11000000 / +:2:10100000. We can subtract exponents, XOR the signs, and then we perform 11000000/10100000 to whatever precision we need. Integer division would give us 0, but if we go for 8 more iterations of the same division algorithm, we would get 8 bits of the decimal. Rounding, we get +:-1:10010100 which corresponds to .10010100 = 154/256 = .6015625.

Here is my code, currently. I am pretty sure it isn't nearly as optimized as it could be and it is nearly 900 bytes:
Code: [Select]
Div_Sub:
;DE/C, DE <C*256, C>127
   ld a,d
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ adc a,a \ jr c,$+5 \ ret p \ cp c \ ret c \ inc e \ ret
FloatDiv_80:
; 1 bit sign + 15 bits signed exponent (16384 is exp = 0)   (little endian)
; 64 bits mantissa, (big endian)
;Inputs:
;  HL points to dividend
;  DE points to divisor
   ex de,hl
   call LoadFPOPs
   ld hl,(fpOP1)
   ld de,(fpOP2)
   ld a,h
   xor d
   push af
   res 7,d
   res 7,h
   sbc hl,de
   ld bc,16384
   add hl,bc
   pop af
   and $80
   or h
   ld h,a
   ld (fpOP3),hl
;Now perform the division of fpOP2/fpOP1
;The algo works like this:
;  Take the first byte of fpOP2, compare against that of fpOP1
;  If it is bigger, since fpOP1 should have bit 7 set (normalized numbers),
;    it divides at most once. So the first byte is 1, subtract fpOP2-fpOP1->fpOP2
;  After this, we repeatedly compare the upper two bytes of fpOP1 to the first byte
;    of fpOP1. This is to estimate how many times fpOP1 can be divided by fpOP1.
;  This is just a guestimate, but each digit is an overestimate by at most 1!
;
;  Example with smaller numbers. Take 8AD176/AC0980
;   First digit = 0   ('digits' are 8-bit ints, so on [0,255])
;   Now 8AD1/AC = CE, so 8AD176.00 - AC0980*0.CE    = 8AD176-8A6FAF = 61D1
;   Now 61D1/AC = 91, so 61D1.0000 - AC0980*.0091   = 61D1.0-6171.6180 = 5F.9E80
;   Now 5F9E/AC = 8E, so 5F.9E80   - AC0980*.00008E = 5F.9E8000-5F.6D4500 = .313B
;  In this case, there were no over estimates. We would have know if the subtraction step
;  yeilded a negative output. To adjust this, decrement the new digit by 1 and add AC0980 to the int.
;  So the example gives 8AD176/AC0980 = 0.CE918E, or in base 10, 9097590/11274624=.806908488274

;fpOP1+2 has denom
;fpOP2+2 has num
   ld de,fpOP2-2
   ld hl,fpOP2+2
   ldi \ ldi \ ldi
   ldi \ ldi \ ldi
   ldi \ ldi \ ldi
   ldi \ ldi \ ldi
denom = fpOP1+2
numer = fpOP2-2
outp  = numer-1
   ld hl,denom
   ld de,numer
   call cp_64b
   ld hl,numer-1
   ld (hl),0
   jr c,noadjust
      inc (hl)
      ex de,hl
      inc de
      ld hl,denom
      call sub_64b
      ex de,hl \ dec hl
noadjust:
   inc hl
   ld de,numer+8
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   ld de,801Eh
   ld hl,800Bh
   ld a,(hl)
   rra
   jr nc,directcopy
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ ret
directcopy:
   inc hl
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ld hl,(fpOP3) \ dec hl \ ld (fpOP3),hl \ ret
div_sub_1:
   ld bc,(denom)
   ld a,(hl)
   inc hl
   push hl
   ld l,(hl)
   ld h,a
   ex de,hl
   call Div_Sub
   ld c,e
   ex de,hl
   call fused_mul_sub
   ld hl,9
   add hl,de
   ex de,hl
   pop hl
   ret
fused_mul_sub:
;multiply denominator*E and subtract from numerator
   xor a
   ld hl,(denom+6) \ ld b,a \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   ld a,(de) \ sub l \ ld (de),a \ dec de
   ld a,h \ adc a,b

   ld hl,(denom+5) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom+4) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom+3) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom+2) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom+1) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom-1) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ dec de
   ld l,a
   ld a,(de)
   sbc a,l
;if c flag is set, overestimate
   ld a,c \ ld (de),a
   ret nc
   ld hl,8
   add hl,de
   ex de,hl
   ld hl,denom+7
   ld a,(de) \ add a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec de
   ex de,hl \ dec (hl) \ ex de,hl
   ret



;num+7 - hl

sub_64b:
;(de)-(hl), big endian 64-bit.
   ld bc,7
   add hl,bc
   ex de,hl
   add hl,bc
   ex de,hl
   ld a,(de) \ sub (hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ ret
cp_64b:
;compares (de) to (hl), big endian 64-bit ints
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret

LoadFPOPs:
;HL points to the first
;DE points to the second
   push de
   ld de,fpOP1
   xor a
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ld (de),a \ inc de
   ld (de),a \ inc de
   ld (de),a \ inc de
   ld (de),a \ inc de
   pop hl
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ld (de),a \ inc de
   ld (de),a \ inc de
   ld (de),a \ inc de
   ld (de),a \ inc de
   ret
.echo "Size:",$-Div_Sub
It is so large because I unrolled much of it. As an example, using e and pi:
Code: [Select]
   ld hl,float_e
   ld de,float_pi
   jp FloatDiv_80
;e/pi =0.dd816a76547ca9910802972996d4e3
float_pi:
  .dw 16384+1 \ .db $c9,$0f,$da,$a2,$21,$68,$c2,$34    ;pi, not rounded up
float_e:
  .dw 16384+1 \  .db $ad,$f8,$54,$58,$a2,$bb,$4a,$9A    ;e, not rounded up
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on March 19, 2014, 07:40:17 am
At a hefty 1631 bytes and a change of syntax, I have managed to get division under 20 000 t-states and multiplication at about 13000 t-states.


The change of syntax is basically using the exponent as 16384 for 0, sign is bit 15, and these together comprise the first two bytes in little-endian. The following 8 bytes are the mantissa and are also little-endian (it was big-endian before).


I also changed how the multiplication and division routines are done. For the division at each step, instead of doing 8 8x8->16 multiplications, it is faster to do 4 8x16->24 multiplications. For the multiplication, instead of doing 16 16x16->32 multiplications, I do 32 8x16->24 multiplications.


I should also note that I have these timings without code to check for overflow, zero, infinities, or NAN, but I changed exponent storage to make these easier to do. Now if I can get addition and subtraction in this new format, I will be tempted to write some really slow routines for exponentials, logs, and trig functions. Then they will be available until I have time to actually design much faster algorithms.


Also, I have been trying to think of a way to make a "geometric mean" function. Basically, the geometric mean of 'a' and 'b' is sqrt(ab). I could just multiply and perform the square root, but for the past few months I have been having the idea of fusing the two operations into one. The idea would be that since an NxN multiplication would return a number of size 2N, then I can try to get 2 bits at a time of the multiplication. For restoring squares square root, it takes in the upper two bits at each iteration, so I can feed the bits directly from the multiplication into the square root algorithm without worrying about storing a 128-bit integer, and while getting full precision to 64 bits (instead of multiplying to get a 64-bit approximation and taking the square root of that).


The geometric mean would be useful when combined with the arithmetic mean (a+b)/2, especially if I can get it to be streamlined and fast. There is a class of algorithms that use the Arithmetic-Geometric Mean. This provides the asymptotically fastest known method of computing things like natural log, arctangent, and others.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on March 19, 2014, 01:11:59 pm
I have a new version for the 80-bit floats and I made an example routine to compute the square root using Newton's method (basically .5(Ans+X/Ans) 5 times). Since there isn't already a square root routine and this gets about 64-bits of precision, I have left it in, but it is slow.


Attached is the current source, and it is updated in the Google Docs thing. For a summary of timings:
Code: [Select]

clock cycles ops per sec, 6MHz
Add/Sub   1200 cc 5000
Multiplication 13000 cc 461
Division 19000 cc 315
Sqrt 108000 cc 55
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on April 10, 2014, 10:33:46 pm

I added in formats and full support for signed Zero, signed Infinity, and NAN. I also found that Newton's method may actually be my best option for the square root algorithm, but I did mix in my own algorithm, too. The one I came up with is a linear-time convergence algorithm, but the iterations are faster than an iteration of Newton's method. I used one iteration of my method at a cost of about 2000 cycles, to give a better approximate than 2 iterations of Newton's method, so I got to remove one Newton's method iteration and the already optimized first iteration. The result was over 20 000 clock cycles saved, putting it about 700 clock cycles slower than TI's (but if I only had to compute 47 bits of precision, it would be over 22000 t-states faster for the same accuracy :P).


Anyways, I updated the google docs thing, and I also attached the files. I reorganized things by splitting up main routines (add/sub, multiply, divide, square root) into their own files and then I just #include them.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on April 11, 2014, 09:17:37 am
Since the previous upload has been downloaded already, I guess I will upload this revised version. Last night in bed I was thinking about better ways to do the square root and I realized my method was not in fact as fast as it could be. See, I did the following computation in my head:


Xn is a sequence converging to sqrt(X) and is defined by:
##x_{n+1}=\frac{x_{n}+\frac{x}{x_{n}}}{2}##
(Basically, it averages an overestimate and underestimate of the square root.)


So say at iteration n, I have the error being ##k=x_{n}-\sqrt{x}##. Then the next iteration of the sequence has the error:


##x_{n+1}-\sqrt{x}=\frac{x_{n}+\frac{x}{x_{n}}}{2}-\sqrt{x}##


##=\frac{x_{n}+\frac{x}{x_{n}}-2\sqrt{x}}{2}##


##=\frac{x_{n}-\sqrt{x}+\frac{x}{x_{n}}-\sqrt{x}}{2}##


##=\frac{k+\frac{x}{x_{n}}-\sqrt{x}}{2}##


##=\frac{k+\frac{x}{\sqrt{x}+k}-\sqrt{x}}{2}##


##=\frac{k+\frac{x-x-k\sqrt{x}}{\sqrt{x}+k}}{2}##


##=\frac{k-\frac{k\sqrt{x}}{\sqrt{x}+k}}{2}##


##=\frac{k-\frac{k(x_{n}-k)}{\sqrt{x}+k}}{2}##


##=\frac{k-\frac{k x_{n}-k^{2})}{\sqrt{x}+k}}{2}##


##=\frac{k-\frac{k x_{n}-k^{2})}{x_{n}}}{2}##


##=\frac{k-k+\frac{k^{2})}{x_{n}}}{2}##

##=\frac{\frac{k^{2})}{x_{n}}}{2}##
What does this even mean? Suppose I had m bits of accuracy. Then the new accuracy, as long as x is on [1,4), as my values are, is at least 2n+1 bits of accuracy (up to 2n+2). Basically, if I could get 8 bits of accuracy initially and cheaply, the next iteration would yield at least 17 bits, then 35, then 71. What I decided to do, then was, just use a simple and fast 16-bit square root routine that I wrote to get an 8-bit estimate. However, when I went to look in my folder, I found that I had already written a 32-bit square root routine that returned a 16-bit result.


This means that I successfully removed 3 iterations of Newton's method since last night, bringing the total to just 2 iterations. Now it is down from 108 000 t-states to just 46 000, putting it much faster than TI's.



Title: Re: [z80] Floating Point Routines
Post by: TIfanx1999 on April 11, 2014, 09:39:07 am
I saw on IRC that you made significant speed gains and said that it might be ~2x faster than TI's. Kudos to you! :D
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on April 11, 2014, 09:47:18 am
Well, here are timings I got from WabbitEmu for the OS (86825 ccs) and mine (46831ccs). So it isn't quite twice as fast, but it is almost. I am also working on a routine to cut out another 16000 or so, so then it will be almost 3 times faster. For the timings I have:
Code: [Select]

Args used:
1.570796326794897
57.29577951308232
For example, 57.29577951308232/1.570796326794897


                TI-OS      Float80    diff      ratio     analysis
add/subtract    2758       3166       +408      1.1479    Add/sub is a bit slower, possibly noticeably
multiply        35587      10851      -24736    0.3049    Multiplication is signigicantly faster. Noticeable.
divide          40521      18538      -21983    0.4575    Division is significantly faster. Noticeable.
square root     86825      46831      -39994    0.5394    Square roots, are significantly faster. Noticeable


notes: TI-Floats are approximately 47 bits of precision. Float80 uses 64 bits of precision (that is 14 digits versus 19)
Title: Re: [z80] Floating Point Routines
Post by: Streetwalrus on April 11, 2014, 10:22:06 am
Oh wow that's nice ! I might use these in Illusiat since they are faster. It would be nice if you made an axiom out of them.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on April 11, 2014, 10:31:42 am
For that, what kind of floating point precision would you need? 80-bit floats are pretty wild, but you may have seen that I have a bunch of 24-bit floats and I can make 40-bit floats that are really fast, too (like, cut all the times in 4ths). In fact, the multiplication routine uses a divide-and-conquer algorithm that has an intermediate step of computing 32-bit multiplications (and a 40-bit float would have 32-bit multiplication).
Title: Re: [z80] Floating Point Routines
Post by: TIfanx1999 on April 11, 2014, 10:37:59 am
Well, here are timings I got from WabbitEmu for the OS (86825 ccs) and mine (46831ccs). So it isn't quite twice as fast, but it is almost. I am also working on a routine to cut out another 16000 or so, so then it will be almost 3 times faster. For the timings I have:
Code: [Select]

Args used:
1.570796326794897
57.29577951308232
For example, 57.29577951308232/1.570796326794897


                TI-OS      Float80    diff      ratio     analysis
add/subtract    2758       3166       +408      1.1479    Add/sub is a bit slower, possibly noticeably
multiply        35587      10851      -24736    0.3049    Multiplication is signigicantly faster. Noticeable.
divide          40521      18538      -21983    0.4575    Division is significantly faster. Noticeable.
square root     86825      46831      -39994    0.5394    Square roots, are significantly faster. Noticeable


notes: TI-Floats are approximately 47 bits of precision. Float80 uses 64 bits of precision (that is 14 digits versus 19)

:o
That's awesome Xeda! ;D
Title: Re: [z80] Floating Point Routines
Post by: Streetwalrus on April 11, 2014, 11:36:08 am
I don't know how much precision the original needs. I know that calculations go way high in the battle engine and that there is a 9999 damage cap if it goes beyond that. Also it's trimmed to an integer in the end.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on April 12, 2014, 05:42:20 pm
Hmm, isn't 16-bit math sufficient, then? Also, in good news, I actually shaved off close to 18000 more clock cycles from the square root routine, putting it at a little over 3 times faster than TI's. I am back to working on the exponential and logarithm routine, but they are table based (using a single 64-element LUT with 9 bytes to each element). From this I will build the Float->Str routine.
Title: Re: [z80] Floating Point Routines
Post by: Streetwalrus on April 12, 2014, 05:48:16 pm
No I think it goes beyond 16bits, and if it uses division then the decimal part is needed for accuracy. Gonna check that later and tell you more.
Title: Re: [z80] Floating Point Routines
Post by: TIfanx1999 on April 12, 2014, 07:17:54 pm
Hmm, isn't 16-bit math sufficient, then? Also, in good news, I actually shaved off close to 18000 more clock cycles from the square root routine, putting it at a little over 3 times faster than TI's. I am back to working on the exponential and logarithm routine, but they are table based (using a single 64-element LUT with 9 bytes to each element). From this I will build the Float->Str routine.

Awesome Xeda! :D
Title: Re: [z80] Floating Point Routines
Post by: chickendude on April 26, 2014, 04:59:12 pm
Xeda, i was just looking through the 24-bit division routine and saw this line:
Code: [Select]
or a \ sbc hl,de \ jr c,$+7 \ set 7,b \ jp $+4 \ add hl,de \ srl d \ rr eI was just wondering if instead of all those "jp $+4"s, if you tried using "set 7,b \ .db $38 ;jr c,... \ add hl,de \ ; ..." you might be able to save two bytes and 3 t-states (x 15 repetitions). Since the carry will never be set there, it'll just skip the add hl,de which it will read as part of the jr. When the condition is false, jr is actually faster (and, of course, smaller) than a jp.
Title: Re: [z80] Floating Point Routines
Post by: calc84maniac on April 26, 2014, 05:01:41 pm
Xeda, i was just looking through the 24-bit division routine and saw this line:
Code: [Select]
or a \ sbc hl,de \ jr c,$+7 \ set 7,b \ jp $+4 \ add hl,de \ srl d \ rr eI was just wondering if instead of all those "jp $+4"s, if you tried using "set 7,b \ .db $56 ;jr c,... \ add hl,de \ ; ..." you might be able to save a byte and 3 t-states (x 15 repetitions). Since the carry will never be set there, it'll just skip the add hl,de which it will read as part of the jr. When the condition is false, jr is actually faster (and, of course, smaller) than a jp.

That sounds like a good idea. However, remember that jr c is $38 and not $56 ;)

Also, that actually saves two bytes, so even better.
Title: Re: [z80] Floating Point Routines
Post by: chickendude on April 26, 2014, 07:49:07 pm
Oops, i was looking at the decimal version  :-X Thanks for catching that (and that it actually saves two bytes), though i'm pretty sure it wouldn't have been an issue for Xeda ;)
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on April 30, 2014, 09:24:13 am
Oh, that's a really good idea! I actually have to rewrite that division, though (it doesn't return accurate results because it isn't keeping track of all of the remainder term). But even so, i wonder how many more returns can use a similar optimization? Excellent!
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on November 24, 2015, 11:49:48 am
Necro update:

I have some Single Precision Floating Point Routines (https://drive.google.com/folderview?id=0B4HNIXQZLWM8LVdMbjRHVWFidkU&usp=sharing) that seem to be working. They have been tested, but not rigorously tested. The routines currently available seem to be working as expected including being able to compute and display ##\log_{2} \left(\pi^{-32}\right)## to 7 digits of accuracy. The single->string routine is not as complete as it could be, but it is working.

Syntax is consistent with HL pointing to the first argument, DE to a second argument if needed, and BC pointing to where the result is output. Only cmpSingle is different, because it does not store an output. Instead, it returns the result of the comparison in the flags register. With that exception, no float routines modify the main or shadow registers.

Now for an example of use on a TI-OS, let's compute and display ##\log_{2} \left(\pi^{-32}\right)##:
Code: [Select]
    ld hl,const_pi  ;pi is the first arg
    ld d,h \ ld e,l ;pi is also the second
    ld bc,scrap
    call mulSingle  ;pi*pi = pi^2
    ld h,b \ ld l,c ;Gonna square the result
    ld d,b \ ld e,c ;BC points to the result of the previous multiply, now HL and DE do, too.
    call mulSingle  ;= pi^4
    call mulSingle  ;= pi^8
    call mulSingle  ;= pi^16
    call mulSingle  ;= pi^32
    call invSingle  ;= 1/pi^32 = pi^-32
    call lgSingle   ;= lg(pi^-32)
    call single2string
    bcall(_PutS)

From the readme, included routines are:
Code: [Select]
absSingle
  func: |x| -> z
  mem:  None
addSingle
  func: x+y -> z
  mem:  6 bytes
  Note: special cases not done
subSingle
  func: x-y -> z
  mem:  10 bytes
  Note: special cases not done
rsubSingle
  func: -x+y -> z
  mem:  10 bytes
  Note: special cases not done
invSingle
  func: 1/x -> z
  mem:  5 bytes
divSingle
  func: x/y -> z
  mem:  5 bytes
cmpSingle
  func: compare x to y, no output
        return z flag if x=y (error is up to the last 2 bits)
        return c flag if x<y
        return nc if x>=y
  mem:  None
single2string
  func: string(x) -> z
  mem:  44 bytes
mulSingle
  func: x*y -> z
  mem:  6 bytes
negSingle
  func: -x -> z
  mem:  None
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on November 25, 2015, 01:33:01 pm
I haven't update the public download, but I finished:
log10(x)
ln(x)
logy(x)
2x
ex
yx

And I'm working on making a much more general and better (asymptotically faster, less RAM used) routine for converting floats to base 10 strings (and TI floats).

Short Term Plan:
After the better float->string, I plan to implement a string->single and tifloat->single. These should be fairly easy to implement (I have done it several times before with success).
After that, a very simple math parser and I/O. This will be difficult-ish (6 on a scale of 1 to 10).

Longterm Schedule:
Update the exponential and log routine with a complex algorithm. Medium difficulty using the existing framework. This will supply most of the trig routines, too.
Add complex support to the math parser.
Extend these routines to the 80-bit floats.
Go nuts because now I can pump out tons of routines since I have the building blocks for most of math.


Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on November 26, 2015, 12:35:06 pm
I have not added extra-precision calculation, but I really need to. The loss of accuracy gets built up in those bottom bits meaning a lose 1 decimal digit of accuracy (so we only have 6 digits of accuracy). TI uses 4 extra digits of precision in their intermediate calculations which is why they manage to keep much of the error out.

Anyways, I added a randSingle function, hyperbolic functions (sinh, cosh, tanh), a faked sqrt (using logs and exponents, until I find or rewrite lost code). I also overhauled the single2str function as planned, and I made a few other routines including a single->TI Float routine.

I've updated the public download, but I'll also upload to this post :)

Have a screenshot!
(http://i.imgur.com/jPtgn1V.png)
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on December 05, 2015, 06:55:20 pm
Bugs fixed:
Spoiler For Special Formats:
Basically I had ZERO with an exponent of $FF and INF/NAN with exponents of $00. However, a calculation resulting in 'ZERO' was usually only indicated by an exponent shift from $01 to $00, and INF was indicated by a shift from $FF to $00 so I could rely on the z flag. The problem is, an addition could technically result in a value with exponent of $FF and not register as an overflow. Then when the result was fed into another routine, it would be flagged as ZERO instead of INF !

Now all special values have an exponent reading $00.
Spoiler For single2str:
there was an issue where when |x|<.1, the output string would be partially mangled. This was fixed. There also may have been an un-noticed off-by-one error when returning negative exponents, but I finally noticed and fixed it.
New and Updated Routines
Spoiler For str2Single:
Converts a TI-ASCII string into a float.
Spoiler For geomeanSingle:
Computes the geometric mean of two numbers.
New Bugs:
Spoiler For str2single:
This does not yet handle the engineering E.
This crashes on some inputs, possibly if the strings are too large.

Title: Re: [z80] Floating Point Routines
Post by: joijoi on December 24, 2015, 01:05:39 am
Hello,
are you still working on the 80bits version of these routines?
Thanks in advances for any answer that you could give
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on December 28, 2015, 12:13:44 am
At the moment, I am not working on the 80-bit versions. However, I am not finished working with them. Life is busy again, so it will probably be a while yet before I work on floating point routines again.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on December 10, 2018, 11:45:55 pm
As an update and backup to Omni, I have been working on some of the extended precision routines. At this moment, I've rewritten multiplication and square roots, and I am about to work on division.

The square root routine is now averaging 9183.443cc, a full three times faster than the previous implementation ! (and over 9 times faster than the TI-OS float algorithm).

The multiplication routine is now averaging 9925.527cc, a more humble 8.5% improvement. (still ~3.5 times faster than the TI-OS float multiplication). A good chunk of the speed improvement comes from a slightly faster 16-bit multiply routine from Runer112, which also has much nicer/more useful output registers.


I had an issue with the new square root algorithm, so for now I have patched it up so that it works, but it's about 1000cc slower than necessary. In the process of implementing this new algorithm (https://www.omnimaga.org/math-and-science/a-faster-newton's-method-square-root/), I rewrote some faster division routines and I might be able to get division down below 13000cc, a 30+% speed improvement (3 times faster than the OS routine).
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on December 12, 2018, 10:37:30 pm
Okay, with a lot of discussion with Runer112 over IRC/Discord, and a lot of coding, here is an update!
First, it seems like there might be an accuracy issue with the lower bits of xdiv, but that could just be an issue from truncating the input.

xsqrt is averaging ~9183.443cc
xmul is averaging ~9925.527cc
xdiv is averaging ~11107.370cc

Comparing to the old set of routines from a few years ago, division is now almost exactly 40% faster, multiplication is still about 8.5% faster, and square roots are three times faster.

Comparing to the OS routines, multiply is about 3.59 times faster, divide is about 3.65 times faster, and square root is about 9.45 times faster.

Pretty much the one thing the OS has better is converting floats to strings, which the OS, using BCD floats, is wayyyyy faster for. I project that converting to a string could take on average 50000cc with my method, versus maybe 600cc for a TI float (assuming no special formatting).

EDIT: I'm still not good with GitHub, so I'm going to try to get this to work, but I make no promises. z80float (https://github.com/Zeda/z80float)
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on December 15, 2018, 11:52:40 pm
So I added in addition/subtraction and float->str. I haven't calculated timing for the add/sub, or float->str, but I would guess about 1500cc and 90000cc are reasonable guesses. The conversion introduces error in the last digits, so I should only return about 16 digits. At the moment, the only formatting that truncates to 16 digits max is when the exponent is too high in magnitude (ex. 1.234567890123456e-9).

Anyways, you can git the source on GitHub, but here are some ugly screenshots attached with evident rounding issues :P


EDIT: It turns out I computed the 10^-(2^k) table with lower precision. Now that it is fixed, numbers are being displayed with a bit better precision. I am currently working on implementing the B-G algorithm.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on January 05, 2019, 12:58:36 pm
As an update the B-G algorithm is implemented, as well as inverse trig and inverse hyperbolic functions, and natural logarithm. There have been a bunch of bug fixes, and optimizations.

My initial implementation of the 64-bit square root had an issue that was too daunting to track down. Instead, I just made a quick patch to make it work, requiring a 32-bit square operation. I finally decided to rework the code a bit and fixed the issue, allowing me to get rid of that 32-bit square and replace it with a 16-bit square (as originally intended).

xsqrt is now under 6600 clock cycles in average !

EDIT:
I made some more accurate timings, including taking into account the time it takes to make a bcall.
Code: [Select]
          TI-OS       z80float  dif         %
ln    = 131547.46cc   ~165000   +33452.54   125.43%   much slower :(
atan  = 173317.82cc   ~174000   +682.18     100.39%   slightly slower
atanh = 175320.91cc   ~174000   -1320.91     99.25%   slightly faster
sqrt  =  77699.51cc   6540.79   -71158.72     8.42%   Way faster!
mul   =  30229.53cc   9928.23   -20301.30    32.84%   over 3 times faster
add   =   1737.99cc   2094.31   +356.32     120.50%   slower :(
Also, have a recent screenshot:
(https://i.imgur.com/YKmKP3Q.png)
Keep in mind that I'm displaying one extra digit beyond the accuracy of these floats, so the last digit is useless. In the next digit, error is off by less than 2.0 !
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on January 14, 2019, 02:40:17 pm
x-post

I have been focusing on the single-precision floats this past week or so. I rewrote or re-worked a lot of routines. I got rid of most of the tables by switching to a polynomial approximation for the 2^x routine (thanks to the Sollya (http://sollya.gforge.inria.fr/) program!) and using the B-G algorithm to compute lnSingle. It turned out to be faster this way, anyways.

I implemented sine, cosine, and tangent, the first  two, again, using minimax polynomial approximation. I optimized the square-root routine (much faster but a few bytes bigger). I re-implemented the B-G algorithm using math optimizations I came up with a few months ago. I opted for two B-G implementations-- one for lnSingle which requires only 1 iteration for single precision, and one for the inverse trig and hyperbolic functions which needs 2 iterations. For anybody looking to save on size, you can just use the second B-G routine for natural logarithm. It will be a little slower, but it'll work just fine (maybe even give you an extra half-bit of precision :P).

I included the Python program that I use for converting numbers to my single precision format. You can use it to convert a single float or a bunch of them. I also included a Python tool I made for computing more efficient coefficients in the B-G algorithm, but that'll only be useful to me and maybe a handful of other people. It's there on the off chance somebody stumbles across my project looking for a B-G implementation.


The single precision floats are largely complete in that I can't think of any other functions that I want to add. There is still work to be done on range reduction and verification, as well as bug fixes and more extensive testing.

Here is a current screenshot of some of the routines and their outputs:
(https://i.imgur.com/2b3cjBi.png)

The current list of single-precision routines:
Code: [Select]
Basic arithmetic:
  absSingle     |x| -> z       Computes the absolute value
  addSingle     x+y -> z
  ameanSingle   (x+y)/2 -> z.  Arithmetic mean of two numbers.
  cmpSingle     cmp(x,y)       Compare two numbers. Output is in the flags register!
  rsubSingle    y-x -> z
  subSingle     x-y -> z
  divSingle     x/y -> z
  invSingle     1/x -> z
  mulSingle     x*y -> z
  negSingle     -x  -> z
  sqrtSingle    sqrt(x*y) -> z
  geomeanSingle sqrt(x*y) -> z

Logs, Exponentials, Powers
  expSingle    e^x -> z
  pow2Single   2^x -> z
  pow10Single  10^x-> z
  powSingle    y^x -> z
  lgSingle     log2(x)  -> z
  lnSingle     ln(x)    -> z
  log10Single  log10(x) -> z
  logSingle    log_y(x) -> z

Trig, Hyperbolic, and their Inverses
  acoshSingle   acosh(x) -> z
  acosSingle    acos(x)  -> z
  asinhSingle   asinh(x) -> z
  asinSingle    asin(x)  -> z
  atanhSingle   atanh(x) -> z
  atanSingle    atan(x)  -> z
  coshSingle    cosh(x)  -> z
  cosSingle     cos(x)   -> z
  sinhSingle    sinh(x)  -> z
  sinSingle     sin(x)   -> z
  tanhSingle    tanh(x)  -> z
  tanSingle     tan(x)   -> z

Special-Purpose    Used by various internal functions, or optimized for special cases
  bg2iSingle     1/BG(x,y) -> z   Fewer iterations, but enough to be suitable for ln(x). Kind of a special-purpose routine
  bgiSingle      1/BG(x,y) -> z   More iterations, general-purpose, needed for the inverse trig and hyperbolics
  div255Single   x/255 -> z
  div85Single    x/85  -> z
  div51Single    x/51  -> z
  div17Single    x/17  -> z
  div15Single    x/15  -> z
  div5Single     x/5   -> z
  div3Single     x/3   -> z
  mul10Single    x*10  -> z
  mulSingle_p375         x*0.375  -> z      Used in bg2iSingle.  x*(3/8)
  mulSingle_p34375       x*0.34375-> z      Used in bgiSingle.   x*(11/32)
  mulSingle_p041015625   x*0.041015625-> z  Used in bgiSingle.   x*(21/512)

Miscellaneous and Utility
  randSingle    rand   -> z
  single2str    str(x) -> z           Convert a single to a null-terminated string, with formatting
  single2TI     tifloat(x) -> z       Converts a single to a TI-float. Useful for interacting with the TI-OS
  ti2single     single(tifloat x)->z  Converts a TI-float to a single. Useful for interacting with the TI-OS
  single2char   Honestly, I forgot what it does, but I use it in some string routines. probably converts to a uint8
  pushpop       pushes the main registers to the stack and sets up a routine so that when your code exits, it restores registers. Replaces manually surrounding code with push...pop
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on January 22, 2019, 04:38:28 pm
Update:
For the extended-precision floats, I added:
Code: [Select]
xcmp     for comparing two numbers
xneg     -x -> z
xabs     |x|-> z
xinv     1/x -> z   Observed a bug in 1/pi !
xpow     x^y -> z
xpow2    2^x
xpow10   10^x
xlog     log_y(x)   It's failing miserably
xlg      log2(x)
xlog10   log10(x)   Observed a bug in log10(pi)
I made the str->single routine better (it had been quickly thrown together and failed on many/most cases). Now it appears that digits get swapped in some cases! :( I have to look into this.

To Do:
Look into the string->single routine and figure out what is wrong
I still have to look into the bugs observed in the single-precision Mandelbrot set program
Look into the errors in xinv, xlog, and xlog10  (these might all be related, or maybe I accidentally a byte in the built-in constants).
Have to make xsin, xcos, xtan, xsinh, xcosh, xtanh, xtoTI (x-float to TI float), TItox (TI float to x-float), and strtox (string --> x-float).
For all of the trig routines, I still need to apply range-reduction :|

Once these are done, it's just finding and fixing bugs and optimizing, and the project is as complete as what I wanted to do. BCD floats were a cute idea, but I'm a bit more realistic now :P
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on February 21, 2020, 12:50:43 am
Wow, it has been over a year since I last updated this thread. I followed up on that todo list a while ago, but I'm posting to announce a new addition to the library: 24-bit floats! I basically wrote all of these in the last 48 hours and I'm pretty burned-out:
https://github.com/Zeda/z80float/tree/master/f24

These floats are a fantastic balance of speed and usefulness on the Z80, and I think all of the routines fit into about 2500 bytes, so they are compact, too.
Title: Re: [z80] Floating Point Routines
Post by: Xeda112358 on May 06, 2021, 07:59:08 pm
Wow, it has been over a year since I last updated this thread! :D

I've added support for IEEE-754 binary32 floats here (https://github.com/Zeda/z80float/tree/master/f32)!
What is really nice about this is that compilers like fasmg have built-in support for these floats, and there are a plethora of existing tools to work with them.
I've rearranged the project a bit and I am working on making the code more agnostic (so not relying on spasm's unique features). I've added a lot of conversion routines, especially for f32 and f24 floats.

I took an unplanned path with the f32 floats and ended up making addition, subtraction, multiplication, division, and square roots work entirely within registers and the stack, so no external RAM was required. I haven't taken the time to do a speed analysis of these, but I'm thinking they are a little slower than the single routines. I expect that multiplication and division are slower, and hopefully addition and subtraction are faster.