[z80] Floating Point Routines

Calculator Community > ASM

<< < (4/10) > >>

Xeda112358:
I have updated the 24-bit floating point routines. Most of the routines have easy input/output. If there are two inputs, it is BHL and CDE. If it is one input, it is AHL or BHL. Outputs for the floating point routines are AHL or BHL. There are also a handful of extra routines, such as:

BC_Times_DE (returns 32-bit result, worst case is less than 700 t-states)
normalise24 to renormalise 24-bit floats
SetInf to return a float as infinity (keeping the original sign)
FloatToUInt to convert a float to an unsigned integer
FloatToSInt to convert a float to a signed integer
SqrtHL_prec16 returns the square root of HL to 16 bits of accuracy.

The actual floating point routines are:

Float24Mul
Float24Div
Float24Add
Float24Sub
Float24Lg
Float24Sqrt

Xeda112358:
A while ago, I thought of a way to possibly make division faster, but I never had the time or focus to actually write it out in assembly. I was working on it today after satisfying myself with the math involved and it doesn't work yet (there is a bug somewhere that seems to be incrementing the wrong byte in the result on occasion). However, the fix for that should not make it much slower. My goal was to get division below 45 000 t-states and from current tests, it will probably be more like 25 000 t-states. This is nearly twice as fast as the OS division and a little faster than the OS' multiplication and this is for 19 digits of accuracy.

My worry when I thought of the algorithm was that it seemed like it would be pretty close to the speed of long division and might only be faster for very large or small sized numbers. It seems to be that on the z80, this method blows away long division when using a floating point storage format.

Here is the way it works-- If my numbers are always normalised (this means the upper bit is always set, unless the number is 0), then I might want something like the following in hexadecimal 8AD176/AC0980 (in base 10, 9097590/11274624). What I do is I make an estimate for the first byte. Normally on the z80, we look at the bits because handling too many bytes is difficult. However, my algorithm isn't quite long division. The speed up comes from how we choose the first chunk of 8 bits. I just take the upper 16 bits of the numerator, upper 8-bits of the denominator, and divide those to get my estimate. The nice part is that this will only be off by as much as 1 (and if it is, it was an overestimate). Then whatever the estimate was, multiply that by the denominator (like in base 10) and subtract. So for the example:

--- Code: ---First digit = 0 ('digits' are 8-bit ints, so on [0,255])
Now 8AD1/AC = CE, so 8AD176.00 - AC0980*0.CE = 8AD176-8A6FAF = 61D1
Now 61D1/AC = 91, so 61D1.0000 - AC0980*.0091 = 61D1.0-6171.6180 = 5F.9E80
Now 5F9E/AC = 8E, so 5F.9E80 - AC0980*.00008E = 5F.9E8000-5F.6D4500 = .313B

--- End code ---
In this case, there were no over estimates. We would have know if the subtraction step yeilded a negative output. To adjust this, decrement the new digit by 1 and add AC0980 to the int. So the example gives 8AD176/AC0980 = 0.CE918E, or in base 10, 9097590/11274624=.806908488274

That multiplication step is just multiplying an 8-bit number by a 64-bit number in my routine, and I combine this with subtraction (so it is a fused multiply-subtract routine) and it only needs to be performed 8 times. Hopefully when I get out of work later, I won't be too tired to keep working on this :)

chickendude:
This stuff is mostly all way beyond me but don't think no one is reading/that no one finds it interesting. I think these routines are really cool and i'm trying to make the effort to understand them. I hope you keep working on them!

Xeda112358:
Cool, thanks :) I finally got all of the small bugs out and it currently takes 25053 t-states to perform pi/e and 22483 t-states for e/pi. However, I have the number format slightly different from my other 80-bit floats. After experimenting earlier this year, it is actually a little friendlier to store the exponent+sign in little-endian mode and have 16384 represent the exponent 0.

Floating point is basically like scientific notation. So in base 10, 314.1593=3.141593*102. The way it would be stored in floating point is a sign bit (positive, so usually 0), followed by the exponent (2), followed by the string "3141593". Now if we want to square this, we can multiply 3141593*3141593 and add the exponents, and XOR the sign. So you get +:4:9869606577649 and this tells us that the number is +98696.06577649. In practice, we truncate and round the digits, so to 6 digits, +:4:986961.

In binary, it is the same idea. If I keep the numbers normalized (keeping the mantissa on [1,2)), then division of 3/5 for example, is dividing +:1:11000000 / +:2:10100000. We can subtract exponents, XOR the signs, and then we perform 11000000/10100000 to whatever precision we need. Integer division would give us 0, but if we go for 8 more iterations of the same division algorithm, we would get 8 bits of the decimal. Rounding, we get +:-1:10010100 which corresponds to .10010100 = 154/256 = .6015625.

Here is my code, currently. I am pretty sure it isn't nearly as optimized as it could be and it is nearly 900 bytes:

--- Code: ---Div_Sub:
;DE/C, DE <C*256, C>127
ld a,d
sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
sla e \ adc a,a \ jr c,$+5 \ ret p \ cp c \ ret c \ inc e \ ret
FloatDiv_80:
; 1 bit sign + 15 bits signed exponent (16384 is exp = 0) (little endian)
; 64 bits mantissa, (big endian)
;Inputs:
; HL points to dividend
; DE points to divisor
ex de,hl
call LoadFPOPs
ld hl,(fpOP1)
ld de,(fpOP2)
ld a,h
xor d
push af
res 7,d
res 7,h
sbc hl,de
ld bc,16384
add hl,bc
pop af
and $80
or h
ld h,a
ld (fpOP3),hl
;Now perform the division of fpOP2/fpOP1
;The algo works like this:
; Take the first byte of fpOP2, compare against that of fpOP1
; If it is bigger, since fpOP1 should have bit 7 set (normalized numbers),
; it divides at most once. So the first byte is 1, subtract fpOP2-fpOP1->fpOP2
; After this, we repeatedly compare the upper two bytes of fpOP1 to the first byte
; of fpOP1. This is to estimate how many times fpOP1 can be divided by fpOP1.
; This is just a guestimate, but each digit is an overestimate by at most 1!
;
; Example with smaller numbers. Take 8AD176/AC0980
; First digit = 0 ('digits' are 8-bit ints, so on [0,255])
; Now 8AD1/AC = CE, so 8AD176.00 - AC0980*0.CE = 8AD176-8A6FAF = 61D1
; Now 61D1/AC = 91, so 61D1.0000 - AC0980*.0091 = 61D1.0-6171.6180 = 5F.9E80
; Now 5F9E/AC = 8E, so 5F.9E80 - AC0980*.00008E = 5F.9E8000-5F.6D4500 = .313B
; In this case, there were no over estimates. We would have know if the subtraction step
; yeilded a negative output. To adjust this, decrement the new digit by 1 and add AC0980 to the int.
; So the example gives 8AD176/AC0980 = 0.CE918E, or in base 10, 9097590/11274624=.806908488274

;fpOP1+2 has denom
;fpOP2+2 has num
ld de,fpOP2-2
ld hl,fpOP2+2
ldi \ ldi \ ldi
ldi \ ldi \ ldi
ldi \ ldi \ ldi
ldi \ ldi \ ldi
denom = fpOP1+2
numer = fpOP2-2
outp = numer-1
ld hl,denom
ld de,numer
call cp_64b
ld hl,numer-1
ld (hl),0
jr c,noadjust
inc (hl)
ex de,hl
inc de
ld hl,denom
call sub_64b
ex de,hl \ dec hl
noadjust:
inc hl
ld de,numer+8
call div_sub_1
call div_sub_1
call div_sub_1
call div_sub_1
call div_sub_1
call div_sub_1
call div_sub_1
call div_sub_1
ld de,801Eh
ld hl,800Bh
ld a,(hl)
rra
jr nc,directcopy
inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
inc hl \ ld a,(hl) \ rra \ ld (de),a \ ret
directcopy:
inc hl
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ld hl,(fpOP3) \ dec hl \ ld (fpOP3),hl \ ret
div_sub_1:
ld bc,(denom)
ld a,(hl)
inc hl
push hl
ld l,(hl)
ld h,a
ex de,hl
call Div_Sub
ld c,e
ex de,hl
call fused_mul_sub
ld hl,9
add hl,de
ex de,hl
pop hl
ret
fused_mul_sub:
;multiply denominator*E and subtract from numerator
xor a
ld hl,(denom+6) \ ld b,a \ ld l,b
sla h \ jr nc,$+3 \ ld l,c
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
ld a,(de) \ sub l \ ld (de),a \ dec de
ld a,h \ adc a,b

ld hl,(denom+5) \ ld l,b
sla h \ jr nc,$+3 \ ld l,c
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add a,l \ jr nc,$+3 \ inc h \ ld l,a
ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

ld hl,(denom+4) \ ld l,b
sla h \ jr nc,$+3 \ ld l,c
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add a,l \ jr nc,$+3 \ inc h \ ld l,a
ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

ld hl,(denom+3) \ ld l,b
sla h \ jr nc,$+3 \ ld l,c
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add a,l \ jr nc,$+3 \ inc h \ ld l,a
ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

ld hl,(denom+2) \ ld l,b
sla h \ jr nc,$+3 \ ld l,c
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add a,l \ jr nc,$+3 \ inc h \ ld l,a
ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

ld hl,(denom+1) \ ld l,b
sla h \ jr nc,$+3 \ ld l,c
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add a,l \ jr nc,$+3 \ inc h \ ld l,a
ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

ld hl,(denom) \ ld l,b
sla h \ jr nc,$+3 \ ld l,c
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add a,l \ jr nc,$+3 \ inc h \ ld l,a
ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

ld hl,(denom-1) \ ld l,b
sla h \ jr nc,$+3 \ ld l,c
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add hl,hl \ jr nc,$+3 \ add hl,bc
add a,l \ jr nc,$+3 \ inc h \ ld l,a
ld a,(de) \ sub l \ ld (de),a \ ld a,h \ dec de
ld l,a
ld a,(de)
sbc a,l
;if c flag is set, overestimate
ld a,c \ ld (de),a
ret nc
ld hl,8
add hl,de
ex de,hl
ld hl,denom+7
ld a,(de) \ add a,(hl) \ ld (de),a \ dec hl \ dec de
ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
ld a,(de) \ adc a,(hl) \ ld (de),a \ dec de
ex de,hl \ dec (hl) \ ex de,hl
ret

;num+7 - hl

sub_64b:
;(de)-(hl), big endian 64-bit.
ld bc,7
add hl,bc
ex de,hl
add hl,bc
ex de,hl
ld a,(de) \ sub (hl) \ ld (de),a \ dec de \ dec hl
ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
ld a,(de) \ sbc a,(hl) \ ld (de),a \ ret
cp_64b:
;compares (de) to (hl), big endian 64-bit ints
ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
ld a,(de) \ cp (hl) \ ret

LoadFPOPs:
;HL points to the first
;DE points to the second
push de
ld de,fpOP1
xor a
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ld (de),a \ inc de
ld (de),a \ inc de
ld (de),a \ inc de
ld (de),a \ inc de
pop hl
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ld (de),a \ inc de
ld (de),a \ inc de
ld (de),a \ inc de
ld (de),a \ inc de
ret
.echo "Size:",$-Div_Sub

--- End code ---
It is so large because I unrolled much of it. As an example, using e and pi:

--- Code: --- ld hl,float_e
ld de,float_pi
jp FloatDiv_80
;e/pi =0.dd816a76547ca9910802972996d4e3
float_pi:
.dw 16384+1 \ .db $c9,$0f,$da,$a2,$21,$68,$c2,$34 ;pi, not rounded up
float_e:
.dw 16384+1 \ .db $ad,$f8,$54,$58,$a2,$bb,$4a,$9A ;e, not rounded up

--- End code ---

Xeda112358:
At a hefty 1631 bytes and a change of syntax, I have managed to get division under 20 000 t-states and multiplication at about 13000 t-states.

The change of syntax is basically using the exponent as 16384 for 0, sign is bit 15, and these together comprise the first two bytes in little-endian. The following 8 bytes are the mantissa and are also little-endian (it was big-endian before).

I also changed how the multiplication and division routines are done. For the division at each step, instead of doing 8 8x8->16 multiplications, it is faster to do 4 8x16->24 multiplications. For the multiplication, instead of doing 16 16x16->32 multiplications, I do 32 8x16->24 multiplications.

I should also note that I have these timings without code to check for overflow, zero, infinities, or NAN, but I changed exponent storage to make these easier to do. Now if I can get addition and subtraction in this new format, I will be tempted to write some really slow routines for exponentials, logs, and trig functions. Then they will be available until I have time to actually design much faster algorithms.

Also, I have been trying to think of a way to make a "geometric mean" function. Basically, the geometric mean of 'a' and 'b' is sqrt(ab). I could just multiply and perform the square root, but for the past few months I have been having the idea of fusing the two operations into one. The idea would be that since an NxN multiplication would return a number of size 2N, then I can try to get 2 bits at a time of the multiplication. For restoring squares square root, it takes in the upper two bits at each iteration, so I can feed the bits directly from the multiplication into the square root algorithm without worrying about storing a 128-bit integer, and while getting full precision to 64 bits (instead of multiplying to get a 64-bit approximation and taking the square root of that).

The geometric mean would be useful when combined with the arithmetic mean (a+b)/2, especially if I can get it to be streamlined and fast. There is a class of algorithms that use the Arithmetic-Geometric Mean. This provides the asymptotically fastest known method of computing things like natural log, arctangent, and others.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version