Author Topic: 68k 64-bit multiplication (Read 12820 times)

christop · « **on:** August 03, 2011, 01:47:29 am »

Hi all! I am in search of a fast routine that can multiply two 64-bit unsigned integers and return a 128-bit unsigned integer. Preferably the inputs are in %d0:%d1 and %d2:%d3, and the output is in %d0:%d1:%d2:%d3, but I'll take what I can get if the order is different (perhaps the product can be in %d4:%d5:%d6:%d7).

Does anyone know where I can find such a routine?

Another option is a routine that can multiply two 1.63 fixed-point values and return a 2.62 fixed-point product (or, preferably, normalized to 1.63). Such a routine might be faster than shifting a 128-bit product and dropping the lower 64 bits, as there may be no need to calculate all of the low-order bits--just the ones that influence the upper 64 bits.

On a somewhat related topic, I also need a 64-bit division routine (64 / 64 => 64), also normalized.

In case you're curious what I need this for, I'm writing a floating-point library which uses 64-bit significands, and I have to multiply these significands for the multiplication floating-point operation. The significand can be thought of as a normalized 1.63 fixed-point value, which means the high bit (the integer part) is always 1. As such, the product "p" of two 1.63 fixed-point values (a 2.62 fixed-point value) must be 1.0 <= p < 4.0.

AngelFish · « **Reply #1 on:** August 03, 2011, 02:02:22 am »

A fast routine? You're probably better off doing floating point than 64 bit integer math on a processor with 16 bit multiplication.

Assuming d0 and d2 are the high words:

128 bit result: d0*d2*2^64+(d0*d3+d1*d2)*2^32+d1*d3. This uses 32 bit multiplication, a problem I assume you've already solved.

I don't have a division routine, but it sounds positively painful.

christop · « **Reply #2 on:** August 03, 2011, 10:28:38 am »

I'm actually writing a floating-point library!

I've already figured out the math behind 64-bit multiplication using only 16-bit multiplies:

Code: [Select]

(1000a+100b+10c+1d)*(1000e+100f+10g+1h) =
1000000ae + 100000af + 10000ag + 1000ah +
 100000be +  10000bf +  1000bg +  100bh +
  10000ce +   1000cf +   100cg +   10ch +
   1000de +    100df +    10dg +    1dh

(Each digit represents 16 bits, since I'm multiplying two 16-bit "digits" at a time.)

Here's where each of the half-products falls into place in the final 128-bit product:

Code: [Select]

   ah
   bg
   cf
   de
  ag  dh
  bf ch
  ce dg
 af bh
 be cg
ae  df
00000000

Obviously, if I'm going to use only the top 64 bits (top 4 digits), I have to calculate at least ten half-products. The remaining five half-products may or may not affect the sum, so some of them can possibly not be calculated. It depends on whether their sums carry into the upper 64 bits. The trick is to do this quickly in 68k asm.

FWIW, the TI-AMS floating-point format uses 16 BCD digits in 64 bits, so TI-AMS has to multiply values in 4-bit or 8-bit units. I know PedroM pre-multiplies one of the operands by the values 0 through 9 (using only addition and subtraction), but it still has to perform a lot of multiply instructions. My code uses 80-bit extended-precision IEEE754 floating-point values, so it should be faster, even with all 16 multiply instructions. Plus this will give me more than 19 decimal digits of precision vs 16 decimal digits with TI-AMS floats. On the other hand, float->decimal and decimal->float are slower than with BCD, but I expect those operations to be performed less often than pure math operations.

christop · « **Reply #3 on:** August 06, 2011, 12:06:03 am »

Continuing from my last post, here's what I've gotten so far. I decided to start with a plain 64x64->128 multiplication routine but with a few optimization tricks.

First of all, this routine will have to execute 16 multiply instructions, since there are four 16-bit words per 64-bit value (4*4=16), as I noted in my previous post. To make the rest of this post make more sense, I've numbered all 16 of these 32-bit products:

Code: [Select]

00 11 22 33     ae af ag ah
44 55 66 77  =  be bf bg bh
88 99 aa bb     ce cf cg ch
cc dd ee ff     de df dg dh

The first little optimization I thought to do is to push the first 4 32-bit products onto the stack (storing the 128-bit result on the stack). The first four products are 00 (88|55|22) (dd|aa|77) ff, where (...) provides a few different options in that position. I wanted to minimize the number of swaps needed to multiply these products (swaps are needed to put the appropriate 16-bit value in the lower 16 bits of the register that contains that value), so I chose the products 00 88 dd ff. This sequence requires only two swaps. (I actually calculate them in reverse order--ff dd 88 00--so that 00 ends up at the top of the stack (lowest address) because it's the most significant word, and we're storing values in big endian). From here I was just thinking about calculating the remaining 12 products and add them to the current 128-bit result, with carry, using inline loops (or unrolled loops). Either that, or push each of them onto the stack as individual partial products and add them all with carry.

The second optimization I thought of is to combine or "pack" the products together into partial products like the first one. This gave me the following seven partial products:

Code: [Select]

0088ddff
.44ccee.
.1199bb.
..55aa..
..2277..
...66...
...33...

Then I would add up only seven partial products rather than 13 (or worse, 16). Note that the ordering of the products within each partial product is somewhat arbitrary. I could have used the partial products ..5577.. and ..22aa.. instead of ..55aa.. and ..22aa.., for example. They add up the same either way.

The last optimization tick comes from a recent realization of mine: the maximum product of two unsigned 16-bit values is 0xFFFE0001. What this means is that staggered long (32-bit) values can be added together without carry. Here's an example:

Code: [Select]

FFFE0001
+   FFFE0001
+       FFFE0001
----------------
FFFEFFFFFFFF0001

This addition can be extended to any number of staggered longs without carry.

This optimization resulted in the following layout:

Code: [Select]

first partial product:
0088ddff +
.44ccee.

second partial product:
.1199bb. +
..55aa..

third partial product:
..2277.. +
...66...

fourth partial product:
...33...

The first partial product is calculated by adding the first set of products (00 88 dd ff) and the second set of products (44 cc ee) without carries, since they are staggered from each other. The second partial product is calculated similarly and is then added to the first partial product (using addition with carries). The third and fourth partial products are also calculated and added to the result in turn. Afterwards, the 128-bit result on the stack is popped off into the registers %d0-%d3 (with a single movem.l (%sp)+,%d0-%d3 instruction) and the routine returns.

Using this staggered-add trick, I just have to add up 4 partial products with carry instead of 16 or even 13. Hopefully I can find a few other tricks like this to help speed up the routine even more.

The stack usage of this routine will be 256 bits, or 32 bytes. I might be able to reduce that by storing part of the 128-bit result in registers and part of it on the stack, which will surely complicate it more but might also make it a bit faster as well.

UPDATE: I finally implemented a working multiply routine, almost exactly as I described it here. It takes a bit more than 32 bytes of stack space, though, since it has to save a few registers. My first test was to multiply 0x000000E8D4A51000 (1e12 decimal) by 0x000000000280DE80 (42e6 decimal) to get the product 0x000000000000000246DDF97976680000 (42e18 decimal), which the routine managed to compute correctly after a few bug fixes. I'll need to test some larger values to ensure that all bits are touched with non-zero values.

Right now the routine takes 2326 cycles, which isn't too bad but not that great either. The 16 multiplies account for just over half of that (1184 cycles), so I may be able to reduce the total by a few hundred cycles with some additional optimizing.

Lionel Debroux · « **Reply #4 on:** August 06, 2011, 03:33:36 am »

Yeah, try to store as much stuff as possible in registers, because working with memory slows down quite a bit

christop · « **Reply #5 on:** August 06, 2011, 10:52:57 pm »

As I've been posting here at Cemetech, I have a working routine now. It's not necessarily the most optimized but it's pretty quick (it runs in 1824 cycles).

Here's a screenshot showing a little test program multiplying a few inputs:

If you're so inclined, you can double-check the results yourself.

On a somewhat but not too related note, the Windows calculator sucks. It can't do fractions in "Programmer" mode, and that's the only mode that supports different bases (eg, hexadecimal), which means it can't calculate hexadecimal fractions like 1.6A09E667F. Microsoft must think that programmers don't (or can't) use fractions. I'm a programmer, and I do in fact use fractions at work! It also uses weird keyboard shortcuts like Y for exponentiation or F9 for negate. The only sensible keyboard shortcuts are the keys 0 through 9 and +-/*. Lame.

Fortunately, I use Linux at home, so I don't have these problems at home. gcalctool FTW!

</rant>

AngelFish · « **Reply #6 on:** August 07, 2011, 12:14:15 am »

Um, fractions aren't defined for INTEGER data types, which are all the Windows calculator deals with in programmer mode.

christop · « **Reply #7 on:** August 07, 2011, 01:44:37 am »

Quote from: Qwerty.55 on August 07, 2011, 12:14:15 am

Um, fractions aren't defined for INTEGER data types, which are all the Windows calculator deals with in programmer mode.

That was basically the point of my gripe: that it supports only integers in Programmer mode. It would be nice if it supported fractions in hexadecimal, octal, and binary. (Hexadecimal/octal/binary are not integer types; they are bases).

For example, if I want to see what 2.375 is in binary, I use the standard Gnome calculator (gcalctool) and switch to binary. 2.375 decimal is 10.011 in binary, 2.6 in hexadecimal, or 2.3 in octal. It's really handy for calculating fixed-point values, for example (which is a common task in some areas of PROGRAMMING), but the Windows calculator can't do this basic task.

fb39ca4 · « **Reply #8 on:** August 07, 2011, 01:46:55 am »

It would be nice to have a hexadecimal calculator that actually supports a decimal point, it would be very useful for fixed point math.

ninja'd it seems

AngelFish · « **Reply #9 on:** August 07, 2011, 05:52:58 am »

Quote from: christop on August 07, 2011, 01:44:37 am

Quote from: Qwerty.55 on August 07, 2011, 12:14:15 am
Um, fractions aren't defined for INTEGER data types, which are all the Windows calculator deals with in programmer mode.
That was basically the point of my gripe: that it supports only integers in Programmer mode. It would be nice if it supported fractions in hexadecimal, octal, and binary. (Hexadecimal/octal/binary are not integer types; they are bases).

For example, if I want to see what 2.375 is in binary, I use the standard Gnome calculator (gcalctool) and switch to binary. 2.375 decimal is 10.011 in binary, 2.6 in hexadecimal, or 2.3 in octal. It's really handy for calculating fixed-point values, for example (which is a common task in some areas of PROGRAMMING), but the Windows calculator can't do this basic task.

I know that hexadecimal and the rest are bases. I've worked in some pretty crazy bases before (base -5i, anyone?)

All I was saying was that the Windows calculator only handles one data type in that particular mode and the operations you'd like aren't defined for that data type. Calculator is like Notepad. It works, but it doesn't have a lot of nice features. I'm frankly surprised they even put Integer data types in there. Personally, I never need more than integer types, but a cursory glance at Google shows some calculators you can use.

harold · « **Reply #10 on:** August 07, 2011, 06:58:42 am »

I use wolfram alpha for that, it can't do imaginary bases though

fb39ca4 · « **Reply #11 on:** August 07, 2011, 02:28:32 pm »

How do you work in base -5i

AngelFish · « **Reply #12 on:** August 07, 2011, 03:54:58 pm »

With a calculator handy

The basic fact you end up using is that any number x can be represented in base z as ∑a_izⁱ

Horrowind · « **Reply #13 on:** August 07, 2011, 08:00:42 pm »

so 17 = -(4i * (5i)¹ + 3 * (5i)⁰) = -[4i, 3]_5i?

I couldn't come up with an expression which used natural numbers lower then 5 only, but I may be missing something...

calcdude84se · « **Reply #14 on:** August 07, 2011, 08:42:35 pm »

Quote from: Qwerty.55 on August 07, 2011, 03:54:58 pm

With a calculator handy

The basic fact you end up using is that any number x can be represented in base z as ∑a_izⁱ

With the exception of when z = 0 or |z|=1, of course

Author Topic: 68k 64-bit multiplication (Read 12820 times)

christop

68k 64-bit multiplication

AngelFish

Re: 68k 64-bit multiplication

christop

Re: 68k 64-bit multiplication

christop

Re: 68k 64-bit multiplication

Lionel Debroux

Re: 68k 64-bit multiplication

christop

Re: 68k 64-bit multiplication

AngelFish

Re: 68k 64-bit multiplication

christop

Re: 68k 64-bit multiplication

fb39ca4

Re: 68k 64-bit multiplication

AngelFish

Re: 68k 64-bit multiplication

harold

Re: 68k 64-bit multiplication

fb39ca4

Re: 68k 64-bit multiplication

AngelFish

Re: 68k 64-bit multiplication

Horrowind

Re: 68k 64-bit multiplication

calcdude84se

Re: 68k 64-bit multiplication