Author Topic: 24 bit multiplication (Read 15673 times)

ACagliano · « **Reply #30 on:** December 11, 2011, 01:41:47 pm »

Ok. I am particularly interested now in 2-byte multiplication and 4-byte square rooting. How would they be done?

jacobly · « **Reply #31 on:** December 11, 2011, 02:06:55 pm »

// Multiply a times b
temp = 0
repeat for each bit in a
 temp <<= 1
 if (high bit of a set)
  temp += b
 a <<= 1
return temp

if a and b are 2 bytes, temp is 4 bytes, and you loop 16 times.

Spoiler For for code:

Code: [Select]

// Sqrt a
temp = high byte of a
a <<= 8
b = 0
repeat for every 2 bits in a
 test = b << 8 + 0x40
 b <<= 1
 if (temp >= test)
  temp -= test
  set low bit of b
 temp += high 2 bits of a
 a <<= 2
return b

If a is 4 bytes, then b and temp are 2 bytes, and you loop 16 times.

Spoiler For code:

ACagliano · « **Reply #32 on:** December 11, 2011, 02:23:11 pm »

Quote from: jacobly on December 11, 2011, 02:06:55 pm

Code: [Select]
// Multiply a times b temp = 0 repeat for each bit in a temp <<= 1 if (high bit of a set) temp += b a <<= 1 return tempif a and b are 2 bytes, temp is 4 bytes, and you loop 16 times.
Spoiler For for code:
~~stolen~~ from Axep_MulFull: ; Input in hl, result in cahl ld c,h ld a,l ld hl,0 ;11 ld b,16 ;7 __MulFullNext: add hl,hl ;11 rla ;4 rl c ;8 jr nc,__MulFullSkip ;12/7 add hl,de ;11 adc a,0 ;7 jr nc,__MulFullSkip inc c __MulFullSkip: djnz __MulFullNext ret __MulFullEnd:

Code: [Select]
// Sqrt a temp = high byte of a a <<= 8 b = 0 repeat for every 2 bits in a test = b << 8 + 0x40 b <<= 1 if (temp >= test) temp -= test set low bit of b temp += high 2 bits of a a <<= 2 return bIf a is 4 bytes, then b and temp are 2 bytes, and you loop 16 times.
Spoiler For code:
~~stole~~ my own routine from axe (and modified it)p_Sqrt88: ; input in hlde, result in de ld b,16 ld a,h ld c,l push de ; ld ixh,d pop ix ; ld ixl,e ld de,0 ld h,d ld l,e __Sqrt88Loop: sub $40 sbc hl,de jr nc,__Sqrt88Skip add a,$40 adc hl,de __Sqrt88Skip: ccf rl e rl d add ix,ix rl c rla adc hl,hl add ix,ix rl c rla adc hl,hl djnz __Sqrt88Loop ret __Sqrt88End:

Kool. Thanks. But, shouldn't the first one have two inputs?

Xeda112358 · « **Reply #33 on:** December 11, 2011, 02:30:16 pm »

So with two-byte multiplication, you can take advantage of the fact that add hl,hl is the same as shifting hl left. It even gives you the carry! So in this case:

Code: [Select]

     ld hl,0
     ld a,16
MultLoop:
     add hl,hl      ;shifts hl left
     rl e \ rl d    ;shifts de left and if hl overflowed, it overflows into de
     jr nc,$+6      ;if the bit in DE is o, skip this chunk
       add hl,bc    ;add bc to hl (think of this as the first number)
       jr nc,$+3    ;overflow into de
         inc de
     dec a
     jr nz,MultLoop
     ret

That will multiply DE times BC and return the result in DEHL. I will see if I can port a square root routine for 32-bit...

EDIT: changed inc e to inc de

jacobly · « **Reply #34 on:** December 11, 2011, 02:48:20 pm »

Quote from: ACagliano on December 11, 2011, 02:23:11 pm

Quote from: jacobly on December 11, 2011, 02:06:55 pm
Code: [Select]
// Multiply a times b temp = 0 repeat for each bit in a temp <<= 1 if (high bit of a set) temp += b a <<= 1 return tempif a and b are 2 bytes, temp is 4 bytes, and you loop 16 times.
Spoiler For for code:
~~stolen~~ from Axep_MulFull: ; Input in hl and de, result in cahl ld c,h ld a,l ld hl,0 ;11 ld b,16 ;7 __MulFullNext: add hl,hl ;11 rla ;4 rl c ;8 jr nc,__MulFullSkip ;12/7 add hl,de ;11 adc a,0 ;7 jr nc,__MulFullSkip inc c __MulFullSkip: djnz __MulFullNext ret __MulFullEnd:

Code: [Select]
// Sqrt a temp = high byte of a a <<= 8 b = 0 repeat for every 2 bits in a test = b << 8 + 0x40 b <<= 1 if (temp >= test) temp -= test set low bit of b temp += high 2 bits of a a <<= 2 return bIf a is 4 bytes, then b and temp are 2 bytes, and you loop 16 times.
Spoiler For code:
~~stole~~ my own routine from axe (and modified it)p_Sqrt88: ; input in hlde, result in de ld b,16 ld a,h ld c,l push de ; ld ixh,d pop ix ; ld ixl,e ld de,0 ld h,d ld l,e __Sqrt88Loop: sub $40 sbc hl,de jr nc,__Sqrt88Skip add a,$40 adc hl,de __Sqrt88Skip: ccf rl e rl d add ix,ix rl c rla adc hl,hl add ix,ix rl c rla adc hl,hl djnz __Sqrt88Loop ret __Sqrt88End:

Kool. Thanks. But, shouldn't the first one have two inputs?

Of course, hl and de, isn't that what I said

FloppusMaximus · « **Reply #35 on:** December 11, 2011, 04:41:31 pm »

Quote from: jacobly on December 08, 2011, 08:35:58 pm

My first multiplication routine takes 2746 - 4570 cycles, the second takes 1680 - 2880 cycles.

Oh boy, optimization time

The best I have so far is somewhere around 1800 cycles average (I'm too lazy to work out the exact probabilities at the moment, and not counting memory delays) using a squaring table and undocumented IX instructions. Input is BDE and CHL, output is BCDEAL. This routine works by expanding the formula 2xy = x²+y²-|x-y|², summed over each of the 9 pairs of bytes in the input.

(I'm not saying this is practical - unless you really have thousands of 24-bit multiplications to perform, you don't need this kind of speed. This is just for fun.)

Code: [Select]

SUBFIRST .macro src1, src2, hdest, ldest
 exx
 ld a, src1
 sub src2
 jr nc, $ + 4
 neg
 exx
 ld l, a
 ld a, ldest
 sub (hl)
 ld ldest, a
 inc h
 ld a, hdest
 sbc a, (hl)
 ld hdest, a
  .endm

SUBNEXT .macro src1, src2, hdest, ldest
 dec h
 ex af, af'
 exx
 ld a, src1
 sub src2
 jr nc, $ + 4
 neg
 exx
 ld l, a
 ex af, af'
 ld a, ldest
 sbc a, (hl)
 ld ldest, a
 inc h
 ld a, hdest
 sbc a, (hl)
 ld hdest, a
  .endm

BDE_times_CHL_sqrdiff_v3:
 ld a, d
 exx
 ld h, high(sqrtab)
 ld l, a
 ld e, (hl)
 inc h
 ld d, (hl)  ; DE = d²
 exx
 ld a, b
 exx
 ld l, a
 ld b, (hl)
 dec h
 ld c, (hl)  ; BC = b²
 exx
 ld a, e
 exx
 ld l, a
 ld a, (hl)
 inc h
 ld h, (hl)
 ld l, a   ; HL = e²
 call BC_DE_HL_times_10101
 push bc
  push hl
   push de
    exx
    ld a, h
    exx
    ld h, high(sqrtab)
    ld l, a
    ld e, (hl)
    inc h
    ld d, (hl)  ; DE = h²
    exx
    ld a, c
    exx
    ld l, a
    ld b, (hl)
    dec h
    ld c, (hl)  ; BC = c²
    exx
    ld a, l
    exx
    ld l, a
    ld a, (hl)
    inc h
    ld h, (hl)
    ld l, a  ; HL = l²
    call BC_DE_HL_times_10101
    pop ix
   add ix, de
   pop de
  adc hl, de
  ex de, hl
  pop hl
 adc hl, bc
 ld b, h
 ld c, l   ; BCDEIX = total
 push af

  ld h, high(sqrtab)
  SUBFIRST e, l, ixh, ixl
  SUBNEXT  d, h, d, e
  SUBNEXT  b, c, b, c
  jp nc, BDE_times_CHL_sqrdiff_v3_nc1
  pop af
 ccf
 push af
BDE_times_CHL_sqrdiff_v3_nc1:

  inc b

  dec h
  SUBFIRST e, h, e, ixh
  SUBNEXT  d, c, c, d
  jr nc, BDE_times_CHL_sqrdiff_v3_nc2
  dec b
  jp nz, BDE_times_CHL_sqrdiff_v3_nc2
  pop af
 ccf
 push af
BDE_times_CHL_sqrdiff_v3_nc2:

  dec h
  SUBFIRST d, l, e, ixh
  SUBNEXT  b, h, c, d
  jr nc, BDE_times_CHL_sqrdiff_v3_nc3
  dec b
  jp nz, BDE_times_CHL_sqrdiff_v3_nc3
  pop af
 ccf
 push af
BDE_times_CHL_sqrdiff_v3_nc3:

  inc c

  dec h
  SUBFIRST b, l, d, e
  jr nc, BDE_times_CHL_sqrdiff_v3_nc4
  dec c
  jp nz, BDE_times_CHL_sqrdiff_v3_nc4
  dec b
  jp nz, BDE_times_CHL_sqrdiff_v3_nc4
  pop af
 ccf
 push af
BDE_times_CHL_sqrdiff_v3_nc4:

  dec h
  SUBFIRST e, c, d, e
  pop hl
 jr nc, BDE_times_CHL_sqrdiff_v3_nc5
 dec c
 jp nz, BDE_times_CHL_sqrdiff_v3_nc5
 dec b
 jp nz, BDE_times_CHL_sqrdiff_v3_nc5
 inc l
BDE_times_CHL_sqrdiff_v3_nc5:

 dec b
 dec c

 rr l
 rr b
 rr c
 rr d
 rr e
 ld a, ixl
 ld l, a
 ld a, ixh
 rra
 rr l
 ret


BC_DE_HL_times_10101:
 push bc
  ld a, h
  ex af, af'
  sub a
  ld c, a
  ld b, l
  add hl, bc
  adc a, a
  ld b, e
  add hl, bc
  adc a, c  ; AHL = [ L+H+E L ]
  pop bc
 push hl
  push bc
   ld c, a
   ld b, 0
   ex af, af'
   ld h, a
   add hl, bc  ; no way this can carry (initial HL is a square)
   ld c, a
   ld b, e
   sub a
   add hl, bc
   adc a, a  ; AHL(SP+2) = [ H+E L+H L+H+E L ]
   add hl, de
   adc a, 0  ; AHL(SP+2) = [ H+E+D L+H+E L+H+E L ]
   pop bc
  add hl, bc
  adc a, 0  ; AHL(SP) = [ H+E+D+B L+H+E+C L+H+E L ]
  ld e, d
  ld d, c
  add hl, de
  adc a, b
  jr nc, BC_DE_HL_times_10101_nc1
  inc b   ; BAHL(SP) = [ B B H+E+D+C+B L+H+E+D+C L+H+E L ]
BC_DE_HL_times_10101_nc1:
  add a, e
  jr nc, BC_DE_HL_times_10101_nc2
  inc b   ; BAHL(SP) = [ B D+B H+E+D+C+B L+H+E+D+C L+H+E L ]
BC_DE_HL_times_10101_nc2:
  pop de
 add a, c
 ld c, a
 ret nc
 inc b   ; BCHLDE = [ B D+C+B H+E+D+C+B L+H+E+D+C L+H+E L ]
 ret

To get back to the topic somewhat, ACagliano, it sounds like you're more interested in squaring than in general multiplication. Squaring can be considerably faster, especially if you use a lookup table (e.g., my best 16-bit squaring routine is around 170 cycles, versus around 800 for general multiplication.)

cerzus69 · « **Reply #36 on:** December 12, 2011, 10:43:43 am »

Quote from: jacobly on December 07, 2011, 11:05:18 pm

I do have a 24-bit floating-point multiplication routine

saved 2 bytes, 1149 cycles saved by using iy too (and in a way compatible with TIOS, imagine that)
Code: [Select]
; hldebc = hlc * bde ld (iy+asm_Flag1),b xor a ld ix,0 ld b,24 Loop: add ix,ix rla rl c adc hl,hl jr nc,Next add ix,de adc a,(iy+asm_Flag1) rl c jr nc,Next inc hl Next: djnz Loop ld e,a ld d,c push ix ; ld c,ixl pop bc ; ld b,ixh

Jacobly, are you sure this works because I've been going through the code and it seems to me like the second 'rl c' should instead be 'add carry flag to c'. 2 times 'rl c' per loop seems wrong to me. Could you explain please? Because I've tried it as well in wabbitemu, taking the different input in account, and it is still not doing the right thing.

ACagliano · « **Reply #37 on:** December 12, 2011, 02:35:39 pm »

Yeah, all I need is 16-bit subtraction (which 'sub' supports, I think), 16-bit squaring, 32-bit addition, then 32-bit square rooting (or will I need to go up to 40-bit?).

Xeda112358 · « **Reply #38 on:** December 12, 2011, 02:48:12 pm »

16-bit subtraction

Code: [Select]

or a     ;to make sure the c flag is reset. Not always necessary if you know the c flag will be reset
sbc hl,bc  ;you can do sbc hl,de also.

32-bit addition (you mean two 32-bit inputs?)

Code: [Select]

;Inputs:
;     HLBC is one of the 32-bit inputs
;     DE points to the other 32-bit input in RAM
;Outputs:
;     HLBC is the 32-bit result
;     DE is incremented 3 times
;     A=H
;     c flag is set if there is an overflow
     ld a,(de) \ inc de
     add a,c \ ld c,a
     ld a,(de) \ inc de
     adc a,b \ ld b,a
     ld a,(de) \ inc de
     adc a,l \ ld l,a
     ld a,(de)
     adc a,h \ ld h,a
     ret

Squaring and square rooting... I will think on it

Also, I am working on a mini math library that will include RAM based math (so all the values will be in RAM). It seems like a few of these commands will need to rely on some memory. If they do, I suggest using the OP registers (11 bytes of RAM each).

jacobly · « **Reply #39 on:** December 12, 2011, 07:07:45 pm »

Quote from: cerzus69 on December 12, 2011, 10:43:43 am

Quote from: jacobly on December 07, 2011, 11:05:18 pm
I do have a 24-bit floating-point multiplication routine

saved 2 bytes, 1149 cycles saved by using iy too (and in a way compatible with TIOS, imagine that)
Code: [Select]
; hldebc = hlc * bde ld (iy+asm_Flag1),b xor a ld ix,0 ld b,24 Loop: add ix,ix rla rl c adc hl,hl jr nc,Next add ix,de adc a,(iy+asm_Flag1) rl c jr nc,Next inc hl Next: djnz Loop ld e,a ld d,c push ix ; ld c,ixl pop bc ; ld b,ixh

Jacobly, are you sure this works because I've been going through the code and it seems to me like the second 'rl c' should instead be 'add carry flag to c'. 2 times 'rl c' per loop seems wrong to me. Could you explain please? Because I've tried it as well in wabbitemu, taking the different input in account, and it is still not doing the right thing.

That's strange. My test program must not have been working right, because when I went back and changed it a bit, it suddenly started telling me that the second routine doesn't work.

Anyway, my new test program seems to agree with this change.

Code: [Select]

 ; hldebc = hlc * bde
 ld (iy+asm_Flag1),b
 xor a
 ld ix,0
 ld b,24
Loop:
 add ix,ix
 rla
 rl c
 adc hl,hl
 jr nc,Next
 add ix,de
 adc a,(iy+asm_Flag1)
 jr nc,Next
 inc c
 jr nz,Next
 inc hl
Next:
 djnz Loop
 ld e,a
 ld d,c
 push ix ; ld c,ixl
 pop bc ; ld b,ixh

cerzus69 · « **Reply #40 on:** December 13, 2011, 11:06:38 am »

Quote from: jacobly on December 12, 2011, 07:07:45 pm

That's strange. My test program must not have been working right, because when I went back and changed it a bit, it suddenly started telling me that the second routine doesn't work.
Anyway, my new test program seems to agree with this change.
Code: [Select]
; hldebc = hlc * bde ld (iy+asm_Flag1),b xor a ld ix,0 ld b,24 Loop: add ix,ix rla rl c adc hl,hl jr nc,Next add ix,de adc a,(iy+asm_Flag1) jr nc,Next inc c jr nz,Next inc hl Next: djnz Loop ld e,a ld d,c push ix ; ld c,ixl pop bc ; ld b,ixh

Cool, thanks a lot, indeed it works now!

Author Topic: 24 bit multiplication (Read 15673 times)

ACagliano

Re: 24 bit multiplication

jacobly

Re: 24 bit multiplication

ACagliano

Re: 24 bit multiplication

Xeda112358

Re: 24 bit multiplication

jacobly

Re: 24 bit multiplication

FloppusMaximus

Re: 24 bit multiplication

cerzus69

Re: 24 bit multiplication

ACagliano

Re: 24 bit multiplication

Xeda112358

Re: 24 bit multiplication

jacobly

Re: 24 bit multiplication

cerzus69

Re: 24 bit multiplication