Author Topic: eZ80 Optimized Routines (Read 15899 times)

jcochran · « **Reply #15 on:** October 11, 2021, 12:28:05 pm »

The mult16 routine from Xeda112358 has a bug, and can be made one byte shorter.

To demonstrate bug, multiply 0ffffh by 0ffffh.
Correct result = 0fffe0001
Incorrect result = 0feff0001

New routine.

Code: [Select]

mul16:
;;Expects Z80 mode
;;Inputs: hl,bc
;;Outputs: Upper 16 bits in DE, lower 16 bits in BC
;;53 or 55 t-states
;;29 bytes 
    ld d,c
    ld e,l
    mlt de
    push de     ; Push low product onto stack.
    ld d,h      ; Prepare for high product
    ld e,b
    ld h,e      ; Swap bytes for 2 middle products. Do here when I have copies of high values
    ld b,d      ; instead of later, needing A as temp.
    mlt de      ; Calculate high product
    mlt hl      ; Calculate first middle product
    mlt bc      ; Calculate second middle product
    add hl,bc   ; Sum middle products
    jr nc,$+3
    inc d       ; Inc D if carry.  (The INC DE was a bug).
    pop bc      ; Retrieve low product.
    ld a,b      ; Add in middle products to create final answer.
    add a,l
    ld b,a
    ld a,e
    adc a,h
    ld e,a
    ret nc
    inc d
    ret

Xeda112358 · « **Reply #16 on:** October 11, 2021, 12:32:17 pm »

Oh, fantastic! Good catch!

jcochran · « **Reply #17 on:** October 12, 2021, 04:18:27 pm »

If you don't insist on the current register usage, you can get a bit shorter and faster.

Code: [Select]

; HLDE = HL * DE
; 52 cc, 28 bytes
Mult16:
 ld b,h
 ld c,d
 mlt bc
 push bc
 ld b,l
 ld c,e
 ld l,c
 ld e,b
 mlt bc
 mlt de
 mlt hl
 xor a
 add hl,de
 ld e,h
 ld h,l
 ld l,a
 adc a,a
 ld d,a
 add hl,bc
 ex de,hl
 pop bc
 adc hl,bc
 ret

Unfortunately, I don't see any way to make it shorter or faster.

jcochran · « **Reply #18 on:** October 27, 2021, 10:36:07 am »

Was wondering how "optimized" the multiply 32 routine was using the karatsuba algorithm vs a straight forward multiply. And the answer is the straight forward multiply is significantly faster.

Following routine was written in several stages.
1. Write straight forward multiple, doing the 16 sub products from least significant to most significant, using AHL as a running sum.
2. For 3 of the additions, it's impossible for a carry to propagate from the 16 bit add. This allowed for the adc a,a to be deleted, resulting in 3 bytes and cycles saved.
3. Since the B and C registers weren't used, decided to use them to cache a few values from the numbers being multiplied. This saves 20 bytes and 32 cycles.
4. Finally, a minor modification was made for the 30 product. Saves 1 byte.

Code: [Select]

; (IX+8) = (IX+0)*(IX+4)
;
; Register usage
; AHL = running sum for result.
;       Periodically, L is saved as next byte and sum shifted right by 1 byte.
; DE = Next sub product to be added
; B  = Cached value of (ix+0) and later (ix+7)
; C  = Cached value of (ix+4) and later (ix+3)
; IX = pointer to values being multiplied and result.
;
; 178 bytes
; 275 clock cycles

Mult32:
 xor a
; Calc byte 0
 ld b,(ix+0)
 ld h,b  ; Prod 00
 ld c,(ix+4)
 ld l,c
 mlt hl
 ld (ix+8),l
 ld l,h
 ld h,a

; Calc byte 1
 ld d,b  ; Prod 01
 ld e,(ix+5)
 mlt de
 add hl,de  ; No carry possible
 ld d,(ix+1) ; Prod 10
 ld e,c
 mlt de
 add hl,de
 adc a,a
 ld (ix+9),l
 ld l,h
 ld h,a
 xor a

; Calc byte 2
 ld d,b  ; Prod 02
 ld e,(ix+6)
 mlt de
 add hl,de  ; No carry possible
 ld d,(ix+2) ; Prod 20
 ld e,c
 mlt de
 add hl,de
 adc a,a
 ld d,(ix+1) ; Prod 11
 ld e,(ix+5)
 mlt de
 add hl,de
 adc a,0
 ld (ix+10),l
 ld l,h
 ld h,a
 xor a

; Calc byte 3
 ld d,b  ; Prod 03 (Last use of cached (ix+0) in b)
 ld b,(ix+7) ; Cache (ix+7)
 ld e,b
 mlt de
 add hl,de
 adc a,a
 ld de,(ix+3) ; Prod 30. Minor optimization that saves a byte
 ld c,e  ; Cache (ix+3)
 mlt de
 add hl,de
 adc a,0
 ld d,(ix+1) ; Prod 12
 ld e,(ix+6)
 mlt de
 add hl,de
 adc a,0
 ld d,(ix+2) ; Prod 21
 ld e,(ix+5)
 mlt de
 add hl,de
 adc a,0
 ld (ix+11),l
 ld l,h
 ld h,a
 xor a

; Calc byte 4
 ld d,c  ; Prod 31
 ld e,(ix+5)
 mlt de
 add hl,de
 adc a,a
 ld d,(ix+1) ; Prod 13
 ld e,b
 mlt de
 add hl,de
 adc a,0
 ld d,(ix+2) ; Prod 22
 ld e,(ix+6)
 mlt de
 add hl,de
 adc a,0
 ld (ix+12),l
 ld l,h
 ld h,a
 xor a

; Calc byte 5
 ld d,(ix+2) ; Prod 23
 ld e,b
 mlt de
 add hl,de
 adc a,a
 ld d,c  ; Prod 32
 ld e,(ix+6)
 mlt de
 add hl,de
 adc a,0
 ld (ix+13),l
 ld l,h
 ld h,a

; Calc bytes 6 & 7
 mlt bc  ; Prod 33
 add hl,bc  ; No carry possible
 ld (ix+14),hl
 ret

Author Topic: eZ80 Optimized Routines (Read 15899 times)

jcochran

Re: eZ80 Optimized Routines

Xeda112358

Re: eZ80 Optimized Routines

jcochran

Re: eZ80 Optimized Routines

jcochran

Re: eZ80 Optimized Routines