### Author Topic: eZ80 Optimized Routines  (Read 8107 times)

0 Members and 1 Guest are viewing this topic.

#### jcochran

• LV0 Newcomer (Next: 5)
• Posts: 3
• Rating: +0/-0
##### Re: eZ80 Optimized Routines
« Reply #15 on: October 11, 2021, 12:28:05 pm »
The mult16 routine from Xeda112358 has a bug, and can be made one byte shorter.

To demonstrate bug, multiply 0ffffh by 0ffffh.
Correct result = 0fffe0001
Incorrect result = 0feff0001

New routine.
Code: [Select]
mul16:
;;Expects Z80 mode
;;Inputs: hl,bc
;;Outputs: Upper 16 bits in DE, lower 16 bits in BC
;;53 or 55 t-states
;;29 bytes
ld d,c
ld e,l
mlt de
push de     ; Push low product onto stack.
ld d,h      ; Prepare for high product
ld e,b
ld h,e      ; Swap bytes for 2 middle products. Do here when I have copies of high values
ld b,d      ; instead of later, needing A as temp.
mlt de      ; Calculate high product
mlt hl      ; Calculate first middle product
mlt bc      ; Calculate second middle product
add hl,bc   ; Sum middle products
jr nc,\$+3
inc d       ; Inc D if carry.  (The INC DE was a bug).
pop bc      ; Retrieve low product.
ld b,a
ld a,e
ld e,a
ret nc
inc d
ret

#### Xeda112358

• they/them
• Moderator
• LV12 Extreme Poster (Next: 5000)
• Posts: 4685
• Rating: +718/-6
• Calc-u-lator, do doo doo do do do.
##### Re: eZ80 Optimized Routines
« Reply #16 on: October 11, 2021, 12:32:17 pm »
Oh, fantastic! Good catch!

#### jcochran

• LV0 Newcomer (Next: 5)
• Posts: 3
• Rating: +0/-0
##### Re: eZ80 Optimized Routines
« Reply #17 on: October 12, 2021, 04:18:27 pm »
If you don't insist on the current register usage, you can get a bit shorter and faster.
Code: [Select]
; HLDE = HL * DE
; 52 cc, 28 bytes
Mult16:
ld b,h
ld c,d
mlt bc
push bc
ld b,l
ld c,e
ld l,c
ld e,b
mlt bc
mlt de
mlt hl
xor a
ld e,h
ld h,l
ld l,a
ld d,a
ex de,hl
pop bc
ret

Unfortunately, I don't see any way to make it shorter or faster.

#### jcochran

• LV0 Newcomer (Next: 5)
• Posts: 3
• Rating: +0/-0
##### Re: eZ80 Optimized Routines
« Reply #18 on: October 27, 2021, 10:36:07 am »
Was wondering how "optimized" the multiply 32 routine was using the karatsuba algorithm vs a straight forward multiply. And the answer is the straight forward multiply is significantly faster.

Following routine was written in several stages.
1. Write straight forward multiple, doing the 16 sub products from least significant to most significant, using AHL as a running sum.
2. For 3 of the additions, it's impossible for a carry to propagate from the 16 bit add. This allowed for the adc a,a to be deleted, resulting in 3 bytes and cycles saved.
3. Since the B and C registers weren't used, decided to use them to cache a few values from the numbers being multiplied. This saves 20 bytes and 32 cycles.
4. Finally, a minor modification was made for the 30 product. Saves 1 byte.

Code: [Select]
; (IX+8) = (IX+0)*(IX+4)
;
; Register usage
; AHL = running sum for result.
;       Periodically, L is saved as next byte and sum shifted right by 1 byte.
; DE = Next sub product to be added
; B  = Cached value of (ix+0) and later (ix+7)
; C  = Cached value of (ix+4) and later (ix+3)
; IX = pointer to values being multiplied and result.
;
; 178 bytes
; 275 clock cycles

Mult32:
xor a
; Calc byte 0
ld b,(ix+0)
ld h,b ; Prod 00
ld c,(ix+4)
ld l,c
mlt hl
ld (ix+8),l
ld l,h
ld h,a

; Calc byte 1
ld d,b ; Prod 01
ld e,(ix+5)
mlt de
add hl,de ; No carry possible
ld d,(ix+1) ; Prod 10
ld e,c
mlt de
ld (ix+9),l
ld l,h
ld h,a
xor a

; Calc byte 2
ld d,b ; Prod 02
ld e,(ix+6)
mlt de
add hl,de ; No carry possible
ld d,(ix+2) ; Prod 20
ld e,c
mlt de
ld d,(ix+1) ; Prod 11
ld e,(ix+5)
mlt de
ld (ix+10),l
ld l,h
ld h,a
xor a

; Calc byte 3
ld d,b ; Prod 03 (Last use of cached (ix+0) in b)
ld b,(ix+7) ; Cache (ix+7)
ld e,b
mlt de
ld de,(ix+3) ; Prod 30. Minor optimization that saves a byte
ld c,e ; Cache (ix+3)
mlt de
ld d,(ix+1) ; Prod 12
ld e,(ix+6)
mlt de
ld d,(ix+2) ; Prod 21
ld e,(ix+5)
mlt de
ld (ix+11),l
ld l,h
ld h,a
xor a

; Calc byte 4
ld d,c ; Prod 31
ld e,(ix+5)
mlt de
ld d,(ix+1) ; Prod 13
ld e,b
mlt de
ld d,(ix+2) ; Prod 22
ld e,(ix+6)
mlt de
ld (ix+12),l
ld l,h
ld h,a
xor a

; Calc byte 5
ld d,(ix+2) ; Prod 23
ld e,b
mlt de
ld d,c ; Prod 32
ld e,(ix+6)
mlt de