Author Topic: Fast 8-bit Squaring (Read 2557 times)

Xeda112358 · « **on:** January 28, 2014, 09:03:43 am »

Last night I was thinking again about my Sine Approximation routine and holy crud, I must have been really tired not to recognize this:

In the approximation, I was taking the upper 8 bits, truncating propagation. This was so that I had a fixed-point routine. But if I had taken the lower 8 bits, I would have had a useful 8-bit integer routine for -x²-x.

But wait.

If I start the accumulator with x (which is actually 3 cycles faster to do), then I end up with -x².

Well then, anybody wanna optimize this? ^^ :

Code: [Select]

L_sqrd:
;Input: L
;Output: L*L->A
;159 t-states
;39 bytes
 ld h,l
 ld c,l
 rr h
 sbc a,a
 xor l
 add a,c
 ld c,a
 rl l

 rr h
 sbc a,a
 xor l
 and %11111000
 add a,c
 ld c,a
 rl l

 rr h
 sbc a,a
 xor l
 and %11100000
 add a,c
 ld c,a
 rl l

 ld a,h
 rrca
 xor l 
 and 128
 xor c
 neg
 ret

It might be more easy if I explain that it is computing:

Code: [Select]

=(hhhhhhhh^abcdefgh)+(gggggg00^bcdefg00)+(ffff0000^cdef0000)+(ee000000^de000000)
=(hhhhhhhh^abcdefgh)+(ggggg000^bcdef000)+(fff00000^cde00000)+(e0000000^d0000000)
=((hhhhhhhh^abcdefgh)+(ggggg000^bcdef000)+(fff00000^cde00000))^e0000000^d0000000

^ is XOR
+ is addition modulo 256

EDIT: Cross-posted:
I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.
Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:

Code: [Select]

L_sqrd:
;Input: L
;Output: L*L->A
;151 t-states
;37 bytes
 ld h,l
;First iteration, get the lowest 3 bits
 sla l
 rrh
 sbc a,a
 or l
;second iteration, get the next 2 bits
 ld c,a
 rr h
 sbc a,a
 xor l
 and $F8
 add a,c
;third iteration, get the next 2 bits
 ld c,a
 sla l
 rr h
 sbc a,a
 xor l
 and $E0
 add a,c
;fourth iteration, get the last bit
 ld c,a
 ld a,l
 add a,a
 rrc h
 xor h
 and $80
 xor c
 neg
 ret

Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.

Xeda112358 · « **Reply #1 on:** January 28, 2014, 12:29:06 pm »

I managed to optimize the first iteration. I thought to analyze it while in class, so I saved two bytes, 8 cycles.

EDIT: Here is some pseudocode for getting the full square (not just the lower bits):

Code: [Select]

A_sqrd:
;Input: A is an x-bit number (example, 8-bit, 16-bit, et-cetera)
;Output: A*A
 A<<1→B     ;Store A shifted once into B.
 0→ACC     ;Initialise the accumulator with 0
 1<<(x-1)→MASK    ;Set the last bit of MASK
 For (x-1) Iterations:
  ACC<<1→ACC   ;shift ACC (same as *2 or ACC+ACC)
  B<<1→B    ;shift left once, keep carry
  ((SBC(0,0)^A)&MASK)+ACC→ACC ;SBC(0,0) is "subtract with carry" and will yield either -1 (carry flag set) or 0 (carry flag reset).
  (MASK>>1)|MASK→MASK  ;arithmetic right shift
 Return (A<<x-A)-ACC

Implemented on the Z80, it kind of sucks taking twice as long as a traditional routine. However, it is novel, I think, taking 7 iterations:

Code: [Select]

L_sqrd:
;Input: L
;Output: L*L->HL
;It's really slow compared to a generic 8-bit multiplication
;It's also pretty big
;But it is a novel approach!
 ld e,a            ;this just stays a constant for the algorithm
 rlca
 ld d,a            ;this will be rotated to get the bits of the input
 ld l,0            ;the accumulator
 ld bc,0780h      ;c is a mask
loop:
 add hl,hl
 rlc d
 sbc a,a
 xor e
 and c
 add a,l
 ld l,a
 jr nc,$+3
 inc h
 sra c
 djnz loop
;HL is 255*input-input*input
;add input
 ld c,e
 add hl,bc
;HL is 256*input-input*input
;negate HL
 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a
;HL is input*input-256*input
;add input*256
 ld e,b
 add hl,de
;HL is input*input
 ret

Author Topic: Fast 8-bit Squaring (Read 2557 times)

Xeda112358

Fast 8-bit Squaring

Xeda112358

Re: Fast 8-bit Squaring