Author Topic: Assembly Programmers - Help Axe Optimize!  (Read 46247 times)

0 Members and 1 Guest are viewing this topic.

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #270 on: December 12, 2011, 11:57:46 pm »
Yeah, I see no way to optimize the full 32-bit multiplication... But fixed-point multiplication, now that's an entirely different story! First, here's a totally different approach to sign handling that reduces p_88Mul to less than half of its current size! ;D


Original routine: 38 bytes, ~1128 cycles
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_MulFull
ld l,h
ld h,a
pop af
xor h
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__88MulEnd:
   Smaller routine: 18 bytes, ~1089 cycles
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
push hl
call $3F00+sub_MulFull
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:


20 bytes saved? Not bad at all! But what if you're more interested in shaving off cycles than bytes? Don't worry, I covered that base too. Instead of using the slower p_MulFull, this final routine uses my faster p_Mul for 8 bits of the multiplication and an inlined, slightly different version of faster multiplication for the other 8 bits. End result: it's about 260 cycles faster than the smaller solution, or about 30% faster! ;D It's 16 bytes larger than my smaller method, but actually it would often end up resulting in smaller programs because it relies on the much more popular p_Mul instead of p_MulFull.


Faster routine: 34 bytes, ~831 cycles
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
push hl
ld c,l
ld a,h
ld l,0
ld b,b \ .db 8 \ call $3F00+sub_Mul
ld a,c
ld bc,8<<8+0
__88MulNext:
add hl,hl
rla
jr nc,__88MulSkip
add hl,de
adc a,c
__88MulSkip:
djnz __88MulNext
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:
« Last Edit: December 13, 2011, 12:04:38 am by Runer112 »

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #271 on: December 13, 2011, 01:19:54 am »
Wow thanks!  However there seems to be an issue.  The 3 pictures attached are the output from the Mandelbrot Set demo program. The first is the original routine.  The second is your new size optimized version.  As you can see it works, but the rounding appears to be asymmetrical (which might still be okay).  The last one is your speed optimized version.  I think you have a bug somewhere...  :P
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #272 on: December 13, 2011, 01:38:30 am »
I think I can explain the asymmetry of the size-optimized version. Because it adjusts signs differently, I think it now rounds down instead of towards zero like the old routine.

However, I have no clue what is going on with the speed-optimized routine. Can you look at the debugger and confirm that the call to sub_Mul is actually entering where it's supposed to be entering, at __MulByte? Because I wouldn't be surprised if the fact that you probably had to add the offset call macro for call nz,__MulByte in p_Mul is messing up the offset calls due to its own size.
« Last Edit: December 13, 2011, 01:40:46 am by Runer112 »

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #273 on: December 13, 2011, 02:07:20 am »
The disassembly looks fine to me.  All the jumps calls and everything of that nature are aligned.  I tried 4 test cases with different combinations of sign values and they seemed okay.  Since the generated picture is relatively close to the original given that it was a chaotic system sensitive to errors, I would guess it is only a few special cases that cause it to return a wrong result.

EDIT: I made a program to run them side by side on random numbers and quit when the output is different.  Here is an output that gives different results between the routines:

$FFE0 ** $F5F1 (-0.125 ** -10.059)

Results in $0143 (1.26) in size optimized.
Results in $0239 (2.22) in speed optimized.
« Last Edit: December 13, 2011, 02:31:40 am by Quigibo »
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #274 on: December 13, 2011, 03:04:07 am »
That edit was helpful, it gave me a hunch as to what the problem was and (I think) that hunch was correct. Unfortunately, the fix for this problem will cost a byte and about 70 cycles. It will still be about 20% faster than the small routine though. And it still relies on the more common p_Mul instead of p_MulFull, so being 17 bytes larger might still be worth it.


Faster routine: 35 bytes, ~900 cycles
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
push hl
ld c,l
ld a,h
ld l,0
ld b,b \ .db 8 \ call $3F00+sub_Mul
ld b,8
__88MulNext:
add hl,hl
rla
rl c
jr nc,__88MulSkip
add hl,de
adc a,0
__88MulSkip:
djnz __88MulNext
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:
« Last Edit: December 13, 2011, 03:04:48 am by Runer112 »

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2896
  • Rating: +467/-17
    • View Profile
    • TI-Boy CE
Re: Assembly Programmers - Help Axe Optimize!
« Reply #275 on: December 17, 2011, 11:29:24 pm »
So... Z-Test. At a cost of 8 cycles, you can go from 17 bytes plus 3 bytes times the number of options (limited to something like 85?) to 16 bytes plus 2 bytes times the number of options (limited to amount of program space).

Here's my method:
Code: [Select]
  ld de,-range
  add hl,de
  ld de,jumptable_end
  jr c,default
  add hl,hl
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
default:
  ex de,hl
  jp (hl)
  .dw Label0
  .dw Label1
  .dw Label2
  ;.....
jumptable_end:
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #276 on: December 17, 2011, 11:39:31 pm »
Wow thanks!  I was considering that, but I assumed the overhead would be large, not smaller!  Thanks!

Also, I could move the labels to the data section of the code to make it even faster!

Code: [Select]
 ld de,-range
  add hl,de
  jr c,default
  add hl,hl
  ld de,jumptable_end
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
  ex de,hl
  jp (hl)
default:
« Last Edit: December 17, 2011, 11:41:17 pm by Quigibo »
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2896
  • Rating: +467/-17
    • View Profile
    • TI-Boy CE
Re: Assembly Programmers - Help Axe Optimize!
« Reply #277 on: December 17, 2011, 11:50:41 pm »
If you wanted to save 2 cycles in the case of a jump, you could use an odd table setup with all the LSBs in a row followed by all the MSBs in a row, like so:

Code: [Select]
 ld de,-range
  add hl,de
  jr c,routine_end
  ex de,hl
  ld hl,jumptable_end
  add hl,de
  ld a,(hl)
  add hl,de
  ld l,(hl)
  ld h,a
  jp (hl)
routine_end:

I imagine that might not work well with the way pointers are handled in the compiler, though.

Edit:
And I suppose the current Z-Test is actually limited to 39 options due to the range of the JR instruction...
« Last Edit: December 17, 2011, 11:56:28 pm by calc84maniac »
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #278 on: December 19, 2011, 01:12:53 am »
First, an optimization that I can't give you code for: making *^CONST use an equivalent constant division optimization if one exists. And don't forget about the trivial cases, *^1 and *^0. Of course, these only apply if you don't change this operation to return a 32-bit result somehow. Which it really should. :P

Next, some silly optimizations: ^0, <<ᴇ8000, >>ᴇ7FFF should simply be 0, while ≥≥ᴇ8000 and ≤≤ᴇ7FFF should simply be 1. If you're wondering why ^0 should be 0, that's what the general modulus routine would return anyways.

Finally, some optimizations for signed comparisons. These have been lacking general forms which take advantage of absolute jumps as well as optimized forms for constants for quite some time. Thanks to jacobly and calc84maniac for helping me come up with the first two! If either of you two are reading this, feel free to look at the other operations and try to optimize them. ;)

Code: [Select]
p_SGT0:
.db 8
ld a,h
or l
jr z,$+6
add hl,hl
sbc hl,hl
inc hl
p_SLE0:
.db 9
ld a,h
or l
jr z,$+6
add hl,hl
ccf
sbc hl,hl
inc hl
p_SLtLeXX:
.db 11
ld a,h
add a,$80
ld h,a
ld de,$0000 ;$8000-const
add hl,de
sbc hl,hl
inc hl
.db rp_Ans,6
p_SGtGeXX:
.db 12
ld a,h
add a,$80
ld h,a
xor a
ld de,$0000 ;$8000-const
add hl,de
ld h,a
rla
ld l,a
.db rp_Ans,6
p_SIntGt:
.db 11
scf
sbc hl,de
add hl,hl
jp pe,$+4
ccf
sbc hl,hl
inc hl
p_SIntGe:
.db 11
xor a
sbc hl,de
add hl,hl
jp po,$+4
ccf
ld h,a
rla
ld l,a
p_SIntLt:
.db 11
scf
sbc hl,de
add hl,hl
jp po,$+4
ccf
sbc hl,hl
inc hl
p_SIntLe:
.db 11
xor a
sbc hl,de
add hl,hl
jp pe,$+4
ccf
ld h,a
rla
ld l,a
« Last Edit: December 19, 2011, 02:52:21 pm by Runer112 »

Offline jacobly

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 205
  • Rating: +161/-1
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #279 on: December 20, 2011, 04:47:46 am »
p_DrawOff: save 1 byte, save ~40 cycles
Original
Code: [Select]
xor a
ld e,a
dec a
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
and (hl)
or e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
cpl
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:
Optimized
Code: [Select]
xor a
ld e,$FF
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
or (hl)
and e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:

p_Pix: save 2 bytes, save ~6 cycles
Original
Code: [Select]
p_Pix:
.db __PixEnd-1-$ ;Draws pixel (c,l)
ld de,plotSScreen
pop af
pop bc
push af
ld b,0

ld a,l
cp 64
ld a,b
ret nc
ld a,c
cp 96
ld a,b
ret nc

ld h,b
ld a,l
add a,a
add a,l
ld l,a
add hl,hl
add hl,hl
add hl,de
ld a,c
srl c
srl c
srl c
add hl,bc
and %00000111
ld b,a
ld a,%10000000
ret z
___GetPixLoop:
rrca
djnz ___GetPixLoop
ret
__PixEnd:
Optimized
Code: [Select]
p_Pix:
.db __PixEnd-1-$ ;Draws pixel (c,l)
ld de,plotSScreen
pop af
pop bc
push af
ld b,0

ld a,c
cp 96
ld a,b
ret nc
sla l
ret c
sla l
ret c

ld h,b
ex de,hl
add hl,de
add hl,de
add hl,de
ld a,c
srl c
srl c
srl c
add hl,bc
and %00000111
ld b,a
ld a,%10000000
ret z
___GetPixLoop:
rrca
djnz ___GetPixLoop
ret
__PixEnd:

p_ArcTan: save 1 byte, save ~1 cycle
Original
Code: [Select]
p_ArcTan:
.db __ArcTanEnd-1-$
ex de,hl ;de = y
pop hl
ex (sp),hl ;hl = x
push hl
ld a,h ;\
xor d ;/ Get parity
jp m,__ArcTanSS-p_ArcTan-1
add hl,de ;\
jr __ArcTanDS ; |
__ArcTanSS: ; |hl = x +- y
sbc hl,de ; |
__ArcTanDS: ;/
ex de,hl ;de = x +- y
ld b,6 ;\
__ArcTan64: ; |
add hl,hl ; |hl = 64y
djnz __ArcTan64 ;/
call $3F00+sub_SDiv ;hl = 64y/(x +- y)
pop af ;\
rla ; |Right side, fine
ret nc ;/
sbc a,a ;\
sub h ; |Reverse sign extend
ld h,a ;/
ld a,l ;\
add a,128 ; |Add or sub 128
ld l,a ;/
ret
__ArcTanEnd:
Optimized
Code: [Select]
p_ArcTan:
.db __ArcTanEnd-1-$
ex de,hl ;de = y
pop hl
ex (sp),hl ;hl = x
push hl
ld a,h ;\
xor d ;/ Get parity
jp m,__ArcTanSS-p_ArcTan-2
add hl,de ;\
ld c,c \ .db $FA ; |
;jr __ArcTanDS ; |
__ArcTanSS: ; |hl = x +- y
sbc hl,de ; |
__ArcTanDS: ;/
ex de,hl ;de = x +- y
ld b,6 ;\
__ArcTan64: ; |
add hl,hl ; |hl = 64y
djnz __ArcTan64 ;/
call $3F00+sub_SDiv ;hl = 64y/(x +- y)
pop af ;\
rla ; |Right side, fine
ret nc ;/
sbc a,a ;\
sub h ; |Reverse sign extend
ld h,a ;/
ld a,l ;\
add a,128 ; |Add or sub 128
ld l,a ;/
ret
__ArcTanEnd:

Offline jacobly

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 205
  • Rating: +161/-1
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #280 on: December 24, 2011, 01:45:01 pm »
p_DrawOr/Xor: save 17 bytes (plus 4 every time a custom buffer is used)
aligned saves 98 cycles, unaligned saves ~173 cycles
save additional 21 cycles every time a custom buffer is used
Code: [Select]
p_DrawOr:
.db __DrawOrEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop bc ;Input c = Sprite Y Position
pop de ;Input e = Sprite X Position
push af
ld b,7
ld a,e
add a,b
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,a
ld a,c
add a,b
jr c,__DrawOrClipTop
sub 64+7
ret nc
cpl
cp b
jr c,__DrawOrClipBottom
ld a,b
jr __DrawOrClipBottom
__DrawOrClipTop:
inc ix
inc c
jr nz,__DrawOrClipTop
__DrawOrClipBottom:
inc a
ld b,0
sla c
sla c
add hl,bc
add hl,bc
add hl,bc
ld c,d
add hl,bc
ld b,a
ld a,e
and 7
jr z,__DrawOrAligned
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawOrLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawOrShift:
srl c
rra
djnz __DrawOrShift
and e
or (hl)
ld (hl),a
dec hl
ld a,c
and d
or (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawOrLoop
ret
__DrawOrAligned:
ld de,12
__DrawOrAlignedLoop:
ld a,(ix)
or (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawOrAlignedLoop
ret
__DrawOrEnd:

p_DrawXor:
.db __DrawXorEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop bc ;Input c = Sprite Y Position
pop de ;Input e = Sprite X Position
push af
ld b,7
ld a,e
add a,b
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,a
ld a,c
add a,b
jr c,__DrawXorClipTop
sub 64+7
ret nc
cpl
cp b
jr c,__DrawXorClipBottom
ld a,b
jr __DrawXorClipBottom
__DrawXorClipTop:
inc ix
inc c
jr nz,__DrawXorClipTop
__DrawXorClipBottom:
inc a
ld b,0
sla c
sla c
add hl,bc
add hl,bc
add hl,bc
ld c,d
add hl,bc
ld b,a
ld a,e
and 7
jr z,__DrawXorAligned
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawXorLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawXorShift:
srl c
rra
djnz __DrawXorShift
and e
xor (hl)
ld (hl),a
dec hl
ld a,c
and d
xor (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawXorLoop
ret
__DrawXorAligned:
ld de,12
__DrawXorAlignedLoop:
ld a,(ix)
xor (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawXorAlignedLoop
ret
__DrawXorEnd:

Offline Xeda112358

  • Xombie.
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4543
  • Rating: +715/-6
  • meow :3
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #281 on: December 24, 2011, 03:58:42 pm »
I finally have an optimisation that might work or be useful >.> Runer112 apparently mentioned optimising the p_FreqOut routine by replacing:
Code: [Select]
dec hl
dec bc
ld a,b
or c
jr nz,__FreqOutLoop2
with this:
Code: [Select]
cpd
jp pe,__FreqOutLoop2
However, the issue was that the frequency would be thrown off as it cut out 8*HL cycles. However, when I was stealing the code for my own evil intentions, I saw this optimisation and thought of that issue and here is my solution:
Code: [Select]

p_FreqOut:
xor a
__FreqOutLoop1:
push bc
        xor     %00000011
ld e,a
__FreqOutLoop2:
ld a,h
or l
jr z,__FreqOutDone
cpd
ld a,e
        scf
jp pe,__FreqOutLoop2
__FreqOutDone:
pop bc
out ($00),a
ret nc
jr __FreqOutLoop1
__FreqOutEnd:
The way the code is reordered, now, it should only cut out 8*HL/BC cycles which is much less than 8*HL. I think Runer said that it might be up to 1% faster for higher notes and negligible for lower notes.


EDIT: Okay, found a problem: It is actually 2 cycles slower in the inside loop, now, so that will just slow the routine by 2*hl, too

Offline jacobly

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 205
  • Rating: +161/-1
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #282 on: December 26, 2011, 12:17:26 pm »
p_DrawOr: 18 bytes saved
p_DrawXor: 18 bytes saved
p_DrawOff: 14 bytes saved
p_DrawMsk: 10 bytes saved
p_DrawMsk2: 11 bytes saved
Code: [Select]
p_DrawOr:
.db __DrawOrEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawOrClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawOrClipBottom
ld b,d
jr __DrawOrNoClipV
__DrawOrClipTop:
inc ix
inc e
jr nz,__DrawOrClipTop
__DrawOrClipBottom:
ld b,a
__DrawOrNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
inc b
ld a,c
and d
ld d,-7*3
add hl,de
jr z,__DrawOrAligned
ld e,c
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawOrLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawOrShift:
srl c
rra
djnz __DrawOrShift
and e
or (hl)
ld (hl),a
dec hl
ld a,c
and d
or (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawOrLoop
ret
__DrawOrAligned:
ld de,12
__DrawOrAlignedLoop:
ld a,(ix)
or (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawOrAlignedLoop
ret
__DrawOrEnd:

p_DrawXor:
.db __DrawXorEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawXorClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawXorClipBottom
ld b,d
jr __DrawXorNoClipV
__DrawXorClipTop:
inc ix
inc e
jr nz,__DrawXorClipTop
__DrawXorClipBottom:
ld b,a
__DrawXorNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
inc b
ld a,c
and d
ld d,-7*3
add hl,de
jr z,__DrawXorAligned
ld e,c
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawXorLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawXorShift:
srl c
rra
djnz __DrawXorShift
and e
xor (hl)
ld (hl),a
dec hl
ld a,c
and d
xor (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawXorLoop
ret
__DrawXorAligned:
ld de,12
__DrawXorAlignedLoop:
ld a,(ix)
xor (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawXorAlignedLoop
ret
__DrawXorEnd:

p_DrawOff:
.db __DrawOffEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawOffClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawOffClipBottom
ld b,d
jr __DrawOffNoClipV
__DrawOffClipTop:
inc ix
inc e
jr nz,__DrawOffClipTop
__DrawOffClipBottom:
ld b,a
__DrawOffNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawOffAligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawOffLoop
inc d
cp 96-7
jr nc,__DrawOffLoop
inc d
__DrawOffLoop:
push bc
ld b,c
ld c,(ix+0)
xor a
ld e,$FF
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
or (hl)
and e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:
ld bc,13
add hl,bc
inc ix
pop bc
djnz __DrawOffLoop
ret
__DrawOffAligned:
ld e,12
__DrawOffAlignedLoop:
ld a,(ix)
ld (hl),a
inc ix
add hl,de
djnz __DrawOffAlignedLoop
ret
__DrawOffEnd:

p_DrawMsk:
.db __DrawMskEnd-1-$
ex (sp),hl
pop ix ;Input hl = Sprite
pop de
pop bc
push hl
ld hl,plotSScreen
ld d,7
ld a,e
add a,d
jr c,__DrawMskClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawMskClipBottom
ld b,d
jr __DrawMskNoClipV
__DrawMskClipTop:
inc ix
inc e
jr nz,__DrawMskClipTop
__DrawMskClipBottom:
ld b,a
__DrawMskNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawMskAligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawMskLoop
inc d
cp 96-7
jr nc,__DrawMskLoop
inc d

__DrawMskLoop:
push bc

push hl

ld b,c
ld e,(ix+0)
xor a
ld h,a
ld c,(ix+8)
__DrawMskShift:
srl e
rr h
srl c
rra
djnz __DrawMskShift

ld b,h
pop hl
push af

dec d
jr z,__DrawMskSkipRight1

push bc
xor b
cpl
ld c,a

ld a,(hl)
or b
and c
ld (hl),a
pop bc

__DrawMskSkipRight1:
dec hl
inc d
push de
jr z,__DrawMskSkipLeft1

ld a,c
xor e
cpl
ld d,a

ld a,(hl)
or e
and d
ld (hl),a

__DrawMskSkipLeft1:
ld de,appBackUpScreen-plotSScreen+1
add hl,de
pop de
pop af
dec d
jr z,__DrawMskSkipRight2

or b
cpl

and (hl)
or b
ld (hl),a

__DrawMskSkipRight2:
dec hl
inc d
jr z,__DrawMskSkipLeft2

ld a,c
or e
cpl

and (hl)
or e
ld (hl),a

__DrawMskSkipLeft2:
ld bc,plotSScreen-appBackUpScreen+13
add hl,bc

inc ix
pop bc
djnz __DrawMskLoop
ret
__DrawMskAligned:
push hl
ld de,appBackUpScreen-plotSScreen
add hl,de

ld a,(ix+0)
ld d,a
xor (ix+8)
cpl
ld e,a

and (hl)
or d
ld (hl),a

pop hl

ld a,(hl)
or d
and e
ld (hl),a

inc ix
ld de,12
add hl,de
djnz __DrawMskAligned
ret
__DrawMskEnd:

p_DrawMsk2:
.db __DrawMsk2End-1-$
ex (sp),hl
pop ix ;Input hl = Sprite
pop de
pop bc
push hl
ld hl,plotSScreen
ld d,7
ld a,e
add a,d
jr c,__DrawMsk2ClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawMsk2ClipBottom
ld b,d
jr __DrawMsk2NoClipV
__DrawMsk2ClipTop:
inc ix
inc e
jr nz,__DrawMsk2ClipTop
__DrawMsk2ClipBottom:
ld b,a
__DrawMsk2NoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawMsk2Aligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawMsk2Loop
inc d
cp 96-7
jr nc,__DrawMsk2Loop
inc d
__DrawMsk2Loop:
push bc
push hl

ld b,c
ld e,(ix+0)
xor a
ld h,a
ld c,(ix+8)
__DrawMsk2Shift:
srl e
rr h
srl c
rra
djnz __DrawMsk2Shift

ld b,h ;e = left spr, b = right spr, c = left msk, a = right msk
pop hl

dec d
jr z,__DrawMsk2SkipRight

cpl
and (hl)
xor b
ld (hl),a

__DrawMsk2SkipRight:
dec hl
inc d
jr z,__DrawMsk2SkipLeft

ld a,c
cpl
and (hl)
xor e
ld (hl),a

__DrawMsk2SkipLeft:
ld bc,13
add hl,bc

inc ix
pop bc
djnz __DrawMsk2Loop
ret
__DrawMsk2Aligned:
ld e,12
__DrawMsk2AlignedLoop:
ld a,(ix+8)
cpl
and (hl)
xor (ix+0)
ld (hl),a
inc ix
add hl,de
djnz __DrawMsk2AlignedLoop
ret
__DrawMsk2End:

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #283 on: February 02, 2012, 11:56:23 pm »
Just a small optimization I see with the new Nth string command. Because you restack the return location by popping it into bc, you're already loading bc with a value that's at least $4000 for applications and at least $8000 for programs, so the ld b,h inside the loop is not necessary.

Offline jacobly

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 205
  • Rating: +161/-1
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #284 on: September 18, 2012, 03:03:40 am »
Thanks to a suggestion from calc84maniac, I have optimized the routine that is used for both *^ and ** to be 25-50% faster. ;D In addition, every use of *^ would be 2 bytes smaller.

p_MulFull: same size, save 300-550 cycles
Original
Code: [Select]
p_MulFull:
.db __MulFullEnd-1-$
ld c,h
ld a,l
ld hl,0
ld b,16
__MulFullNext:
add hl,hl
rla
rl c
jr nc,__MulFullSkip
add hl,de
adc a,0
jr nc,__MulFullSkip
inc c
__MulFullSkip:
djnz __MulFullNext
ret
__MulFullEnd:
Optimized
Code: [Select]
p_MulFull:
.db __MulFullEnd-1-$
xor a
ld c,h
ld h,a
or l
ld l,h
call nz,__MulFullByte-p_MulFull-1
ld a,c
__MulFullByte:
ld b,8
__MulFullNext:
rra
jr nc,__MulFullSkip
add hl,de
__MulFullSkip:
rr h
rr l
djnz __MulFullNext
ret
__MulFullEnd:
Note: Output changed: hl = bits 16-31 of the result, do rra after the routine returns to get a = bits 8-15 of the result.
« Last Edit: September 18, 2012, 03:10:34 am by jacobly »