Assembly Programmers - Help Axe Optimize!

Omnimaga »
Forum »
Calculator Community »
Major Community Projects »
The Axe Parser Project (Moderator: Runer112) »
Assembly Programmers - Help Axe Optimize!

« previous next »

Print

Pages: 1 ... 17 18 [19] 20 Go Down

Author Topic: Assembly Programmers - Help Axe Optimize! (Read 145175 times)

0 Members and 1 Guest are viewing this topic.

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #270 on: December 12, 2011, 11:57:46 pm »

Yeah, I see no way to optimize the full 32-bit multiplication... But fixed-point multiplication, now that's an entirely different story! First, here's a totally different approach to sign handling that reduces p_88Mul to less than half of its current size!

Original routine: 38 bytes, ~1128 cycles

p_88Mul:
 .db __88MulEnd-1-$
 ld a,h
 xor d
 push af
 bit 7,h
 jr z,$+8
 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a
 bit 7,d
 jr z,$+8
 xor a
 sub e
 ld e,a
 sbc a,a
 sub d
 ld d,a
 call $3F00+sub_MulFull
 ld l,h
 ld h,a
 pop af
 xor h
 ret p
 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a
 ret
__88MulEnd:

Smaller routine: 18 bytes, ~1089 cycles

p_88Mul:
 .db __88MulEnd-1-$
 push hl
 call $3F00+sub_MulFull
 pop bc
 bit 7,b
 jr z,$+3
 sub e
 ld l,h
 ld h,a
 bit 7,d
 ret z
 sub c
 ld h,a
 ret
__88MulEnd:

20 bytes saved? Not bad at all! But what if you're more interested in shaving off cycles than bytes? Don't worry, I covered that base too. Instead of using the slower p_MulFull, this final routine uses my faster p_Mul for 8 bits of the multiplication and an inlined, slightly different version of faster multiplication for the other 8 bits. End result: it's about 260 cycles faster than the smaller solution, or about 30% faster!

It's 16 bytes larger than my smaller method, but actually it would often end up resulting in smaller programs because it relies on the much more popular p_Mul instead of p_MulFull.

Faster routine: 34 bytes, ~831 cycles

p_88Mul:
 .db __88MulEnd-1-$
 push hl
 ld c,l
 ld a,h
 ld l,0
 ld b,b \ .db 8 \ call $3F00+sub_Mul
 ld a,c
 ld bc,8<<8+0
__88MulNext:
 add hl,hl
 rla
 jr nc,__88MulSkip
 add hl,de
 adc a,c
__88MulSkip:
 djnz __88MulNext
 pop bc
 bit 7,b
 jr z,$+3
 sub e
 ld l,h
 ld h,a
 bit 7,d
 ret z
 sub c
 ld h,a
 ret
__88MulEnd:

« Last Edit: December 13, 2011, 12:04:38 am by Runer112 »

Logged

+5/-0 karm for this message

Quigibo

The Executioner
CoT Emeritus
LV11 Super Veteran (Next: 3000)
Posts: 2031
Rating: +1075/-24
I wish real life had a "Save" and "Load" button...

Re: Assembly Programmers - Help Axe Optimize!

« Reply #271 on: December 13, 2011, 01:19:54 am »

Wow thanks! However there seems to be an issue. The 3 pictures attached are the output from the Mandelbrot Set demo program. The first is the original routine. The second is your new size optimized version. As you can see it works, but the rounding appears to be asymmetrical (which might still be okay). The last one is your speed optimized version. I think you have a bug somewhere...

mbrot1.gif (1.71 kB, 192x128 - viewed 1223 times.)

mbrot2.gif (1.71 kB, 192x128 - viewed 1225 times.)

mbrot3.gif (1.78 kB, 192x128 - viewed 1211 times.)

Logged

___Axe_Parser___
Today the calculator, tomorrow the world!

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #272 on: December 13, 2011, 01:38:30 am »

I think I can explain the asymmetry of the size-optimized version. Because it adjusts signs differently, I think it now rounds down instead of towards zero like the old routine.

However, I have no clue what is going on with the speed-optimized routine. Can you look at the debugger and confirm that the call to sub_Mul is actually entering where it's supposed to be entering, at __MulByte? Because I wouldn't be surprised if the fact that you probably had to add the offset call macro for call nz,__MulByte in p_Mul is messing up the offset calls due to its own size.

« Last Edit: December 13, 2011, 01:40:46 am by Runer112 »

Logged

Quigibo

The Executioner
CoT Emeritus
LV11 Super Veteran (Next: 3000)
Posts: 2031
Rating: +1075/-24
I wish real life had a "Save" and "Load" button...

Re: Assembly Programmers - Help Axe Optimize!

« Reply #273 on: December 13, 2011, 02:07:20 am »

The disassembly looks fine to me. All the jumps calls and everything of that nature are aligned. I tried 4 test cases with different combinations of sign values and they seemed okay. Since the generated picture is relatively close to the original given that it was a chaotic system sensitive to errors, I would guess it is only a few special cases that cause it to return a wrong result.

EDIT: I made a program to run them side by side on random numbers and quit when the output is different. Here is an output that gives different results between the routines:

$FFE0 ** $F5F1 (-0.125 ** -10.059)

Results in $0143 (1.26) in size optimized.
Results in $0239 (2.22) in speed optimized.

« Last Edit: December 13, 2011, 02:31:40 am by Quigibo »

Logged

___Axe_Parser___
Today the calculator, tomorrow the world!

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #274 on: December 13, 2011, 03:04:07 am »

That edit was helpful, it gave me a hunch as to what the problem was and (I think) that hunch was correct. Unfortunately, the fix for this problem will cost a byte and about 70 cycles. It will still be about 20% faster than the small routine though. And it still relies on the more common p_Mul instead of p_MulFull, so being 17 bytes larger might still be worth it.

Faster routine: 35 bytes, ~900 cycles

p_88Mul:
 .db __88MulEnd-1-$
 push hl
 ld c,l
 ld a,h
 ld l,0
 ld b,b \ .db 8 \ call $3F00+sub_Mul
 ld b,8
__88MulNext:
 add hl,hl
 rla
 rl c
 jr nc,__88MulSkip
 add hl,de
 adc a,0
__88MulSkip:
 djnz __88MulNext
 pop bc
 bit 7,b
 jr z,$+3
 sub e
 ld l,h
 ld h,a
 bit 7,d
 ret z
 sub c
 ld h,a
 ret
__88MulEnd:

« Last Edit: December 13, 2011, 03:04:48 am by Runer112 »

Logged

calc84maniac

eZ80 Guru
Coder Of Tomorrow
LV11 Super Veteran (Next: 3000)
Posts: 2912
Rating: +471/-17

Re: Assembly Programmers - Help Axe Optimize!

« Reply #275 on: December 17, 2011, 11:29:24 pm »

So... Z-Test. At a cost of 8 cycles, you can go from 17 bytes plus 3 bytes times the number of options (limited to something like 85?) to 16 bytes plus 2 bytes times the number of options (limited to amount of program space).

Here's my method:

  ld de,-range
  add hl,de
  ld de,jumptable_end
  jr c,default
  add hl,hl
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
default:
  ex de,hl
  jp (hl)
  .dw Label0
  .dw Label1
  .dw Label2
  ;.....
jumptable_end:

Logged

"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Quigibo

The Executioner
CoT Emeritus
LV11 Super Veteran (Next: 3000)
Posts: 2031
Rating: +1075/-24
I wish real life had a "Save" and "Load" button...

Re: Assembly Programmers - Help Axe Optimize!

« Reply #276 on: December 17, 2011, 11:39:31 pm »

Wow thanks! I was considering that, but I assumed the overhead would be large, not smaller! Thanks!

Also, I could move the labels to the data section of the code to make it even faster!

  ld de,-range
  add hl,de
  jr c,default
  add hl,hl
  ld de,jumptable_end
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
  ex de,hl
  jp (hl)
default:

« Last Edit: December 17, 2011, 11:41:17 pm by Quigibo »

Logged

___Axe_Parser___
Today the calculator, tomorrow the world!

calc84maniac

eZ80 Guru
Coder Of Tomorrow
LV11 Super Veteran (Next: 3000)
Posts: 2912
Rating: +471/-17

Re: Assembly Programmers - Help Axe Optimize!

« Reply #277 on: December 17, 2011, 11:50:41 pm »

If you wanted to save 2 cycles in the case of a jump, you could use an odd table setup with all the LSBs in a row followed by all the MSBs in a row, like so:

  ld de,-range
  add hl,de
  jr c,routine_end
  ex de,hl
  ld hl,jumptable_end
  add hl,de
  ld a,(hl)
  add hl,de
  ld l,(hl)
  ld h,a
  jp (hl)
routine_end:

I imagine that might not work well with the way pointers are handled in the compiler, though.

Edit:
And I suppose the current Z-Test is actually limited to 39 options due to the range of the JR instruction...

« Last Edit: December 17, 2011, 11:56:28 pm by calc84maniac »

Logged

"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #278 on: December 19, 2011, 01:12:53 am »

First, an optimization that I can't give you code for: making *^CONST use an equivalent constant division optimization if one exists. And don't forget about the trivial cases, *^1 and *^0. Of course, these only apply if you don't change this operation to return a 32-bit result somehow. Which it really should.

Next, some silly optimizations: ^0, <<ᴇ8000, >>ᴇ7FFF should simply be 0, while ≥≥ᴇ8000 and ≤≤ᴇ7FFF should simply be 1. If you're wondering why ^0 should be 0, that's what the general modulus routine would return anyways.

Finally, some optimizations for signed comparisons. These have been lacking general forms which take advantage of absolute jumps as well as optimized forms for constants for quite some time. Thanks to jacobly and calc84maniac for helping me come up with the first two! If either of you two are reading this, feel free to look at the other operations and try to optimize them.

p_SGT0:
 .db 8
 ld a,h
 or l
 jr z,$+6
 add hl,hl
 sbc hl,hl
 inc hl
p_SLE0:
 .db 9
 ld a,h
 or l
 jr z,$+6
 add hl,hl
 ccf
 sbc hl,hl
 inc hl
p_SLtLeXX: 
 .db 11
 ld a,h
 add a,$80
 ld h,a
 ld de,$0000  ;$8000-const
 add hl,de
 sbc hl,hl
 inc hl
 .db rp_Ans,6
p_SGtGeXX:
 .db 12
 ld a,h
 add a,$80
 ld h,a
 xor a
 ld de,$0000  ;$8000-const
 add hl,de
 ld h,a
 rla
 ld l,a
 .db rp_Ans,6
p_SIntGt:
 .db 11
 scf
 sbc hl,de
 add hl,hl
 jp pe,$+4
 ccf
 sbc hl,hl
 inc hl
p_SIntGe:
 .db 11
 xor a
 sbc hl,de
 add hl,hl
 jp po,$+4
 ccf
 ld h,a
 rla
 ld l,a
p_SIntLt:
 .db 11
 scf
 sbc hl,de
 add hl,hl
 jp po,$+4
 ccf
 sbc hl,hl
 inc hl
p_SIntLe:
 .db 11
 xor a
 sbc hl,de
 add hl,hl
 jp pe,$+4
 ccf
 ld h,a
 rla
 ld l,a

« Last Edit: December 19, 2011, 02:52:21 pm by Runer112 »

Logged

jacobly

LV5 Advanced (Next: 300)
Posts: 205
Rating: +161/-1

Re: Assembly Programmers - Help Axe Optimize!

« Reply #279 on: December 20, 2011, 04:47:46 am »

p_DrawOff: save 1 byte, save ~40 cycles

Original

 xor a
 ld e,a
 dec a
__DrawOffShift:
 srl c
 rr e
 rra
 djnz __DrawOffShift
 dec d
 jr z,__DrawOffSkipRight
 ld b,a
 and (hl)
 or e
 ld (hl),a
 ld a,b
__DrawOffSkipRight:
 dec hl
 inc d
 jr z,__DrawOffSkipLeft
 cpl
 and (hl)
 or c
 ld (hl),a
__DrawOffSkipLeft:

Optimized

 xor a
 ld e,$FF
__DrawOffShift:
 srl c
 rr e
 rra
 djnz __DrawOffShift
 dec d
 jr z,__DrawOffSkipRight
 ld b,a
 or (hl)
 and e
 ld (hl),a
 ld a,b
__DrawOffSkipRight:
 dec hl
 inc d
 jr z,__DrawOffSkipLeft
 and (hl)
 or c
 ld (hl),a
__DrawOffSkipLeft:

p_Pix: save 2 bytes, save ~6 cycles

Original

p_Pix:
 .db __PixEnd-1-$  ;Draws pixel (c,l)
 ld de,plotSScreen
 pop af
 pop bc
 push af
 ld b,0

 ld a,l
 cp 64
 ld a,b
 ret nc
 ld a,c
 cp 96
 ld a,b
 ret nc

 ld h,b
 ld a,l
 add a,a
 add a,l
 ld l,a
 add hl,hl
 add hl,hl
 add hl,de
 ld a,c
 srl c
 srl c
 srl c
 add hl,bc
 and %00000111
 ld b,a
 ld a,%10000000
 ret z
___GetPixLoop:
 rrca
 djnz ___GetPixLoop
 ret
__PixEnd:

Optimized

p_Pix:
 .db __PixEnd-1-$  ;Draws pixel (c,l)
 ld de,plotSScreen
 pop af
 pop bc
 push af
 ld b,0

 ld a,c
 cp 96
 ld a,b
 ret nc
 sla l
 ret c
 sla l
 ret c

 ld h,b
 ex de,hl
 add hl,de
 add hl,de
 add hl,de
 ld a,c
 srl c
 srl c
 srl c
 add hl,bc
 and %00000111
 ld b,a
 ld a,%10000000
 ret z
___GetPixLoop:
 rrca
 djnz ___GetPixLoop
 ret
__PixEnd:

p_ArcTan: save 1 byte, save ~1 cycle

Original

p_ArcTan:
 .db __ArcTanEnd-1-$
 ex de,hl  ;de = y
 pop hl
 ex (sp),hl  ;hl = x
 push hl
 ld a,h  ;\
 xor d  ;/ Get parity
 jp m,__ArcTanSS-p_ArcTan-1
 add hl,de  ;\
 jr __ArcTanDS ; |
__ArcTanSS:   ; |hl = x +- y
 sbc hl,de  ; |
__ArcTanDS:   ;/
 ex de,hl  ;de = x +- y
 ld b,6  ;\
__ArcTan64:   ; |
 add hl,hl  ; |hl = 64y
 djnz __ArcTan64 ;/
 call $3F00+sub_SDiv ;hl = 64y/(x +- y)
 pop af  ;\
 rla   ; |Right side, fine
 ret nc  ;/
 sbc a,a  ;\
 sub h  ; |Reverse sign extend
 ld h,a  ;/
 ld a,l  ;\
 add a,128  ; |Add or sub 128
 ld l,a  ;/
 ret
__ArcTanEnd:

Optimized

p_ArcTan:
 .db __ArcTanEnd-1-$
 ex de,hl  ;de = y
 pop hl
 ex (sp),hl  ;hl = x
 push hl
 ld a,h  ;\
 xor d  ;/ Get parity
 jp m,__ArcTanSS-p_ArcTan-2
 add hl,de  ;\
 ld c,c \ .db $FA ; |
 ;jr __ArcTanDS ; |
__ArcTanSS:   ; |hl = x +- y
 sbc hl,de  ; |
__ArcTanDS:   ;/
 ex de,hl  ;de = x +- y
 ld b,6  ;\
__ArcTan64:   ; |
 add hl,hl  ; |hl = 64y
 djnz __ArcTan64 ;/
 call $3F00+sub_SDiv ;hl = 64y/(x +- y)
 pop af  ;\
 rla   ; |Right side, fine
 ret nc  ;/
 sbc a,a  ;\
 sub h  ; |Reverse sign extend
 ld h,a  ;/
 ld a,l  ;\
 add a,128  ; |Add or sub 128
 ld l,a  ;/
 ret
__ArcTanEnd:

Logged

+1/-0 karm for this message

jacobly

LV5 Advanced (Next: 300)
Posts: 205
Rating: +161/-1

Re: Assembly Programmers - Help Axe Optimize!

« Reply #280 on: December 24, 2011, 01:45:01 pm »

p_DrawOr/Xor: save 17 bytes (plus 4 every time a custom buffer is used)
aligned saves 98 cycles, unaligned saves ~173 cycles
save additional 21 cycles every time a custom buffer is used

p_DrawOr:
 .db __DrawOrEnd-1-$
 push hl
 pop ix   ;Input ix = Sprite
 ld hl,plotSScreen  ;Input hl = Buffer
 pop af
 pop bc   ;Input c = Sprite Y Position
 pop de   ;Input e = Sprite X Position
 push af
 ld b,7
 ld a,e
 add a,b
 cp 96+7
 ret nc
 rrca
 rrca
 rrca
 and $1f
 ld d,a
 ld a,c
 add a,b
 jr c,__DrawOrClipTop
 sub 64+7
 ret nc
 cpl
 cp b
 jr c,__DrawOrClipBottom
 ld a,b
 jr __DrawOrClipBottom
__DrawOrClipTop:
 inc ix
 inc c
 jr nz,__DrawOrClipTop
__DrawOrClipBottom:
 inc a
 ld b,0
 sla c
 sla c
 add hl,bc
 add hl,bc
 add hl,bc
 ld c,d
 add hl,bc
 ld b,a
 ld a,e
 and 7
 jr z,__DrawOrAligned
 ld c,a
 ld a,e
 cp -7
 sbc a,a
 ld d,a
 and e
 cp 96-7
 sbc a,a
 ld e,a
__DrawOrLoop:
 push bc
 ld b,c
 ld c,(ix)
 xor a
__DrawOrShift:
 srl c
 rra
 djnz __DrawOrShift
 and e
 or (hl)
 ld (hl),a
 dec hl
 ld a,c
 and d
 or (hl)
 ld (hl),a
 ld c,13
 add hl,bc
 inc ix
 pop bc
 djnz __DrawOrLoop
 ret
__DrawOrAligned:
 ld de,12
__DrawOrAlignedLoop:
 ld a,(ix)
 or (hl)
 ld (hl),a
 inc ix
 add hl,de
 djnz __DrawOrAlignedLoop
 ret
__DrawOrEnd:

p_DrawXor:
 .db __DrawXorEnd-1-$
 push hl
 pop ix   ;Input ix = Sprite
 ld hl,plotSScreen  ;Input hl = Buffer
 pop af
 pop bc   ;Input c = Sprite Y Position
 pop de   ;Input e = Sprite X Position
 push af
 ld b,7
 ld a,e
 add a,b
 cp 96+7
 ret nc
 rrca
 rrca
 rrca
 and $1f
 ld d,a
 ld a,c
 add a,b
 jr c,__DrawXorClipTop
 sub 64+7
 ret nc
 cpl
 cp b
 jr c,__DrawXorClipBottom
 ld a,b
 jr __DrawXorClipBottom
__DrawXorClipTop:
 inc ix
 inc c
 jr nz,__DrawXorClipTop
__DrawXorClipBottom:
 inc a
 ld b,0
 sla c
 sla c
 add hl,bc
 add hl,bc
 add hl,bc
 ld c,d
 add hl,bc
 ld b,a
 ld a,e
 and 7
 jr z,__DrawXorAligned
 ld c,a
 ld a,e
 cp -7
 sbc a,a
 ld d,a
 and e
 cp 96-7
 sbc a,a
 ld e,a
__DrawXorLoop:
 push bc
 ld b,c
 ld c,(ix)
 xor a
__DrawXorShift:
 srl c
 rra
 djnz __DrawXorShift
 and e
 xor (hl)
 ld (hl),a
 dec hl
 ld a,c
 and d
 xor (hl)
 ld (hl),a
 ld c,13
 add hl,bc
 inc ix
 pop bc
 djnz __DrawXorLoop
 ret
__DrawXorAligned:
 ld de,12
__DrawXorAlignedLoop:
 ld a,(ix)
 xor (hl)
 ld (hl),a
 inc ix
 add hl,de
 djnz __DrawXorAlignedLoop
 ret
__DrawXorEnd:

Logged

+1/-0 karm for this message

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4704
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: Assembly Programmers - Help Axe Optimize!

« Reply #281 on: December 24, 2011, 03:58:42 pm »

I finally have an optimisation that might work or be useful >.> Runer112 apparently mentioned optimising the p_FreqOut routine by replacing:

dec hl
dec bc
ld a,b
or c
jr nz,__FreqOutLoop2

with this:

cpd
jp pe,__FreqOutLoop2

However, the issue was that the frequency would be thrown off as it cut out 8*HL cycles. However, when I was stealing the code for my own evil intentions, I saw this optimisation and thought of that issue and here is my solution:


p_FreqOut:
 xor a
__FreqOutLoop1:
 push bc
        xor     %00000011
 ld e,a
__FreqOutLoop2:
 ld a,h
 or l
 jr z,__FreqOutDone
 cpd
 ld a,e
        scf
 jp pe,__FreqOutLoop2
__FreqOutDone:
 pop bc
 out ($00),a
 ret nc
 jr __FreqOutLoop1
__FreqOutEnd:

The way the code is reordered, now, it should only cut out 8*HL/BC cycles which is much less than 8*HL. I think Runer said that it might be up to 1% faster for higher notes and negligible for lower notes.

EDIT: Okay, found a problem: It is actually 2 cycles slower in the inside loop, now, so that will just slow the routine by 2*hl, too

« Last Edit: December 24, 2011, 04:03:24 pm by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

jacobly

LV5 Advanced (Next: 300)
Posts: 205
Rating: +161/-1

Re: Assembly Programmers - Help Axe Optimize!

« Reply #282 on: December 26, 2011, 12:17:26 pm »

p_DrawOr: 18 bytes saved
p_DrawXor: 18 bytes saved
p_DrawOff: 14 bytes saved
p_DrawMsk: 10 bytes saved
p_DrawMsk2: 11 bytes saved

p_DrawOr:
 .db __DrawOrEnd-1-$
 push hl
 pop ix   ;Input ix = Sprite
 ld hl,plotSScreen  ;Input hl = Buffer
 pop af
 pop de   ;Input e = Sprite Y Position
 pop bc   ;Input c = Sprite X Position
 push af
 ld d,7
 ld a,e
 add a,d
 jr c,__DrawOrClipTop
 sub 64+7
 ret nc
 cpl
 cp d
 jr c,__DrawOrClipBottom
 ld b,d
 jr __DrawOrNoClipV
__DrawOrClipTop:
 inc ix
 inc e
 jr nz,__DrawOrClipTop
__DrawOrClipBottom:
 ld b,a
__DrawOrNoClipV:
 ld a,c
 add a,d
 cp 96+7
 ret nc
 rrca
 rrca
 rrca
 and $1f
 sla e
 sla e
 add hl,de
 add hl,de
 add hl,de
 ld e,a
 inc b
 ld a,c
 and d
 ld d,-7*3
 add hl,de
 jr z,__DrawOrAligned
 ld e,c
 ld c,a
 ld a,e
 cp -7
 sbc a,a
 ld d,a
 and e
 cp 96-7
 sbc a,a
 ld e,a
__DrawOrLoop:
 push bc
 ld b,c
 ld c,(ix)
 xor a
__DrawOrShift:
 srl c
 rra
 djnz __DrawOrShift
 and e
 or (hl)
 ld (hl),a
 dec hl
 ld a,c
 and d
 or (hl)
 ld (hl),a
 ld c,13
 add hl,bc
 inc ix
 pop bc
 djnz __DrawOrLoop
 ret
__DrawOrAligned:
 ld de,12
__DrawOrAlignedLoop:
 ld a,(ix)
 or (hl)
 ld (hl),a
 inc ix
 add hl,de
 djnz __DrawOrAlignedLoop
 ret
__DrawOrEnd:

p_DrawXor:
 .db __DrawXorEnd-1-$
 push hl
 pop ix   ;Input ix = Sprite
 ld hl,plotSScreen  ;Input hl = Buffer
 pop af
 pop de   ;Input e = Sprite Y Position
 pop bc   ;Input c = Sprite X Position
 push af
 ld d,7
 ld a,e
 add a,d
 jr c,__DrawXorClipTop
 sub 64+7
 ret nc
 cpl
 cp d
 jr c,__DrawXorClipBottom
 ld b,d
 jr __DrawXorNoClipV
__DrawXorClipTop:
 inc ix
 inc e
 jr nz,__DrawXorClipTop
__DrawXorClipBottom:
 ld b,a
__DrawXorNoClipV:
 ld a,c
 add a,d
 cp 96+7
 ret nc
 rrca
 rrca
 rrca
 and $1f
 sla e
 sla e
 add hl,de
 add hl,de
 add hl,de
 ld e,a
 inc b
 ld a,c
 and d
 ld d,-7*3
 add hl,de
 jr z,__DrawXorAligned
 ld e,c
 ld c,a
 ld a,e
 cp -7
 sbc a,a
 ld d,a
 and e
 cp 96-7
 sbc a,a
 ld e,a
__DrawXorLoop:
 push bc
 ld b,c
 ld c,(ix)
 xor a
__DrawXorShift:
 srl c
 rra
 djnz __DrawXorShift
 and e
 xor (hl)
 ld (hl),a
 dec hl
 ld a,c
 and d
 xor (hl)
 ld (hl),a
 ld c,13
 add hl,bc
 inc ix
 pop bc
 djnz __DrawXorLoop
 ret
__DrawXorAligned:
 ld de,12
__DrawXorAlignedLoop:
 ld a,(ix)
 xor (hl)
 ld (hl),a
 inc ix
 add hl,de
 djnz __DrawXorAlignedLoop
 ret
__DrawXorEnd:

p_DrawOff:
 .db __DrawOffEnd-1-$
 push hl
 pop ix   ;Input ix = Sprite
 ld hl,plotSScreen  ;Input hl = Buffer
 pop af
 pop de   ;Input e = Sprite Y Position
 pop bc   ;Input c = Sprite X Position
 push af
 ld d,7
 ld a,e
 add a,d
 jr c,__DrawOffClipTop
 sub 64+7
 ret nc
 cpl
 cp d
 jr c,__DrawOffClipBottom
 ld b,d
 jr __DrawOffNoClipV
__DrawOffClipTop:
 inc ix
 inc e
 jr nz,__DrawOffClipTop
__DrawOffClipBottom:
 ld b,a
__DrawOffNoClipV:
 ld a,c
 add a,d
 cp 96+7
 ret nc
 rrca
 rrca
 rrca
 and $1f
 ld d,0
 sla e
 sla e
 add hl,de
 add hl,de
 add hl,de
 ld e,a
 add hl,de
 inc b
 ld a,c
 and 7
 jr z,__DrawOffAligned
 ld e,c
 ld c,a
 ld a,e
 cp -7
 jr nc,__DrawOffLoop
 inc d
 cp 96-7
 jr nc,__DrawOffLoop
 inc d
__DrawOffLoop:
 push bc
 ld b,c
 ld c,(ix+0)
 xor a
 ld e,$FF
__DrawOffShift:
 srl c
 rr e
 rra
 djnz __DrawOffShift
 dec d
 jr z,__DrawOffSkipRight
 ld b,a
 or (hl)
 and e
 ld (hl),a
 ld a,b
__DrawOffSkipRight:
 dec hl
 inc d
 jr z,__DrawOffSkipLeft
 and (hl)
 or c
 ld (hl),a
__DrawOffSkipLeft:
 ld bc,13
 add hl,bc
 inc ix
 pop bc
 djnz __DrawOffLoop
 ret
__DrawOffAligned:
 ld e,12
__DrawOffAlignedLoop:
 ld a,(ix)
 ld (hl),a
 inc ix
 add hl,de
 djnz __DrawOffAlignedLoop
 ret
__DrawOffEnd:

p_DrawMsk:
 .db __DrawMskEnd-1-$
 ex (sp),hl
 pop ix   ;Input hl = Sprite
 pop de
 pop bc
 push hl
 ld hl,plotSScreen
 ld d,7
 ld a,e
 add a,d
 jr c,__DrawMskClipTop
 sub 64+7
 ret nc
 cpl
 cp d
 jr c,__DrawMskClipBottom
 ld b,d
 jr __DrawMskNoClipV
__DrawMskClipTop:
 inc ix
 inc e
 jr nz,__DrawMskClipTop
__DrawMskClipBottom:
 ld b,a
__DrawMskNoClipV:
 ld a,c
 add a,d
 cp 96+7
 ret nc
 rrca
 rrca
 rrca
 and $1f
 ld d,0
 sla e
 sla e
 add hl,de
 add hl,de
 add hl,de
 ld e,a
 add hl,de
 inc b
 ld a,c
 and 7
 jr z,__DrawMskAligned
 ld e,c
 ld c,a
 ld a,e
 cp -7
 jr nc,__DrawMskLoop
 inc d
 cp 96-7
 jr nc,__DrawMskLoop
 inc d

__DrawMskLoop:
 push bc

 push hl

 ld b,c
 ld e,(ix+0)
 xor a
 ld h,a
 ld c,(ix+8)
__DrawMskShift:
 srl e
 rr h
 srl c
 rra
 djnz __DrawMskShift

 ld b,h
 pop hl
 push af

 dec d
 jr z,__DrawMskSkipRight1

 push bc
 xor b
 cpl
 ld c,a

 ld a,(hl)
 or b
 and c
 ld (hl),a
 pop bc

__DrawMskSkipRight1:
 dec hl
 inc d
 push de
 jr z,__DrawMskSkipLeft1

 ld a,c
 xor e
 cpl
 ld d,a

 ld a,(hl)
 or e
 and d
 ld (hl),a

__DrawMskSkipLeft1:
 ld de,appBackUpScreen-plotSScreen+1
 add hl,de
 pop de
 pop af
 dec d
 jr z,__DrawMskSkipRight2

 or b
 cpl

 and (hl)
 or b
 ld (hl),a

__DrawMskSkipRight2:
 dec hl
 inc d
 jr z,__DrawMskSkipLeft2

 ld a,c
 or e
 cpl

 and (hl)
 or e
 ld (hl),a

__DrawMskSkipLeft2:
 ld bc,plotSScreen-appBackUpScreen+13
 add hl,bc

 inc ix
 pop bc
 djnz __DrawMskLoop
 ret
__DrawMskAligned:
 push hl
 ld de,appBackUpScreen-plotSScreen
 add hl,de

 ld a,(ix+0)
 ld d,a
 xor (ix+8)
 cpl
 ld e,a

 and (hl)
 or d
 ld (hl),a

 pop hl

 ld a,(hl)
 or d
 and e
 ld (hl),a

 inc ix
 ld de,12
 add hl,de
 djnz __DrawMskAligned
 ret
__DrawMskEnd:

p_DrawMsk2:
 .db __DrawMsk2End-1-$
 ex (sp),hl
 pop ix   ;Input hl = Sprite
 pop de
 pop bc
 push hl
 ld hl,plotSScreen
 ld d,7
 ld a,e
 add a,d
 jr c,__DrawMsk2ClipTop
 sub 64+7
 ret nc
 cpl
 cp d
 jr c,__DrawMsk2ClipBottom
 ld b,d
 jr __DrawMsk2NoClipV
__DrawMsk2ClipTop:
 inc ix
 inc e
 jr nz,__DrawMsk2ClipTop
__DrawMsk2ClipBottom:
 ld b,a
__DrawMsk2NoClipV:
 ld a,c
 add a,d
 cp 96+7
 ret nc
 rrca
 rrca
 rrca
 and $1f
 ld d,0
 sla e
 sla e
 add hl,de
 add hl,de
 add hl,de
 ld e,a
 add hl,de
 inc b
 ld a,c
 and 7
 jr z,__DrawMsk2Aligned
 ld e,c
 ld c,a
 ld a,e
 cp -7
 jr nc,__DrawMsk2Loop
 inc d
 cp 96-7
 jr nc,__DrawMsk2Loop
 inc d
__DrawMsk2Loop:
 push bc
 push hl

 ld b,c
 ld e,(ix+0)
 xor a
 ld h,a
 ld c,(ix+8)
__DrawMsk2Shift:
 srl e
 rr h
 srl c
 rra
 djnz __DrawMsk2Shift

 ld b,h   ;e = left spr, b = right spr, c = left msk, a = right msk
 pop hl

 dec d
 jr z,__DrawMsk2SkipRight

 cpl
 and (hl)
 xor b
 ld (hl),a

__DrawMsk2SkipRight:
 dec hl
 inc d
 jr z,__DrawMsk2SkipLeft

 ld a,c
 cpl
 and (hl)
 xor e
 ld (hl),a

__DrawMsk2SkipLeft:
 ld bc,13
 add hl,bc

 inc ix
 pop bc
 djnz __DrawMsk2Loop
 ret
__DrawMsk2Aligned:
 ld e,12
__DrawMsk2AlignedLoop:
 ld a,(ix+8)
 cpl
 and (hl)
 xor (ix+0)
 ld (hl),a
 inc ix
 add hl,de
 djnz __DrawMsk2AlignedLoop
 ret
__DrawMsk2End:

Logged

+2/-0 karm for this message

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #283 on: February 02, 2012, 11:56:23 pm »

Just a small optimization I see with the new Nth string command. Because you restack the return location by popping it into bc, you're already loading bc with a value that's at least $4000 for applications and at least $8000 for programs, so the ld b,h inside the loop is not necessary.

Logged

jacobly

LV5 Advanced (Next: 300)
Posts: 205
Rating: +161/-1

Re: Assembly Programmers - Help Axe Optimize!

« Reply #284 on: September 18, 2012, 03:03:40 am »

Thanks to a suggestion from calc84maniac, I have optimized the routine that is used for both *^ and ** to be 25-50% faster.

In addition, every use of *^ would be 2 bytes smaller.

p_MulFull: same size, save 300-550 cycles

Original

p_MulFull:
 .db __MulFullEnd-1-$
 ld c,h
 ld a,l
 ld hl,0
 ld b,16
__MulFullNext:
 add hl,hl
 rla
 rl c
 jr nc,__MulFullSkip
 add hl,de
 adc a,0
 jr nc,__MulFullSkip
 inc c
__MulFullSkip:
 djnz __MulFullNext
 ret
__MulFullEnd:

Optimized

p_MulFull:
 .db __MulFullEnd-1-$
 xor a
 ld c,h
 ld h,a
 or l
 ld l,h
 call nz,__MulFullByte-p_MulFull-1
 ld a,c
__MulFullByte:
 ld b,8
__MulFullNext:
 rra
 jr nc,__MulFullSkip
 add hl,de
__MulFullSkip:
 rr h
 rr l
 djnz __MulFullNext
 ret
__MulFullEnd:

Note: Output changed: hl = bits 16-31 of the result, do rra after the routine returns to get a = bits 8-15 of the result.

« Last Edit: September 18, 2012, 03:10:34 am by jacobly »

Logged

+2/-0 karm for this message

Print

Pages: 1 ... 17 18 [19] 20 Go Up

« previous next »

Omnimaga »
Forum »
Calculator Community »
Major Community Projects »
The Axe Parser Project (Moderator: Runer112) »
Assembly Programmers - Help Axe Optimize!

Server load over the past 5, 10 and 15 minutes respectively: 0.89501953125, 1.2060546875, 1.35498046875

Page created in 0.081 seconds with 56 queries.