### Author Topic: Assembly Programmers - Help Axe Optimize!  (Read 140691 times)

0 Members and 1 Guest are viewing this topic.

#### Runer112

• Project Author
• LV11 Super Veteran (Next: 3000)
• Posts: 2289
• Rating: +639/-31
##### Re: Assembly Programmers - Help Axe Optimize!
« Reply #270 on: December 12, 2011, 11:57:46 pm »
Yeah, I see no way to optimize the full 32-bit multiplication... But fixed-point multiplication, now that's an entirely different story! First, here's a totally different approach to sign handling that reduces p_88Mul to less than half of its current size!

 Original routine: 38 bytes, ~1128 cyclesCode: [Select]p_88Mul: .db __88MulEnd-1-$ld a,h xor d push af bit 7,h jr z,$+8 xor a sub l ld l,a sbc a,a sub h ld h,a bit 7,d jr z,$+8 xor a sub e ld e,a sbc a,a sub d ld d,a call$3F00+sub_MulFull ld l,h ld h,a pop af xor h ret p xor a sub l ld l,a sbc a,a sub h ld h,a ret__88MulEnd: Smaller routine: 18 bytes, ~1089 cyclesCode: [Select]p_88Mul: .db __88MulEnd-1-$push hl call$3F00+sub_MulFull pop bc bit 7,b jr z,$+3 sub e ld l,h ld h,a bit 7,d ret z sub c ld h,a ret__88MulEnd: 20 bytes saved? Not bad at all! But what if you're more interested in shaving off cycles than bytes? Don't worry, I covered that base too. Instead of using the slower p_MulFull, this final routine uses my faster p_Mul for 8 bits of the multiplication and an inlined, slightly different version of faster multiplication for the other 8 bits. End result: it's about 260 cycles faster than the smaller solution, or about 30% faster! It's 16 bytes larger than my smaller method, but actually it would often end up resulting in smaller programs because it relies on the much more popular p_Mul instead of p_MulFull.  Faster routine: 34 bytes, ~831 cyclesCode: [Select]p_88Mul: .db __88MulEnd-1-$ push hl ld c,l ld a,h ld l,0 ld b,b \ .db 8 \ call $3F00+sub_Mul ld a,c ld bc,8<<8+0__88MulNext: add hl,hl rla jr nc,__88MulSkip add hl,de adc a,c__88MulSkip: djnz __88MulNext pop bc bit 7,b jr z,$+3 sub e ld l,h ld h,a bit 7,d ret z sub c ld h,a ret__88MulEnd:
« Last Edit: December 13, 2011, 12:04:38 am by Runer112 »

#### Quigibo

• The Executioner
• CoT Emeritus
• LV11 Super Veteran (Next: 3000)
• Posts: 2031
• Rating: +1075/-24
##### Re: Assembly Programmers - Help Axe Optimize!
« Reply #271 on: December 13, 2011, 01:19:54 am »
Wow thanks!  However there seems to be an issue.  The 3 pictures attached are the output from the Mandelbrot Set demo program. The first is the original routine.  The second is your new size optimized version.  As you can see it works, but the rounding appears to be asymmetrical (which might still be okay).  The last one is your speed optimized version.  I think you have a bug somewhere...
___Axe_Parser___
Today the calculator, tomorrow the world!

#### Runer112

• Project Author
• LV11 Super Veteran (Next: 3000)
• Posts: 2289
• Rating: +639/-31
##### Re: Assembly Programmers - Help Axe Optimize!
« Reply #272 on: December 13, 2011, 01:38:30 am »
I think I can explain the asymmetry of the size-optimized version. Because it adjusts signs differently, I think it now rounds down instead of towards zero like the old routine.

However, I have no clue what is going on with the speed-optimized routine. Can you look at the debugger and confirm that the call to sub_Mul is actually entering where it's supposed to be entering, at __MulByte? Because I wouldn't be surprised if the fact that you probably had to add the offset call macro for call nz,__MulByte in p_Mul is messing up the offset calls due to its own size.
« Last Edit: December 13, 2011, 01:40:46 am by Runer112 »

#### Quigibo

• The Executioner
• CoT Emeritus
• LV11 Super Veteran (Next: 3000)
• Posts: 2031
• Rating: +1075/-24
##### Re: Assembly Programmers - Help Axe Optimize!
« Reply #273 on: December 13, 2011, 02:07:20 am »
The disassembly looks fine to me.  All the jumps calls and everything of that nature are aligned.  I tried 4 test cases with different combinations of sign values and they seemed okay.  Since the generated picture is relatively close to the original given that it was a chaotic system sensitive to errors, I would guess it is only a few special cases that cause it to return a wrong result.

EDIT: I made a program to run them side by side on random numbers and quit when the output is different.  Here is an output that gives different results between the routines:

$FFE0 **$F5F1 (-0.125 ** -10.059)

Results in $0143 (1.26) in size optimized. Results in$0239 (2.22) in speed optimized.
« Last Edit: December 13, 2011, 02:31:40 am by Quigibo »
___Axe_Parser___
Today the calculator, tomorrow the world!

#### Runer112

• Project Author
• LV11 Super Veteran (Next: 3000)
• Posts: 2289
• Rating: +639/-31
##### Re: Assembly Programmers - Help Axe Optimize!
« Reply #274 on: December 13, 2011, 03:04:07 am »
That edit was helpful, it gave me a hunch as to what the problem was and (I think) that hunch was correct. Unfortunately, the fix for this problem will cost a byte and about 70 cycles. It will still be about 20% faster than the small routine though. And it still relies on the more common p_Mul instead of p_MulFull, so being 17 bytes larger might still be worth it.

 Faster routine: 35 bytes, ~900 cyclesCode: [Select]p_88Mul: .db __88MulEnd-1-$push hl ld c,l ld a,h ld l,0 ld b,b \ .db 8 \ call$3F00+sub_Mul ld b,8__88MulNext: add hl,hl rla rl c jr nc,__88MulSkip add hl,de adc a,0__88MulSkip: djnz __88MulNext pop bc bit 7,b jr z,$+3 sub e ld l,h ld h,a bit 7,d ret z sub c ld h,a ret__88MulEnd: « Last Edit: December 13, 2011, 03:04:48 am by Runer112 » #### calc84maniac • eZ80 Guru • Coder Of Tomorrow • LV11 Super Veteran (Next: 3000) • Posts: 2912 • Rating: +471/-17 ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #275 on: December 17, 2011, 11:29:24 pm » So... Z-Test. At a cost of 8 cycles, you can go from 17 bytes plus 3 bytes times the number of options (limited to something like 85?) to 16 bytes plus 2 bytes times the number of options (limited to amount of program space). Here's my method: Code: [Select]  ld de,-range add hl,de ld de,jumptable_end jr c,default add hl,hl add hl,de ld e,(hl) inc hl ld d,(hl)default: ex de,hl jp (hl) .dw Label0 .dw Label1 .dw Label2 ;.....jumptable_end: "Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman #### Quigibo • The Executioner • CoT Emeritus • LV11 Super Veteran (Next: 3000) • Posts: 2031 • Rating: +1075/-24 • I wish real life had a "Save" and "Load" button... ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #276 on: December 17, 2011, 11:39:31 pm » Wow thanks! I was considering that, but I assumed the overhead would be large, not smaller! Thanks! Also, I could move the labels to the data section of the code to make it even faster! Code: [Select]  ld de,-range add hl,de jr c,default add hl,hl ld de,jumptable_end add hl,de ld e,(hl) inc hl ld d,(hl) ex de,hl jp (hl)default: « Last Edit: December 17, 2011, 11:41:17 pm by Quigibo » ___Axe_Parser___ Today the calculator, tomorrow the world! #### calc84maniac • eZ80 Guru • Coder Of Tomorrow • LV11 Super Veteran (Next: 3000) • Posts: 2912 • Rating: +471/-17 ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #277 on: December 17, 2011, 11:50:41 pm » If you wanted to save 2 cycles in the case of a jump, you could use an odd table setup with all the LSBs in a row followed by all the MSBs in a row, like so: Code: [Select]  ld de,-range add hl,de jr c,routine_end ex de,hl ld hl,jumptable_end add hl,de ld a,(hl) add hl,de ld l,(hl) ld h,a jp (hl)routine_end: I imagine that might not work well with the way pointers are handled in the compiler, though. Edit: And I suppose the current Z-Test is actually limited to 39 options due to the range of the JR instruction... « Last Edit: December 17, 2011, 11:56:28 pm by calc84maniac » "Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman #### Runer112 • Project Author • LV11 Super Veteran (Next: 3000) • Posts: 2289 • Rating: +639/-31 ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #278 on: December 19, 2011, 01:12:53 am » First, an optimization that I can't give you code for: making *^CONST use an equivalent constant division optimization if one exists. And don't forget about the trivial cases, *^1 and *^0. Of course, these only apply if you don't change this operation to return a 32-bit result somehow. Which it really should. Next, some silly optimizations: ^0, <<ᴇ8000, >>ᴇ7FFF should simply be 0, while ≥≥ᴇ8000 and ≤≤ᴇ7FFF should simply be 1. If you're wondering why ^0 should be 0, that's what the general modulus routine would return anyways. Finally, some optimizations for signed comparisons. These have been lacking general forms which take advantage of absolute jumps as well as optimized forms for constants for quite some time. Thanks to jacobly and calc84maniac for helping me come up with the first two! If either of you two are reading this, feel free to look at the other operations and try to optimize them. Code: [Select] p_SGT0: .db 8 ld a,h or l jr z,$+6 add hl,hl sbc hl,hl inc hlp_SLE0: .db 9 ld a,h or l jr z,$+6 add hl,hl ccf sbc hl,hl inc hlp_SLtLeXX: .db 11 ld a,h add a,$80 ld h,a ld de,$0000 ;$8000-const add hl,de sbc hl,hl inc hl .db rp_Ans,6p_SGtGeXX: .db 12 ld a,h add a,$80 ld h,a xor a ld de,$0000 ;$8000-const add hl,de ld h,a rla ld l,a .db rp_Ans,6p_SIntGt: .db 11 scf sbc hl,de add hl,hl jp pe,$+4 ccf sbc hl,hl inc hlp_SIntGe: .db 11 xor a sbc hl,de add hl,hl jp po,$+4 ccf ld h,a rla ld l,ap_SIntLt: .db 11 scf sbc hl,de add hl,hl jp po,$+4 ccf sbc hl,hl inc hlp_SIntLe: .db 11 xor a sbc hl,de add hl,hl jp pe,$+4 ccf ld h,a rla ld l,a « Last Edit: December 19, 2011, 02:52:21 pm by Runer112 » #### jacobly • LV5 Advanced (Next: 300) • Posts: 205 • Rating: +161/-1 ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #279 on: December 20, 2011, 04:47:46 am » p_DrawOff: save 1 byte, save ~40 cycles  OriginalCode: [Select] xor a ld e,a dec a__DrawOffShift: srl c rr e rra djnz __DrawOffShift dec d jr z,__DrawOffSkipRight ld b,a and (hl) or e ld (hl),a ld a,b__DrawOffSkipRight: dec hl inc d jr z,__DrawOffSkipLeft cpl and (hl) or c ld (hl),a__DrawOffSkipLeft: OptimizedCode: [Select] xor a ld e,$FF__DrawOffShift: srl c rr e rra djnz __DrawOffShift dec d jr z,__DrawOffSkipRight ld b,a or (hl) and e ld (hl),a ld a,b__DrawOffSkipRight: dec hl inc d jr z,__DrawOffSkipLeft and (hl) or c ld (hl),a__DrawOffSkipLeft:

p_Pix: save 2 bytes, save ~6 cycles
 OriginalCode: [Select]p_Pix: .db __PixEnd-1-$;Draws pixel (c,l) ld de,plotSScreen pop af pop bc push af ld b,0 ld a,l cp 64 ld a,b ret nc ld a,c cp 96 ld a,b ret nc ld h,b ld a,l add a,a add a,l ld l,a add hl,hl add hl,hl add hl,de ld a,c srl c srl c srl c add hl,bc and %00000111 ld b,a ld a,%10000000 ret z___GetPixLoop: rrca djnz ___GetPixLoop ret__PixEnd: OptimizedCode: [Select]p_Pix: .db __PixEnd-1-$ ;Draws pixel (c,l) ld de,plotSScreen pop af pop bc push af ld b,0 ld a,c cp 96 ld a,b ret nc sla l ret c sla l ret c ld h,b ex de,hl add hl,de add hl,de add hl,de ld a,c srl c srl c srl c add hl,bc and %00000111 ld b,a ld a,%10000000 ret z___GetPixLoop: rrca djnz ___GetPixLoop ret__PixEnd:

p_ArcTan: save 1 byte, save ~1 cycle
 OriginalCode: [Select]p_ArcTan: .db __ArcTanEnd-1-$ex de,hl ;de = y pop hl ex (sp),hl ;hl = x push hl ld a,h ;\ xor d ;/ Get parity jp m,__ArcTanSS-p_ArcTan-1 add hl,de ;\ jr __ArcTanDS ; |__ArcTanSS: ; |hl = x +- y sbc hl,de ; |__ArcTanDS: ;/ ex de,hl ;de = x +- y ld b,6 ;\__ArcTan64: ; | add hl,hl ; |hl = 64y djnz __ArcTan64 ;/ call$3F00+sub_SDiv ;hl = 64y/(x +- y) pop af ;\ rla ; |Right side, fine ret nc ;/ sbc a,a ;\ sub h ; |Reverse sign extend ld h,a ;/ ld a,l ;\ add a,128 ; |Add or sub 128 ld l,a ;/ ret__ArcTanEnd: OptimizedCode: [Select]p_ArcTan: .db __ArcTanEnd-1-$ex de,hl ;de = y pop hl ex (sp),hl ;hl = x push hl ld a,h ;\ xor d ;/ Get parity jp m,__ArcTanSS-p_ArcTan-2 add hl,de ;\ ld c,c \ .db$FA ; | ;jr __ArcTanDS ; |__ArcTanSS: ; |hl = x +- y sbc hl,de ; |__ArcTanDS: ;/ ex de,hl ;de = x +- y ld b,6 ;\__ArcTan64: ; | add hl,hl ; |hl = 64y djnz __ArcTan64 ;/ call 3F00+sub_SDiv ;hl = 64y/(x +- y) pop af ;\ rla ; |Right side, fine ret nc ;/ sbc a,a ;\ sub h ; |Reverse sign extend ld h,a ;/ ld a,l ;\ add a,128 ; |Add or sub 128 ld l,a ;/ ret__ArcTanEnd: #### jacobly • LV5 Advanced (Next: 300) • Posts: 205 • Rating: +161/-1 ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #280 on: December 24, 2011, 01:45:01 pm » p_DrawOr/Xor: save 17 bytes (plus 4 every time a custom buffer is used) aligned saves 98 cycles, unaligned saves ~173 cycles save additional 21 cycles every time a custom buffer is used Code: [Select] p_DrawOr: .db __DrawOrEnd-1- push hl pop ix ;Input ix = Sprite ld hl,plotSScreen ;Input hl = Buffer pop af pop bc ;Input c = Sprite Y Position pop de ;Input e = Sprite X Position push af ld b,7 ld a,e add a,b cp 96+7 ret nc rrca rrca rrca and $1f ld d,a ld a,c add a,b jr c,__DrawOrClipTop sub 64+7 ret nc cpl cp b jr c,__DrawOrClipBottom ld a,b jr __DrawOrClipBottom__DrawOrClipTop: inc ix inc c jr nz,__DrawOrClipTop__DrawOrClipBottom: inc a ld b,0 sla c sla c add hl,bc add hl,bc add hl,bc ld c,d add hl,bc ld b,a ld a,e and 7 jr z,__DrawOrAligned ld c,a ld a,e cp -7 sbc a,a ld d,a and e cp 96-7 sbc a,a ld e,a__DrawOrLoop: push bc ld b,c ld c,(ix) xor a__DrawOrShift: srl c rra djnz __DrawOrShift and e or (hl) ld (hl),a dec hl ld a,c and d or (hl) ld (hl),a ld c,13 add hl,bc inc ix pop bc djnz __DrawOrLoop ret__DrawOrAligned: ld de,12__DrawOrAlignedLoop: ld a,(ix) or (hl) ld (hl),a inc ix add hl,de djnz __DrawOrAlignedLoop ret__DrawOrEnd:p_DrawXor: .db __DrawXorEnd-1-$ push hl pop ix ;Input ix = Sprite ld hl,plotSScreen ;Input hl = Buffer pop af pop bc ;Input c = Sprite Y Position pop de ;Input e = Sprite X Position push af ld b,7 ld a,e add a,b cp 96+7 ret nc rrca rrca rrca and $1f ld d,a ld a,c add a,b jr c,__DrawXorClipTop sub 64+7 ret nc cpl cp b jr c,__DrawXorClipBottom ld a,b jr __DrawXorClipBottom__DrawXorClipTop: inc ix inc c jr nz,__DrawXorClipTop__DrawXorClipBottom: inc a ld b,0 sla c sla c add hl,bc add hl,bc add hl,bc ld c,d add hl,bc ld b,a ld a,e and 7 jr z,__DrawXorAligned ld c,a ld a,e cp -7 sbc a,a ld d,a and e cp 96-7 sbc a,a ld e,a__DrawXorLoop: push bc ld b,c ld c,(ix) xor a__DrawXorShift: srl c rra djnz __DrawXorShift and e xor (hl) ld (hl),a dec hl ld a,c and d xor (hl) ld (hl),a ld c,13 add hl,bc inc ix pop bc djnz __DrawXorLoop ret__DrawXorAligned: ld de,12__DrawXorAlignedLoop: ld a,(ix) xor (hl) ld (hl),a inc ix add hl,de djnz __DrawXorAlignedLoop ret__DrawXorEnd: #### Xeda112358 • they/them • Moderator • LV12 Extreme Poster (Next: 5000) • Posts: 4704 • Rating: +719/-6 • Calc-u-lator, do doo doo do do do. ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #281 on: December 24, 2011, 03:58:42 pm » I finally have an optimisation that might work or be useful >.> Runer112 apparently mentioned optimising the p_FreqOut routine by replacing: Code: [Select] dec hldec bcld a,bor cjr nz,__FreqOutLoop2with this: Code: [Select] cpdjp pe,__FreqOutLoop2However, the issue was that the frequency would be thrown off as it cut out 8*HL cycles. However, when I was stealing the code for my own evil intentions, I saw this optimisation and thought of that issue and here is my solution: Code: [Select] p_FreqOut: xor a__FreqOutLoop1: push bc xor %00000011 ld e,a__FreqOutLoop2: ld a,h or l jr z,__FreqOutDone cpd ld a,e scf jp pe,__FreqOutLoop2__FreqOutDone: pop bc out ($00),a ret nc jr __FreqOutLoop1__FreqOutEnd:The way the code is reordered, now, it should only cut out 8*HL/BC cycles which is much less than 8*HL. I think Runer said that it might be up to 1% faster for higher notes and negligible for lower notes.

EDIT: Okay, found a problem: It is actually 2 cycles slower in the inside loop, now, so that will just slow the routine by 2*hl, too
« Last Edit: December 24, 2011, 04:03:24 pm by Xeda112358 »

#### jacobly

• Posts: 205
• Rating: +161/-1
##### Re: Assembly Programmers - Help Axe Optimize!
« Reply #282 on: December 26, 2011, 12:17:26 pm »
p_DrawOr: 18 bytes saved
p_DrawXor: 18 bytes saved
p_DrawOff: 14 bytes saved
p_DrawMsk: 10 bytes saved
p_DrawMsk2: 11 bytes saved
Code: [Select]
p_DrawOr: .db __DrawOrEnd-1-$push hl pop ix ;Input ix = Sprite ld hl,plotSScreen ;Input hl = Buffer pop af pop de ;Input e = Sprite Y Position pop bc ;Input c = Sprite X Position push af ld d,7 ld a,e add a,d jr c,__DrawOrClipTop sub 64+7 ret nc cpl cp d jr c,__DrawOrClipBottom ld b,d jr __DrawOrNoClipV__DrawOrClipTop: inc ix inc e jr nz,__DrawOrClipTop__DrawOrClipBottom: ld b,a__DrawOrNoClipV: ld a,c add a,d cp 96+7 ret nc rrca rrca rrca and$1f sla e sla e add hl,de add hl,de add hl,de ld e,a inc b ld a,c and d ld d,-7*3 add hl,de jr z,__DrawOrAligned ld e,c ld c,a ld a,e cp -7 sbc a,a ld d,a and e cp 96-7 sbc a,a ld e,a__DrawOrLoop: push bc ld b,c ld c,(ix) xor a__DrawOrShift: srl c rra djnz __DrawOrShift and e or (hl) ld (hl),a dec hl ld a,c and d or (hl) ld (hl),a ld c,13 add hl,bc inc ix pop bc djnz __DrawOrLoop ret__DrawOrAligned: ld de,12__DrawOrAlignedLoop: ld a,(ix) or (hl) ld (hl),a inc ix add hl,de djnz __DrawOrAlignedLoop ret__DrawOrEnd:p_DrawXor: .db __DrawXorEnd-1-$push hl pop ix ;Input ix = Sprite ld hl,plotSScreen ;Input hl = Buffer pop af pop de ;Input e = Sprite Y Position pop bc ;Input c = Sprite X Position push af ld d,7 ld a,e add a,d jr c,__DrawXorClipTop sub 64+7 ret nc cpl cp d jr c,__DrawXorClipBottom ld b,d jr __DrawXorNoClipV__DrawXorClipTop: inc ix inc e jr nz,__DrawXorClipTop__DrawXorClipBottom: ld b,a__DrawXorNoClipV: ld a,c add a,d cp 96+7 ret nc rrca rrca rrca and$1f sla e sla e add hl,de add hl,de add hl,de ld e,a inc b ld a,c and d ld d,-7*3 add hl,de jr z,__DrawXorAligned ld e,c ld c,a ld a,e cp -7 sbc a,a ld d,a and e cp 96-7 sbc a,a ld e,a__DrawXorLoop: push bc ld b,c ld c,(ix) xor a__DrawXorShift: srl c rra djnz __DrawXorShift and e xor (hl) ld (hl),a dec hl ld a,c and d xor (hl) ld (hl),a ld c,13 add hl,bc inc ix pop bc djnz __DrawXorLoop ret__DrawXorAligned: ld de,12__DrawXorAlignedLoop: ld a,(ix) xor (hl) ld (hl),a inc ix add hl,de djnz __DrawXorAlignedLoop ret__DrawXorEnd:p_DrawOff: .db __DrawOffEnd-1-$push hl pop ix ;Input ix = Sprite ld hl,plotSScreen ;Input hl = Buffer pop af pop de ;Input e = Sprite Y Position pop bc ;Input c = Sprite X Position push af ld d,7 ld a,e add a,d jr c,__DrawOffClipTop sub 64+7 ret nc cpl cp d jr c,__DrawOffClipBottom ld b,d jr __DrawOffNoClipV__DrawOffClipTop: inc ix inc e jr nz,__DrawOffClipTop__DrawOffClipBottom: ld b,a__DrawOffNoClipV: ld a,c add a,d cp 96+7 ret nc rrca rrca rrca and$1f ld d,0 sla e sla e add hl,de add hl,de add hl,de ld e,a add hl,de inc b ld a,c and 7 jr z,__DrawOffAligned ld e,c ld c,a ld a,e cp -7 jr nc,__DrawOffLoop inc d cp 96-7 jr nc,__DrawOffLoop inc d__DrawOffLoop: push bc ld b,c ld c,(ix+0) xor a ld e,$FF__DrawOffShift: srl c rr e rra djnz __DrawOffShift dec d jr z,__DrawOffSkipRight ld b,a or (hl) and e ld (hl),a ld a,b__DrawOffSkipRight: dec hl inc d jr z,__DrawOffSkipLeft and (hl) or c ld (hl),a__DrawOffSkipLeft: ld bc,13 add hl,bc inc ix pop bc djnz __DrawOffLoop ret__DrawOffAligned: ld e,12__DrawOffAlignedLoop: ld a,(ix) ld (hl),a inc ix add hl,de djnz __DrawOffAlignedLoop ret__DrawOffEnd:p_DrawMsk: .db __DrawMskEnd-1-$ ex (sp),hl pop ix ;Input hl = Sprite pop de pop bc push hl ld hl,plotSScreen ld d,7 ld a,e add a,d jr c,__DrawMskClipTop sub 64+7 ret nc cpl cp d jr c,__DrawMskClipBottom ld b,d jr __DrawMskNoClipV__DrawMskClipTop: inc ix inc e jr nz,__DrawMskClipTop__DrawMskClipBottom: ld b,a__DrawMskNoClipV: ld a,c add a,d cp 96+7 ret nc rrca rrca rrca and $1f ld d,0 sla e sla e add hl,de add hl,de add hl,de ld e,a add hl,de inc b ld a,c and 7 jr z,__DrawMskAligned ld e,c ld c,a ld a,e cp -7 jr nc,__DrawMskLoop inc d cp 96-7 jr nc,__DrawMskLoop inc d__DrawMskLoop: push bc push hl ld b,c ld e,(ix+0) xor a ld h,a ld c,(ix+8)__DrawMskShift: srl e rr h srl c rra djnz __DrawMskShift ld b,h pop hl push af dec d jr z,__DrawMskSkipRight1 push bc xor b cpl ld c,a ld a,(hl) or b and c ld (hl),a pop bc__DrawMskSkipRight1: dec hl inc d push de jr z,__DrawMskSkipLeft1 ld a,c xor e cpl ld d,a ld a,(hl) or e and d ld (hl),a__DrawMskSkipLeft1: ld de,appBackUpScreen-plotSScreen+1 add hl,de pop de pop af dec d jr z,__DrawMskSkipRight2 or b cpl and (hl) or b ld (hl),a__DrawMskSkipRight2: dec hl inc d jr z,__DrawMskSkipLeft2 ld a,c or e cpl and (hl) or e ld (hl),a__DrawMskSkipLeft2: ld bc,plotSScreen-appBackUpScreen+13 add hl,bc inc ix pop bc djnz __DrawMskLoop ret__DrawMskAligned: push hl ld de,appBackUpScreen-plotSScreen add hl,de ld a,(ix+0) ld d,a xor (ix+8) cpl ld e,a and (hl) or d ld (hl),a pop hl ld a,(hl) or d and e ld (hl),a inc ix ld de,12 add hl,de djnz __DrawMskAligned ret__DrawMskEnd:p_DrawMsk2: .db __DrawMsk2End-1-$ ex (sp),hl pop ix ;Input hl = Sprite pop de pop bc push hl ld hl,plotSScreen ld d,7 ld a,e add a,d jr c,__DrawMsk2ClipTop sub 64+7 ret nc cpl cp d jr c,__DrawMsk2ClipBottom ld b,d jr __DrawMsk2NoClipV__DrawMsk2ClipTop: inc ix inc e jr nz,__DrawMsk2ClipTop__DrawMsk2ClipBottom: ld b,a__DrawMsk2NoClipV: ld a,c add a,d cp 96+7 ret nc rrca rrca rrca and $1f ld d,0 sla e sla e add hl,de add hl,de add hl,de ld e,a add hl,de inc b ld a,c and 7 jr z,__DrawMsk2Aligned ld e,c ld c,a ld a,e cp -7 jr nc,__DrawMsk2Loop inc d cp 96-7 jr nc,__DrawMsk2Loop inc d__DrawMsk2Loop: push bc push hl ld b,c ld e,(ix+0) xor a ld h,a ld c,(ix+8)__DrawMsk2Shift: srl e rr h srl c rra djnz __DrawMsk2Shift ld b,h ;e = left spr, b = right spr, c = left msk, a = right msk pop hl dec d jr z,__DrawMsk2SkipRight cpl and (hl) xor b ld (hl),a__DrawMsk2SkipRight: dec hl inc d jr z,__DrawMsk2SkipLeft ld a,c cpl and (hl) xor e ld (hl),a__DrawMsk2SkipLeft: ld bc,13 add hl,bc inc ix pop bc djnz __DrawMsk2Loop ret__DrawMsk2Aligned: ld e,12__DrawMsk2AlignedLoop: ld a,(ix+8) cpl and (hl) xor (ix+0) ld (hl),a inc ix add hl,de djnz __DrawMsk2AlignedLoop ret__DrawMsk2End: #### Runer112 • Project Author • LV11 Super Veteran (Next: 3000) • Posts: 2289 • Rating: +639/-31 ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #283 on: February 02, 2012, 11:56:23 pm » Just a small optimization I see with the new Nth string command. Because you restack the return location by popping it into bc, you're already loading bc with a value that's at least$4000 for applications and at least $8000 for programs, so the ld b,h inside the loop is not necessary. #### jacobly • LV5 Advanced (Next: 300) • Posts: 205 • Rating: +161/-1 ##### Re: Assembly Programmers - Help Axe Optimize! « Reply #284 on: September 18, 2012, 03:03:40 am » Thanks to a suggestion from calc84maniac, I have optimized the routine that is used for both *^ and ** to be 25-50% faster. In addition, every use of *^ would be 2 bytes smaller. p_MulFull: same size, save 300-550 cycles  OriginalCode: [Select]p_MulFull: .db __MulFullEnd-1-$ ld c,h ld a,l ld hl,0 ld b,16__MulFullNext: add hl,hl rla rl c jr nc,__MulFullSkip add hl,de adc a,0 jr nc,__MulFullSkip inc c__MulFullSkip: djnz __MulFullNext ret__MulFullEnd: OptimizedCode: [Select]p_MulFull: .db __MulFullEnd-1-\$ xor a ld c,h ld h,a or l ld l,h call nz,__MulFullByte-p_MulFull-1 ld a,c__MulFullByte: ld b,8__MulFullNext: rra jr nc,__MulFullSkip add hl,de__MulFullSkip: rr h rr l djnz __MulFullNext ret__MulFullEnd:
Note: Output changed: hl = bits 16-31 of the result, do rra after the routine returns to get a = bits 8-15 of the result.
« Last Edit: September 18, 2012, 03:10:34 am by jacobly »