Author Topic: Assembly Programmers - Help Axe Optimize! (Read 145164 times)

Runer112 · « **Reply #150 on:** March 12, 2011, 10:47:24 pm »

I'm back to take on a few more routines that I either couldn't follow or just decided not to try in my first mass optimization post.

p_Sqrt: 1 byte and 4 cycles saved. I still think it may be a good idea to replace this with a restoring square root algorithm, though, like the one I suggested a while ago here. Although maybe not that exact one, because I wrote that when I was still not too familiar with assembly and it may not be very optimized.

Code: (Original code: 14 bytes, n*37+36 cycles) [Select]

p_Sqrt:
 .db __SqrtEnd-1-$
 ld a,-1
 ld d,a
 ld e,a
__SqrtLoop:
 add hl,de
 inc a
 dec e
 dec de
 jr c,__SqrtLoop
 ld h,0
 ld l,a
 ret
__SqrtEnd:

Code: (Optimized code: 13 bytes, n*37+32 cycles) [Select]

p_Sqrt:
 .db __SqrtEnd-1-$
 ld de,-1&$FF
 ld b,e
 ld c,e
__SqrtLoop:
 add hl,bc
 inc e
 dec c
 dec bc
 jr c,__SqrtLoop
 ex de,hl
 ret
__SqrtEnd:

p_Sin: 3 bytes and 8 cycles saved.

Code: (Original code: 29 bytes, too lazy to test cycles) [Select]

p_Sin:
 .db __SinEnd-1-$
 add a,a
 rr c
 ld d,a
 cpl
 ld e,a
 xor a
 ld b,8
__SinLoop:
 rrc e
 jr nc,__SinSkip
 add a,d
__SinSkip:
 rra
 djnz __SinLoop
 adc a,a
 ld l,a
 ld h,b
 rl c
 ret nc
 cpl
 inc a
 ret z
 ld l,a
 dec h
 ret
__SinEnd:

Code: (Optimized code: 26 bytes, too lazy to test-8 cycles) [Select]

p_Sin:
 .db __SinEnd-1-$
 ld c,a
 add a,a
 ld d,a
 cpl
 ld e,a
 xor a
 ld b,8
__SinLoop:
 rra
 rrc e
 jr nc,__SinSkip
 add a,d
__SinSkip:
 djnz __SinLoop
 ld l,a
 ld h,b
 or c
 ret p
 xor a
 sub l
 ret z
 ld l,a
 dec h
 ret
__SinEnd:

p_Log: 1 byte saved.

Code: (Original code: 11 bytes, n*31+17 cycles) [Select]

p_Log:
 .db 11
 ld a,16
 scf
__LogLoop:
 adc hl,hl
 dec a
 jr nc,__LogLoop
 ld l,a
 ld h,0

Code: (Optimized code: 10 bytes, n*31+13 cycles) [Select]

p_Log:
 .db 10
 ld de,16
 scf
__LogLoop:
 adc hl,hl
 dec e
 jr nc,__LogLoop
 ex de,hl

Before we leave this routine, though, the output for hl=0 isn't really correct; it returns 255. You could change the dec e in my suggested routine to dec de to give a slightly more accurate result of -1, but again, that's not quite correct either. The real result of log(0) should be negative infinity, which would be most properly represented by -32768. For the small cost of 3 bytes, the following routine would give you this result:

Code: (Mathematically correct code: 13 bytes, only a little bit slower cycles) [Select]

p_Log:
 .db 13
 ld de,16
__LogLoop:
 add hl,hl
 jr c,__LogLoopEnd
 dec e
 jr nz,__LogLoop
__LogLoopEnd:
 ex de,hl
 ccf
 rr h

p_Exp: As with above suggestion, this isn't an optimization, this is a suggested improvement. With the current routine, I see two issues. Firstly, it returns 2^(input mod 256) instead of 2^(input), the latter of which is probably what you would expect of a 16-bit math function. Secondly, this routine does not do anything special for inputs with high values mod 256, which could result in it taking up to 7195 cycles. The following routine would correct both of these behaviors. Also note that it is a subroutine instead of inline code (more on turning inline code into subroutines later).

Code: (Mathematically correct code: 16 bytes, only a little bit slower cycles) [Select]

p_Exp:
 .db __ExpEnd-p_Exp-1
 ld b,l
 ld a,l
 and %11110000
 or h
 ld hl,0
 ret nz
 inc b
 scf
__ExpLoop:
 adc hl,hl
 djnz __ExpLoop
 ret
__ExpEnd:

__DrawMskAligned: 2 bytes, 72 cycles saved. I only tacked the aligned part of the masked sprite routine because the rest is scary.

Code: (Original code: 33 bytes, 1481 cycles) [Select]

__DrawMskAligned:
 dec hl
__DrawMskAlignedLoop:
 ld a,(ix+0)
 xor (ix+8)
 cpl

 ld c,a
 ld a,(hl)
 or (ix+0)
 and c
 ld (hl),a

 ld de,appBackUpScreen-plotSScreen
 add hl,de

 ld a,c
 and (hl)
 or (ix+0)
 ld (hl),a

 inc ix
 ld de,plotSScreen-appBackUpScreen+12
 add hl,de

 djnz __DrawMskAlignedLoop

Code: (Optimized code: 31 bytes, 1409 cycles) [Select]

__DrawMskAligned:
 dec hl
__DrawMskAlignedLoop:
 push hl
 ld de,appBackUpScreen-plotSScreen
 add hl,de

 ld a,(ix+0)
 ld d,a
 xor (ix+8)
 cpl
 ld e,a

 and (hl)
 or d
 ld (hl),a

 pop hl

 ld a,(hl)
 or d
 and e
 ld (hl),a

 inc ix
 ld de,12
 add hl,de

 djnz __DrawMskAlignedLoop

Finally, here are some routines that I feel would be better suited to be subroutines instead of inline code. Feel free to disagree with me on any or all of these.

p_DispStrApp and p_TextStrApp: Any application that displays text will most likely be doing so far more than once, making the size savings from turning these into subroutines probably quite large. Also, the OS routines used to display text are slow enough that you wouldn't notice any speed difference from the overhead of a subroutine.
p_Length: Not really necessary to turn into a subroutine, it just seems like the kind of function that should be.
p_Log and p_Exp: Like p_Mul and p_Sqrt, I think that math routines should probably be subroutines. The current routines are only 11 and 10 bytes respectively, but if you use the routines I suggested above, they would then be 13 and 16 bytes respectively, making them much more worthy of being a subroutine. The p_Exp routine I suggested actually relies on being a subroutine.
p_GetBit and p_GetBit16: Although listed as only being 12 bytes in Commands.inc, they have an extra 2 bytes of overhead register shifting not shown in the routines. Since these are both 14-byte math functions that could definitely be called on more than once in programs that use them, I feel that they should probably be subroutines.

And for one going in the other direction, perhaps p_OnKey doesn't need to be a subroutine? Very few programs use the on key, and I'm guessing that any program that does isn't likely to use it more than once. Or at least if you don't want to change it, can you change the .db 10 into .db __OnKeyEnd-p_OnKey-1 or .db __OnKeyEnd-1-$? I was always confused when I saw this routine about whether or not it was inserted as inline code or a subroutine.

Quigibo · « **Reply #151 on:** March 27, 2011, 04:18:51 pm »

Found a very small optimization that I'm surprised you didn't find

Code: (Original) [Select]

p_Min:
 .db 8
 pop de
 or a
 sbc hl,de
 add hl,de
 jr c,$+3
 ex de,hl

Code: (Optimized) [Select]

p_Min:
 .db 8
 pop de
 or a
 sbc hl,de
 ex de,hl  
 jr nc,$+3
 add hl,de

Code: (Original) [Select]

p_Max:
 .db 8
 pop de
 or a
 sbc hl,de
 add hl,de
 jr nc,$+3
 ex de,hl

Code: (Optimized) [Select]

p_Max:
 .db 8
 pop de
 or a
 sbc hl,de
 ex de,hl
 jr c,$+3
 add hl,de

It saves 7 clock cycles in the best case yet the worse case remains the same. So that's a 3.5 clock cycle speed up in the average uniform case.

Runer112 · « **Reply #152 on:** March 27, 2011, 06:01:33 pm »

Quigibo, I noticed that you implemented a bunch of my optimizations/bug fixes, which is awesome. Having left a few out is fine, but I'm stuck wondering why you left out a specific few of my suggestions in particular that I thought would be useful. Did you just miss some of these, or did you purposely leave them out?

An optimized Exch() routine
Turning p_DispStrApp and p_TextStrApp into subroutines
Adjusted GetCalc() routines; I see you included one of the three adjusted routines (p_GetArc) probably because it's smaller than the original routine, but this routine includes the adjustment made in the other two routines I suggested and it seems incongruous for only the archived GetCalc() routine to have this fix implemented

And on an unrelated note, could you extend the DispGraph change to DispGraphClrDraw as well? And because I know a bunch of people may not like DispGraph no longer working in 15MHz mode, could you perhaps include a routine like DispGraphNormal that saves the CPU speed setting, puts the CPU in 6MHz mode, calls the normal DispGraph routine, and then restores the CPU speed? Something like this:

Code: [Select]

p_FastCopy6MHz:
 .db __FastCopy6MHzEnd-1-$
 in a,($02)
 rla
 in a,($20)
 push af
 xor a
 out ($20),a
 call sub_FastCpy
 pop af
 ret c
 out ($20),a
 ret
__FastCopy6MHzEnd:

Quigibo · « **Reply #153 on:** March 27, 2011, 06:33:30 pm »

Oh yeah, now I remember why I didn't want to change Dispgraph before

The overhead is small though relative to the size of the routine, I think I'll just make regular DispGraph account for the speed settings and it would be roughly the same size as the original.

Runer112 · « **Reply #154 on:** March 27, 2011, 06:38:24 pm »

Sorry if I'm pestering you about this, but could you please respond to the first part of my last post? Even if you tell me you purposely left them out for whatever reasons that's completely fine, I just wouldn't want you to have missed suggestions that you might have actually wanted to implement.

Quigibo · « **Reply #155 on:** March 28, 2011, 04:30:42 am »

Oh yeah, sorry. I was trying to release this quickly with my limited time so I didn't have time to make changes that would require me to rewrite core routines. The exch() optimization requires this, because even though it works for Axioms, its not the same for regular routines yet but it will be eventually. The subroutines I didn't convert yet because I kind of overlooked it, but I'll get to it next version. As for GetCalc() I'm not sure if I'm ready to change it yet because it will create an incompatibility with current programs that still use the Ptr-2 to reference os vars. I will probably put this on the poll.

Runer112 · « **Reply #156 on:** March 28, 2011, 11:41:47 am »

Alright, that makes sense. Thanks.

calc84maniac · « **Reply #157 on:** April 06, 2011, 01:19:54 am »

Code: (Original) [Select]

p_IntGt:
 .db 7
 ex de,hl
 xor a
 sbc hl,de
 ld h,a
 rla
 ld l,a

Code: (Optimized) [Select]

p_IntGt:
 .db 6
 scf
 sbc hl,de
 sbc hl,hl
 inc hl

Also, I think optimized comparisons for comparing to immediate values would be a good idea. For example, <3 becomes ld de,-3 \ add hl,de \ sbc hl,hl \ inc hl

Runer112 · « **Reply #158 on:** April 06, 2011, 11:50:59 am »

For that matter, a lot of things could still use optimized immediate value versions. Pretty much every comparison could, as well as a few other operators. For instance, p_BoolOrImm, p_BoolAndImm, and p_BoolXorImm have been sitting in Commands.inc since Axe 0.4.7 and I would love to see them implemented. Constant bit setting and resetting would also be awesome. Unfortunately none of these things can be achieved with an Axiom, or I would be all over this.

calc84maniac · « **Reply #159 on:** April 19, 2011, 11:00:30 pm »

I made a "real" square root algorithm, meaning one that doesn't use the repeated subtraction method. Certainly not a size optimization, but it certainly makes the average execution time a lot lower.

Code: [Select]

p_Sqrt:
 .db __SqrtEnd-1-$
 ex de,hl
 ld bc,$8000
 ld h,c
 ld l,c
__SqrtLoop:
 srl b
 rr c
 add hl,bc
 ex de,hl
 sbc hl,de
 jr nc,__SqrtNoOverflow
 add hl,de
 ex de,hl
 or a
 sbc hl,bc
 ;jr __SqrtOverflow ;Commented out in favor of super optimization
 .db $DA ;JP C opcode to skip next 2 bytes since carry is reset here.
__SqrtNoOverflow:
 ex de,hl
 add hl,bc
__SqrtOverflow:
 srl h
 rr l
 srl b
 rr c
 jr nc,__SqrtLoop
 ret
__SqrtEnd:

Runer112 · « **Reply #160 on:** April 19, 2011, 11:07:34 pm »

How does it compare to this one? I suggested it a while ago but Quigibo either didn't see it or didn't seem to be interested in it.

calc84maniac · « **Reply #161 on:** April 19, 2011, 11:11:39 pm »

Quote from: Runer112 on April 19, 2011, 11:07:34 pm

How does it compare to this one? I suggested it a while ago but Quigibo either didn't see it or didn't seem to be interested in it.

Er... that looks impressive

I'll have to take a closer look at how it works sometime

Quigibo · « **Reply #162 on:** April 20, 2011, 06:08:32 pm »

I'll think about it. I'm not sure how often square roots are actually needed and how applicable they are to speed constraints.

Another thing, I'm adding new auto-opts for constants in the comparisons (less than, less than or equal to, greater than, and greater than or equal to). I have found nice optimizations for powers of 2 and some low numbers, but if anyone wants to try writing some, I could definitely use them. I probably missed some or may have sub-optimal solutions.

Runer112 · « **Reply #163 on:** April 20, 2011, 06:10:56 pm »

If you post what you have so far, I can look at them and see if I can find any optimizations for them. I also might see if there are any other optimized comparisons you don't already have that I could contribute.

Builderboy · « **Reply #164 on:** April 20, 2011, 06:38:42 pm »

What someone needs to do is take an program that can emulate the z80's basic mathematical and jumping functions in a program setting and brute force out *the* most optimized program that performs a small simple task like multiplying two numbers

Author Topic: Assembly Programmers - Help Axe Optimize! (Read 145164 times)

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Quigibo

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Quigibo

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Quigibo

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

calc84maniac

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

calc84maniac

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

calc84maniac

Re: Assembly Programmers - Help Axe Optimize!

Quigibo

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Builderboy

Re: Assembly Programmers - Help Axe Optimize!