Omnimaga

Calculator Community => Major Community Projects => The Axe Parser Project => Topic started by: Quigibo on February 26, 2010, 03:52:51 pm

Title: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on February 26, 2010, 03:52:51 pm
I'm going to post most of the assembly routines I use in Axe Parser here to see if any of you asm programmers can help me with their optimizations.

The most important thing right now is the clipped sprite routine since its really big.  Here's what I got so far:

p_DrawOr8x8:
   push   hl
   pop   ix      ;Input hl = Sprite
   ld   b,7      ;Input c = Sprite X Position
   ld   d,0      ;Input e = Sprite Y Position
   ld   h,d
   ld   a,c
   add   a,b
   jr   c,__ClipLeft
   sub   96+7
   ret   nc
   cpl
   cp   b
   jr   nc,__NoClipH
__ClipRight:
   inc   d
   jr   __ClipHDone
__ClipLeft:
   add   a,89
   ld   c,a
__ClipHDone:
   inc   d      ;d,c,e are updated
__NoClipH:
   ld   a,e
   add   a,b
   jr   c,__ClipTop
   sub   64+7
   ret   nc
   cpl
   cp   b
   jr   nc,__NoClipV
   jr   __ClipBottom
__ClipTop:
   inc   ix
   inc   e
   jr   nz,__ClipTop
__ClipBottom:
   ld   b,a
__NoClipV:         ;b,ix,e are updated.
   dec   d
   jr   z,__NoFix
   inc   e
__NoFix:
   push   de
   ld   l,e
   ld   d,h
   add   hl,hl
   add   hl,de
   add   hl,hl
   add   hl,hl
   ld   e,c
   ld   a,e
   srl   e
   srl   e
   srl   e
    add   hl,de
   ld   de,plotSScreen-11
   add   hl,de
   pop   de
   inc   b
    and   %00000111
   jr   z,__DrawOr8x8Aligned
   ld   c,a
__DrawOr8x8Loop:
   push   bc
   ld   b,c
   ld   c,(ix+0)
   xor   a
__DrawOr8x8Shift:
   srl   c
   rra
   djnz   __DrawOr8x8Shift
   dec   d
   jr   z,__SkipRight
   or   (hl)
   ld   (hl),a
__SkipRight:
   dec   hl
   inc   d
   jr   z,__SkipLeft
   ld   a,c
   or   (hl)
   ld   (hl),a
__SkipLeft:
   ld   c,13
   add   hl,bc
   inc   ix
   pop   bc
   djnz   __DrawOr8x8Loop
   ret
__DrawOr8x8Aligned:
   dec   hl
   ld   de,12
__DrawOr8x8AlignedLoop:
   ld   a,(ix+0)
   or   (hl)
   ld   (hl),a
   inc   ix
   add   hl,de
   djnz   __DrawOr8x8AlignedLoop
   ret
__DrawOr8x8End:


If you spot anything that can be optimized, bold it so I can see what you changed, thanks!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on February 26, 2010, 04:57:59 pm
I would prefer no extra RAM usage at all. Otherwise, games may not run on any TI-84+ manufactured after April 2007 and will not be compatible with the regular 83+, meaning a considerable drop in the author's audience.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Galandros on February 26, 2010, 05:09:49 pm
I would prefer no extra RAM usage at all. Otherwise, games may not run on any TI-84+ manufactured after April 2007 and will not be compatible with the regular 83+, meaning a considerable drop in the author's audience.
Generally ram usage is just the routine needs some temporary bytes to store data. (bytes in the program itself or inserted in the TI-OS available RAM or free ram zones of the TI-OS) Only when you need a good amount of memory you use the extra ram pages.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Eeems on February 26, 2010, 06:14:01 pm
Aldo if I'm not wrong there is a little extra ram in the newer calcs, but not that much.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on February 26, 2010, 06:21:02 pm
Yeah. I'm not sure if it's accessed the same way, though.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Galandros on February 26, 2010, 07:11:58 pm
In the new calcs only the page 53h is still there. It is same port as before. Dunno what 3rd party software uses it...
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on February 26, 2010, 08:43:34 pm
In the new calcs only the page 53h is still there. It is same port as before. Dunno what 3rd party software uses it...
Actually, we don't know which one is still there. All we know is that pages $82-$87 appear to be the same 16K of physical memory.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: ztrumpet on February 27, 2010, 10:28:07 am
In the new calcs only the page 53h is still there. It is same port as before. Dunno what 3rd party software uses it...
If I'm not mistaken, I think Omnicalc's Quick Apps uses it and it works fine on a new 84+se.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on February 27, 2010, 09:23:58 pm
Safe Copy requires the undocumented instruction:
  in f,(c)

Will that be compatible with the Nspire?  If anyone has one, could you please try adding this to any Axe Parser code and see if it crashes:

Asm(0E10ED70)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on February 27, 2010, 10:21:03 pm
Safe Copy requires the undocumented instruction:
  in f,(c)

Will that be compatible with the Nspire?  If anyone has one, could you please try adding this to any Axe Parser code and see if it crashes:

Asm(0E10ED70)
It won't be compatible. But it doesn't require that instruction anyway. I always do this:
Code: [Select]
in a,($10)
rla
jr c,$-3 ;or was it "nc"?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on February 27, 2010, 11:58:30 pm
Hopefully, though, maybe someone will write a 84+ emu for the Nspire to replace the current one, now that Ndless is out :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 14, 2010, 03:23:09 am
Does anyone know an efficient way to do signed multiplication for a 2's compliment system?  I can't find any tutorials on the internet.  My naive method is to remove and keep track of the sign bit for each term, multiply the positive versions together, and then add the new sign bit.  Is there a better method?

This is what I'm using to get the sign bit out of de and keep track of it with the 'b' register.  b starts at zero, so I can use bit 0 of b as the new sign bit if I repeat this for hl.

Code: [Select]
bit 7,d
jr z,__MulSNotNeg
inc b
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
__MulSNotNeg:

I would have to do the same thing for hl and then again at the end when I need to put the bit back so it seems like a lot of extra code...
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on March 14, 2010, 09:38:51 am
No need at all. 16-bit * 16-bit -> 16-bit will give the same result for unsigned and signed arithmetic. Problem solved! ;)

Seriously, try multiplying some signed values in your parser and you will get the right results.

Edit:
Scratch that, I just tried it myself. What is your normal multiplication routine?

Edit2:
I just disassembled it, and I think you are only doing an 8-bit * 16-bit multiplication. That could explain the bad outputs.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Galandros on March 14, 2010, 10:17:13 am
I have some ideas but I don't know how well it will be implemented:
There are some shift instructions that preserve the sign bit, for example sra. And sla is other arithmetic shift... Probably this isn't useful or as fast as other methods.
When the multiplication is finished you can set the correct sign based in the inputs.
You can probably use the sign flag to optimize.

neg is equivalent to cpl / inc a. Dunno how this goes into 16-bit pair registers.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on March 14, 2010, 12:55:18 pm
Here is your original multiplication routine:
Code: [Select]
xor a
 or d
 jr nz,$+3
 ex de,hl
 ld a,l
 ld hl,0
_multloop:
 rra
 jr nc,$+3
 add hl,de
 sla e
 rl d
 or a
 jr nz,_multloop
 ret

Here is my modified signed version, only 8 extra bytes (all in the overhead). It multiplies two signed values, one of which is between -256 and 255.
Code: [Select]
ld a,d
 rrca
 cp d
 jr nz,$+3
 ex de,hl
 xor a
 inc h
 jr nz,$+3
 sub e
 ld h,a
 ld a,l
 ld l,0
 or a ;Returns if multiplying by 0 or -256, also resets carry flag
 ret z
_multloop:
 rra
 jr nc,$+3
 add hl,de
 sla e
 rl d
 or a
 jr nz,_multloop
 ret
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 14, 2010, 05:11:40 pm
Thanks!  I'll try that :)

What about division?  Clearly I'll need more than just overhead since there's no overflow to loop around.  Is there a more efficient way to do than what I was originally planning with multiplication?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on March 14, 2010, 08:11:17 pm
btw are you planning to provide the source code to some of the hardcore asm coders on this board eventually, in case some people might find some optimizations to make the compiled code smaller? That could maybe help too.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 15, 2010, 01:44:10 pm
Kind of.  I'm going to release all of my templated assembly code that the executable programs use, but I don't think I will release the source of the parser itself.  Right now, I'm not too worried about the optimizations.  Its the actual code of the Parser I am trying to finished first so I can release a beta, but I keep getting distracted by wantting to add more commands since they're relatively easier and more fun  :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Iambian on March 15, 2010, 02:02:31 pm
I've got no clue whether or not you've seen this, but I think it might be of some help. http://map.grauw.nl/sources/external/z80bits.html

Will post later if something else catches my eye.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on March 15, 2010, 05:46:22 pm
Kind of.  I'm going to release all of my templated assembly code that the executable programs use, but I don't think I will release the source of the parser itself.  Right now, I'm not too worried about the optimizations.  Its the actual code of the Parser I am trying to finished first so I can release a beta, but I keep getting distracted by wantting to add more commands since they're relatively easier and more fun  :P
Never release the source to public before releasing the software, btw. If for example, you decide to post Axe 0.2 both on Omnimaga and Ticalc at the same time, wait until it makes it to ticalc archives before releasing the source. Better protection against code thieves.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 15, 2010, 08:57:53 pm
I've got no clue whether or not you've seen this, but I think it might be of some help. http://map.grauw.nl/sources/external/z80bits.html
Yeah, that's what I've been using, but it doesn't have many signed routines, just unsigned.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on March 15, 2010, 10:09:55 pm
Hmmm it would seem so.  Is it really so much of a hassle to flip the negative bit, multiply/divide, then flip it again?  Seems trivial compared to any other modification, although i'm not an asm guy :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on March 15, 2010, 11:08:03 pm
Hmmm it would seem so.  Is it really so much of a hassle to flip the negative bit, multiply/divide, then flip it again?  Seems trivial compared to any other modification, although i'm not an asm guy :P
You can't just flip a bit. You have to subtract from zero.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 15, 2010, 11:15:32 pm
Builderboy, you're thinking of a 1's compliment system where the last bit is just a sign bit.  2's compliment is a little bit different.  The advantage of 2's compliment system is that it is arithmetically (and apparently also multiplicitively) compatible with unsigned numbers.

By the way calcmaniac, your code didn't work.  I tried -1 times 1 and it returned a weird number.  But I was able to create my own routine after looking at some other code.  Its a little slower, but its roughly the same size as the original 8bit routine.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on March 15, 2010, 11:41:33 pm
By the way calcmaniac, your code didn't work.  I tried -1 times 1 and it returned a weird number.  But I was able to create my own routine after looking at some other code.  Its a little slower, but its roughly the same size as the original 8bit routine.
I just typed it in on my calculator and tried $ffff*$0001 and $0001*$ffff and both gave $ffff. Did you make a typo somewhere or something?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 15, 2010, 11:48:02 pm
I meant -1 times -1 sorry.  I just copied and pasted it btw.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on March 15, 2010, 11:59:11 pm
I tried -1*-1 and I ended up with 1. I dunno, maybe there's something going on with the underscore in the loop label? I've never really used that.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on March 16, 2010, 09:21:55 pm
I just made optimized min/max routines that you can use :D
Code: [Select]
Min_HLDE:
 xor a
 sbc hl,de
 jr c,$+4
 ld h,a
 ld l,a
 add hl,de

Code: [Select]
Max_HLDE:
 xor a
 sbc hl,de
 jr nc,$+4
 ld h,a
 ld l,a
 add hl,de
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 16, 2010, 09:58:29 pm
That's a cleaver trick!  But wouldn't something like this be simpler?

or a
sbc hl,de
add hl,de
jr nc,$+3
ex de,hl


But I'm trying to convert all of my math commands to signed operations anyway, so I would need to tweak it a bit.

I'm going to try your multiplication routine again and see if its smaller.  I think I forgot the ret at the end which might of been what screwed me up.

Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: AaroneusTheGreat on March 16, 2010, 10:13:06 pm
Quote
That's a cleaver trick!

Actually a cleaver trick would be more like juggling butchers knives...  :P

Sorry I saw your typo and thought it was pretty funny, I make mistakes like that all the time and it makes me giggle that someone else does too from time to time.  ;D

</end off topic comment>

BTW I think you're doing a smart thing by asking for help optimizing your code, there are some excellent programmers here who would more than likely love to get their hands on your code to help make this as good a program as it could possibly be.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on March 16, 2010, 10:13:57 pm
That's a cleaver trick!  But wouldn't something like this be simpler?

or a
sbc hl,de
add hl,de
jr nc,$+3
ex de,hl


But I'm trying to convert all of my math commands to signed operations anyway, so I would need to tweak it a bit.
Ah, I didn't think of that. :P Here's a good way to do signed comparison by the way:
or a
sbc hl,de
ld a,h
rla
jp po,$+4
ccf

It should give the same flag outputs you would expect from an unsigned compare (note that rla does not modify the Z or P/V flags)

Edit:
Now I came up with one that can restore the original value of HL without destroying the C flag in the process:
or a
sbc hl,de
ld a,h
jp po,$+4
cpl
add hl,de
rla
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: ztrumpet on March 17, 2010, 04:00:24 pm
That's really neat!  Great job calc84! ;D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 20, 2010, 10:00:23 pm
Does anyone know any good sin/cos routines that are under 128 bytes?  The entire circle should be 256 brads (binary radians) so each quadrant is 64.  It doesn't need to be 100% accurate, but it should be pretty close.  It doesn't need to be that fast either, but I would prefer using a method that doesn't need multiplication such as a look up table or CORDIC.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on March 21, 2010, 11:40:46 am
Um well first question, (even though i'm not going to end up writing this routine :P) what should the output be in since we don't have floating point?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 21, 2010, 04:42:47 pm
It should be 128*sin() so that the number fits in a single signed byte.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Galandros on March 23, 2010, 07:04:20 pm
Does anyone know any good sin/cos routines that are under 128 bytes?  The entire circle should be 256 brads (binary radians) so each quadrant is 64.  It doesn't need to be 100% accurate, but it should be pretty close.  It doesn't need to be that fast either, but I would prefer using a method that doesn't need multiplication such as a look up table or CORDIC.
I think yes. It was made by Will West but I couldn't find the original post in Revsoft so I add as attach.

see: "Sin_A   parabolic approximation of sin(a) a is in units of pi/256"
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 23, 2010, 09:53:02 pm
Hey thanks!  That will definitely come in handy!  It does use multiplication in that routine though, but oh well.  It just makes it difficult on the parsing side to have to call other subroutines that may or may not already have been added, but I guess I'll figure out a way to template it better to make this easier in the future anyway.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 02, 2010, 10:30:41 am
Your 4-level grayscale routine is pretty unoptimized, and it doesn't even use the right dither pattern (1/3 and 2/3). I figured I could help out here. This is about as fast as it gets (with your double-buffer layout). The small size cost is worth it in this case, I think.
Code: [Select]
DispGraphRR:
di
ld a,$80
out ($10),a
ld (save_sp),sp
ld l,plotSScreen&$ff - 1
ld de,appbackupscreen - plotSScreen
ld sp,plotSScreen - appbackupscreen + 12
ld c,$1f
dec (iy+asmflags2)
jr nz,gray4skip
ld (iy+asmflags2),3
jr gray4entry3
gray4skip:
ld a,(flags+asmflags2)
dec a
jr z,gray4entry2

gray4entry1:
ld h,plotSScreen >> 8
inc l
ld b,64
inc c
ld a,c
cp $2c
jr z,gray4end
out ($10),a

gray4loop1:
ld a,(hl)
add hl,de
xor (hl)
and %11011011
xor (hl)
add hl,sp
out ($11),a
djnz gray4loop2

gray4entry2:
ld h,plotSScreen >> 8
inc l
ld b,64
inc c
ld a,c
cp $2c
jr z,gray4end
out ($10),a

gray4loop2:
ld a,(hl)
add hl,de
xor (hl)
and %01101101
xor (hl)
add hl,sp
out ($11),a
djnz gray4loop3

gray4entry3:
ld h,plotSScreen >> 8
inc l
ld b,64
inc c
ld a,c
cp $2c
jr z,gray4end
out ($10),a

gray4loop3:
ld a,(hl)
add hl,de
xor (hl)
and %10110110
xor (hl)
add hl,sp
out ($11),a
djnz gray4loop1
jr gray4entry1

gray4end:
ld sp,(save_sp)
ei
ret

Edit: Some misnamed/missing labels
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 02, 2010, 12:57:12 pm
Will this one works in 15 MHz mode too, or just 6 MHz like his own routine?

(btw by 15 MHz, I really mean on real hardware, not just emulator)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 02, 2010, 05:58:35 pm
Thanks!  Although that looks much larger than my current routine, I'll have to see if the improvement in speed/quality is significant enough to justify the size increase.  I'll do some more testing this week.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 02, 2010, 08:48:00 pm
DJ_Omni, it probably won't work in 15MHz mode. On the other hand, Quigibo's routine almost could run fine in 15MHz, which is a bad thing (means it would be pretty slow in 6MHz)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 02, 2010, 09:55:09 pm
Yeah, I think it will be worth the size increase.  Its a 46 byte increase, but it has the added bonus that it doesn't need to be initialized with a separate command (after I modify this routine).  Also, I think there are a few places I can optimize to save on memory, but still not hinder the speed.

EDIT: Actually that code is pretty rock solid optimized, there were only a couple places I made improvements.

In the entry points, its better if the instructions are in this order:
Code: [Select]
ld a,c
cp $2c
jr z,__Disp4LvlDone
ld h,plotSScreen >> 8
inc l
ld b,64
inc c
out ($10),a

Because this way, it jumps out of the loop sooner when it gets to the end of the routine.  Not that big of a deal, but it saves several clock cycles each render.

Also, the jump table I changed to this:
Code: [Select]
inc (iy+asm_flag2)
jr z,__Disp4Lvlentry3
ld a,(flags+asm_flag2)
inc a
jr z,__Disp4Lvlentry2
ld (iy+asm_flag2),-2
So that you don't need to initialize the byte that keeps track of gray layer since it falls through if the number was uninitialized.

Thanks again!  This does look a lot better  ;D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 02, 2010, 10:52:32 pm
Nice to see possible optimizations :D

I don't mind an additional 46 bytes in my progs if the speed increases a lot personally. It's only if you use grayscale, anyway.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 02, 2010, 11:32:24 pm
Actually, I just tried this on hardware, and its too fast for even 6MHz.  But not to worry, I can group some of it into a subroutine to both add the needed delay and reduce the size of the code at the same time.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 02, 2010, 11:34:12 pm
If it's too fast on 6 MHz too, will it glitch on it too?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 03, 2010, 12:46:22 am
Hmm, did you try with ALCDFIX?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 03, 2010, 01:28:31 am
Me or Qigubio?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 03, 2010, 12:20:52 pm
Quigibo, since he said he was having slight lcd problems with my routine
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 03, 2010, 06:21:31 pm
Aaah ok. Will the problems occurs only on certain calcs, btw?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 03, 2010, 06:27:44 pm
Well, this is probably just the same thing that caused problems with certain games on newer calcs (which require longer delays when interfacing with the LCD). I suppose it wouldn't hurt to add a little more delay to be safe.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 03, 2010, 06:43:33 pm
Aaah ok. Wasn't it on random calcs, tho? I remember way back in 2004 as soon as the 84+ came out there were already people with LCD issues, including regular 83+
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 03, 2010, 07:09:15 pm
I have the newer 84 with the slower LCD by the way.  I'll see if this fixes it.  Although, I would rather the user not need to do this when running axe games.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 03, 2010, 07:15:49 pm
same here. I think this is a major hassle when you download a game for your new calc only to find out you will need an additional program.
/me stabs Texas Instruments x.x

They need to quit with their crappy LSD... er... LCDs
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: TIfanx1999 on June 03, 2010, 11:07:18 pm
I have the newer 84 with the slower LCD by the way.  I'll see if this fixes it.  Although, I would rather the user not need to do this when running axe games.
If it's a shoddy LCD it really can't be helped. :( It's a shame that some of the hardware is low quality.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 04, 2010, 01:14:54 am
Alright, I made a big modification.  Its now only 7 bytes larger than the original routine now, still doesn't need to be initialized, and is significantly faster and better looking, but still just slow enough to work on any LCD without having to run fixer programs.

The only downside is that I use a shadow register.  That means this command should NEVER be used in an interrupt routine.  That's alright though becasue you should never have LCD commands inside of an interrupt anyway.  Its much better to just have the interrupt control a counter and then your drawing routine updates the LCD depending on the counter.

Let me know if this can be optimized more and still run on the slower LCDs.

Code: [Select]
p_Disp4Lvl:
di
ld a,$80
out ($10),a
ld (OP1),sp
ld sp,appbackupscreen - plotSScreen
ld e,(plotSScreen-appbackupscreen+12)&$ff
ld c,-$0C
ex af,af'
ld a,%11011011
inc (iy+asm_flag2)
jr z,__Disp4Lvlskip
add a,a
ld hl,(flags+asm_flag2)
inc l
jr z,__Disp4Lvlskip
rlca
ld (iy+asm_flag2),-2
__Disp4Lvlskip:
ld l,plotSScreen&$ff-1
ex af,af'
__Disp4Lvlentry:
ld a,c
add a,$2C
ld h,plotSScreen>>8
inc l
ld b,64
out ($10),a
__Disp4Lvlloop:
ld a,(hl)
add hl,sp
xor (hl)
ex af,af'
cp e                              <-- Epic coincidence that e just happened to be between 182 and 219
rra
ld d,a
ex af,af'
and d
xor (hl)
out ($11),a
ld d,(plotSScreen-appbackupscreen+12)>>8
add hl,de
djnz __Disp4Lvlloop
inc c
jr nz,__Disp4Lvlentry
__Disp4LvlDone:
ld sp,(OP1)
ei
ret
__Disp4LvlEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 04, 2010, 01:32:38 am
altough I won't understand the code, I am glad you found a solution :D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 09, 2010, 01:16:54 am
Hmm, remember when I suggested you could use IX as a pointer to the variables for easier access when doing 8-bit operations? That would be too much of a hassle, we agreed. But what if you moved the variables to the END of savesscreen? Then they would be within the range of the IY register (which points 4 bytes after the end of savesscreen). Just something to consider :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: ztrumpet on June 09, 2010, 06:28:02 pm
This is really cool!
Quigibo, is it possible to have the routine that's too fast, or is it way too fast? ;D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 09, 2010, 09:59:04 pm
Hmm, remember when I suggested you could use IX as a pointer to the variables for easier access when doing 8-bit operations? That would be too much of a hassle, we agreed. But what if you moved the variables to the END of savesscreen? Then they would be within the range of the IY register (which points 4 bytes after the end of savesscreen). Just something to consider :)
Hmm... interesting proposition.  Although it would optimize some operations like addition and subtraction, there is still one key advantage to using the existing variable slots.  By using only the least significant bytes of those variables, you never need to do "conversions" when you switch from word to byte mode.  You have to get input and output somehow otherwise there's not much advantage to the new mode.  By being able to skip the conversions, I think that will save more memory in the long run than by using the iy or ix registers.

By the way, I've further optimized that grayscale command and its basically the same size as my original now except way faster which is excellent.

@ztrumpet: I'm not sure if it actually works, I'm going to assume calc83manic tested it on his hardware and got it to work since he proposed it, but the main issue here is compatibility.  This isn't the only sacrifice I've had to make.  There are a few other places where undocumented commands could make routines more efficient, but that wouldn't please the Nspire users.  I could add waste instructions to the super fast routine, but it really adds a lot to the size since I have to add 3 times as many.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 09, 2010, 11:08:09 pm
Actually, I didn't test it oncalc. I've written things like this before, so this time I just typed it up.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 09, 2010, 11:41:29 pm
I've further optimized that grayscale command and its basically the same size as my original now except way faster which is excellent.
Cool to hear :D I can't wait to see the difference :)

Now I wonder how easy it would be to make a grayscale version of the 3D racing game that comes with Axe

calc83manic
You downgraded calc84maniac to 6 MHz hardware :(
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on June 09, 2010, 11:53:09 pm
calc83manic
You downgraded calc84maniac to 6 MHz hardware :(

He also downgraded him to having a mood disorder.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 09, 2010, 11:55:49 pm
calc83manic
You downgraded calc84maniac to 6 MHz hardware :(

He also downgraded him to having a mood disorder.
oh crap I didn't notice x.x. I hope he still finishes all his projects including the 8 level grayscale raycaster D: *runs*
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: TIfanx1999 on June 10, 2010, 07:44:46 am
It's awesome that you were able to further optimize the grayscale, this makes me very happy! :D
calc83manic
You downgraded calc84maniac to 6 MHz hardware :(

He also downgraded him to having a mood disorder.
oh crap I didn't notice x.x. I hope he still finishes all his projects including the 8 level grayscale raycaster D: *runs*
LOL, you guys... :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: tr1p1ea on June 10, 2010, 08:41:26 am
Maybe you could use ixl/ixh to avoid the shadow reg? It is 9 clocks for ld iirc tho :S.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on June 10, 2010, 09:31:50 am
It's awesome that you were able to further optimize the grayscale, this makes me very happy! :D
calc83manic
You downgraded calc84maniac to 6 MHz hardware :(

He also downgraded him to having a mood disorder.
oh crap I didn't notice x.x. I hope he still finishes all his projects including the 8 level grayscale raycaster D: *runs*
LOL, you guys... :P
lol :P

Maybe you could use ixl/ixh to avoid the shadow reg? It is 9 clocks for ld iirc tho :S.
Aren't those incompatible with the Nspire, though? Or am I confusing them?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: ztrumpet on June 10, 2010, 10:07:44 am
LOL you guys. :P

Maybe you could use ixl/ixh to avoid the shadow reg? It is 9 clocks for ld iirc tho :S.
Aren't those incompatible with the Nspire, though? Or am I confusing them?
I'm pretty sure they're not compatible, since they are undocumented. :(
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 10, 2010, 10:54:43 am
Also, you can't use bit-shift instructions on ixh/ixl.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 10, 2010, 04:38:40 pm
*bump* (excuse the double post, I didn't really read this before)

Hmm, remember when I suggested you could use IX as a pointer to the variables for easier access when doing 8-bit operations? That would be too much of a hassle, we agreed. But what if you moved the variables to the END of savesscreen? Then they would be within the range of the IY register (which points 4 bytes after the end of savesscreen). Just something to consider :)
Hmm... interesting proposition.  Although it would optimize some operations like addition and subtraction, there is still one key advantage to using the existing variable slots.  By using only the least significant bytes of those variables, you never need to do "conversions" when you switch from word to byte mode.  You have to get input and output somehow otherwise there's not much advantage to the new mode.  By being able to skip the conversions, I think that will save more memory in the long run than by using the iy or ix registers.
What are these "conversions" you are talking about, and why are they affected by using the IY register? You can keep using normal memory loads/stores as much as you want, IY is just a bonus optimization (especially once you add 8-bit math mode). I'd imagine being able to directly add, subtract, and, or, xor variables in only 3 bytes is worth moving the variables to the end of the buffer.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 11, 2010, 04:22:57 pm
Oh, I see!  So you mean move all the variables to the end, not separate 8-bit variables.  That's actually a really good idea then.  And I don't think this would cause any compatibility issues assuming no one is abusing the Asm() feature.  The only drawback is that buffer overflows will flow into the A-Z variables making debugging that difficult, but that's really a non-issue.  In fact, its probably far safer than overflowing into some random ram values as far as stability.

I'll wait until I add the feature before moving the variables though.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on July 09, 2010, 04:21:39 pm
Here is a speed optimization for the length() command (the size turned out to be the same):
Code: [Select]
xor a
ld b,a
ld c,a
cpir
ld hl,-1
sbc hl,bc
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on July 09, 2010, 04:35:59 pm
Thanks! I thought about doing it that way, but I couldn't figure it out, its smart to use the negative. ;D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on September 22, 2010, 01:21:17 pm
Oh, how I love to bump this thread! :D

I'm thinking it would be neat for the commands that return booleans to not always have to calculate the 0 or 1 value directly. I was thinking instead that the command will set an internal compiler variable that tells which condition code to use. Then, the following command can either optimize according to the condition or otherwise generate a 0 or 1 as usual.

The most common application for this is the If statement. Take, for example, If A=5:
Code: [Select]
  ld hl,(var_a)
  ld de,5
  or a
  sbc hl,de
  ;The Z condition code is set to correspond to 1
  jp nz,end

Or how about B<10+A->A:
Code: [Select]
  ld hl,(var_b)
  ld de,-10
  add hl,de
  ;The NC condition code is set to correspond to 1
  ld hl,(var_a)
  jp c,no_inc
  inc hl
no_inc:

I imagine the "=0" command, when applied while condition code is active, will invert the condition variable and generate no code.

And... I guess that is all for now. Good day!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on September 22, 2010, 04:24:15 pm
I'm thinking it would be neat for the commands that return booleans to not always have to calculate the 0 or 1 value directly. I was thinking instead that the command will set an internal compiler variable that tells which condition code to use. Then, the following command can either optimize according to the condition or otherwise generate a 0 or 1 as usual.

How do you surmise that the compiler would know when to use which? For instance, A≠5 or (B≠6) could be optimized to A-5 or (B-6), but A≠5 and (B≠6) could not be similarly optimized.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on September 22, 2010, 04:32:51 pm
A≠5 and (B≠6) would be something like:
Code: [Select]
  ld hl,(var_a)
  ld de,5
  or a
  sbc hl,de
  ;The NZ condition code is set to correspond to 1
  ld hl,1
  jp nz,_
  ld hl,(var_b)
  ld de,6
  or a
  sbc hl,de
  ;The NZ condition code is set to correspond to 1
  ld hl,1
  jr nz,_
  dec hl
_
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on September 22, 2010, 06:42:19 pm
And how would the parser decide to do that ???
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on September 22, 2010, 06:48:29 pm
The problem with conditional "short-circuit evaluation" is that it has to do a lot of non-linear "look-ahead" parsing to determine if it's okay to get out of the statement early or not.  You might for example have If A≠5 and sub(EQL,B,C) which might need to evaluate the second expression even if the first one is false.  The idea definitely sounds good though, but it seems like it would be really complicated for the compiler to tell whether or not it can actually use that optimization and be completely compatible with previous versions.  And even when it can, I would have to write completely new block code and assembly templates for those conditionals.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on October 20, 2010, 08:56:30 am
Oops, necropost, oh well :P

I don't know if this approach was purposely left out, as it's 15 bytes larger than the current routine and sometimes slower. I'm referring to the square root routine. Whereas the current routine (14 bytes) takes 37n+38 T-states (linear time), where n is the result+1 (1-256), the following routine (29 bytes) takes 5n+800 T-states (near constant time), where n is the number of set bits in the result (0-8). The existing routine is faster for values that would yield results of 0-19, but this routine would be faster for values that would yield results of 20-255, which is a much broader range of the 8-bit spectrum. Also, it would be much more reliable to run at a near constant speed in programs which rely on that to run smoothly themselves. The existing routine would take only a few hundred T-states for low inputs, but would take up to OVER NINE THOUSAND T-states to calculate the square roots for the highest inputs. So it's up to you if this is something you want to use.

Code: [Select]
p_Sqrt:
.db __SqrtEnd-1-$
ld a,l
ld l,h
ld de,$0040
ld h,d
ld b,8
or a
__SqrtLoop:
sbc hl,de
jr nc,__SqrtSkip
add hl,de
__SqrtSkip:
ccf
rl d
rla
adc hl,hl
rla
adc hl,hl
djnz __SqrtLoop
ld h,0
ld l,d
ret
__SqrtEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on October 30, 2010, 08:11:33 pm
I think it's been long enough that I can safely double post :P


Bit routine optimizations! Please tell me if any of these wouldn't work correctly, as I wrote them myself and I'm not a terribly experienced assembly programmer so that's a definite possibility.


Code: (Current code) [Select]
p_GetBit0:
.db 5 ;5 bytes, 36 T-states
add hl,hl
ccf
sbc hl,hl
inc hl


p_GetBit1:
.db 6 ;6 bytes, 47 T-states
add hl,hl
add hl,hl
ccf
sbc hl,hl
inc hl


p_GetBit2:
.db 7 ;7 bytes, 58 T-states
add hl,hl
add hl,hl
add hl,hl
ccf
sbc hl,hl
inc hl


p_GetBit6:
.db 7 ;7 bytes, 37 T-states
ld a,h
rra
rra
ccf
sbc hl,hl
inc hl

p_GetBit7:
.db 6 ;6 bytes, 33 T-states
rr h
ccf
sbc hl,hl
inc hl

p_GetBit8:
.db 6 ;6 bytes, 33 T-states
rl l
ccf
sbc hl,hl
inc hl


p_GetBit9:
.db 7 ;7 bytes, 37 T-states
ld a,l
rla
rla
ccf
sbc hl,hl
inc hl

p_GetBit10:
.db 8 ;8 bytes, 30/29 T-states
bit 5,l
ld hl,0
jr z,$+3
inc l




p_GetBit14:
.db 7 ;7 bytes, 37 T-states
ld a,l
rra
rra
ccf
sbc hl,hl
inc hl

p_GetBit15:
.db 6 ;6 bytes, 33 T-states
rr l
ccf
sbc hl,hl
inc hl

 
       
Code: (Optimized code) [Select]
p_GetBit0:
.db 5 ;5 bytes, 27 T-states
xor a
add hl,hl
ld h,a
rla
ld l,a

p_GetBit1:
.db 6 ;6 bytes, 38 T-states
xor a
add hl,hl
add hl,hl
ld h,a
rla
ld l,a

p_GetBit2:
.db 7 ;7 bytes, 49 T-states
xor a
add hl,hl
add hl,hl
add hl,hl
ld h,a
rla
ld l,a

p_GetBit6:
.db 7 ;7 bytes, 26 T-states
ld a,%00000010
and h
rrca
ld h,0
ld l,a


p_GetBit7:
.db 6 ;6 bytes, 22 T-states
ld a,%00000001
and h
ld h,0
ld l,a

p_GetBit8:
.db 5 ;5 bytes, 27 T-states
xor a
ld h,a
add hl,hl
ld l,h
ld h,a

p_GetBit9:
.db 6 ;6 bytes, 38 T-states
xor a
add hl,hl
ld h,a
add hl,hl
ld l,h
ld h,a

p_GetBit10:
.db 7 ;7 bytes, 49 T-states
xor a
add hl,hl
add hl,hl
ld h,a
add hl,hl
ld l,h
ld h,a

p_GetBit14:
.db 7 ;7 bytes, 26 T-states
ld a,%00000010
and l
rrca
ld h,0
ld l,a


p_GetBit15:
.db 5 ;5 bytes, 20 T-states
xor a
ld h,a
inc a
and l
ld l,a
 


Other optimizations:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on November 10, 2010, 02:38:37 am
Signed greater than comparison:

Code: (Current code) [Select]
p_SIntGt:
.db 13 ;13 bytes, 48 T-states
ex de,hl
xor a
ld b,h
sbc hl,de
ld h,a
rra
xor b
xor d
rlca
and 1
ld l,a
       
Code: (Optimized code) [Select]
p_SIntGt:
.db 12 ;12 bytes, 67 T-states
ld bc,$8000
add hl,bc
ex de,hl
add hl,bc
xor a
sbc hl,de
ld h,a
rla
ld l,a


You getting all this Quigibo? :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on November 10, 2010, 02:40:58 am
I think he is too busy, which might explain why he doesn't respond.
/me hopes his school schedule doesn't get so drastic that he gets forced to quit the community for good... I am not too worried about the future of Axe programming, though. I was worried that if her became less active, there would be less activity in his sub-forum since he replied to a lot of help topics, but then activity still continued. I guess a huge thank to you and a bunch of other people is in order. Sadly, having quit programming a while ago I did not really participate much, though X.x
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on November 10, 2010, 07:55:16 pm
Yeah, I'm still reading all of this, even though I'm less active, I still visit just about every day :)  I've even been able to do a little more progress with Axe even with my busy schedule.

Runer112, are you sure that comparison is correct?  It seems like all it does is just change the high order bit before doing the subtraction.  It needs to check if the parity changed in that bit before and after the subtraction.  I actually already have plans to optimize this since I will be able to use the parity/overflow flag once I get relative jump replacement working with the axioms (so I can carry that feature over to the built-in commands).
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on November 10, 2010, 09:20:06 pm
Changing the high order bit does work, actually. It changes a comparison in the -32768 to 32767 range to a comparison in the 0 to 65535 range (effectively changing from a signed comparison to an unsigned comparison).
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on November 10, 2010, 11:55:08 pm
Changing the high order bit does work, actually. It changes a comparison in the -32768 to 32767 range to a comparison in the 0 to 65535 range (effectively changing from a signed comparison to an unsigned comparison).

Yup :) This is the only signed comparison for which this method is better though.

Do all the bit optimizations look correct by the way?



EDIT: If you plan on optimizing the signed comparisons to use the parity/overflow flag, you might want to check into that a bit. I was playing around with signed comparisons and wabbitemu was telling me very strange things. It seemed to tell me that signed comparisons relied on an xor of the p/v and s flags. Which makes no sense, but that's what wabbitemu was telling me. See table below.

hldesbc hl,de    cp/vshl>>de
2000    6000    C0001    0    1        0    
2000A00080001111
2000E00040001001
6000200040000001
6000A000C0001111
6000E00080001111
A000200080000010
A000600040000100
A000E000C0001010
E0002000C0000010
E000600080000010
E000A00040000001
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on November 11, 2010, 12:32:17 am
Yeah, I'm still reading all of this, even though I'm less active, I still visit just about every day :)  I've even been able to do a little more progress with Axe even with my busy schedule.
Ah phew, good to hear x.x. Still, I hope the schedule won't get even more hectic with the time. X.x
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on November 11, 2010, 01:04:30 am
It seemed to tell me that signed comparisons relied on an xor of the p/v and s flags. Which makes no sense, but that's what wabbitemu was telling me.
It actually does make a bit of sense. Whether the mathematical (non-overflowed) result of the subtraction is positive or negative should give you the result of the comparison.  However, if there was a signed overflow, it will give the wrong result. So the sign flag needs to be inverted if there was an overflow, and XOR achieves this perfectly.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on November 11, 2010, 01:43:50 am
Yeah, my point is that Quigibo is probably just better off using the signed comparisons he already uses instead of bothering with the p/v flag, because it gets messy.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on November 15, 2010, 08:27:44 am
Actually, the main reason he didn't use the p/v flag is because his routines didn't support absolute jumps. They apparently do now, so some speed-up using these flags might be possible.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on November 28, 2010, 06:04:40 pm
Cool, Quigibo added all my optimized auto-optimizations :) But I think you missed p_GetBit15, which can be optimized to be the same as p_Mod2.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on November 28, 2010, 06:08:28 pm
Cool, Quigibo added all my optimized auto-optimizations :) But I think you missed p_GetBit15, which can be optimized to be the same as p_Mod2.

Great! So, it optimizes the Axe script or the Asm conversion?

Like, the following program:

Code: [Select]
Output(0,0,"Hello World")
Is optimized to:

Code: [Select]
Output(0,0,"Hello World
Or is it Assembly that is optimized?

Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calcdude84se on November 28, 2010, 06:09:36 pm
This relates to the underlying z80 machine code (or ASM, if you prefer) that Axe generates. ;)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on November 28, 2010, 06:10:09 pm
This relates to the underlying z80 machine code (or ASM, if you prefer) that Axe generates. ;)

Good then, so the 'executables', which is, the .asm files are now running faster!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: ASHBAD_ALVIN on November 28, 2010, 06:40:57 pm
I would love to help with optimization, I can help with virtually everything except messing with archive, writing apps, and advanced sprite routines.

Other than that, I can help with everything else.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 01, 2010, 03:38:20 pm
Just some minor stuff. None of these improve size, but they'll shave off a few cycles.

Code: [Select]
p_Add510:
.db 4
inc h
inc h
dec hl
dec hl
p_Add514:
.db 4
inc h
inc h
inc hl
inc hl
p_Add767:
.db 4
inc h
inc h
inc h
dec hl
p_Add769:
.db 4
inc h
inc h
inc h
inc hl
p_Add1024:
.db 4
inc h
inc h
inc h
inc h
p_Sub510:
.db 4
dec h
dec h
inc hl
inc hl
p_Sub514:
.db 4
dec h
dec h
dec hl
dec hl
p_Sub767:
.db 4
dec h
dec h
dec h
inc hl
p_Sub769:
.db 4
dec h
dec h
dec h
dec hl
p_Sub1024:
.db 4
dec h
dec h
dec h
dec h
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 17, 2010, 01:11:47 am
Some constant equality checking optimizations. Feel free to extrapolate these optimizations to values that previously weren't optimized because the code would have been too large.

Code: (Current code) [Select]
p_EQN256:
.db 9
inc h
ld a,l
or h
jr z,$+5
ld hl,255
inc l
p_EQN2:
.db 10
inc hl
inc hl
ld a,l
or h
jr z,$+5
ld hl,255
inc l
p_EQN1:
.db 9
inc hl
ld a,l
or h
jr z,$+5
ld hl,255
inc l
p_EQ0:
.db 8
ld a,l
or h
jr z,$+5
ld hl,255
inc l
p_EQ256:
.db 9
dec h
ld a,l
or h
jr z,$+5
ld hl,255
inc l

       
Code: (Optimized code) [Select]
p_EQN256:
.db 8
inc h
ld a,l
or h
add 255
sbc hl,hl
inc hl
p_EQN2:
.db 8
inc l
ld a,l
and h
sub 255
sbc hl,hl
inc hl

p_EQN1:
.db 7
ld a,l
and h
sub 255
sbc hl,hl
inc hl

p_EQ0:
.db 7
ld a,l
or h
add 255
sbc hl,hl
inc hl
p_EQ256:
.db 8
dec h
ld a,l
or h
add 255
sbc hl,hl
inc hl



Also, p_Div32768 can be optimized to be the same as p_SLT0 and p_GetBit0.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 20, 2010, 05:42:17 pm
A few more equality checking optimizations. Not sure why I didn't see these before.

Code: (Current code) [Select]
p_NEN3:
.db 10
inc hl
inc hl
inc hl
ld a,l
or h
jr z,$+5
ld hl,1
p_NEN2:
.db 9
inc hl
inc hl
ld a,l
or h
jr z,$+5
ld hl,1
p_NEN1:
.db 8
inc hl
ld a,l
or h
jr z,$+5
ld hl,1

       
Code: (Optimized code) [Select]
p_NEN3:
.db 9
inc l
inc l
ld a,l
and h
add a,1
sbc hl,hl
inc hl
p_NEN2:
.db 8
inc l
ld a,l
and h
add a,1
sbc hl,hl
inc hl
p_NEN1:
.db 7
ld a,l
and h
add a,1
sbc hl,hl
inc hl

Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Happybobjr on December 20, 2010, 05:53:58 pm
will this increase speed, decrease size or both?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 20, 2010, 05:57:40 pm
These will decrease size, but also slightly decrease speed. I believe that Quigibo prefers smaller sizes over slightly slower code.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on December 20, 2010, 06:00:00 pm
Concerning optimizations:

I used an Assembly Disassembler to disassemble my game uPong and the code was 1531 lines! It was huge. Hopefully, thanks to Runner's and other members' optimizations, Axe programs can get smaller and smaller :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Deep Toaster on December 20, 2010, 06:01:55 pm
Concerning optimizations:

I used an Assembly Disassembler to disassemble my game uPong and the code was 1531 lines! It was huge. Hopefully, thanks to Runner's and other members' optimizations, Axe programs can get smaller and smaller :)

Axe programs always have some overhead. You can get rid of memory storing like with For( loops and switch them to use registers, and you'd get it several times faster too :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 20, 2010, 06:34:04 pm
The more assembly instructions available, the more you could optimize programs. But I don't want Axe to just turn into an assembly compiler with a 16-bit math engine and some other pre-built structures on the side. If you want that, you probably know assembly anyways and could just write your program in assembly and grab some of Axe's routines.

However, three instructions that I believe would complement Axe very well without defeating the purpose of its hl-based system are the ex de,hl, push hl, and pop hl instructions.

Quigibo had toyed with the idea of implementing the ex de,hl instruction before, allowing it to be called with either →π or π. They would do the same thing, but it would be a more familiar syntax for programmers used to a variable system in which storing and recalling values are different instructions. And/or he could implement the instruction as the Exch() command with no arguments.

Although there would definitely need to be a warning to only use push and pop if you know how stacks work and know what you're doing, those would be pretty helpful I think. Maybe push hl could be StoreGDB and pop hl could be RecallGDB .


In case Quigibo is reading this, any chance of these happening?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Ashbad on December 20, 2010, 06:35:49 pm
well, that's a great idea, but maybe even more commands to let advanced users toy with assembly instructions further?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 20, 2010, 06:39:20 pm
I don't think push and pop are that good of an idea because of the stack thing.  And all of those are only 2 hex characters of the Asm() command anyway.  The exchange I am still thinking about, I'm trying to see how practical it would be and how many commands are actually safe to use and I could change some commands to use bc instead.

Thanks again for the optimizations of the equalities, those are very widely used in programs so its great to see those optimized.  I extended your optimizations to equality comparisons against variables too which actually reduces the number of auto-optimizations I need to do since the general cases are smaller.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 20, 2010, 06:45:47 pm
I realize that the stack can be very dangerous, as one push without a pop and you can say goodbye to the contents of your RAM. But the stack is also one of the most powerful programming tools present on the calculator. As long as you added warnings not to use the stack commands unless users know how the stack functions, I think it would be a very nice addition. And I would be happy to write an addition to the documentation about stack usage so users who aren't familiar with stacks could learn their power. ;)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Ashbad on December 20, 2010, 06:48:48 pm
very true, I personally love the stack most among most other things, that's one thing I always loved about raw assembly programming.

Then, Axe can appeal to just about everybody :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Happybobjr on December 20, 2010, 07:05:13 pm
I would suggest a 2nd version within the folder Developers Tools.
That way some noob won't just find it on his calc.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 20, 2010, 08:03:06 pm
There's really no reason to make an alternate version. Users would most likely learn about these new commands from the updates thread or by looking in the commands list, and Quigibo can add warnings to both of those.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on December 21, 2010, 03:07:37 pm
These will decrease size, but also slightly decrease speed. I believe that Quigibo prefers smaller sizes over slightly slower code.
I think it depends, though. If the slight speed decrease is 0.01 FPS to 0.02 it may not sound like much if executed once but if someone uses something 20 times a loop he will see the difference. Otherwise I think he definitively optimizes for size.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on December 21, 2010, 03:29:20 pm
Hmmmm i wonder if in future versions there might be an option to compile for either speed or size, depending on the optimization you want O.O
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on December 21, 2010, 08:54:51 pm
Wouldn't it make the compiler super large, though? :O
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Happybobjr on December 21, 2010, 10:31:22 pm
or just have Axe version ?.?.? A  and ?.?.?B

EDIT: post 1000!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 21, 2010, 10:45:40 pm
The problem with a compile setting for size/speed is that generally, you want part of your code to be optimized for size and another part to be optimized for speed.  This could be done with "speed-op" and "size-op" commands or perhaps a prefix for labels like "Lbl +LBL" indicates everything until the next label is optimized for speed and everything else by default is optimized for size.  Any shared subroutines like multiplication are added in the second pass so if multiplication was flagged to be fast in one place, it would be fast everywhere else too since it has to add the huge routine to your code anyway, all the other calls should share it.

It would make the parser a lot bigger, but it has room.  I'm almost full on the first page, but I still have about 10KB of extra room on the second page.  Commands are data in the Axe app so they would go on that page anyway.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Deep Toaster on December 21, 2010, 10:46:47 pm
By the way, was there a major change to the way Axe parsed →{BUNCH OF COMMANDS} somewhere between 0.2.6 and 0.4.4? I upgraded and didn't notice it until now :-\

EDIT: 10 KB remaining? Can I make a quick feature request then? Some sort of a way to pack Axioms into the app would be epic, even if we have to do it on the computer. Just a suggestion.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 21, 2010, 10:51:50 pm
Yeah, I don't remember when that was, but it optimizes better that way.  When storing stuff to a constant pointer, it returns the stuff you stored.  If storing to a variable pointer, it returns the pointer itself, or the pointer plus 1 if you're storing 2 bytes at a time.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on December 22, 2010, 02:55:40 am
The problem with a compile setting for size/speed is that generally, you want part of your code to be optimized for size and another part to be optimized for speed.  This could be done with "speed-op" and "size-op" commands or perhaps a prefix for labels like "Lbl +LBL" indicates everything until the next label is optimized for speed and everything else by default is optimized for size.  Any shared subroutines like multiplication are added in the second pass so if multiplication was flagged to be fast in one place, it would be fast everywhere else too since it has to add the huge routine to your code anyway, all the other calls should share it.

It would make the parser a lot bigger, but it has room.  I'm almost full on the first page, but I still have about 10KB of extra room on the second page.  Commands are data in the Axe app so they would go on that page anyway.
That would be a nice idea actually. My worry was effectively about some parts needing to be smaller but others faster. Good luck with whatever you decide. :)

Also I notice Axe is at 70% now ;D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 29, 2010, 04:32:35 am
You guys won't believe this, but I found an optimization to the general multiplication.  O.O

Code: [Select]
;#####  OLD  #####
p_Mul:
ld b,h
ld c,l
ld hl,0
ld a,16
__MulNext:
add hl,hl
rl e
rl d
jr nc,__MulSkip
add hl,bc
jr nc,__MulSkip
inc de
__MulSkip:
dec a
jr nz,__MulNext
ret
Code: [Select]
;######  NEW  ######
p_Mul:
ld c,h
ld a,l
ld hl,0
ld b,16
__MulNext:
add hl,hl
rla
rl c
jr nc,__MulSkip
add hl,de
adc a,0
jr nc,__MulSkip
inc c
__MulSkip:
djnz __MulNext
ret

Its a small optimization, its the same size as the original but it saves an average of 4 clock cycles per bit of the multiplication.  Since its a 16 bit multiplication, that's an average of 64 clock cycles (11 micro-seconds) and about a 6% increase in speed.  So its not that big of a deal, I was just really shocked I was able to optimize it.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on December 29, 2010, 04:44:27 am
Wow, that's such a common operator I can't even imagine the potential for speed increase in large programs like raycasters O.O
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jnesselr on December 29, 2010, 02:18:48 pm
That is actually an amazing increase in speed. Very nice!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on December 30, 2010, 04:04:42 am
Nice! O.O
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on January 04, 2011, 12:11:05 pm
It doesn't look like the optimized p_EQN2 and p_EQN1 routines that I suggested made it into Axe 0.4.7. Did you maybe forget to add them? Or were my routines incorrect?

Also, I think you might have overlooked my footnote in the first equality optimization post mentioning that p_Div32768 can be optimized to be the same as p_SLT0 and p_GetBit0. :P



EDIT: Also, it doesn't look like p_EQNX and p_NENX are being used by the parser. Is this just an accident, or did you intentionally leave them out for now due to the problem of the constant in the source code not equaling the constant that should be inserted?

And along the lines of comparing negative shorts, if you get p_NENX working, it should be the same size and faster for -3 than p_NEN3, so the latter should be removed. And for p_NEN2, can't the first instruction just be inc l instead of inc hl?


EDIT 2: I think I've mentioned this in the past as well, but it looks like you must've missed seeing it/adding it. p_GetBit15 could be optimized to be the same as p_Mod2.


EDIT 3: (Man this is going to be one long post) It was smart of you to make the greater than, greater or equal, less than, and less or equal comparisons with second arguments that are expressions call the opposite routine after popping the first argument into de, thus avoiding any double ex de,hl's. Could something like this also be used for the greater than and less or equal comparisons with the second argument being a variable? A byte could be saved by making these insances use ex de,hl / ld hl,($0000) to load the variable instead of ld de,($0000) and then call the opposite routine.

EDIT 4: Going along the lines of edit 3, p_SIntGt and p_SIntLe could be optimized for variable arguments in the same way. Although no bytes would be saved calling p_SIntLt instead of p_SIntGt, cycles would be.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on January 04, 2011, 05:32:59 pm
I think most of those I just missed thanks for catching them.  The p_EQNX and p_NENX were intentionally left out though because I need to rewrite my optimizer to handle negative shorts first.  As for your new optimizations, I'm not sure If I want to add those because It would require me to write a lot more code for the parser since I just have all the math operations and optimizations macro'd in right now.  Checking for a variable would be a little tricky in that section.  But I'll try it out later if I have time.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on January 04, 2011, 05:34:52 pm
Alright, these aren't that urgent anyways. Just trying to squeeze every last byte and cycle out of Axe programs. ;)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on January 05, 2011, 06:42:20 pm
Wow that took a long time. But I hope the results will be worth it.
Quigibo, get out your reading glasses. ;)

(By the way, I haven't tested these myself, but the code looks solid. If you believe that any of these would not work or have any questions, tell me.)



Smaller nibble retrieval routines. 1 byte saved for reading from RAM, 3 bytes saved for reading from ROM.

Thanks to calc84maniac for reminding me that $0000-$7FFF is read-only!

Code: (Original routine: 18 bytes, ~72 cycles) [Select]
p_Nib1:
.db __Nib1End-$-1
scf
rr h
rr l
ld a,(hl)
jr c,__Nib1Skip
rrca
rrca
rrca
rrca
__Nib1Skip:
and %00001111
ld l,a
ld h,0
ret
__Nib1End:
   
Code: (Optimized routine: 17 bytes, ~105 cycles) [Select]
p_Nib1:
.db __Nib1End-$-1
xor a
scf
rr h
rr l
ld b,(hl)
__Nib1Loop:
rrd
ccf
jr c,__Nib1Loop
ld (hl),b
ld l,a
ld h,0
ret
__Nib1End:

Code: (Original routine: 18 bytes, ~68 cycles) [Select]
p_Nib2:
.db __Nib2End-$-1
srl h
rr l
ld a,(hl)
jr c,__Nib2Skip
rrca
rrca
rrca
rrca
__Nib2Skip:
and %00001111
ld l,a
ld h,0
ret
__Nib2End:
   
Code: (Optimized routine: 15 bytes, ~77 cycles) [Select]
p_Nib2:
.db __Nib2End-$-1
xor a
srl h
rr l
rrd
jr c,__Nib2Skip
rld
__Nib2Skip:
ld l,a
ld h,0
ret
__Nib2End:



Smaller and faster nibble storage routine. 1 byte and ~17 cycles saved.

Code: (Original routine: 23 bytes, ~127 cycles) [Select]
p_NibSto:
.db __NibStoEnd-$-1
pop bc
pop de
push bc
scf
rr h
rr l
ld b,(hl)
ex de,hl ;hl = byte ;de = addr
ld a,%11110000
jr c,__NibStoSkip
add hl,hl
add hl,hl
add hl,hl
add hl,hl
cpl
__NibStoSkip:
and b
or l
ld (de),a
ret
__NibStoEnd:
   
Code: (Optimized routine: 22 bytes, ~110 cycles) [Select]
p_NibSto:
.db __NibStoEnd-$-1
pop bc
pop de
push bc
scf
rr h
rr l
jr c,__NibStoHigh
rrd
ld a,e
rld
ret
__NibStoHigh:
rld
ld a,e
rrd
ret
__NibStoEnd:



Faster buffer inversion routine. 9951 cycles saved.

Code: (Original routine: 16 bytes, 38425 cycles) [Select]
p_InvBuff:
.db __InvBuffEnd-1-$
ld hl,plotSScreen
ld bc,768
__InvBuffLoop:
ld a,(hl)
cpl
ld (hl),a
inc hl
dec bc
ld a,b
or c
jr nz,__InvBuffLoop
ret
__InvBuffEnd:
   
Code: (Optimized routine: 16 bytes, 28474 cycles) [Select]
p_InvBuff:
.db __InvBuffEnd-1-$
ld hl,plotSScreen
ld bc,3
__InvBuffLoop:
ld a,(hl)
cpl
ld (hl),a
inc hl
djnz __InvBuffLoop
dec c
jr nz,__InvBuffLoop
ret
__InvBuffEnd:



You'll laugh at this... but I managed to save 4 cycles in the unarchive and archive routines. And only if the targeted variable doesn't exist. But hey, why not take all the savings you can get.

I think this works. It relies on the page number returned in b always being 0 if a RAM page and always being in the range or $01-$7F if a flash page.

Code: (Original routine: 18 bytes, a lot of cycles) [Select]
p_Unarchive:
.db __UnarchiveEnd-1-$
MOV9TOOP1()
B_CALL(_ChkFindSym)
ld hl,0
ret c
inc b
dec b
ret z
B_CALL(_Arc_Unarc)
ld hl,1
ret
__UnarchiveEnd:
   
Code: (Optimized routine: 18 bytes, a lot of-4 cycles) [Select]
p_Unarchive:
.db __UnarchiveEnd-1-$
MOV9TOOP1()
B_CALL(_ChkFindSym)
ld hl,0
ret c
dec b
ret m
inc b
B_CALL(_Arc_Unarc)
ld hl,1
ret
__UnarchiveEnd:

Code: (Original routine: 18 bytes, a lot of cycles) [Select]
p_Archive:
.db __ArchiveEnd-1-$
MOV9TOOP1()
B_CALL(_ChkFindSym)
ld hl,0
ret c
inc b
dec b
ret nz
B_CALL(_Arc_Unarc)
ld hl,1
ret
__ArchiveEnd:
   
Code: (Optimized routine: 18 bytes, a lot of-4 cycles) [Select]
p_Archive:
.db __ArchiveEnd-1-$
MOV9TOOP1()
B_CALL(_ChkFindSym)
ld hl,0
ret c
dec b
ret p
inc b
B_CALL(_Arc_Unarc)
ld hl,1
ret
__ArchiveEnd:



Smaller archived variable locating. 4 bytes saved.

Code: (Original routine: 55 bytes, a lot of cycles) [Select]
p_GetArc:
.db __GetArcEnd-1-$
push de
MOV9TOOP1()
B_CALL(_ChkFindSym)
ld hl,0
jr c,__GetArcFail
ld a,(OP1)
cp ListObj
jr z,__GetArcName
cp ProgObj
jr z,__GetArcName
cp AppvarObj
jr z,__GetArcName
cp GroupObj
jr z,__GetArcName
__GetArcStatic:
ld hl,14
jr __GetArcDone
__GetArcName:
ld hl,9
add hl,de
B_CALL(_LoadDEIndPaged)
ld d,0
inc hl
inc hl
__GetArcDone:
add hl,de
__GetArcFail:
ex de,hl
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
ex de,hl
ret
__GetArcEnd:
   
Code: (Optimized routine: 51 bytes, a lot of cycles) [Select]
p_GetArc:
.db __GetArcEnd-1-$
push de
MOV9TOOP1()
B_CALL(_ChkFindSym)
ld hl,0
jr c,__GetArcFail
and %00011111
ld d,b
ld hl,__GetArcVarTypes
ld bc,__GetArcEnd-__GetArcVarTypes
cpir
ld b,d
ld hl,14
jr nz,__GetArcDone
ld l,9
add hl,de
B_CALL(_LoadDEIndPaged)
ld d,0
inc e
inc e
__GetArcDone:
add hl,de
__GetArcFail:
ex de,hl
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
ex de,hl
ret
__GetArcVarTypes:
.db ListObj,ProgObj,AppvarObj,GroupObj
__GetArcEnd:



Smaller 8-bit get bit routine. 1 byte saved.

Code: (Original routine: 13 bytes, ~110 cycles) [Select]
p_GetBit:
.db 13
ld a,e
and %00000111
inc a
ld b,a
ld a,l
__GetBitLoop:
add a,a
djnz __GetBitLoop
ld h,b
ld l,b
rl l
   
Code: (Optimized routine: 12 bytes, ~152 cycles) [Select]
p_GetBit:
.db 12
ld a,e
and %00000111
inc a
ld b,a
xor a
__GetBitLoop:
ld h,a
add hl,hl
djnz __GetBitLoop
ld l,h
ld h,a



As long as the low byte of vx_SptBuff is at most $F8: faster sprite flipping routines. 16 cycles saved each.

Code: (Original routine: 13 bytes, 338 cycles) [Select]
p_FlipV:
.db __FlipVEnd-1-$
ex de,hl
ld hl,vx_SptBuff+8
ld b,8
__FlipVLoop:
dec hl
ld a,(de)
ld (hl),a
inc de
djnz __FlipVLoop
ret
__FlipVEnd:
   
Code: (Optimized routine: 13 bytes, 322 cycles) [Select]
p_FlipV:
.db __FlipVEnd-1-$
ex de,hl
ld hl,vx_SptBuff+8
ld b,8
__FlipVLoop:
dec l
ld a,(de)
ld (hl),a
inc de
djnz __FlipVLoop
ret
__FlipVEnd:

Code: (Original routine: 21 bytes, 1907 cycles) [Select]
p_FlipH:
.db __FlipHEnd-1-$
ld de,vx_SptBuff
push de
ld b,8
__FlipHLoop1:
ld c,(hl)
ld a,1
__FlipHLoop2:
rr c
rla
jr nc,__FlipHLoop2
ld (de),a
inc hl
inc de
djnz __FlipHLoop1
pop hl
ret
__FlipHEnd:
   
Code: (Optimized routine: 21 bytes, 1891 cycles) [Select]
p_FlipH:
.db __FlipHEnd-1-$
ld de,vx_SptBuff
push de
ld b,8
__FlipHLoop1:
ld c,(hl)
ld a,1
__FlipHLoop2:
rr c
rla
jr nc,__FlipHLoop2
ld (de),a
inc hl
inc e
djnz __FlipHLoop1
pop hl
ret
__FlipHEnd:



Smaller and faster sprite rotating routines. 2 bytes smaller and 166 cycles faster. These also save 16 cycles from relying on the low byte of vx_SptBuff being at most $F8.

Code: (Original routine: 22 bytes, 2874 cycles) [Select]
p_RotC:
.db __RotCEnd-1-$
ex de,hl
ld hl,vx_SptBuff
ld c,8
__RotCLoop1:
push hl
ld b,8
ld a,(de)
__RotCLoop2:
rla
rr (hl)
inc hl
djnz __RotCLoop2
pop hl
inc de
dec c
jr nz,__RotCLoop1
ret
__RotCEnd:
   
Code: (Optimized routine: 20 bytes, 2708 cycles) [Select]
p_RotC:
.db __RotCEnd-1-$
ex de,hl
ld c,8+1
__RotCLoop1:
ld hl,vx_SptBuff
dec c
ret z
ld b,8
ld a,(de)
__RotCLoop2:
rla
rr (hl)
inc l
djnz __RotCLoop2
inc de
jr __RotCLoop1
__RotCEnd:

Code: (Original routine: 22 bytes, 2874 cycles) [Select]
p_RotCC:
.db __RotCCEnd-1-$
ex de,hl
ld hl,vx_SptBuff
ld c,8
__RotCCLoop1:
push hl
ld b,8
ld a,(de)
__RotCCLoop2:
rra
rl (hl)
inc hl
djnz __RotCCLoop2
pop hl
inc de
dec c
jr nz,__RotCCLoop1
ret
__RotCCEnd:
   
Code: (Optimized routine: 20 bytes, 2708 cycles) [Select]
p_RotCC:
.db __RotCCEnd-1-$
ex de,hl
ld c,8+1
__RotCCLoop1:
ld hl,vx_SptBuff
dec c
ret z
ld b,8
ld a,(de)
__RotCCLoop2:
rra
rl (hl)
inc l
djnz __RotCCLoop2
inc de
jr __RotCCLoop1
__RotCCEnd:



That's all I have for now. I think I got just about everything I could possibly find, but I might have some more later. And if you want all the routines in one file, I uploaded them all here (http://pastebin.com/Ux0B9322).
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on January 05, 2011, 06:49:06 pm
Woah a lot of new optimizations! O.O Nice job Runer112!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Eeems on January 05, 2011, 06:54:54 pm
Great job guys! :D
I should probably get back into working with AXE sometime again. Although it's hard after moving over to assembly again.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on January 05, 2011, 06:57:55 pm
Yeah it can be hard if you don,t have as much freedom to do certain things you need. When I switched to Axe it was a bit hard to use BASIC again.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Eeems on January 05, 2011, 07:05:04 pm
Yeah I know :(
For me it's working with higher level stuff, my head just seems to work with lower level languages better, well almost, I still can work with JavaScript and PHP pretty good :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on January 05, 2011, 07:14:09 pm
Your nibble read routine that reads from archive will fail, because it's ROM. In that case, it might work to do something like this:
Code: [Select]
p_Nib2:
.db __Nib2End-$-1
xor a
srl h
rr l
rrd
jr c,__Nib2Skip
rld
__Nib2Skip:
ld l,a
ld h,0
ret
__Nib2End:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on January 05, 2011, 07:16:59 pm
I just sort of blindly copied the $8000-$FFFF routine, not taking into account the fact that $0000-7FFF was ROM. Good catch, I'll edit my post now. And it turns out that's actually smaller anyways. ;)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on January 05, 2011, 10:59:54 pm
 O.O How do you do this? You're a madman!

So I've never really used or knew what the rrd and rld instructions did.  I though thought they were some of those obscure instructions like daa, which they are, but I guess there are situations where you can use them, like with daa in the hex routine.  Awesome job there!  I can't believe I missed the push pop thing in the sprite rotation ones, that was embarrassing...  I really do like the getcalc ones, but it uses an inline self-reference.  I think because there's only one, I can easily replace it, but I'll have to make sure.   The inversion one is excellent as well.  I could have sworn I tried that same method before but couldn't get it the same size.

However, these are the concerns I have:  First, the sprite rotation commands, why did you move the ret to the middle of the routine?  It looks like that's just going to add more cycles since a conditional jr takes the same amount of cycles as a regular jr anyway.  Next, is it really a safe assumption that all ROM pages are between $7F and $FF for all current models and potentially future models?  And lastly, are you sure trying to modifying rom (unsuccessfully) has no potential side effects to things like flags and registers?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on January 05, 2011, 11:07:27 pm
The CPU has no way of knowing whether the write fails or not, so there are no side effects.

Also, I think the p_GetArc routine fails if the archived VAT entry overlaps into another page. You really need to check for that.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on January 05, 2011, 11:11:30 pm
However, these are the concerns I have:  First, the sprite rotation commands, why did you move the ret to the middle of the routine?  It looks like that's just going to add more cycles since a conditional jr takes the same amount of cycles as a regular jr anyway.

Yeah, I'm not really sure why I did that. Feel free to initialize c to 8 instead and decrease and check c at the end using a conditional jump instead.

Next, is it really a safe assumption that all ROM pages are between $7F and $FF for all current models and potentially future models?

$01 and $7F are all ROM pages, and $80-$87 are all RAM pages (at least for the calculators that have all those RAM pages), so it would make sense that $80 and up is RAM. But feel free to leave this optimization out anyways, it only saves 4 cycles part of the time.

And lastly, are you sure trying to modifying rom (unsuccessfully) has no potential side effects to things like flags and registers?

After a quick test, yes, rrd and rld affect a correctly even when hl points to a byte in ROM.



EDIT: By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on January 06, 2011, 02:19:30 am

EDIT: By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it.

This would be really great. When I programmed in Axe, knowing the size of commands was always something I wanted to be aware of, and I'm sure a lot of people would like it, since some people might be really tight on memory and want to find every way to optimize for size.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: TIfanx1999 on January 06, 2011, 08:41:27 am
Faster buffer inversion routine. 9951 cycles saved.
It's over 9000!!!!
What?!? O.O
9000?
 <_< Yea, I know... I had to...
But seriously dude, all those optimizations are awesome!  ;D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Happybobjr on January 06, 2011, 10:09:17 am
By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it.
/me loves runner
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on January 06, 2011, 10:11:37 am
By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it.
/me loves runner

And so does Scout.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on January 07, 2011, 12:20:48 am
Faster buffer inversion routine. 9951 cycles saved.
It's over 9000!!!!
What?!? O.O
9000?
 <_< Yea, I know... I had to...
But seriously dude, all those optimizations are awesome!  ;D
Lol I just actually noticed that ;D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on January 09, 2011, 01:00:51 pm
Oh damn, you know what? Now I remember why I had the conditional return in the middle of the sprite rotating routines, Quigibo. Without it, the routines would return vx_SptBuff+8 in hl. Oops... But instead of re-implementing the conditional return, here's the better fix:

Code: [Select]
p_RotC:
.db __RotCEnd-1-$
ex de,hl
ld c,8
__RotCLoop1:
ld hl,vx_SptBuff+8
ld b,8
ld a,(de)
__RotCLoop2:
dec l
rra
rr (hl)
djnz __RotCLoop2
inc de
dec c
jr nz,__RotCLoop1
ret
__RotCEnd:

p_RotCC:
.db __RotCCEnd-1-$
ex de,hl
ld c,8
__RotCCLoop1:
ld hl,vx_SptBuff+8
ld b,8
ld a,(de)
__RotCCLoop2:
dec l
rla
rl (hl)
djnz __RotCCLoop2
inc de
dec c
jr nz,__RotCCLoop1
ret
__RotCCEnd:



EDIT: And as a side note, would it be possible to reformat DS<() so that the variable is reinitialized to its maximum value at the End? That way, 3 bytes could be saved by having both the zero and not zero conditions using the same store command. For example:

Code: [Select]
ld hl,(var)
dec hl
ld a,h
or l
jp nz,DS_End
;Code inside statement goes here
ld hl,max
DS_End:
ld (var),hl
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on January 10, 2011, 05:35:19 pm
Quigibo, you could probably optimize const->{expr} statements to give a lot of optimization benefits:
Code: [Select]
;const->{expr}
;Evaluate expr here
ld (hl),const

;const->{expr}r
;Evaluate expr here
ld (hl),const & $FF
inc hl
ld (hl),const >> 8

;const->{expr}rr
;Evaluate expr here
ld (hl),const >> 8
inc hl
ld (hl),const & $FF

These optimizations would still be compatible with code in earlier Axe versions because HL ends up exactly as it used to.

Edit:
These extra optimizations are also possible for storing 0:
Code: [Select]
;0->{expr}r or 0->{expr}rr
;Evaluate expr here
xor a
ld (hl),a
inc hl
ld (hl),a
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on February 06, 2011, 08:45:09 pm
:-[ It looks like yet another error has been discovered with my attempts to optimize things. The nibble retrieval routines and the nibble storage routine that I posted treat low and high nibbles in opposite ways. I'm pretty sure that the nibble retrieval routines are backwards and that the conditional jr c jumps should be changed to jr nc.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: squidgetx on February 06, 2011, 08:47:17 pm
So would changing this make the new nibble routines opposite the ones found in .4.6? (or the same?)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on February 07, 2011, 01:11:13 am
Nice catch, can't wait for the new version :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: squidgetx on February 14, 2011, 07:17:34 am
Could this possibly be auto-optimized:

pxl-Test(CONST1,CONST2)

to

{CONST2*12+(CONST1/8)+L6}re(CONST1^8) (except ofc the math is all precalculated during parsing time.)? It saves more than 10 bytes and 200 cycles.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on February 15, 2011, 04:01:39 pm
Some improvements to MemKit! :P


Next(): 2 bytes and a few cycles saved. Also, isn't the end-of-VAT check in the wrong place? I could be wrong because my VAT experience isn't too great, but because this routine checks for the end of the VAT at the start, wouldn't this command advance the VAT pointer to the end of the VAT and not recognize it as the end until the next Next()? This would cause problems with programs reading garbage VAT data for the last "entry." If I'm right about this (which may not be the case), the third block of code I posted should hopefully recognize the end of the VAT as soon as it hits it and never advance the VAT pointer to point to the end.

Code: (Original code: 26 bytes, 152/66 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    de,-6
 add   hl,de
 ld    e,(hl)
 inc   e
 xor   a
 ld    d,a
 sbc   hl,de
 ld    (axv_X1t),hl
 ret
 
   
Code: (Optimized code: 24 bytes, 144/66 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    de,-6
 add   hl,de
 ld    a,(hl)
 cpl
 ld    e,a
 add   hl,de
 ld    (axv_X1t),hl
 ret
 
   
Code: (Optimized (and fixed?) code: 24 bytes, 144/113 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,-6
 add   hl,de
 ld    a,(hl)
 cpl
 ld    e,a
 add   hl,de
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    (axv_X1t),hl
 ret
 


Dim()rr: Fixed the page offset.

Code: (Original code) [Select]

 ld    ix,(axv_X1t)
 ld    l,(ix-6)
 ld    h,0
 
   
Code: (Fixed code) [Select]

 ld    ix,(axv_X1t)
 ld    l,(ix-5)
 ld    h,0
 


Print(): n*16-13 cycles saved, n=name length. Assuming an average name length of 4.5 characters, 59 cycles saved.

Code: (Original code: 18 bytes, n*55+51 cycles) [Select]

 ld    ix,(axv_X1t)
 ld    b,(ix-6)
Ax6_Loop:
 ld    a,(ix-7)
 ld    (hl),a
 inc   hl
 dec   ix
 djnz  Ax6_Loop
 ld    (hl),b
 ret
 
   
Code: (Optimized code: 18 bytes, n*39+64 cycles) [Select]

 ex    de,hl
 ld    hl,(axv_X1t)
 ld    bc,-6
 add   hl,bc
 ld    b,(hl)
 ex    de,hl
Ax6_Loop:
 dec   de
 ld    a,(de)
 ld    (hl),a
 inc   hl
 djnz  Ax6_Loop
 ld    (hl),b
 ret
 
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on February 16, 2011, 01:47:44 pm
Yay, double post! But it's been almost a day and I have a pretty good question/suggestion. This relates to the screen display commands. This was brought to mind when squidgetx made a post mentioning something I had discovered a while ago when documenting the speed of Axe commands. What he mentioned is that DispGraphr actually runs faster than DispGraph. Here's a quote of my response to that:

I see you've been reading up on my Commands documentation, eh squidgetx? Yeah, that's an interesting thing I discovered when speed testing the display commands. On calculators like mine with the old, "good" screen drivers, the screen driver delay seems to be pretty low and constant from calculator to calculator. DispGraph could run just as fast or faster than DispGraphr on these calculators. However, due to inconsistencies with the screen drivers in newer units, the routine may run too fast for the driver on some calculators, causing display problems, so Quigibo had to add a portion of code to pause the routine until the driver says it is ready. However, this pause itself adds some overhead time, making the routine slower.

Quigibo, the DispGraphr routine doesn't have any throttling system in place, yet no problems have been reported with it on newer calculators. Could you just remove the throttling system from the DispGraph routine and add one or two time-wasting instructions to make each loop iteration take as many cycles as each DispGraphr loop iteration?


EDIT: Hmm I don't know if Quigibo reads this thread and would see that, so I'm probably going to post that in a major thread he reads or send him a message about that.


The second paragraph is my suggested optimization. The 3-level grayscale routine doesn't have a throttling system, yet there have been no reports of display problems from anybody. Wouldn't this suggest that all the screen drivers can handle routines that have as much delay as this? The data copying loop in the 3-level grayscale routine takes 72 cycles per byte output, so could delays simply be added to the normal screen display routine to make its loop at least 72 cycles?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on February 16, 2011, 02:58:19 pm
Quote
Print()

What function would that be Runer?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on February 16, 2011, 11:32:36 pm
@squidgetx
I don't think pixel testing points with constant coordinates is common enough to warrant the pixel tester to treat it as a special case.  99% of the time, you're going to be using variable arguments to test pixels.  If not, the code can probably be made more efficient without a pixel test in the first place.

The second paragraph is my suggested optimization. The 3-level grayscale routine doesn't have a throttling system, yet there have been no reports of display problems from anybody. Wouldn't this suggest that all the screen drivers can handle routines that have as much delay as this? The data copying loop in the 3-level grayscale routine takes 72 cycles per byte output, so could delays simply be added to the normal screen display routine to make its loop at least 72 cycles?

Unfortunately that is not entirely true.  There has actually been at least 1 report that the 3-level routine is too fast and causes flickers once in a great while on very new hardware.  If there was a lower bound for clock cycles, I'm right on it.  Although, I could still probably take the safety stuff off the safe copy routine, still have it faster (but not too fast) and still be smaller.  I will look into that.

And I do read most of these threads, I'm just generally too busy to post, but I try to when I have small pockets of free time :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on February 17, 2011, 09:47:36 pm
Now that you have absolute jumps implemented:

Code: (Original code) [Select]

p_Exchange:
.db 13
pop de
ex (sp),hl
pop bc
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
ld a,b
or c
jr nz,$-8

   
Code: (Optimized code) [Select]

p_Exchange:
.db 12
pop de
ex (sp),hl
pop bc
__ExchangeLoop:
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
jp pe,__ExchangeLoop ;or is it po?



Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on February 20, 2011, 06:47:27 pm
I felt bad last time I optimized the constant bit-checking auto optimizations because I left about half of them out, stuck with the 8-byte plain old bit check routine. But thanks to a random revelation I had while lying in bed last night, I have come back for the forgotten ones!


Code: (Original code) [Select]
p_GetBit2:
.db 7 ;7 bytes, 49 cycles
xor a
add hl,hl
add hl,hl
add hl,hl
ld h,a
rla
ld l,a
p_GetBit3:
.db 8 ;8 bytes, 30/29 cycles
bit 4,h
ld hl,0
jr z,$+3
inc l

p_GetBit4:
.db 8 ;8 bytes, 30/29 cycles
bit 3,h
ld hl,0
jr z,$+3
inc l

p_GetBit5:
.db 8 ;8 bytes, 30/29 cycles
bit 2,h
ld hl,0
jr z,$+3
inc l

p_GetBit10:
.db 7 ;7 bytes, 49 cycles
xor a
add hl,hl
add hl,hl
ld h,a
add hl,hl
ld l,h
ld h,a
p_GetBit11:
.db 8 ;8 bytes, 30/29 cycles
bit 4,l
ld hl,0
jr z,$+3
inc l

p_GetBit12:
.db 8 ;8 bytes, 30/29 cycles
bit 3,l
ld hl,0
jr z,$+3
inc l

p_GetBit13:
.db 8 ;8 bytes, 30/29 cycles
bit 2,l
ld hl,0
jr z,$+3
inc l

 
   
Code: (Optimized code) [Select]
p_GetBit2:
.db 7 ;7 bytes, 37 cycles
ld a,h
set 5,h
cp h
sbc hl,hl
inc hl


p_GetBit3:
.db 7 ;7 bytes, 37 cycles
ld a,h
set 4,h
cp h
sbc hl,hl
inc hl
p_GetBit4:
.db 7 ;7 bytes, 37 cycles
ld a,h
set 3,h
cp h
sbc hl,hl
inc hl
p_GetBit5:
.db 7 ;7 bytes, 37 cycles
ld a,h
set 2,h
cp h
sbc hl,hl
inc hl
p_GetBit10:
.db 7 ;7 bytes, 37 cycles
ld a,l
set 5,l
cp l
sbc hl,hl
inc hl


p_GetBit11:
.db 7 ;7 bytes, 37 cycles
ld a,l
set 4,l
cp l
sbc hl,hl
inc hl
p_GetBit12:
.db 7 ;7 bytes, 37 cycles
ld a,l
set 3,l
cp l
sbc hl,hl
inc hl
p_GetBit13:
.db 7 ;7 bytes, 37 cycles
ld a,l
set 2,l
cp l
sbc hl,hl
inc hl
 
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: DJ Omnimaga on February 22, 2011, 12:15:45 am
Nice to see new optimizations :D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on March 12, 2011, 10:47:24 pm
I'm back to take on a few more routines that I either couldn't follow or just decided not to try in my first mass optimization post (http://ourl.ca/4175/160993).



p_Sqrt: 1 byte and 4 cycles saved. I still think it may be a good idea to replace this with a restoring square root algorithm, though, like the one I suggested a while ago here (http://ourl.ca/4175/130486). Although maybe not that exact one, because I wrote that when I was still not too familiar with assembly and it may not be very optimized.

Code: (Original code: 14 bytes, n*37+36 cycles) [Select]
p_Sqrt:
.db __SqrtEnd-1-$
ld a,-1
ld d,a
ld e,a
__SqrtLoop:
add hl,de
inc a
dec e
dec de
jr c,__SqrtLoop
ld h,0
ld l,a
ret
__SqrtEnd:
   
   
Code: (Optimized code: 13 bytes, n*37+32 cycles) [Select]
p_Sqrt:
.db __SqrtEnd-1-$
ld de,-1&$FF
ld b,e
ld c,e
__SqrtLoop:
add hl,bc
inc e
dec c
dec bc
jr c,__SqrtLoop
ex de,hl
ret
__SqrtEnd:
   



p_Sin: 3 bytes and 8 cycles saved.

Code: (Original code: 29 bytes, too lazy to test cycles) [Select]
p_Sin:
.db __SinEnd-1-$
add a,a
rr c
ld d,a
cpl
ld e,a
xor a
ld b,8
__SinLoop:
rrc e
jr nc,__SinSkip
add a,d
__SinSkip:
rra
djnz __SinLoop
adc a,a
ld l,a
ld h,b
rl c
ret nc
cpl
inc a
ret z
ld l,a
dec h
ret
__SinEnd:
   
   
Code: (Optimized code: 26 bytes, too lazy to test-8 cycles) [Select]
p_Sin:
.db __SinEnd-1-$
ld c,a
add a,a
ld d,a
cpl
ld e,a
xor a
ld b,8
__SinLoop:
rra
rrc e
jr nc,__SinSkip
add a,d
__SinSkip:
djnz __SinLoop
ld l,a
ld h,b
or c
ret p
xor a
sub l
ret z
ld l,a
dec h
ret
__SinEnd:
   



p_Log: 1 byte saved.

Code: (Original code: 11 bytes, n*31+17 cycles) [Select]
p_Log:
.db 11
ld a,16
scf
__LogLoop:
adc hl,hl
dec a
jr nc,__LogLoop
ld l,a
ld h,0
   
   
Code: (Optimized code: 10 bytes, n*31+13 cycles) [Select]
p_Log:
.db 10
ld de,16
scf
__LogLoop:
adc hl,hl
dec e
jr nc,__LogLoop
ex de,hl
   

Before we leave this routine, though, the output for hl=0 isn't really correct; it returns 255. You could change the dec e in my suggested routine to dec de to give a slightly more accurate result of -1, but again, that's not quite correct either. The real result of log(0) should be negative infinity, which would be most properly represented by -32768. For the small cost of 3 bytes, the following routine would give you this result:

Code: (Mathematically correct code: 13 bytes, only a little bit slower cycles) [Select]
p_Log:
.db 13
ld de,16
__LogLoop:
add hl,hl
jr c,__LogLoopEnd
dec e
jr nz,__LogLoop
__LogLoopEnd:
ex de,hl
ccf
rr h
   



p_Exp: As with above suggestion, this isn't an optimization, this is a suggested improvement. With the current routine, I see two issues. Firstly, it returns 2^(input mod 256) instead of 2^(input), the latter of which is probably what you would expect of a 16-bit math function. Secondly, this routine does not do anything special for inputs with high values mod 256, which could result in it taking up to 7195 cycles. The following routine would correct both of these behaviors. Also note that it is a subroutine instead of inline code (more on turning inline code into subroutines later).

Code: (Mathematically correct code: 16 bytes, only a little bit slower cycles) [Select]
p_Exp:
.db __ExpEnd-p_Exp-1
ld b,l
ld a,l
and %11110000
or h
ld hl,0
ret nz
inc b
scf
__ExpLoop:
adc hl,hl
djnz __ExpLoop
ret
__ExpEnd:
   



__DrawMskAligned: 2 bytes, 72 cycles saved. I only tacked the aligned part of the masked sprite routine because the rest is scary.

Code: (Original code: 33 bytes, 1481 cycles) [Select]
__DrawMskAligned:
dec hl
__DrawMskAlignedLoop:
ld a,(ix+0)
xor (ix+8)
cpl

ld c,a
ld a,(hl)
or (ix+0)
and c
ld (hl),a

ld de,appBackUpScreen-plotSScreen
add hl,de

ld a,c
and (hl)
or (ix+0)
ld (hl),a

inc ix
ld de,plotSScreen-appBackUpScreen+12
add hl,de

djnz __DrawMskAlignedLoop
     
   
Code: (Optimized code: 31 bytes, 1409 cycles) [Select]
__DrawMskAligned:
dec hl
__DrawMskAlignedLoop:
push hl
ld de,appBackUpScreen-plotSScreen
add hl,de

ld a,(ix+0)
ld d,a
xor (ix+8)
cpl
ld e,a

and (hl)
or d
ld (hl),a

pop hl

ld a,(hl)
or d
and e
ld (hl),a

inc ix
ld de,12
add hl,de

djnz __DrawMskAlignedLoop
     





Finally, here are some routines that I feel would be better suited to be subroutines instead of inline code. Feel free to disagree with me on any or all of these.

And for one going in the other direction, perhaps p_OnKey doesn't need to be a subroutine? Very few programs use the on key, and I'm guessing that any program that does isn't likely to use it more than once. Or at least if you don't want to change it, can you change the .db 10 into .db __OnKeyEnd-p_OnKey-1 or .db __OnKeyEnd-1-$? I was always confused when I saw this routine about whether or not it was inserted as inline code or a subroutine.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 27, 2011, 04:18:51 pm
Found a very small optimization that I'm surprised you didn't find  ;)

Code: (Original) [Select]
p_Min:
.db 8
pop de
or a
sbc hl,de
add hl,de
jr c,$+3
ex de,hl    
Code: (Optimized) [Select]
p_Min:
.db 8
pop de
or a
sbc hl,de
ex de,hl
jr nc,$+3
add hl,de

Code: (Original) [Select]
p_Max:
.db 8
pop de
or a
sbc hl,de
add hl,de
jr nc,$+3
ex de,hl
Code: (Optimized) [Select]
p_Max:
.db 8
pop de
or a
sbc hl,de
ex de,hl
jr c,$+3
add hl,de

It saves 7 clock cycles in the best case yet the worse case remains the same.  So that's a 3.5 clock cycle speed up in the average uniform case.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on March 27, 2011, 06:01:33 pm
Quigibo, I noticed that you implemented a bunch of my optimizations/bug fixes, which is awesome. Having left a few out is fine, but I'm stuck wondering why you left out a specific few of my suggestions in particular that I thought would be useful. Did you just miss some of these, or did you purposely leave them out?


And on an unrelated note, could you extend the DispGraph change to DispGraphClrDraw as well? And because I know a bunch of people may not like DispGraph no longer working in 15MHz mode, could you perhaps include a routine like DispGraphNormal that saves the CPU speed setting, puts the CPU in 6MHz mode, calls the normal DispGraph routine, and then restores the CPU speed? Something like this:

Code: [Select]
p_FastCopy6MHz:
.db __FastCopy6MHzEnd-1-$
in a,($02)
rla
in a,($20)
push af
xor a
out ($20),a
call sub_FastCpy
pop af
ret c
out ($20),a
ret
__FastCopy6MHzEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 27, 2011, 06:33:30 pm
Oh yeah, now I remember why I didn't want to change Dispgraph before  :banghead:

The overhead is small though relative to the size of the routine,  I think I'll just make regular DispGraph account for the speed settings and it would be roughly the same size as the original.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on March 27, 2011, 06:38:24 pm
Sorry if I'm pestering you about this, but could you please respond to the first part of my last post? Even if you tell me you purposely left them out for whatever reasons that's completely fine, I just wouldn't want you to have missed suggestions that you might have actually wanted to implement.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on March 28, 2011, 04:30:42 am
Oh yeah, sorry.  I was trying to release this quickly with my limited time so I didn't have time to make changes that would require me to rewrite core routines.  The exch() optimization requires this, because even though it works for Axioms, its not the same for regular routines yet but it will be eventually.  The subroutines I didn't convert yet because I kind of overlooked it, but I'll get to it next version.  As for GetCalc() I'm not sure if I'm ready to change it yet because it will create an incompatibility with current programs that still use the Ptr-2 to reference os vars.  I will probably put this on the poll.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on March 28, 2011, 11:41:47 am
Alright, that makes sense. Thanks. :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on April 06, 2011, 01:19:54 am
Code: (Original) [Select]
p_IntGt:
.db 7
ex de,hl
xor a
sbc hl,de
ld h,a
rla
ld l,a
Code: (Optimized) [Select]
p_IntGt:
.db 6
scf
sbc hl,de
sbc hl,hl
inc hl

Also, I think optimized comparisons for comparing to immediate values would be a good idea. For example, <3 becomes ld de,-3 \ add hl,de \ sbc hl,hl \ inc hl
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on April 06, 2011, 11:50:59 am
For that matter, a lot of things could still use optimized immediate value versions. Pretty much every comparison could, as well as a few other operators. For instance, p_BoolOrImm, p_BoolAndImm, and p_BoolXorImm have been sitting in Commands.inc since Axe 0.4.7 and I would love to see them implemented. Constant bit setting and resetting would also be awesome. Unfortunately none of these things can be achieved with an Axiom, or I would be all over this.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on April 19, 2011, 11:00:30 pm
I made a "real" square root algorithm, meaning one that doesn't use the repeated subtraction method. Certainly not a size optimization, but it certainly makes the average execution time a lot lower.
Code: [Select]
p_Sqrt:
.db __SqrtEnd-1-$
ex de,hl
ld bc,$8000
ld h,c
ld l,c
__SqrtLoop:
srl b
rr c
add hl,bc
ex de,hl
sbc hl,de
jr nc,__SqrtNoOverflow
add hl,de
ex de,hl
or a
sbc hl,bc
;jr __SqrtOverflow ;Commented out in favor of super optimization
.db $DA ;JP C opcode to skip next 2 bytes since carry is reset here.
__SqrtNoOverflow:
ex de,hl
add hl,bc
__SqrtOverflow:
srl h
rr l
srl b
rr c
jr nc,__SqrtLoop
ret
__SqrtEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on April 19, 2011, 11:07:34 pm
How does it compare to this one (http://ourl.ca/4175/130486)? I suggested it a while ago but Quigibo either didn't see it or didn't seem to be interested in it. :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on April 19, 2011, 11:11:39 pm
How does it compare to this one (http://ourl.ca/4175/130486)? I suggested it a while ago but Quigibo either didn't see it or didn't seem to be interested in it. :P
Er... that looks impressive :P I'll have to take a closer look at how it works sometime
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on April 20, 2011, 06:08:32 pm
I'll think about it.  I'm not sure how often square roots are actually needed and how applicable they are to speed constraints.

Another thing, I'm adding new auto-opts for constants in the comparisons (less than, less than or equal to, greater than, and greater than or equal to).  I have found nice optimizations for powers of 2 and some low numbers, but if anyone wants to try writing some, I could definitely use them.  I probably missed some or may have sub-optimal solutions.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on April 20, 2011, 06:10:56 pm
If you post what you have so far, I can look at them and see if I can find any optimizations for them. I also might see if there are any other optimized comparisons you don't already have that I could contribute.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on April 20, 2011, 06:38:42 pm
What someone needs to do is take an program that can emulate the z80's basic mathematical and jumping functions in a program setting and brute force out *the* most optimized program that performs a small simple task like multiplying two numbers :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on April 20, 2011, 07:58:47 pm
That's actually impossible since its a form of the Halting Problem (http://en.wikipedia.org/wiki/Halting_problem).  It could be faked by some extent but it would be incredibly inefficient. Runer, I'll get those routines up soon.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on April 20, 2011, 08:16:23 pm
I was thinking more along the lines of, given an existing routine (with a certain size and speed) it must create a smaller and faster routine, making there a finite number of cases to test.  If a testing routine took longer than the speed of the example, it could immediately terminate and go to the next routine, guaranteeing that the program would eventually terminate.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on April 21, 2011, 12:09:28 am
Here are the most optimized constant comparisons I could come up with. I'm not sure what optimized comparisons for powers of 2 you found, because I didn't really find any. The only individual special cases I found dealt with 0, 32768, and 65535. And I hope the parser can handle the fancy constant mangling operations necessary for some of these. ;)

Code: [Select]
p_GE0:
.db 3
ld hl,1

p_GT65535:
.db 3
ld hl,0

p_LE65535:
.db 3
ld hl,1

p_LT0:
.db 3
ld hl,0

p_GE1 =p_NE0
p_GT0 =p_NE0
p_LE0 =p_EQ0
p_LT1 =p_EQ0

p_GE32768 =p_Div32768
p_GT32767 =p_Div32768
p_LE32767 =p_SGE0
p_LT32768 =p_SGE0

p_GE65535 =p_EQN1
p_GT65534 =p_EQN1
p_LE65534 =p_NEN1
p_LT65535 =p_NEN1

p_GEconstMod256EQ0:
.db 6
ld a,h
sub const>>8
sbc hl,hl
inc hl

p_GTconstMod256EQ255:
.db 6
ld a,h
sub const+1>>8
sbc hl,hl
inc hl

p_LEconstMod256EQ255:
.db 6
ld a,h
add a,-(const+1>>8)
sbc hl,hl
inc hl

p_LTconstMod256EQ0:
.db 6
ld a,h
add a,-(const>>8)
sbc hl,hl
inc hl

p_GEconst:
.db 8
xor a
ld de,-const
add hl,de
ld h,a
rla
ld l,a

p_GTconst:
.db 8
xor a
ld de,-(const+1)
add hl,de
ld h,a
rla
ld l,a

p_LEconst:
.db 7
ld de,-const
add hl,de
sbc hl,hl
inc hl

p_LTconst:
.db 7
ld de,-(const+1)
add hl,de
sbc hl,hl
inc hl
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on April 25, 2011, 02:50:35 pm
Stolen borrowed from WikiTI (http://wikiti.brandonw.net/index.php?title=83Plus:Ports:20):

Code: [Select]

#define FULLSPEED  in a,2 \ rla \ sbc a,a \ out (20h),a


And this gives you the added bonus of the CPU operating approximately 25KHz faster at full speed mode!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on April 25, 2011, 03:04:55 pm
Stolen borrowed from WikiTI (http://wikiti.brandonw.net/index.php?title=83Plus:Ports:20):

Code: [Select]

#define FULLSPEED  in a,2 \ rla \ sbc a,a \ out (20h),a


And this gives you the added bonus of the CPU operating approximately 25KHz faster at full speed mode!
Note: This has the side effect of out (0),0 on the TI-83+. Is that okay?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on April 25, 2011, 03:07:22 pm
Stolen borrowed from WikiTI (http://wikiti.brandonw.net/index.php?title=83Plus:Ports:20):

Code: [Select]

#define FULLSPEED  in a,2 \ rla \ sbc a,a \ out (20h),a


And this gives you the added bonus of the CPU operating approximately 25KHz faster at full speed mode!
Note: This has the side effect of out (0),0 on the TI-83+. Is that okay?

Axe doesn't work on the 83+ right?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: TIfanx1999 on April 25, 2011, 03:34:18 pm
As far as I know Axe should work fine on the  TI-83+.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on April 25, 2011, 09:18:01 pm
Stolen borrowed from WikiTI (http://wikiti.brandonw.net/index.php?title=83Plus:Ports:20):

Code: [Select]

#define FULLSPEED  in a,2 \ rla \ sbc a,a \ out (20h),a


And this gives you the added bonus of the CPU operating approximately 25KHz faster at full speed mode!
Note: This has the side effect of out (0),0 on the TI-83+. Is that okay?


Since this is the current routine:

Code: [Select]

#define FULLSPEED  in a,(2) \ and 80h \ rlca \ out (20h),a


It shouldn't make 83+ compatibility any worse than it already is.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Deep Toaster on April 26, 2011, 06:48:41 pm
And

Quote from: WikiTI
The only side effect of this is that on the TI-83+ Basic this will cause both linkport lines to go high - which shouldn't matter too much if you're not using the linkport at that time, especially since both lines are high normally...

so there shouldn't be a problem. Except if you really tried, I guess.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on April 26, 2011, 06:50:11 pm
Well, the only problem is that it would completely mess with any program that happens to be using X->Port stuff.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Deep Toaster on April 26, 2011, 06:52:25 pm
True. Most people put Full in the beginning of the program anyway, but I guess it could cause problems.

As Runer112 said, no more than we already have.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 01, 2011, 01:09:25 am
Was randomly browsing through the B_CALLs on WikiTI and found one that should save some bytes in the p_GetArc routine!

Code: (Original code: 56 bytes) [Select]
p_GetArc:
.db __GetArcEnd-1-$
push de
MOV9TOOP1()
B_CALL(_ChkFindSym)
jr c,__GetArcFail
push de
ex de,hl
ld hl,(progPtr)
sbc hl,de
pop de
ld hl,9
jr c,__GetArcName
__GetArcStatic:
ld l,12
and %00011111
jr z,__GetArcDone
cp l
jr z,__GetArcDone
ld l,14
jr __GetArcDone
__GetArcName:
add hl,de
B_CALL(_LoadDEIndPaged)
ld d,0
inc e
inc e
__GetArcDone:
add hl,de
ex de,hl
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
ex de,hl
ret
__GetArcFail:
ld hl,0
pop de
ret
__GetArcEnd:
   
   
Code: (Optimized code: 51 bytes) [Select]
p_GetArc:
.db __GetArcEnd-1-$
push de
MOV9TOOP1()
B_CALL(_ChkFindSym)
jr c,__GetArcFail
B_CALL(_IsFixedName) ;$4363
ld hl,9
jr z,__GetArcName
__GetArcStatic:
ld l,12
and %00011111
jr z,__GetArcDone
cp l
jr z,__GetArcDone
ld l,14
jr __GetArcDone
__GetArcName:
add hl,de
B_CALL(_LoadDEIndPaged)
ld d,0
inc e
inc e
__GetArcDone:
add hl,de
ex de,hl
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
ex de,hl
ret
__GetArcFail:
ld hl,0
pop de
ret
__GetArcEnd:
   


EDIT: And on the topic of the GetCalc() routines, have you decided yet what to do about real and complex number variables? Because right now p_GetArc supports them correctly but the other GetCalc() routines do not. Whether or not you want to support (correctly) adjusting the pointer for real and complex number variables, it would be a good idea to standardize the routines.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on May 17, 2011, 07:30:54 pm
Oops, necropost, oh well :P

I don't know if this approach was purposely left out, as it's 15 bytes larger than the current routine and sometimes slower. I'm referring to the square root routine. Whereas the current routine (14 bytes) takes 37n+38 T-states (linear time), where n is the result+1 (1-256), the following routine (29 bytes) takes 5n+800 T-states (near constant time), where n is the number of set bits in the result (0-8). The existing routine is faster for values that would yield results of 0-19, but this routine would be faster for values that would yield results of 20-255, which is a much broader range of the 8-bit spectrum. Also, it would be much more reliable to run at a near constant speed in programs which rely on that to run smoothly themselves. The existing routine would take only a few hundred T-states for low inputs, but would take up to OVER NINE THOUSAND T-states to calculate the square roots for the highest inputs. So it's up to you if this is something you want to use.

Code: [Select]
p_Sqrt:
.db __SqrtEnd-1-$
ld a,l
ld l,h
ld de,$0040
ld h,d
ld b,8
or a
__SqrtLoop:
sbc hl,de
jr nc,__SqrtSkip
add hl,de
__SqrtSkip:
ccf
rl d
rla
adc hl,hl
rla
adc hl,hl
djnz __SqrtLoop
ld h,0
ld l,d
ret
__SqrtEnd:


Methinks this really should be added.  Its *much* faster most of the time, and runs in more constant time, which is something that would be great for a routine with a reliable speed.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on May 17, 2011, 09:11:15 pm
And the size difference shouldn't be a big deal, because anyone who wants to use square roots in a big project would probably want a fast routine anyway
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 18, 2011, 03:03:13 am
Yeah, I guess I'll add it then.  Found an optimization for it too; b is zero at the end of the djnz so it can be used to zero the h register which saves a byte and 3 cycles.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: z80man on May 18, 2011, 04:49:41 am
Would it be possible to have normal and full compilation modes. So if a program runs at normal by default the code for changing the clock to normal at every dispgraph wouldn't be needed. Also this could be used by 83+ owners so that when they compile a program, full commands are ignored.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Compynerd255 on May 18, 2011, 10:17:46 am
Would it be possible to have normal and full compilation modes. So if a program runs at normal by default the code for changing the clock to normal at every dispgraph wouldn't be needed. Also this could be used by 83+ owners so that when they compile a program, full commands are ignored.
Full commands are already ignored in 83 Plus mode. In fact, if a Full command is run on an 83 Plus, HL returns zero and nothing happens. But I have seen the size of the Full command, and yes, I think it would be a good idea to have some option where Full and Normal are skipped.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 18, 2011, 11:03:11 am
z80man, the code for saving the CPU clock speed actually serves another useful purpose. It also saves the interrupt status, because the display routines have to disable interrupts to safely run. Because of the new display safety routine, if a program were designed to run at 15MHz, all it would need is one Full at the start and one copy of this safety routine. Removing any CPU speed instructions would only save about 20 bytes, not a very large savings. But I guess I still see the merits of your suggestion for people who want to crazily super-optimize. :P

Also, why is this in the Help Axe Optimize thread? It sounds more like a feature request.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 21, 2011, 07:15:18 pm
I'm back, and this time with screen update routine optimizations! I've used 71 cycles as the target minimum delay between port outputs, because that's the number that you said worked for your calculator with a bad LCD driver. If you want these routines to target 72 or 73 cycles between port outputs instead, that's an easy modification for the first two routines. The grayscale routines could be harder.


EDIT: If you're going to use any of these, make sure to actually test them first.

EDIT 2: I previously didn't have an optimization for p_DispGS, but after more closely inspecting the routine, now I do!




p_FastCopy: 1 byte and 1548 cycles saved.

Code: (Original code: 46 bytes, ~59389 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_FastCopy:
.db __FastCopyEnd-1-$
FastCopy:
ld hl,plotSScreen
ld a,$80
out ($10),a
ld c,-$0C
call $0000 ;Safety
push af
__FastCopyAgain:
ld b,64 ;7
ld a,c ;4
add a,$2C ;7
out ($10),a ;11
ld a,(hl) ;7 (waste)
inc de ;6 (waste)
__FastCopyLoop:
push af ;11 (waste)
pop af ;10 (waste)
ld de,12 ;10
ld a,(hl) ;7
add hl,de ;11
out ($11),a ;11
djnz __FastCopyLoop ;13/8
ld de,1-(12*64) ;10
add hl,de ;11
inc c ;4
jr nz,__FastCopyAgain ;12
__FastCopyRestore:
pop af
out ($20),a
ret c
ei
ret
__FastCopyEnd:
.db rp_Ans,__FastCopyEnd-__FastCopyAgain+3
     
   
Code: (Optimized code: 45 bytes, ~57841 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_FastCopy:
.db __FastCopyEnd-1-$
ld hl,plotSScreen
ld c,-$0C
ld a,$80
out ($10),a ;??cc into
call $0000
push af
__FastCopyAgain:
push hl
ld a,c
add a,$2C
out ($10),a ;many cc into, 73cc loop
inc de ;waste
ld b,64
__FastCopyLoop:
ld a,(hl) ;waste
inc de ;waste
dec de ;waste
ld de,12
ld a,(hl)
add hl,de
out ($11),a ;71cc into, 71cc loop
djnz __FastCopyLoop
pop hl
inc hl
inc c
jr nz,__FastCopyAgain
__FastCopyRestore:
pop af
out ($20),a
ret c
ei
ret
__FastCopyEnd:
.db rp_Ans,__FastCopyEnd-p_FastCopy+11
   




p_DrawAndClr: 2 bytes and 1548 cycles saved. Pretty much the same optimization as above.

Code: (Original code: 47 bytes, ~59389 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DrawAndClr:
.db __DrawAndClrEnd-1-$
ld hl,plotSScreen
ld a,$80
out ($10),a
ld c,-$0C
call $0000 ;Safety
push af
__DrawAndClrAgain:
ld b,64 ;7
ld a,c ;4
add a,$2C ;7
out ($10),a ;11
ld a,(hl) ;7 (waste)
inc de ;6 (waste)
__DrawAndClrLoop:
ld de,12 ;10
ld a,(hl) ;7
ld (hl),d ;7
ld (hl),d ;7 (waste)
ld (hl),d ;7 (waste)
add hl,de ;11
out ($11),a ;11
djnz __DrawAndClrLoop ;13/8
ld de,1-(12*64) ;10
add hl,de ;11
inc c ;4
jr nz,__DrawAndClrAgain ;12
__DrawAndClrRestore:
pop af
out ($20),a
ret c
ei
ret
__DrawAndClrEnd:
.db rp_Ans,__DrawAndClrEnd-__DrawAndClrAgain+3
     
   
Code: (Optimized code: 45 bytes, ~57841 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DrawAndClr:
.db __FastCopyEnd-1-$
ld hl,plotSScreen
ld c,-$0C
ld a,$80
out ($10),a ;??cc into
call $0000
push af
__DrawAndClrAgain:
push hl
ld a,c
add a,$2C
out ($10),a ;many cc into, 73cc loop
inc de ;waste
ld b,64
__DrawAndClrLoop:
inc de ;waste
dec de ;waste
ld de,12
ld a,(hl)
ld (hl),d
add hl,de
out ($11),a ;71cc into, 71cc loop
djnz __DrawAndClrLoop
pop hl
inc hl
inc c
jr nz,__DrawAndClrAgain
__DrawAndClrRestore:
pop af
out ($20),a
ret c
ei
ret
__DrawAndClrEnd:
.db rp_Ans,__DrawAndClrEnd-__DrawAndClrAgain+11
   




p_DispGS: ~4847 cycles faster! This is more of a bug fix than an optimization; the old routine copied 13 columns!

Code: (Original code: 66 bytes, ~63507 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DispGS:
.db __DispGSEnd-1-$
call $0000
push af
ld a,$80
out ($10),a
ld (OP2),sp
ld hl,flags+asm_Flag2
rr (hl)
sbc a,a
xor %01010101
ld (hl),a
ld c,a
ld l,appbackupscreen&$ff-1
ld sp,plotSScreen-appbackupscreen
__DispGSNext:
ld a,l ;4
ld b,64 ;7
add a,$21-(appbackupscreen&$ff);7
out ($10),a ;11 Into loop: 59 T-states
inc l ;4
ld h,appbackupscreen>>8 ;7
ld de,appbackupscreen-plotSScreen+12;11
__DispGSLoop:
ld a,(hl) ;7 Loop: 61 T-states
rrc c ;8
and c ;4
add hl,sp ;11
or (hl) ;7
out ($11),a ;11
add hl,de ;11
djnz __DispGSLoop ;13/8 Next Loop: 60 T-states
ld a,l ;4
cp 12+(appbackupscreen&$ff);7
jr nz,__DispGSNext ;12
__DispGSDone:
ld sp,(OP2)
__DispGSRestore:
pop af
out ($20),a
ret c
ei
ret
__DispGSEnd:
.db rp_Ans,__DispGSEnd-p_DispGS-2
     
   
Code: (Optimized code: 66 bytes, ~58660 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DispGS:
.db __DispGSEnd-1-$
call $0000
push af
ld a,$80
out ($10),a ;many cc into
ld (OP2),sp
ld hl,flags+asm_Flag2
rr (hl)
sbc a,a
xor %01010101
ld (hl),a
ld c,a
ld l,appbackupscreen&$ff-1
ld sp,plotSScreen-appbackupscreen
__DispGSNext:
ld a,l
ld b,64
add a,$20-(appbackupscreen&$ff-1)
out ($10),a ;113cc into, 71cc loop
inc hl
ld h,appbackupscreen>>8
ld de,appbackupscreen-plotSScreen+12
__DispGSLoop:
ld a,(hl)
rrc c
and c
add hl,sp
or (hl)
out ($11),a ;71cc into, 72cc loop
add hl,de
djnz __DispGSLoop
ld a,l
cp 12+(appbackupscreen&$ff-1)
jr nz,__DispGSNext
__DispGSDone:
ld sp,(OP2)
__DispGSRestore:
pop af
out ($20),a
ret c
ei
ret
__DispGSEnd:
.db rp_Ans,__DispGSEnd-p_DispGS-2
   




p_Disp4Lvl: 3 bytes larger, but ~7693 cycles faster! Extra bonuses: updates in row-major order for cleaner grayscale AND works with any pair of buffers! :w00t:

Code: (Original code: 79 bytes, ~78433 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_Disp4Lvl:
.db __Disp4LvlEnd-1-$
call $0000
push af
ld (OP2+2),sp
ld a,$80
out ($10),a
ld sp,appbackupscreen - plotSScreen
ld e,(plotSScreen-appbackupscreen+12)&$ff
ld c,-$0C
ex af,af'
ld a,%11011011
ld hl,flags+asm_flag2
inc (hl)
jr z,__Disp4Lvlskip
add a,a
ld b,(hl)
inc b
jr z,__Disp4Lvlskip
rlca
ld (hl),-2
__Disp4Lvlskip:
ld l,plotSScreen&$ff-1
ex af,af'
__Disp4Lvlentry:
ld a,c
add a,$2C
ld h,plotSScreen>>8
inc l
ld b,64
out ($10),a
__Disp4Lvlloop:
ld a,(hl)
add hl,sp
xor (hl)
ex af,af'
cp e
rra
ld d,a
ex af,af'
and d
xor (hl)
out ($11),a
ld d,(plotSScreen-appbackupscreen+12)>>8
add hl,de
djnz __Disp4Lvlloop
inc c
jr nz,__Disp4Lvlentry
__Disp4LvlDone:
ld sp,(OP2+2)
__Disp4LvlRestore:
pop af
out ($20),a
ret c
ei
ret
__Disp4LvlEnd:
.db rp_Ans,__Disp4LvlEnd-p_Disp4Lvl-2
     
   
Code: (Optimized code: 82 bytes, ~70740 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_Disp4Lvl:
.db __Disp4LvlEnd-1-$
ld hl,appBackUpScreen
ld de,plotSScreen
call $0000
push af
push hl
ld a,$07
out ($10),a ;many cc into
ld a,%11011011
or a
ld hl,flags+asm_flag2
inc (hl)
jr z,__Disp4Lvlskip
rra
ld b,(hl)
inc b
jr z,__Disp4Lvlskip
rra
ld (hl),-2
__Disp4LvlSkip:
ex af,af'
pop hl
ld a,$80
__Disp4LvlEntry:
out ($10),a ;76+cc into, 71cc loop
push af
ex (sp),hl ;waste
ex (sp),hl ;waste
nop ;waste
ld a,$20
out ($10),a ;71cc into
ld b,12
__Disp4LvlLoop:
ex af,af'
rra
ld c,a
ex af,af'
ld a,(de)
xor (hl)
and c
xor (hl)
inc de
inc hl
out ($11),a ;71cc into, 77cc loop
djnz __Disp4LvlLoop
inc bc ;waste
ex af,af'
rra
ex af,af'
pop af
inc a
bit 6,a
jr z,__Disp4LvlEntry
__Disp4LvlDone:
ld a,$05
out ($10),a ;73cc into
pop af
out ($20),a
ret c
ei
ret
__Disp4LvlEnd:
.db rp_Ans,__Disp4LvlEnd-p_Disp4Lvl-8
   





Also, I'm going to bump a few old optimization suggestions. They may have been skipped because Axe couldn't support them at the time, but in case it can now or in the near future, I'll make sure they aren't forgotten. And I'll throw in a new optimization that would also require an upgraded command parser.


And as a side note, would it be possible to reformat DS<() so that the variable is reinitialized to its maximum value at the End? That way, 3 bytes could be saved by having both the zero and not zero conditions using the same store command. For example:

Code: [Select]
ld hl,(var)
dec hl
ld a,h
or l
jp nz,DS_End
;Code inside statement goes here
ld hl,max
DS_End:
ld (var),hl


Now that you have absolute jumps implemented:

Code: (Original code) [Select]

p_Exchange:
.db 13
pop de
ex (sp),hl
pop bc
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
ld a,b
or c
jr nz,$-8

   
Code: (Optimized code) [Select]

p_Exchange:
.db 12
pop de
ex (sp),hl
pop bc
__ExchangeLoop:
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
jp pe,__ExchangeLoop ;or is it po?





Code: (Original code: 27 bytes, ~220 cycles) [Select]
p_DKeyVar:
.db __DKeyVarEnd-1-$
dec l
ld a,l
rra
rra
rra
and %00000111
inc a
ld b,a
ld a,%01111111
rlca
djnz $-1
ld h,a
ld a,l
and %00000111
inc a
ld b,a
ld a,%10000000
rlca
djnz $-1
ld l,a
ret
__DKeyVarEnd:
   
   
Code: (Optimized code: 23 bytes, ~259 cycles) [Select]
p_DKeyVar:
.db __DKeyVarEnd-1-$
ld c,l
dec c
ld a,c
rra
rra
rra
call __DKeyVarMask
cpl
ld h,a
ld a,c
__DKeyVarMask:
and %00000111
inc a
ld b,a
ld a,%10000000
rlca
djnz $-1
ld l,a
ret
__DKeyVarEnd:


   
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on May 22, 2011, 05:56:31 am
So, 3 level greyscale routine optimized? Great Runer, I don't really get what you did there, but those look like lots of ASM optimizations for Axe, very nice job!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 22, 2011, 05:14:53 pm
Thanks for those! :) I was actually thinking of changing the 3 level grayscale to be row major too so I'm probably going to be using a whole new routine for that.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 22, 2011, 07:13:24 pm
Row-major 3-level grayscale? I already made one of those, it was just slower and a bit larger than the current routine so I didn't think you'd want it. It's 4 bytes larger and about 8000 cycles slower than the column-major routine I posted above, but here it is:

Code: (70 bytes, ~66541 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DispGS:
.db __DispGSEnd-1-$
ld hl,plotSScreen
ld de,appBackUpScreen
call $0000
push af
ld a,$07
out ($10),a ;many cc into
ld a,(flags+asm_Flag2)
rra
sbc a,a
xor %01010101
ld (flags+asm_Flag2),a
ld c,a
ld a,$80
__DispGSNext:
push af
out ($10),a ;74cc into, 71cc loop
ex (sp),hl ;waste
ex (sp),hl ;waste
rrc c
ld b,12
ld a,$20
out ($10),a ;71cc into
push af ;waste
pop af ;waste
__DispGSLoop:
inc bc ;waste
dec c ;waste
ld a,(de)
and c
or (hl)
inc de
inc hl
out ($11),a ;72cc into, 71cc loop
ld a,(hl) ;waste
djnz __Disp4Lvlloop
pop af
inc a
bit 6,a
jr z,__Disp4Lvlentry
__DispGSDone:
pop af
out ($20),a
ld a,$05
out ($10),a ;83cc into
ret c
ei
ret
__DispGSEnd:
.db rp_Ans,__DispGSEnd-p_DispGS-8
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 23, 2011, 06:59:41 am
Runer, I was testing out your 4 level grayscale routine and I'm getting some really weird results.  Its literally showing black and white lines across the otherwise perfect gray that shift about every second.  When I add some pause between displays, it looks better, but still not as good as the column major routine.  The emulator makes it look fine, so I can't upload a screenshot, but try this upload on hardware.

Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 23, 2011, 11:22:29 am
Is 71 cycles between outputs still too fast for your calculator perhaps? I don't notice any strange black and white lines on my calculator, which has a good LCD driver. The only lines I saw were the diagonal light and dark stripes that are inherent in any unsynced grayscale routine. Can you perhaps elaborate on the problem?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 24, 2011, 09:22:01 pm
Sorry for the late response, this is how the program looks on my calculator (see attachment).  I'm pretty sure it has nothing to do with the delay being too short since it looks better when I add more pause between each DispGraph.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 24, 2011, 10:36:16 pm
Hmm I see what you mean... perhaps the mask rotation/logic is wrong in my routine? I wanted to send the program you posted to wabbitemu so I could debug the mask and logic computations at each step, but wabbitemu refuses to accept your program... And I don't see anything obviously wrong with my mask or logic.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 24, 2011, 10:46:57 pm
Really?  Wabbitemu accepts it fine for me... It was just your routine replacing the current 4 level grayscale.  Random squares were drawn to each buffer and then it did a "Repeat getkey(15):DispGraphrr:End" loop.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 24, 2011, 11:00:47 pm
No worries, I compiled it as an Axiom, sent that to wabbitemu, and am debugging it as we speak. I'm also asking the master of grayscale (thepenguin77) if he sees anything obviously wrong with it.

EDIT: Quigibo, try putting something like a Pause 10-12 in your loop. I think the new routine is actually going too fast and is running at 1.5x your LCD's refresh rate, near-perfectly skipping every third frame.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 25, 2011, 01:18:47 am
Yeah, like I said, if I add pause to the loop, it looks better (pause 11 is perfect gray).  But my point is that I'm not sure anymore if having the routines row major is actually an advantage because the old routine produced just as perfect a gray when it was in sync yet it didn't have graphical problems when it wasn't.  This routine seems less resilient to changes in the pause time.

Maybe my calculator is just the exception though, I don't know what the statistics are for what percentage of calculators have bad LCDs.  Regardless, I don't think any single routine can produce perfect gray across ALL calculator models.  So I think I should just forget about column/row ordering and just stick with the smallest, fastest routines.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 25, 2011, 11:15:39 am
The only case when this new 4-level grayscale routine should have noticeable problems is when it's alone in a loop with no other delay, simply because it's faster than the old routine. In most real situations that it would be used in, there would probably be much larger delay between display calls, like rendering a frame, in which case you want the routine to be as fast as possible.

I would leave the 3-level grayscale routine in column-major but use the 4-level grayscale routine in row-major order, because they are both faster than their alternatives. Although this will allow for 4-level grayscale to draw from arbitrary buffers and not allow for it in 3-level grayscale, I wouldn't worry too much about the incongruity. Being able to call 4-level grayscale with arbitrary buffer arguments would be quite awesome. ;D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 26, 2011, 02:22:21 am
Anyone up for some math?  :P

I want to implement the reciprocal function for fixed point math.  For 8.8 numbers, A-1 is essentially just E10000//A however that division requires a number larger than can fit in a register pair.  Ideally, the routine could hijack a jump point into the current division routine instead of rewriting another one.  But its possible due to the symmetry involved that there might be a significantly optimized method using a slightly different approach, but I can't think of how that would work.  Has anyone seen or written a routine like this before?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 26, 2011, 04:35:58 pm
I don't know of any speed-optimized function specific to taking the inverse. But that definitely doesn't mean one doesn't exist. However, you could easily implement it if you added 8.8 fixed point division:


p_Inverse:
   .db 7
   ex   de,hl
   ld   hl,$100
   call   $0000      ;sub_88Div
   .db rp_Ans,2


p_88Div:
   .db __88DivEnd-1-$
   ld   a,h
   xor   d
   push   af
   bit   7,h
   jr   z,$+8
   xor   a
   sub   l
   ld   l,a
   sbc   a,a
   sub   h
   ld   h,a
   bit   7,d
   jr   z,$+8
   xor   a
   sub   e
   ld   e,a
   sbc   a,a
   sub   d
   ld   d,a
   ld   b,24
   call   $0000      ;sub_Div+2
   pop   af
   add   a,a
   ret   nc
   xor   a
   sub   l
   ld   l,a
   sbc   a,a
   sub   h
   ld   h,a
   ret
__88DivEnd:
   .db   rp_Ans,12



EDIT: Just kidding, that hijacking of the 16/16 division routine to make an 8.8 division routine doesn't work. But it's definitely possible to hijack the 16/16 division routine at least for an 8.8 inverse.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 26, 2011, 08:27:08 pm
I'm not planning to add 8.8 division.  I think just multiplying by the inverse should work with enough accuracy.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 26, 2011, 08:28:41 pm
But the logical way to get the inverse is to divide, is it not?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 26, 2011, 08:31:35 pm
Right, but an inverse can use a standard 16/16 division instead of a 24/16.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 26, 2011, 08:37:17 pm
Yeah, I actually had a routine written for that which hijacked the 16/16 division routine, but deleted it in favor of the 8.8 division routine. However I realized that the 8.8 division routine doesn't work, so I'll try to recreate what I had before:

Code: [Select]
p_Inverse:
.db __InverseEnd-1-$
xor a
bit 7,h
push af
jr z,$+8
sub l
ld l,a
sbc a,a
sub h
ld h,a
xor a
ex de,hl
ld bc,16<<8
ld hl,1
call $0000 ;sub_Div+10
pop af
ret z
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__InverseEnd:
.db rp_Ans,12
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on May 26, 2011, 08:40:56 pm
I actually have a copy of the routine you poster earlier and it was a bit more optimized so no worries :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on May 26, 2011, 08:42:33 pm
Yeah it was more optimized, but I don't think it worked. It would've screwed up normal 16/16 division because of how I reordered the initialization in p_Div to destroy hl before loading hl into ac.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: thepenguin77 on June 10, 2011, 03:14:11 pm
This is a really simple one. When an interrupt is called, interrupts are automatically disabled. So you don't need to start the interrupt routine with DI.

(Thinking that interrupts were enabled by default caused runer quite a headache over IRC ;))
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on June 10, 2011, 11:20:26 pm
More stuff regarding interrupts. SMC'ing the active port 6 page into the interrupt handler is, as far as I know, only necessary for applications. You could get rid of this if the code is being compiled to a program to save 9 bytes.



And on the topic of stuff that involves port 6, I think it would be nice if the archive byte reading routine avoided using a B_CALL for a massive speed boost, especially for code compiled as programs:

p_ReadArc: 18 bytes (2x) larger, but ~1400 cycles (!!!10x!!!) faster

Code: (36 bytes, ~142 cycles) [Select]
p_ReadArc:
.db __ReadArcEnd-1-$
ld c,a
in a,(6)
ld b,a
ld a,h
set 6,h
res 7,h
rlca
rlca
dec a
and %00000011
add a,c
out (6),a
ld c,(hl)
inc hl
bit 7,h
jr z,__ReadArcNoBoundary
set 6,h
res 7,h
inc a
out (6),a
__ReadArcNoBoundary:
ld l,(hl)
ld h,c
ld a,b
out (6),a
ret
__ReadArcEnd:

p_ReadArcApp: 36 bytes (3x) larger, but ~1050 cycles (4x) faster

Code: (54 bytes, ~396 cycles) [Select]
p_ReadArcApp:
.db __ReadArcAppEnd-1-$
push hl
ld hl,$0000
ld de,ramCode
ld bc,__ReadArcAppRamCodeEnd-__ReadArcAppRamCode
ldir
pop hl
ld e,a
ld c,6
in b,(c)
ld a,h
set 6,h
res 7,h
rlca
rlca
dec a
and %00000011
add a,e
call ramCode
ld e,d
inc hl
bit 7,h
jr z,__ReadArcAppNoBoundary
set 6,h
res 7,h
inc a
__ReadArcAppNoBoundary:
call ramCode
ex de,hl
ret
__ReadArcAppEnd:
.db rp_Ans,__ReadArcAppEnd-p_ReadArcApp-3

__ReadArcAppRamCode:
out (6),a
ld d,(hl)
out (c),b
ret
__ReadArcAppRamCodeEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 11, 2011, 01:42:59 am
Quote
This is a really simple one. When an interrupt is called, interrupts are automatically disabled. So you don't need to start the interrupt routine with DI.

They are disabled automatically already... there is a di at the start of the interrupt routine.  Is there some bug with that?

Also, about those archive reading commands... archive reading isn't as useful as it should be due to those sector boundary issues.  For instance, you can't reliably iterate a tilemap in archive because there is a small chance it could overlap between a sector boundary and iterating over it would add a "glitch byte" to the map since each sector adds an extra byte in front.  Although I guess you could modify those routines to take that into account, that might work since you can't read more than 64 consecutive kilobytes anyway.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 11, 2011, 01:51:59 am
Quote
This is a really simple one. When an interrupt is called, interrupts are automatically disabled. So you don't need to start the interrupt routine with DI.

They are disabled automatically already... there is a di at the start of the interrupt routine.  Is there some bug with that?

Also, about those archive reading commands... archive reading isn't as useful as it should be due to those sector boundary issues.  For instance, you can't reliably iterate a tilemap in archive because there is a small chance it could overlap between a sector boundary and iterating over it would add a "glitch byte" to the map since each sector adds an extra byte in front.  Although I guess you could modify those routines to take that into account, that might work since you can't read more than 64 consecutive kilobytes anyway.
There's no chance of overlapping a sector boundary, but yeah you can overlap a page boundary. TI-OS doesn't allow variables to cross sector boundaries.

Edit: About the DI thing, he means that it's a waste of a byte and 4 cycles to DI when it has already been done by the hardware.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on June 17, 2011, 03:49:26 am
I made a one-byte optimization to p_SDiv:
Old:New:
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_Div
pop af
add a,a
ret nc
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_Div
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:

I'm also working on a fixed-point division routine (that hijacks the normal division routine), but I think I need to make sure it works before I post it :P

Edit:
Well, I've convinced myself now that it works. You'll need to add in the stuff to correctly format the routine since I don't fully understand how that works (especially calling in the middle of other routines)
Code: [Select]
p_88Div:
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ld bc,$1000
ld a,l
ld l,h
ld h,c
call __DivLoop
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret

Overflow checking isn't handled, but I suppose that's normal. It might be nice to saturate the result, though.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 17, 2011, 04:55:30 am
Cool, thanks!  I was also able to do that same sign flag optimization to the 8.8 multiplication routine.  Any idea what might be a good token for fixed point division?  That's the main thing holding me back from adding it.  /* is the first thing that comes to mind but I think its confusing.  /// could also work but that's a lot to type...
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Deep Toaster on June 17, 2011, 12:26:31 pm
Maybe /^?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: yrinfish on June 17, 2011, 12:29:04 pm
why not // ? is that already something?/me checks the new wiki

ah, it is...

what about ./? the dot is a 'point' and .* for multiplication
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on June 17, 2011, 12:34:09 pm
why not // ? is that already something?/me checks the new wiki

ah, it is...

what about ./? the dot is a 'point' and .* for multiplication

I like ./ because it's fixed-point division.

And // is probably bad because of the way Quigibo made Axe.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on June 17, 2011, 12:35:24 pm
Since it's fixed point division, how do you think /. looks? I think that operator is actually sort of logical with the decimal point. And although I don't know if you would want to change the fixed point multiplication operator and break compatibility with old programs, I think *. would make a great complement.


Ninja response to Scout: I think the division/multiplication operator first would be a good idea so the parser doesn't see the decimal first and have to worry about what follows being either a comment or a math operator.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: yrinfish on June 17, 2011, 12:37:09 pm
decimal dot doesn't mean a comment if it's not the first token on the line.

and @scout, yes, that was intentional.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Munchor on June 17, 2011, 12:38:26 pm
Oh I forgot comments :P

@yrinfish: Well that seems right too.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on June 17, 2011, 12:39:27 pm
decimal dot doesn't mean a comment if it's not the first token on the line.

Actually the comment indicator works anywhere in a line. I don't think Quigibo intended for this originally, but I think having the side affect of inline comments is actually pretty nice.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: yrinfish on June 17, 2011, 12:42:40 pm
Edit: It does? oh

Ok, scout thinks:

A./B Good idea
A./B Argh, comments
A./B Oh, that's true


lol

But I think that, while it is not really a good way when you want this:

Code: [Select]
:If H
: 35->A
:Else
: 10->A
:End
:./B->C

But, we'll see that later
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 17, 2011, 05:14:30 pm
eh.... as much as I like the dot, I would really like to avoid it so it doesn't look weird next to another new feature I'm adding.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on June 27, 2011, 03:31:09 pm
Found some optimizations for p_IntSetup:

Code: (Original code: 45 bytes, a lot of cycles) [Select]
p_IntSetup:
.db __IntEnd-p_IntSetup-1
di
ld de,$8B01
ld a,d
ld i,a
ld a,l
ld hl,$8B00
ld b,e
ld c,l
ld (hl),$8A
ldir

and %00000110
out (4),a
ld a,%00001000
out (3),a
ld a,%00001010
out (3),a

ld de,$8A8A
ld bc,__IntDataEnd-__IntData
ld hl,$0000
ldir

in a,(6)
ld ($8A8A+__IntDataSMC-__IntData+1),a
__IntEnd:
.db rp_Ans,9
       
   
Code: (Optimized code: 42 bytes, a lot minus 5 cycles) [Select]
p_IntSetup:
.db __IntEnd-p_IntSetup-1
di
ld de,$8B01
ld a,d
ld i,a
ld a,l
ld hl,$8B00
ld b,e
ld c,l
ld (hl),$8A
ldir

and %00000110
out (4),a
ld a,%00001000
out (3),a
ld a,(hl)
out (3),a

ld d,a
ld e,a
ld c,__IntDataEnd-__IntData
ld hl,$0000
ldir

in a,(6)
ld ($8A8A+__IntDataSMC-__IntData+1),a
__IntEnd:
.db rp_Ans,9
       

Also don't forget the two interrupt optimizations that have been suggested earlier! :) (removing di and removing port 6 SMC in programs)





Also, a bug report. After talking with thepenguin77 on IRC, I think I have confirmed my old belief that MemKit's Next() command has a bug regarding reaching the end of the VAT:

Next(): 2 bytes and a few cycles saved. Also, isn't the end-of-VAT check in the wrong place? I could be wrong because my VAT experience isn't too great, but because this routine checks for the end of the VAT at the start, wouldn't this command advance the VAT pointer to the end of the VAT and not recognize it as the end until the next Next()? This would cause problems with programs reading garbage VAT data for the last "entry." If I'm right about this (which may not be the case), the third block of code I posted should hopefully recognize the end of the VAT as soon as it hits it and never advance the VAT pointer to point to the end.

Code: (Original code: 26 bytes, 152/66 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    de,-6
 add   hl,de
 ld    e,(hl)
 inc   e
 xor   a
 ld    d,a
 sbc   hl,de
 ld    (axv_X1t),hl
 ret
   
   
Code: (Optimized code: 24 bytes, 144/66 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    de,-6
 add   hl,de
 ld    a,(hl)
 cpl
 ld    e,a
 add   hl,de
 ld    (axv_X1t),hl
 ret
   
   
Code: (Optimized (and fixed?) code: 24 bytes, 144/113 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,-6
 add   hl,de
 ld    a,(hl)
 cpl
 ld    e,a
 add   hl,de
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    (axv_X1t),hl
 ret
   
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on June 30, 2011, 01:06:47 am
A mix of a feature request and an optimization:

How about compound assignment operators (e.g. +=) for most Axe operations? They would offer savings on every operation that doesn't use a 2-byte variable at a constant address as the main operand. They could also offer even larger savings on basic operations like addition, subtraction, and bitwise logic.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on June 30, 2011, 03:01:03 am
Those might get peephole optimized eventually, so I don't want to spend time on that feature right now.  I realize now that a lot of optimizations I've already done could have been done via peepholes instead of special casing and it would have been a whole lot easier to code and more generalizable.  I'm currently only doing the most conservative peephole ops right now and its already reducing all code by about 1-2% on average.  I'm thinking I can get this to over 5% eventually, especially in code that is not already hyper optimized.

Also, I don't want to remove the port 6 stuff in because I think it potentially allows more advanced paging Axioms to use the regular interrupts.  Plus it would be annoying to change.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Deep Toaster on July 01, 2011, 12:39:04 pm
A mix of a feature request and an optimization:

How about compound assignment operators (e.g. +=) for most Axe operations? They would offer savings on every operation that doesn't use a 2-byte variable at a constant address as the main operand. They could also offer even larger savings on basic operations like addition, subtraction, and bitwise logic.

I was just gonna post that in the other thread. Yes.

And I'm not good enough at ASM to give you any big suggestions, but here's another little request:
·EFF00
to LD L,0 and
·E00FF
to LD H,0. (The · dots are the 16-bit AND operator.)

EDIT: And
+EFF00
to LD H,255 and
+E00FF
to LD L,255.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on July 01, 2011, 06:49:17 pm
That's been implemented since 0.5.2. :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Deep Toaster on July 03, 2011, 09:08:13 pm
Time to update <_<
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on July 12, 2011, 01:05:00 am
Wow, I just did something I didn't think was even possible. I found a good use for a forward djnz. *.*

The following is the part of p_Disp4Lvl that initializes the mask:

Code: (17 bytes, ~62 cycles) [Select]
ld a,%11011011
or a
ld hl,flags+asm_flag2
inc (hl)
jr z,__Disp4Lvlskip
rra
ld b,(hl)
inc b
jr z,__Disp4Lvlskip
rra
ld (hl),-2
   
Code: (16 bytes, ~60 cycles) [Select]
ld a,%11011011
or a
ld hl,flags+asm_flag2
inc (hl)
jr z,__Disp4Lvlskip
rra
ld b,(hl)
djnz __Disp4Lvlskip
rra
ld (hl),-2
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on July 12, 2011, 02:02:39 am
Awesome wow!  Yeah, forward djnz is about as rare as cpir.  Although I think calc84maniac's original 4 level routine used them as well but for a different purpose.

Also on the same subject, although you'll be the only one who knows what I'm talking about, all 12 DispGraph forms work perfectly now.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on July 12, 2011, 08:48:57 am
Just checking, have you actually tested the routines out? Because I didn't actually test those routines I gave you, I just modeled them after some routines I knew worked and hoped these would still work as well. :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on July 12, 2011, 03:33:09 pm
Yeah I tested everything.  One of them had a problem that I fixed with the buffer ordering being switched though.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on July 14, 2011, 11:31:27 pm
Here's a peephole optimization suggestion: Keep track of whether the value in HL is a constant or not, and if so, what constant. For example, I have some code:
Code: [Select]
If condition
do stuff
Else
16->W
End
Obviously, after the Else, HL has to be 0. Thus the 16 can be reduced to a ld l,16 instead of ld hl,16. It might be possible to auto-optimize stuff like 1->A:2->B into 1->A+1->B, but you could always leave that to the user like usual.

Also, I found it a bit annoying that when I did something like If E<(96*256), the part in the parentheses wasn't reduced to a constant before doing the less-than operation. Could the look-ahead parsing be able to detect constants in parentheses?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on July 14, 2011, 11:34:17 pm
Also, I found it a bit annoying that when I did something like If E<(96*256), the part in the parentheses wasn't reduced to a constant before doing the less-than operation. Could the look-ahead parsing be able to detect constants in parentheses?

This times a million.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: ztrumpet on July 14, 2011, 11:35:45 pm
Also, I found it a bit annoying that when I did something like If E<(96*256), the part in the parentheses wasn't reduced to a constant before doing the less-than operation. Could the look-ahead parsing be able to detect constants in parentheses?

This times a million.
This times a million and five.

Seriously, I thought Axe did this already.  Apparently not... so, please? :D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on July 25, 2011, 11:16:41 am
Quigibo, you read my mind. I was about to make a post with code for commands that deal with archived variables to work with variables in RAM too, but you added that in Axe 1.0.2 before I could finish! However, I'll make a post anyways because my p_GetArc routine is smaller. :P I also have a few other things.




p_GetArc: 7 bytes smaller.

Code: (Old code: 76 bytes) [Select]
p_GetArc:
.db __GetArcEnd-1-$
push de
MOV9TOOP1()
B_CALL(_ChkFindSym)
jr c,__GetArcFail
dec b
inc b
jr z,__GetArcRam
B_CALL(_IsFixedName)
ld hl,9
jr z,__GetArcName
__GetArcStatic:
ld l,12
and %00011111
jr z,__GetArcDone
cp l
jr z,__GetArcDone
ld l,14
jr __GetArcDone
__GetArcName:
add hl,de
bit 7,h
jr z,$+7
res 7,h
set 6,h
inc b
B_CALL(_LoadDEIndPaged)
ld d,0
inc e
inc e
__GetArcDone:
add hl,de
ex de,hl
__GetArcStore:
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
ex de,hl
ret
__GetArcRam:
and %00011111
jr z,__GetArcStore
cp CplxObj
jr z,__GetArcStore
inc de
inc de
jr __GetArcStore
__GetArcFail:
ld hl,0
pop de
ret
__GetArcEnd:
       
   
Code: (New code: 69 bytes) [Select]
p_GetArc:
.db __GetArcEnd-1-$
push de
MOV9TOOP1()
B_CALL(_ChkFindSym)
jr c,__GetArcFail
dec b
inc b
jr z,__GetArcRam
B_CALL(_IsFixedName)
ld hl,9
jr z,__GetArcName
ld l,12
__GetArcChkFloat:
and %00011111
jr z,__GetArcDone
cp CplxObj
jr z,__GetArcDone
inc l
inc l
jr __GetArcDone
__GetArcName:
add hl,de
bit 7,h
jr z,$+7
res 7,h
set 6,h
inc b
B_CALL(_LoadDEIndPaged)
ld d,0
inc e
inc e
__GetArcDone:
add hl,de
ex de,hl
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
ex de,hl
ret
__GetArcRam:
ld h,b
ld l,b
jr __GetArcChkFloat
__GetArcFail:
ld hl,0
pop de
ret
__GetArcEnd:
       




p_ReadArc: Bumping an old request for larger but drastically faster archive reading routines. The routines would need to modified slightly to allow for reading from RAM as well, but that should be no problem. I would understand if you didn't want to add the app version, but the program version is immensely better in my opinion.

And on the topic of stuff that involves port 6, I think it would be nice if the archive byte reading routine avoided using a B_CALL for a massive speed boost, especially for code compiled as programs:

p_ReadArc: 18 bytes (2x) larger, but ~1400 cycles (!!!10x!!!) faster

Code: (36 bytes, ~142 cycles) [Select]
p_ReadArc:
.db __ReadArcEnd-1-$
ld c,a
in a,(6)
ld b,a
ld a,h
set 6,h
res 7,h
rlca
rlca
dec a
and %00000011
add a,c
out (6),a
ld c,(hl)
inc hl
bit 7,h
jr z,__ReadArcNoBoundary
set 6,h
res 7,h
inc a
out (6),a
__ReadArcNoBoundary:
ld l,(hl)
ld h,c
ld a,b
out (6),a
ret
__ReadArcEnd:

p_ReadArcApp: 36 bytes (3x) larger, but ~1050 cycles (4x) faster

Code: (54 bytes, ~396 cycles) [Select]
p_ReadArcApp:
.db __ReadArcAppEnd-1-$
push hl
ld hl,$0000
ld de,ramCode
ld bc,__ReadArcAppRamCodeEnd-__ReadArcAppRamCode
ldir
pop hl
ld e,a
ld c,6
in b,(c)
ld a,h
set 6,h
res 7,h
rlca
rlca
dec a
and %00000011
add a,e
call ramCode
ld e,d
inc hl
bit 7,h
jr z,__ReadArcAppNoBoundary
set 6,h
res 7,h
inc a
__ReadArcAppNoBoundary:
call ramCode
ex de,hl
ret
__ReadArcAppEnd:
.db rp_Ans,__ReadArcAppEnd-p_ReadArcApp-3

__ReadArcAppRamCode:
out (6),a
ld d,(hl)
out (c),b
ret
__ReadArcAppRamCodeEnd:




p_CopyArc: Modified to allow for sources in RAM.

Code: (Old code: 22 bytes) [Select]
p_CopyArc:
.db __CopyArcEnd-1-$
pop ix
pop de
ex (sp),hl
ld b,a
ld a,h
rlca
rlca
dec a
and %00000011
add a,b
set 6,h
res 7,h
pop bc
B_CALL(_FlashToRAM)
jp (ix)
__CopyArcEnd:
       
   
Code: (New code: 28 bytes) [Select]
p_CopyArc:
.db __CopyArcEnd-1-$
ex (sp),hl
pop bc
pop de
ex (sp),hl
or a
jr z,__CopyArcRam
push bc
ld b,a
ld a,h
rlca
rlca
dec a
and %00000011
add a,b
set 6,h
res 7,h
pop bc
B_CALL(_FlashToRAM)
ret
__CopyArcRam:
ldir
ret
__CopyArcEnd:
       




Also, I'm not sure why I just realized this now, but why don't the 8-bit logic operations on variables just load the variable into a instead of de to save 2 bytes?



Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on July 26, 2011, 06:04:13 am
Hmm, I'm still not sure if the extra speed is worth the size increase.  I guess a new argument for the speed is to make file reads more consistent (a program using a file from archive might run slower than one reading from ram). But I will put this up in the poll since I'd like to know how may people this would benefit or hurt.

The 8-bit logical operators I don't do that optimization because then I'd need duplicate commands and have even more special casing.  This is something that can easily be peephole optimized in the future however so it might become a non-issue.

I was trying to recursively parse constants in parenthesis in the last update, but it was extremely complicated so I gave up.  I will have to modify the core number reading system to get it to work (which I was planning to do eventually anyway) so I will get to it then.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on July 27, 2011, 02:47:08 pm
Wait.... Runer!  What were you thinking!  The App code for file reading can be the same as the program code, but just use port 7 instead of port 6 and set the high bits of hl for the $8000-$BFFF range.  That's what the Axe app does. :)

EDIT: Also, another thing that the routines would need is to disable interrupts and then restore them afterwards... which I can use the "Safety" code for, but its going to be slower and even larger.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on July 27, 2011, 10:47:25 pm
The app version might need to disable interrupts, but why would the program version need to? Both Axe's and the OS's interrupt handlers back up the page in the $4000-$7FFF bank and restore it upon returning.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on August 11, 2011, 01:10:51 pm
I was toying around with some math routines while I was away and I was curious about the square root algorithms. Are the designed to return the square root rounded down, up, or just rounded? If it is rounded down and you want to round it to the nearest integer answer, here is a code I made a while ago (it isn't even close to what Axe needs, but it should only be taken as an example):
Code: [Select]
;===============================================================
sqrtE:
;===============================================================
;Input:
;     E is the value to find the square root of
;Outputs:
;     A is E-D^2
;     B is 0
;     D is the rounded result
;     E is not changed
;     HL is not changed
;Destroys:
;     C
;
        xor a               ;1      4         4
        ld d,a              ;1      4         4
        ld c,a              ;1      4         4
        ld b,4              ;2      7         7
sqrtELoop:
        rlc d               ;2      8        32
        ld c,d              ;1      4        16
        scf                 ;1      4        16
        rl c                ;2      8        32

        rlc e               ;2      8        32
        rla                 ;1      4        16
        rlc e               ;2      8        32
        rla                 ;1      4        16

        cp c                ;1      4        16
        jr c,$+4            ;4    12|15      48+3x
          inc d             ;--    --        --
          sub c             ;--    --        --
        djnz sqrtELoop      ;2    13|8       47
        cp d                ;1      4         4
        jr c,$+3            ;3    12|11     12|11
          inc d             ;--    --        --
        ret                 ;1     10        10
;===============================================================
;Size  : 29 bytes
;Speed : 347+3x cycles plus 1 if rounded down
;   x is the number of set bits in the result.
;===============================================================

The only reason that I mention this is that I know a lot of graphical algorithms would have better results if the square root was returned in rounded form as opposed to just rounded up or down.

Sorry if this was already covered and I missed it :/
Spoiler For Spoiler:
EDIT:
Wow, I just did something I didn't think was even possible. I found a good use for a forward djnz. *.*
>.> Hehe, I use forward djnz in many-- if not most-- of my programs... It is one of the most useful tricks I use and is kind of my signature touch :) I use it to save time and memory a lot, especially in instances like this:
Code: [Select]
     ld b,a
     or a \ jr nz,Next1
       ;code
Next1:
     djnz Next2
       ;code
Next2:
     djnz Next3
       ;code
;...et cetera
/me loves it
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on August 11, 2011, 04:29:36 pm
All of Axe math simply truncates, so I think the current square root algorithm is pretty good. Anyways you have to remember that Axe uses 16-bit math and that's an 8-bit square root function. :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on August 16, 2011, 07:39:41 pm
Yeah, I know, but I just wanted to give an example. It is really only the last few bytes that are important, though, and I wanted to give a simple, easy to follow example. Also, great job with the optimisations :D I wish I could help more, but most of the codes are a bit beyond my optimisation abilities.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on August 30, 2011, 02:45:57 am
I think I just performed the most ridiculous, impressive optimization I've ever performed on an Axe command. 27 bytes optimized down to 60% of its size: 17 bytes! :w00t:

Code: (Old code: 27 bytes, ~220.5 cycles) [Select]
p_DKeyVar:
.db __DKeyVarEnd-1-$
dec l
ld a,l
rra
rra
rra
and %00000111
inc a
ld b,a
ld a,%01111111
rlca
djnz $-1
ld h,a
ld a,l
and %00000111
inc a
ld b,a
ld a,%10000000
rlca
djnz $-1
ld l,a
ret
__DKeyVarEnd:
       
   
Code: (New code: 17 bytes, ~225 cycles) [Select]
p_DKeyVar:
.db __DKeyVarEnd-1-$
ld a,l
ld hl,%0111111111110111
rlc h
adc a,l
jr c,$-3
ld l,%0000001
rrc l
inc a
jr nz,$-3
ret
__DKeyVarEnd:
       
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on August 30, 2011, 03:02:02 am
O_O  I don't even understand what's going on here.  That's quite impressive!

EDIT: Also, a really obvious optimization I just noticed is that the return should be replaced by a jump to the direct key command so it doesn't have to return and re-call it.  :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Happybobjr on August 30, 2011, 06:23:00 am
what does that code do though ???
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on September 18, 2011, 03:15:23 am
At this rate, I'll have optimized just about every Axe routine eventually! ;)




p_ToHex: 31 cycles faster.

Code: (Old code: 25 bytes, 670 cycles) [Select]
p_ToHex:
.db __ToHexEnd-$-1
ld b,4
ld de,vx_SptBuff
push de
__ToHexLoop:
ld a,$1F
__ToHexShift:
add hl,hl
rla
jr nc,__ToHexShift
daa
add a,$A0
adc a,$40
ld (de),a
inc de
djnz __ToHexLoop
xor a
ld (de),a
pop hl
ret
__ToHexEnd:
       
   
Code: (New code: 25 bytes, 639 cycles) [Select]
p_ToHex:
.db __ToHexEnd-$-1
ld bc,4<<8+$1F
ld de,vx_SptBuff
__ToHexLoop:
ld a,c
__ToHexShift:
add hl,hl
rla
jr nc,__ToHexShift
daa
add a,$A0
adc a,$40
ld (de),a
inc e
djnz __ToHexLoop
ex de,hl
ld (hl),b
ld l,vx_SptBuff&$FF
ret
__ToHexEnd:
       




p_ShiftLeft: 1 byte smaller, 67 cycles faster. You could save an additional 384 cycles by giving up the minor size savings and loading 12<<8+4 into de at the start of the routine and then replacing the immediate data operands in the loop with d and e.

Code: (Old code: 17 bytes, 27542 cycles) [Select]
p_ShiftLeft:
.db __ShiftLeftEnd-1-$
ld hl,plotSScreen+767
ld c,64
__ShiftLeftLoop:
ld b,12
or a
__ShiftLeftShift:
rl (hl)
dec hl
djnz __ShiftLeftShift
dec c
jr nz,__ShiftLeftLoop
ret
__ShiftLeftEnd:
       
   
Code: (New code: 16 bytes, 27475 cycles) [Select]
p_ShiftLeft:
.db __ShiftLeftEnd-1-$
ld hl,plotSScreen+767
xor a
__ShiftLeftLoop:
ld b,12
__ShiftLeftShift:
rl (hl)
dec hl
djnz __ShiftLeftShift
add a,4
jr nz,__ShiftLeftLoop
ret
__ShiftLeftEnd:
       




p_ShiftRight: 1 byte smaller, 67 cycles faster. Same deal as p_ShiftLeft.

Code: (Old code: 17 bytes, 27542 cycles) [Select]
p_ShiftRight:
.db __ShiftRightEnd-1-$
ld hl,plotSScreen
ld c,64
__ShiftRightLoop:
ld b,12
or a
__ShiftRightShift:
rr (hl)
inc hl
djnz __ShiftRightShift
dec c
jr nz,__ShiftRightLoop
ret
__ShiftRightEnd:
       
   
Code: (New code: 16 bytes, 27475 cycles) [Select]
p_ShiftRight:
.db __ShiftRightEnd-1-$
ld hl,plotSScreen
xor a
__ShiftRightLoop:
ld b,12
__ShiftRightShift:
rr (hl)
inc hl
djnz __ShiftRightShift
add a,4
jr nz,__ShiftRightLoop
ret
__ShiftRightEnd:
       




p_FreqOut: 1 byte smaller. Takes advantage of an absolute jump. This is a strange routine to optimize, because optimizing it results in it running about 15% faster which would result in slightly higher pitched and shorter notes. Although this command is rarely used, this augmentation might still make the optimization not worth it. Whether or not you include the optimization, it might be a good idea to change this routine to use p_Safety.

Code: (Old code: 23 bytes) [Select]
p_FreqOut:
.db __FreqOutEnd-1-$
xor a
__FreqOutLoop1:
push bc
ld e,a
__FreqOutLoop2:
ld a,h
or l
jr z,__FreqOutDone
dec hl
dec bc
ld a,b
or c
jr nz,__FreqOutLoop2
ld a,e
xor %00000011
scf
__FreqOutDone:
pop bc
out ($00),a
ret nc
jr __FreqOutLoop1
__FreqOutEnd:
       
   
Code: (New code: 22 bytes) [Select]
p_FreqOut:
.db __FreqOutEnd-1-$
xor a
__FreqOutLoop1:
push bc
ld e,a
__FreqOutLoop2:
ld a,h
or l
jr z,__FreqOutDone
cpd
jp pe,__FreqOutLoop2
ld a,e
xor %00000011
scf
__FreqOutDone:
pop bc
out ($00),a
ret nc
jr __FreqOutLoop1
__FreqOutEnd:
       




p_IntSetup: 4 bytes smaller. I thought this was some pretty impressive work. ;D And regarding interrupts, I still think the port 6 saving and restoring shenanigans aren't necessary for programs. The only reason port 6 would need to be restored to the value it held when interrupts were enabled is if the user is using a shell application in conjugation with their Axe program. In that case, either the designer of the shell application interface system could provide modified interrupt routines in an Axiom, or the user is probably intelligent enough to be able to provide their own interrupt routines. (Actually it wouldn't even need to be their own, they could just copy the one for applications from the Commands.inc file)

Code: (Old code: 42 bytes, a lot of cycles) [Select]
p_IntSetup:
.db __IntEnd-p_IntSetup-1
di
ld de,$8B01
ld a,d
ld i,a
ld a,l
ld hl,$8B00
ld b,e
ld c,l
ld (hl),$8A
ldir

and %00000110
out (4),a
ld a,%00001000
out (3),a
ld a,(hl)
out (3),a

ld d,a
ld e,a
ld c,__IntDataEnd-__IntData
ld hl,$0000
ldir

in a,(6)
ld ($8A8A+__IntDataSMC-__IntData+1),a
__IntEnd:
.db rp_Ans,9
       
   
Code: (New code: 38 bytes, more cycles but who cares?) [Select]
p_IntSetup:
.db __IntEnd-p_IntSetup-1
di
ld a,l
ld hl,$8C06
ld de,$8C05
ld bc,$8C05-$8A8A

and l
out (4),a
ld a,h
out (3),a
dec a
ld i,a
dec a
out (3),a

ld (hl),a
lddr

ld hl,$0000
ld c,__IntDataEnd-__IntData
ldir

in a,(6)
ld ($8A8A+__IntDataSMC-__IntData+1),a
__IntEnd:
.db rp_Ans,11
       




p_DtoF: 2 bytes smaller. Takes advantage of a bcall to do the same thing. It appears that B_CALL(_SetXXXXOP2) always returns OP2+1, which could be used to save an additional 2 bytes, but this bcall could theoretically be changed in future OS versions and break this optimization.

Code: (Old code: 13 bytes, a lot of cycles) [Select]
p_DtoF:
.db 13
ex (sp),hl
B_CALL(_SetXXXXOP2)
ld hl,OP2
pop de
ld bc,9
ldir
       
   
Code: (New code: 11 bytes, a lot plus a few cycles) [Select]
p_DtoF:
.db 11
ex (sp),hl
B_CALL(_SetXXXXOP2)
ld hl,OP2
pop de
B_CALL(_Mov9B)
       
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on September 20, 2011, 12:26:00 am
p_Length: 1 byte smaller, 2 cycles faster. Takes advantage of the fact that you will not need to search more than 16384 bytes starting at $4000-$7FFF or 32768 bytes starting at $8000-$FFFF, and also you shouldn't be searching at $0000-$3FFF.
Code: ((Old code: 11 bytes)) [Select]
p_Length:
.db __LengthEnd-$-1
xor a
ld b,a
ld c,a
cpir
ld hl,-1
sbc hl,bc
ret
__LengthEnd:
Code: ((New code: 10 bytes)) [Select]
p_Length:
.db __LengthEnd-$-1
xor a
ld b,h
ld d,h
ld e,l
cpir
scf
sbc hl,de
ret
__LengthEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on October 09, 2011, 10:16:40 am
Speed optimization for p_CheckSum by using an absolute jump.
Code: (Old Code: 19 bytes, 63.5*n+37 cycles) [Select]
p_CheckSum:
.db __CheckSumEnd-$-1
ld b,h
ld c,l
pop af
pop hl
push af
xor a
ld d,a
__CheckSumLoop:
add a,(hl)
ld e,a
jr nc,$+3
inc d
cpi
ex de,hl
ret po
ex de,hl
jr __CheckSumLoop
__CheckSumEnd:
Code: (New Code: 19 bytes, 44.5*n+65 cycles) [Select]
p_CheckSum:
.db __CheckSumEnd-$-1
ld b,h
ld c,l
pop af
pop hl
push af
xor a
ld d,a
__CheckSumLoop:
add a,(hl)
jr nc,$+3
inc d
cpi
jp pe,__CheckSumLoop
ld h,d
ld l,a
ret
__CheckSumEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on October 09, 2011, 05:38:20 pm
Hmm, would this optimisation work to save one more byte? (sorry, I could be wrong):
Code: [Select]
p_CheckSum:
.db __CheckSumEnd-$-1
ld b,h
ld c,l
pop hl
ex      (sp),hl
xor a
ld d,a
__CheckSumLoop:
add a,(hl)
jr nc,$+3
inc d
cpi
jp pe,__CheckSumLoop
ld h,d
ld l,a
ret
__CheckSumEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on October 09, 2011, 07:21:47 pm
Ah, nice use of ex (sp),hl :D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on October 09, 2011, 07:26:47 pm
Thanks :) I think I learned it from you folks :)
EDIT: It does use 2 more cycles though, right?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on October 09, 2011, 07:30:34 pm
Thanks :) I think I learned it from you folks :)
EDIT: It does use 2 more cycles though, right?
Actually, ex (sp),hl takes 2 fewer cycles than pop af and push af combined, so it's faster too :)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Happybobjr on October 09, 2011, 07:37:42 pm
what is checksum do?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on October 13, 2011, 11:32:57 am
Here, slightly optimized Bitmap():
Old code, 7 bytes and lots of cycles
Code: [Select]
p_EzSprite:
.db 7
pop de
ld a,e
pop de
ld d,a
B_CALL(_DisplayImage)

New code, 6 bytes and lots of cycles minus 4 :P
Code: [Select]
p_EzSprite:
.db 6
pop bc
pop de
ld d,c
B_CALL(_DisplayImage)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on October 14, 2011, 02:54:36 pm
Is this an optimisation? I get the feeling that there is a reason it doesn't end in an ret and that it uses a jr...

Code: (Old Code: 7 bytes, 30 or 38 cycles) [Select]
p_DecWord:
.db 7
ld a,(hl)
dec (hl)
or a
jr nz,$+4
inc hl
dec (hl)
Code: (New Code: 6 bytes, 29 or 36) [Select]
p_DecWord:
.db 6
ld a,(hl)
dec (hl)
or a
ret nz
inc hl
dec (hl)

EDIT Yep, suspicion confirmed XD
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on November 04, 2011, 01:58:14 am
Not an optimization, but I'm posting this here since more assembly people will read it.  Since the Bitmap() command is being replaced with something actually useful, that means the "Fix 8" and "Fix 9" will also need to be replaced.  Are there any useful flags (particularly for text) that would be useful to Axe programmers that I haven't already covered with the other fix commands?  A couple I can think of are an APD toggle or Lowercase toggle.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: LincolnB on November 04, 2011, 10:24:39 am
Hm...I say this as an Axe programmer, not knowing ASM...how about UPSIDE DOWN TEXT! om nom nom nom
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on November 15, 2011, 12:01:37 am
p_Input: saves three bytes and lots of cycles
Code: [Select]
p_Input:
.db __InputEnd-$-1
res 6,(iy+$1C)
set 7,(iy+$09)
xor a
ld (ioPrompt),a
B_CALL(_GetStringInput)
B_CALL(_ZeroOP1)
ld hl,$2D04
ld (OP1),hl
B_CALL(_ChkFindSym)
inc de
inc de
ex de,hl
ret
__InputEnd:
Code: [Select]
p_Input:
.db __InputEnd-$-1
res 6,(iy+$1C)
set 7,(iy+$09)
xor a
ld (ioPrompt),a
B_CALL(_GetStringInput)
B_CALL(_ZeroOP1)
ld a,$2D
ld (OP1+1),a
rst rFindSym
inc de
inc de
ex de,hl
ret
__InputEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on November 16, 2011, 05:52:32 pm
Thanks! :D
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: parserp on November 16, 2011, 05:54:13 pm
Hm...I say this as an Axe programmer, not knowing ASM...how about UPSIDE DOWN TEXT! om nom nom nom
Yeah! it could be something like Fix 11 :D
that would be awesome!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on November 16, 2011, 08:54:40 pm
But TI doesn't have a text flag for that :(  It has to be something that already exists.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: ztrumpet on November 16, 2011, 09:31:22 pm
I like the lowercase toggle option.  Actually, now that I think about it, it wouldn't be that useful because there's no reliable input command.  Hmm...
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on November 17, 2011, 12:31:08 pm
I believe there is a Feature Wishlist (http://ourl.ca/4057;topicseen) thread ...

And to be relevant to the topic, I have not found any *actual* optimisations just from the quick look I made. There might be more, but y'all have done an amazing job so far!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on December 06, 2011, 01:36:32 am
p_SDiv: same size, saves an average of 7 cycles 3.5 cycles
Edit: oops, abs(hl) isn't always positive (what???)
Original
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_Div
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
Optimized
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
xor d ; a = h
jp p,$+9
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
; a = high byte of abs(hl)
; therefore, bit 7 is clear
; xor d ; or works too
; jp p,$+9
; Edit: doesn't work if hl = $8000
; because abs($8000) = $8000
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_Div
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
Edit: Some more attempts at optimizing.
Optimized (speed)
+1 byte, avg -19 cycles
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
xor d ; a = h
jp p,$+9
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ld a,d
or a
jp p,$+9
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ld b,b
.db 2
call $3F00+sub_Div
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:

p_Div:
.db __DivEnd-1-$
ld a,d
or a
ld b,16
; ...
Optimized (size)
-4 bytes, avg +37 cycles
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
ld b,2
__SDivRepeat:
ex de,hl
xor h
jp p,__SDivSkip
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
__SDivSkip:
xor a
djnz __SDivRepeat
call $3F00+sub_Div
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
Optimized (lol)
-8 bytes, avg +110-4 cycles
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
ld b,2
__SDivRepeat:
push af
xor d
jp p,__SDivSkip
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
__SDivSkip:
dec b
ret m
ex de,hl
ld b,b
.db 1
call z,$3F00+sub_Div
inc b
pop af
djnz __SDivRepeat
jr __SDivRepeat+2
__SDivEnd:

p_Div:
.db __DivEnd-1-$
ld a,d
or a
ld b,16
; ...

Edit: p_SortD: same size, 2*size-5 cycles faster
Original
Code: [Select]
p_SortD:
.db __SortDEnd-1-$
ld c,l
ex de,hl
__SortDLoop2:
ld b,c
push hl
jr __SortDJumpIn
__SortDLoop1:
inc hl
cp (hl)
jr c,__SortDSkip
__SortDJumpIn:
ld a,(hl)
ld d,h
ld e,l
__SortDSkip:
djnz __SortDLoop1
ld b,(hl)
ld (hl),a
ld a,b
ld (de),a
pop hl
dec c
jr nz,__SortDLoop2
ret
__SortDEnd:
Optimized
Code: [Select]
p_SortD:
.db __SortDEnd-1-$
ld c,l
ex de,hl
__SortDLoop2:
ld b,c
push hl
jr __SortDJumpIn
__SortDLoop1:
inc hl
cp (hl)
jr c,__SortDSkip
__SortDJumpIn:
ld a,(hl)
ld d,h
ld e,l
__SortDSkip:
djnz __SortDLoop1
ldi
dec hl
ld (hl),a
pop hl
jp pe,__SortDLoop2
ret
__SortDEnd:

Edit: p_Reciprocal: same size, saves avg 2 cycles
Original

Code: [Select]
p_Reciprocal:
.db __ReciprocalEnd-1-$
xor a
bit 7,h
push af
jr z,$+7
sub l
ld l,a
sbc a,a
sub h
ld h,a
ex de,hl
ld bc,$1000
ld hl,1
xor a
ld b,b
.db 10
call $3F00+sub_Div
pop af
ret z
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__ReciprocalEnd:
Optimized
avg -2 cycles
Code: [Select]
p_Reciprocal:
.db __ReciprocalEnd-1-$
xor a
bit 7,h
push af
jr z,$+8
sub l
ld l,a
sbc a,a
sub h
ld h,a
xor a
ex de,hl
ld bc,$1000
ld hl,1
ld b,b
.db 10
call $3F00+sub_Div
pop af
ret z
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__ReciprocalEnd:
Optimized Moar
avg -33 cycles
Code: [Select]
p_Reciprocal:
.db __ReciprocalEnd-1-$
xor a
bit 7,h
push af
jr z,$+8
sub l
ld l,a
sbc a,a
sub h
ld h,a
xor a
ex de,hl
ld bc,$1001
ld hl,2
ld b,b
.db 16
call $3F00+sub_Div
pop af
ret z
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__ReciprocalEnd:

Edit: p_Mod: 2 bytes smaller, 96 cycles faster!
Original
Code: [Select]
p_Mod:
.db __ModEnd-1-$
ld a,h
ld c,l
ld hl,0
ld b,16
__ModLoop:
scf
rl c
rla
adc hl,hl
sbc hl,de
jr nc,__ModSkip
add hl,de
dec c
__ModSkip:
djnz __ModLoop
ret
__ModEnd:
Optimized
Code: [Select]
p_Mod:
.db __ModEnd-1-$
ld a,h
ld c,l
ld hl,0
ld b,16
__ModLoop:
sla c
rla
adc hl,hl
sbc hl,de
jr nc,__ModSkip
add hl,de
__ModSkip:
djnz __ModLoop
ret
__ModEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Happybobjr on December 06, 2011, 07:56:54 am
awesome I needed speed boost in division.
p_SDiv is signed 16 bit division, right?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on December 09, 2011, 05:53:34 am
Not an optimization, but I'm posting this here since more assembly people will read it.  Since the Bitmap() command is being replaced with something actually useful, that means the "Fix 8" and "Fix 9" will also need to be replaced.  Are there any useful flags (particularly for text) that would be useful to Axe programmers that I haven't already covered with the other fix commands?  A couple I can think of are an APD toggle or Lowercase toggle.

I agree with adding the lowercase enable flag, especially if p_GetKeyPause is modified slightly. For example:
Code: [Select]
p_GetKeyPause: ; Change to subroutine
.db __GetKeyPauseEnd-1-$
B_CALL(_GetKeyRetOff)
res 7,(iy+40)
ld h,0
ld l,a
cp $fc
ret c
ld a,($8446)
ld h,a ; Edit: something like inc h \ ld l,a might be easier
; since lowercase letters would be consecutive
; or ld hl,($8446) \ ld h,a
ret
__GetKeyPauseEnd-1-$
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 10, 2011, 05:06:35 pm
Thanks for the optimizations :)

Unfortunately that p_SortD won't work because ldi also increases de.  Also, the p_Mod won't work because c's right bit will never be set in either of those cases.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on December 10, 2011, 05:11:59 pm
For p_SortD, affecting de doesn't matter because if you follow the code path, the next occurrence of de is when it is loaded from hl, so its contents don't matter.
As for p_Mod, ac is the division result, which is never needed. You can also notice that in the original routine, the new bits shifted into ac are never read.

Edit: fixed grammar ::)

Also, some peephole ops I would find useful.
Code: [Select]
.db 3
sbc hl,de
ld a,h
or l
.db 2
sbc hl,de
Code: [Select]
.db 8
ld de,$0000
add hl,de
sbc hl,hl
inc hl
dec hl
.db 6
ld de,$0000
add hl,de
sbc hl,hl
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 10, 2011, 07:20:44 pm
Oh I forgot to mention that I've optimized division now by having the long division routine call the modulus subroutine to save space...  I do see what you mean now about the sorting, so I added that change. :)

Generally I don't do peephole optimizations unless all the registers are the same, but I'm 99% sure that this particular one should be okay and that nothing else in the Axe protocol currently relies on the "a" register after a zero check, so I'll add that one.  That second one seems rare, it would only occur if you added 1 after a signed greater than or equal to zero comparison.    So at least until make it faster, I'm only trying to do the most common/significant optimizations because each one I add slows down parsing.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on December 10, 2011, 07:43:58 pm
I had a sneaking suspicion that you would find some reason to call p_Mod :/

For the first peephole optimization, since you already had
Code: [Select]
.db 3
sbc hl,hl
ld a,h
or l
.db 2
sbc hl,hl
I figured that you already checked that the value of a is not used.

The second one was an optimization for 32-bit subtraction (it was supposed to be p_LtLeXX followed by dec hl). However, the following should work because it has no differing side effects and should be more common. (btw, I think Runer may have made a similar suggestion)
 .db 2
 inc hl
 dec hl
 .db 0 ;<- dont know if this works
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on December 11, 2011, 07:34:47 am
Lol... I'm already optimizing :P

p_ArcTan: same size, save 19 or 10 (avg 14.5) cycles
Original
Code: [Select]
p_ArcTan:
.db __ArcTanEnd-1-$
ex de,hl ;de = y
pop hl
ex (sp),hl ;hl = x
push hl
ld a,h ;\
xor d ; |Get pairity
rla ;/
jr c,__ArcTanSS ;\
add hl,de ; |
add hl,de ; |
__ArcTanSS: ; |
or a ; |hl = x +- y
sbc hl,de ;/
ex de,hl ;de = x +- y
ld b,6 ;\
__ArcTan64: ; |
add hl,hl ; |hl = 64y
djnz __ArcTan64 ;/
call $3F00+sub_SDiv ;hl = 64y/(x +- y)
pop af ;\
rla ; |Right side, fine
ret nc ;/
sbc a,a ;\
sub h ; |Reverse sign extend
ld h,a ;/
ld a,l ;\
add a,128 ; |Add or sub 128
ld l,a ;/
ret
__ArcTanEnd:
Optimized
Code: [Select]
p_ArcTan:
.db __ArcTanEnd-1-$
ex de,hl ;de = y
pop hl
ex (sp),hl ;hl = x
push hl
ld a,h ;\
xor d ;/ Get parity
jp m,__ArcTanSS-p_ArcTan-1
add hl,de ;\
jr __ArcTanDS ; |
__ArcTanSS: ; |hl = x +- y
sbc hl,de ; |
__ArcTanDS: ;/
ex de,hl ;de = x +- y
ld b,6 ;\
__ArcTan64: ; |
add hl,hl ; |hl = 64y
djnz __ArcTan64 ;/
call $3F00+sub_SDiv ;hl = 64y/(x +- y)
pop af ;\
rla ; |Right side, fine
ret nc ;/
sbc a,a ;\
sub h ; |Reverse sign extend
ld h,a ;/
ld a,l ;\
add a,128 ; |Add or sub 128
ld l,a ;/
ret
__ArcTanEnd:
I'm curious as to why you multiplied by 64 before dividing. It would seem that if the times 64 was after the division, the result would generally be the same, but there would be less of a chance of overflow. It's possible though that it doesn't matter.
Edit: Oh yeah... accuracy. Your way is more accurate, nvm.

p_DrawBmp: saves 3 to 4 bytes, and (8 ± 0 or 4) × (visible height) - 8 cycles
Original
Code: [Select]
p_DrawBmp:
; ...
__DrawBmpGoodSize:
ld b,a ;B = plot_height
push bc ;****** BEGIN BUFFER CALCULATIONS ******
; ...
__DrawBmpLeftLoop:
inc c
dec c
jr z,__DrawBmpSkipMain
dec c
; ...
__DrawBmpOnLeft: ;A = X + 8
inc c
dec c
ld d,(hl)
inc hl
ld e,c ;E = 0 if z
jr z,__DrawBmpSt
; ...
__DrawBmpStSkip:
ld a,e
pop de ;D = X
ld e,c
pop bc
ld c,e ;C = bytes
; ...
__DrawBmpColWall:
inc c
dec c
jr z,__DrawBmpSkipMain
dec c
ld a,d
jr nz,__DrawBmpColLeft
cp 88
ld d,(hl)
inc hl
jr nc,__DrawBmpSkipMain
ld e,c
jr __DrawBmpSt
; ...
Optimized
Code: [Select]
p_DrawBmp:
; ... c = bytes + 1 is required for the rest of the optimizations
__DrawBmpGoodSize:
ld b,a ;B = plot_height
inc c ;C = bytes+1
push bc ;****** BEGIN BUFFER CALCULATIONS ******
; ... undo inc c above, affect z flag the same as before, c is still one more than before
__DrawBmpLeftLoop:
dec c
jr z,__DrawBmpSkipMain
; ... since c is one more than before, check e = c - 1 for 0, instead of c
__DrawBmpOnLeft: ;A = X + 8
ld d,(hl)
inc hl
ld e,c
dec e ;E = 0 and z (if bytes = 0)
jr z,__DrawBmpSt
; ... this stores one more than before to e, but all code paths lead to
; either pop de, ld e,(hl), or ld e,c before e is ever used.
__DrawBmpStSkip:
ld a,e
pop de ;D = X
ld e,c
pop bc
ld c,e ;C = bytes+1
; ... same as above
__DrawBmpColWall:
dec c
jr z,__DrawBmpSkipMain
ld a,d
jr nz,__DrawBmpColLeft
cp 88
ld d,(hl)
inc hl
jr nc,__DrawBmpSkipMain
; I do not understand the reason for ld e,c, however, c is one more than before,
; so dec e to have e be the same as before, but I don't know if this is necessary.
ld e,c
dec e
jr __DrawBmpSt
; ...

Sorry for bumping some of these so soon, but I wanted to change them to work with the new version.

p_88Mul: same size, saves 1 or 6 (avg 3.5) cycles
Original
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_MulFull
ld l,h
ld h,a
pop af
xor h
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__88MulEnd:
Optimized
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
ld a,h
xor d
push af
xor d ; a = h
jp p,$+9-p_88Mul-1
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_MulFull
ld l,h
ld h,a
pop af
xor h
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__88MulEnd:

p_SDiv: same size, saves 1 or 6 (avg 3.5) cycles
Original
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_Div
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
Optimized
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
xor d ; a = h
jp p,$+9-p_SDiv-1
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_Div
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:

p_Reciprocal: same size, saves 31 cycles
Let me know if you want this one explained.
Original
Code: [Select]
p_Reciprocal:
.db __ReciprocalEnd-1-$
xor a
bit 7,h
push af
jr z,$+8
sub l
ld l,a
sbc a,a
sub h
ld h,a
xor a
ex de,hl
ld bc,$1000
ld hl,1
ld b,b \ .db 7 \ call $3F00+sub_Mod
ld h,a
ld l,c
pop af
ret z
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__ReciprocalEnd:
Optimized
Code: [Select]
p_Reciprocal:
.db __ReciprocalEnd-1-$
xor a
bit 7,h
push af
jr z,$+8
sub l
ld l,a
sbc a,a
sub h
ld h,a
xor a
ex de,hl
ld bc,$1001
ld hl,2
ld b,b \ .db 13 \ call $3F00+sub_Mod
ld h,a
ld l,c
pop af
ret z
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__ReciprocalEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 12, 2011, 01:48:13 am
First, an optimization so simple, it doesn't even deserve a fancy side-by-side comparison. It's funny that nobody (including myself, until now) noticed this optimization, because it's something you wouldn't even think about optimizing. But anyways, getting to the optimization: remove the ld hl,0 at the start of p_Mul! Who would've thought, optimize code by simply deleting a line of it! ;D


Second, I actually thought of the above optimization while toying around with the multiplication routine attempting to optimize it in another manner. Although I know you generally like to optimize for size, I think that with a routine like multiplication that is so heavily used, a routine optimized more for speed would be worth it. Especially if the cost in size is a paltry two bytes! This optimization will get a fancy side-by-side comparison. :)


Smaller routine: 14 bytes, ~836 cycles
Code: [Select]
p_Mul:
.db __MulEnd-1-$
ld c,h
ld a,l
ld b,16
__MulNext:
add hl,hl
add a,a
rl c
jr nc,__MulSkip
add hl,de
__MulSkip:
djnz __MulNext
ret
__MulEnd:
   Faster routine: 16 bytes, ~741 cycles
Code: [Select]
p_Mul:
.db __MulEnd-1-$
ld c,l
ld a,h
call __MulByte
ld a,c
__MulByte:
ld b,8
__MulNext:
add hl,hl
add a,a
jr nc,__MulSkip
add hl,de
__MulSkip:
djnz __MulNext
ret
__MulEnd:



EDIT: The division routine just gave me a great idea. For another paltry two bytes, if you'd like, you can cut the time for multiplying 8-bit numbers in half! Compared to Axe's current routine, you would get a speed increase anywhere from 12% to 120% for a total size increase of one byte! Not bad! Of the three routines I have provided, this routine is definitely my favorite.


Even faster routine: 18 bytes,
~749 cycles for 16-bit inputs (h!=0),
~386 cycles for 8-bit inputs (h=0)
Code: [Select]
p_Mul:
.db __MulEnd-1-$
ld c,l
xor a
ld l,a
add a,h
call nz,__MulByte
ld a,c
__MulByte:
ld b,8
__MulNext:
add hl,hl
add a,a
jr nc,__MulSkip
add hl,de
__MulSkip:
djnz __MulNext
ret
__MulEnd:



EDIT 2: Another thought: the same basic optimization I applied in the 16-byte routine (the faster routine before the 8-bit optimization) could be applied to p_MulFull to save a couple hundred cycles in each fixed-point multiplication. High-order multiplication and fixed-point multiplication could no longer share a routine, but it might be worth it considering the two multiplication techniques are (in my experience) not commonly used in the same scenario.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Builderboy on December 12, 2011, 02:28:58 am
*Builderboy votes for using super fast multiplication :D *
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 12, 2011, 02:58:15 am
@jacobly
Your p_DrawBmp optimizations don't do the same thing as the original code (which is why you're confused about the use of e).  For instance, right after __DrawBmpColWall, my original routine checks if c is zero, then checks if c is one.  Yours checks if c is zero, then if its non-zero.  The e register, which is the left aligned byte to be shifted, is only loaded with c as an optimization in the case that c is zero because the sprite is clipped on that wall.  This is always the same as doing ld e,0.

But thanks for the other optimizations, I have added all of them :)

@Runer112
Mind == Blown.  That's unbelievably cool!  Modular arithmetic is quite a strange beast sometimes.  Anyway, I like that last suggestion best as well :)  However, even though removing the zero loading works for the 16 bit multiplication and (I presume) the 32 bit multiplication, I don't think that last optimization will work for 32-bit, unless you can think of another method?
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 12, 2011, 11:57:46 pm
Yeah, I see no way to optimize the full 32-bit multiplication... But fixed-point multiplication, now that's an entirely different story! First, here's a totally different approach to sign handling that reduces p_88Mul to less than half of its current size! ;D


Original routine: 38 bytes, ~1128 cycles
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_MulFull
ld l,h
ld h,a
pop af
xor h
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__88MulEnd:
   Smaller routine: 18 bytes, ~1089 cycles
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
push hl
call $3F00+sub_MulFull
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:


20 bytes saved? Not bad at all! But what if you're more interested in shaving off cycles than bytes? Don't worry, I covered that base too. Instead of using the slower p_MulFull, this final routine uses my faster p_Mul (http://ourl.ca/4175/270383) for 8 bits of the multiplication and an inlined, slightly different version of faster multiplication for the other 8 bits. End result: it's about 260 cycles faster than the smaller solution, or about 30% faster! ;D It's 16 bytes larger than my smaller method, but actually it would often end up resulting in smaller programs because it relies on the much more popular p_Mul instead of p_MulFull.


Faster routine: 34 bytes, ~831 cycles
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
push hl
ld c,l
ld a,h
ld l,0
ld b,b \ .db 8 \ call $3F00+sub_Mul
ld a,c
ld bc,8<<8+0
__88MulNext:
add hl,hl
rla
jr nc,__88MulSkip
add hl,de
adc a,c
__88MulSkip:
djnz __88MulNext
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 13, 2011, 01:19:54 am
Wow thanks!  However there seems to be an issue.  The 3 pictures attached are the output from the Mandelbrot Set demo program. The first is the original routine.  The second is your new size optimized version.  As you can see it works, but the rounding appears to be asymmetrical (which might still be okay).  The last one is your speed optimized version.  I think you have a bug somewhere...  :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 13, 2011, 01:38:30 am
I think I can explain the asymmetry of the size-optimized version. Because it adjusts signs differently, I think it now rounds down instead of towards zero like the old routine.

However, I have no clue what is going on with the speed-optimized routine. Can you look at the debugger and confirm that the call to sub_Mul is actually entering where it's supposed to be entering, at __MulByte? Because I wouldn't be surprised if the fact that you probably had to add the offset call macro for call nz,__MulByte in p_Mul is messing up the offset calls due to its own size.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 13, 2011, 02:07:20 am
The disassembly looks fine to me.  All the jumps calls and everything of that nature are aligned.  I tried 4 test cases with different combinations of sign values and they seemed okay.  Since the generated picture is relatively close to the original given that it was a chaotic system sensitive to errors, I would guess it is only a few special cases that cause it to return a wrong result.

EDIT: I made a program to run them side by side on random numbers and quit when the output is different.  Here is an output that gives different results between the routines:

$FFE0 ** $F5F1 (-0.125 ** -10.059)

Results in $0143 (1.26) in size optimized.
Results in $0239 (2.22) in speed optimized.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 13, 2011, 03:04:07 am
That edit was helpful, it gave me a hunch as to what the problem was and (I think) that hunch was correct. Unfortunately, the fix for this problem will cost a byte and about 70 cycles. It will still be about 20% faster than the small routine though. And it still relies on the more common p_Mul instead of p_MulFull, so being 17 bytes larger might still be worth it.


Faster routine: 35 bytes, ~900 cycles
Code: [Select]
p_88Mul:
.db __88MulEnd-1-$
push hl
ld c,l
ld a,h
ld l,0
ld b,b \ .db 8 \ call $3F00+sub_Mul
ld b,8
__88MulNext:
add hl,hl
rla
rl c
jr nc,__88MulSkip
add hl,de
adc a,0
__88MulSkip:
djnz __88MulNext
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on December 17, 2011, 11:29:24 pm
So... Z-Test. At a cost of 8 cycles, you can go from 17 bytes plus 3 bytes times the number of options (limited to something like 85?) to 16 bytes plus 2 bytes times the number of options (limited to amount of program space).

Here's my method:
Code: [Select]
  ld de,-range
  add hl,de
  ld de,jumptable_end
  jr c,default
  add hl,hl
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
default:
  ex de,hl
  jp (hl)
  .dw Label0
  .dw Label1
  .dw Label2
  ;.....
jumptable_end:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Quigibo on December 17, 2011, 11:39:31 pm
Wow thanks!  I was considering that, but I assumed the overhead would be large, not smaller!  Thanks!

Also, I could move the labels to the data section of the code to make it even faster!

Code: [Select]
 ld de,-range
  add hl,de
  jr c,default
  add hl,hl
  ld de,jumptable_end
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
  ex de,hl
  jp (hl)
default:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on December 17, 2011, 11:50:41 pm
If you wanted to save 2 cycles in the case of a jump, you could use an odd table setup with all the LSBs in a row followed by all the MSBs in a row, like so:

Code: [Select]
 ld de,-range
  add hl,de
  jr c,routine_end
  ex de,hl
  ld hl,jumptable_end
  add hl,de
  ld a,(hl)
  add hl,de
  ld l,(hl)
  ld h,a
  jp (hl)
routine_end:

I imagine that might not work well with the way pointers are handled in the compiler, though.

Edit:
And I suppose the current Z-Test is actually limited to 39 options due to the range of the JR instruction...
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on December 19, 2011, 01:12:53 am
First, an optimization that I can't give you code for: making *^CONST use an equivalent constant division optimization if one exists. And don't forget about the trivial cases, *^1 and *^0. Of course, these only apply if you don't change this operation to return a 32-bit result somehow. Which it really should. :P

Next, some silly optimizations: ^0, <<ᴇ8000, >>ᴇ7FFF should simply be 0, while ≥≥ᴇ8000 and ≤≤ᴇ7FFF should simply be 1. If you're wondering why ^0 should be 0, that's what the general modulus routine would return anyways.

Finally, some optimizations for signed comparisons. These have been lacking general forms which take advantage of absolute jumps as well as optimized forms for constants for quite some time. Thanks to jacobly and calc84maniac for helping me come up with the first two! If either of you two are reading this, feel free to look at the other operations and try to optimize them. ;)

Code: [Select]
p_SGT0:
.db 8
ld a,h
or l
jr z,$+6
add hl,hl
sbc hl,hl
inc hl
p_SLE0:
.db 9
ld a,h
or l
jr z,$+6
add hl,hl
ccf
sbc hl,hl
inc hl
p_SLtLeXX:
.db 11
ld a,h
add a,$80
ld h,a
ld de,$0000 ;$8000-const
add hl,de
sbc hl,hl
inc hl
.db rp_Ans,6
p_SGtGeXX:
.db 12
ld a,h
add a,$80
ld h,a
xor a
ld de,$0000 ;$8000-const
add hl,de
ld h,a
rla
ld l,a
.db rp_Ans,6
p_SIntGt:
.db 11
scf
sbc hl,de
add hl,hl
jp pe,$+4
ccf
sbc hl,hl
inc hl
p_SIntGe:
.db 11
xor a
sbc hl,de
add hl,hl
jp po,$+4
ccf
ld h,a
rla
ld l,a
p_SIntLt:
.db 11
scf
sbc hl,de
add hl,hl
jp po,$+4
ccf
sbc hl,hl
inc hl
p_SIntLe:
.db 11
xor a
sbc hl,de
add hl,hl
jp pe,$+4
ccf
ld h,a
rla
ld l,a
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on December 20, 2011, 04:47:46 am
p_DrawOff: save 1 byte, save ~40 cycles
Original
Code: [Select]
xor a
ld e,a
dec a
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
and (hl)
or e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
cpl
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:
Optimized
Code: [Select]
xor a
ld e,$FF
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
or (hl)
and e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:

p_Pix: save 2 bytes, save ~6 cycles
Original
Code: [Select]
p_Pix:
.db __PixEnd-1-$ ;Draws pixel (c,l)
ld de,plotSScreen
pop af
pop bc
push af
ld b,0

ld a,l
cp 64
ld a,b
ret nc
ld a,c
cp 96
ld a,b
ret nc

ld h,b
ld a,l
add a,a
add a,l
ld l,a
add hl,hl
add hl,hl
add hl,de
ld a,c
srl c
srl c
srl c
add hl,bc
and %00000111
ld b,a
ld a,%10000000
ret z
___GetPixLoop:
rrca
djnz ___GetPixLoop
ret
__PixEnd:
Optimized
Code: [Select]
p_Pix:
.db __PixEnd-1-$ ;Draws pixel (c,l)
ld de,plotSScreen
pop af
pop bc
push af
ld b,0

ld a,c
cp 96
ld a,b
ret nc
sla l
ret c
sla l
ret c

ld h,b
ex de,hl
add hl,de
add hl,de
add hl,de
ld a,c
srl c
srl c
srl c
add hl,bc
and %00000111
ld b,a
ld a,%10000000
ret z
___GetPixLoop:
rrca
djnz ___GetPixLoop
ret
__PixEnd:

p_ArcTan: save 1 byte, save ~1 cycle
Original
Code: [Select]
p_ArcTan:
.db __ArcTanEnd-1-$
ex de,hl ;de = y
pop hl
ex (sp),hl ;hl = x
push hl
ld a,h ;\
xor d ;/ Get parity
jp m,__ArcTanSS-p_ArcTan-1
add hl,de ;\
jr __ArcTanDS ; |
__ArcTanSS: ; |hl = x +- y
sbc hl,de ; |
__ArcTanDS: ;/
ex de,hl ;de = x +- y
ld b,6 ;\
__ArcTan64: ; |
add hl,hl ; |hl = 64y
djnz __ArcTan64 ;/
call $3F00+sub_SDiv ;hl = 64y/(x +- y)
pop af ;\
rla ; |Right side, fine
ret nc ;/
sbc a,a ;\
sub h ; |Reverse sign extend
ld h,a ;/
ld a,l ;\
add a,128 ; |Add or sub 128
ld l,a ;/
ret
__ArcTanEnd:
Optimized
Code: [Select]
p_ArcTan:
.db __ArcTanEnd-1-$
ex de,hl ;de = y
pop hl
ex (sp),hl ;hl = x
push hl
ld a,h ;\
xor d ;/ Get parity
jp m,__ArcTanSS-p_ArcTan-2
add hl,de ;\
ld c,c \ .db $FA ; |
;jr __ArcTanDS ; |
__ArcTanSS: ; |hl = x +- y
sbc hl,de ; |
__ArcTanDS: ;/
ex de,hl ;de = x +- y
ld b,6 ;\
__ArcTan64: ; |
add hl,hl ; |hl = 64y
djnz __ArcTan64 ;/
call $3F00+sub_SDiv ;hl = 64y/(x +- y)
pop af ;\
rla ; |Right side, fine
ret nc ;/
sbc a,a ;\
sub h ; |Reverse sign extend
ld h,a ;/
ld a,l ;\
add a,128 ; |Add or sub 128
ld l,a ;/
ret
__ArcTanEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on December 24, 2011, 01:45:01 pm
p_DrawOr/Xor: save 17 bytes (plus 4 every time a custom buffer is used)
aligned saves 98 cycles, unaligned saves ~173 cycles
save additional 21 cycles every time a custom buffer is used
Code: [Select]
p_DrawOr:
.db __DrawOrEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop bc ;Input c = Sprite Y Position
pop de ;Input e = Sprite X Position
push af
ld b,7
ld a,e
add a,b
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,a
ld a,c
add a,b
jr c,__DrawOrClipTop
sub 64+7
ret nc
cpl
cp b
jr c,__DrawOrClipBottom
ld a,b
jr __DrawOrClipBottom
__DrawOrClipTop:
inc ix
inc c
jr nz,__DrawOrClipTop
__DrawOrClipBottom:
inc a
ld b,0
sla c
sla c
add hl,bc
add hl,bc
add hl,bc
ld c,d
add hl,bc
ld b,a
ld a,e
and 7
jr z,__DrawOrAligned
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawOrLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawOrShift:
srl c
rra
djnz __DrawOrShift
and e
or (hl)
ld (hl),a
dec hl
ld a,c
and d
or (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawOrLoop
ret
__DrawOrAligned:
ld de,12
__DrawOrAlignedLoop:
ld a,(ix)
or (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawOrAlignedLoop
ret
__DrawOrEnd:

p_DrawXor:
.db __DrawXorEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop bc ;Input c = Sprite Y Position
pop de ;Input e = Sprite X Position
push af
ld b,7
ld a,e
add a,b
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,a
ld a,c
add a,b
jr c,__DrawXorClipTop
sub 64+7
ret nc
cpl
cp b
jr c,__DrawXorClipBottom
ld a,b
jr __DrawXorClipBottom
__DrawXorClipTop:
inc ix
inc c
jr nz,__DrawXorClipTop
__DrawXorClipBottom:
inc a
ld b,0
sla c
sla c
add hl,bc
add hl,bc
add hl,bc
ld c,d
add hl,bc
ld b,a
ld a,e
and 7
jr z,__DrawXorAligned
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawXorLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawXorShift:
srl c
rra
djnz __DrawXorShift
and e
xor (hl)
ld (hl),a
dec hl
ld a,c
and d
xor (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawXorLoop
ret
__DrawXorAligned:
ld de,12
__DrawXorAlignedLoop:
ld a,(ix)
xor (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawXorAlignedLoop
ret
__DrawXorEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on December 24, 2011, 03:58:42 pm
I finally have an optimisation that might work or be useful >.> Runer112 apparently mentioned optimising the p_FreqOut routine by replacing:
Code: [Select]
dec hl
dec bc
ld a,b
or c
jr nz,__FreqOutLoop2
with this:
Code: [Select]
cpd
jp pe,__FreqOutLoop2
However, the issue was that the frequency would be thrown off as it cut out 8*HL cycles. However, when I was stealing the code for my own evil intentions, I saw this optimisation and thought of that issue and here is my solution:
Code: [Select]

p_FreqOut:
xor a
__FreqOutLoop1:
push bc
        xor     %00000011
ld e,a
__FreqOutLoop2:
ld a,h
or l
jr z,__FreqOutDone
cpd
ld a,e
        scf
jp pe,__FreqOutLoop2
__FreqOutDone:
pop bc
out ($00),a
ret nc
jr __FreqOutLoop1
__FreqOutEnd:
The way the code is reordered, now, it should only cut out 8*HL/BC cycles which is much less than 8*HL. I think Runer said that it might be up to 1% faster for higher notes and negligible for lower notes.


EDIT: Okay, found a problem: It is actually 2 cycles slower in the inside loop, now, so that will just slow the routine by 2*hl, too
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on December 26, 2011, 12:17:26 pm
p_DrawOr: 18 bytes saved
p_DrawXor: 18 bytes saved
p_DrawOff: 14 bytes saved
p_DrawMsk: 10 bytes saved
p_DrawMsk2: 11 bytes saved
Code: [Select]
p_DrawOr:
.db __DrawOrEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawOrClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawOrClipBottom
ld b,d
jr __DrawOrNoClipV
__DrawOrClipTop:
inc ix
inc e
jr nz,__DrawOrClipTop
__DrawOrClipBottom:
ld b,a
__DrawOrNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
inc b
ld a,c
and d
ld d,-7*3
add hl,de
jr z,__DrawOrAligned
ld e,c
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawOrLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawOrShift:
srl c
rra
djnz __DrawOrShift
and e
or (hl)
ld (hl),a
dec hl
ld a,c
and d
or (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawOrLoop
ret
__DrawOrAligned:
ld de,12
__DrawOrAlignedLoop:
ld a,(ix)
or (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawOrAlignedLoop
ret
__DrawOrEnd:

p_DrawXor:
.db __DrawXorEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawXorClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawXorClipBottom
ld b,d
jr __DrawXorNoClipV
__DrawXorClipTop:
inc ix
inc e
jr nz,__DrawXorClipTop
__DrawXorClipBottom:
ld b,a
__DrawXorNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
inc b
ld a,c
and d
ld d,-7*3
add hl,de
jr z,__DrawXorAligned
ld e,c
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawXorLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawXorShift:
srl c
rra
djnz __DrawXorShift
and e
xor (hl)
ld (hl),a
dec hl
ld a,c
and d
xor (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawXorLoop
ret
__DrawXorAligned:
ld de,12
__DrawXorAlignedLoop:
ld a,(ix)
xor (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawXorAlignedLoop
ret
__DrawXorEnd:

p_DrawOff:
.db __DrawOffEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawOffClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawOffClipBottom
ld b,d
jr __DrawOffNoClipV
__DrawOffClipTop:
inc ix
inc e
jr nz,__DrawOffClipTop
__DrawOffClipBottom:
ld b,a
__DrawOffNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawOffAligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawOffLoop
inc d
cp 96-7
jr nc,__DrawOffLoop
inc d
__DrawOffLoop:
push bc
ld b,c
ld c,(ix+0)
xor a
ld e,$FF
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
or (hl)
and e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:
ld bc,13
add hl,bc
inc ix
pop bc
djnz __DrawOffLoop
ret
__DrawOffAligned:
ld e,12
__DrawOffAlignedLoop:
ld a,(ix)
ld (hl),a
inc ix
add hl,de
djnz __DrawOffAlignedLoop
ret
__DrawOffEnd:

p_DrawMsk:
.db __DrawMskEnd-1-$
ex (sp),hl
pop ix ;Input hl = Sprite
pop de
pop bc
push hl
ld hl,plotSScreen
ld d,7
ld a,e
add a,d
jr c,__DrawMskClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawMskClipBottom
ld b,d
jr __DrawMskNoClipV
__DrawMskClipTop:
inc ix
inc e
jr nz,__DrawMskClipTop
__DrawMskClipBottom:
ld b,a
__DrawMskNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawMskAligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawMskLoop
inc d
cp 96-7
jr nc,__DrawMskLoop
inc d

__DrawMskLoop:
push bc

push hl

ld b,c
ld e,(ix+0)
xor a
ld h,a
ld c,(ix+8)
__DrawMskShift:
srl e
rr h
srl c
rra
djnz __DrawMskShift

ld b,h
pop hl
push af

dec d
jr z,__DrawMskSkipRight1

push bc
xor b
cpl
ld c,a

ld a,(hl)
or b
and c
ld (hl),a
pop bc

__DrawMskSkipRight1:
dec hl
inc d
push de
jr z,__DrawMskSkipLeft1

ld a,c
xor e
cpl
ld d,a

ld a,(hl)
or e
and d
ld (hl),a

__DrawMskSkipLeft1:
ld de,appBackUpScreen-plotSScreen+1
add hl,de
pop de
pop af
dec d
jr z,__DrawMskSkipRight2

or b
cpl

and (hl)
or b
ld (hl),a

__DrawMskSkipRight2:
dec hl
inc d
jr z,__DrawMskSkipLeft2

ld a,c
or e
cpl

and (hl)
or e
ld (hl),a

__DrawMskSkipLeft2:
ld bc,plotSScreen-appBackUpScreen+13
add hl,bc

inc ix
pop bc
djnz __DrawMskLoop
ret
__DrawMskAligned:
push hl
ld de,appBackUpScreen-plotSScreen
add hl,de

ld a,(ix+0)
ld d,a
xor (ix+8)
cpl
ld e,a

and (hl)
or d
ld (hl),a

pop hl

ld a,(hl)
or d
and e
ld (hl),a

inc ix
ld de,12
add hl,de
djnz __DrawMskAligned
ret
__DrawMskEnd:

p_DrawMsk2:
.db __DrawMsk2End-1-$
ex (sp),hl
pop ix ;Input hl = Sprite
pop de
pop bc
push hl
ld hl,plotSScreen
ld d,7
ld a,e
add a,d
jr c,__DrawMsk2ClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawMsk2ClipBottom
ld b,d
jr __DrawMsk2NoClipV
__DrawMsk2ClipTop:
inc ix
inc e
jr nz,__DrawMsk2ClipTop
__DrawMsk2ClipBottom:
ld b,a
__DrawMsk2NoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawMsk2Aligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawMsk2Loop
inc d
cp 96-7
jr nc,__DrawMsk2Loop
inc d
__DrawMsk2Loop:
push bc
push hl

ld b,c
ld e,(ix+0)
xor a
ld h,a
ld c,(ix+8)
__DrawMsk2Shift:
srl e
rr h
srl c
rra
djnz __DrawMsk2Shift

ld b,h ;e = left spr, b = right spr, c = left msk, a = right msk
pop hl

dec d
jr z,__DrawMsk2SkipRight

cpl
and (hl)
xor b
ld (hl),a

__DrawMsk2SkipRight:
dec hl
inc d
jr z,__DrawMsk2SkipLeft

ld a,c
cpl
and (hl)
xor e
ld (hl),a

__DrawMsk2SkipLeft:
ld bc,13
add hl,bc

inc ix
pop bc
djnz __DrawMsk2Loop
ret
__DrawMsk2Aligned:
ld e,12
__DrawMsk2AlignedLoop:
ld a,(ix+8)
cpl
and (hl)
xor (ix+0)
ld (hl),a
inc ix
add hl,de
djnz __DrawMsk2AlignedLoop
ret
__DrawMsk2End:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on February 02, 2012, 11:56:23 pm
Just a small optimization I see with the new Nth string command. Because you restack the return location by popping it into bc, you're already loading bc with a value that's at least $4000 for applications and at least $8000 for programs, so the ld b,h inside the loop is not necessary.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jacobly on September 18, 2012, 03:03:40 am
Thanks to a suggestion from calc84maniac, I have optimized the routine that is used for both *^ and ** to be 25-50% faster. ;D In addition, every use of *^ would be 2 bytes smaller.

p_MulFull: same size, save 300-550 cycles
Original
Code: [Select]
p_MulFull:
.db __MulFullEnd-1-$
ld c,h
ld a,l
ld hl,0
ld b,16
__MulFullNext:
add hl,hl
rla
rl c
jr nc,__MulFullSkip
add hl,de
adc a,0
jr nc,__MulFullSkip
inc c
__MulFullSkip:
djnz __MulFullNext
ret
__MulFullEnd:
Optimized
Code: [Select]
p_MulFull:
.db __MulFullEnd-1-$
xor a
ld c,h
ld h,a
or l
ld l,h
call nz,__MulFullByte-p_MulFull-1
ld a,c
__MulFullByte:
ld b,8
__MulFullNext:
rra
jr nc,__MulFullSkip
add hl,de
__MulFullSkip:
rr h
rr l
djnz __MulFullNext
ret
__MulFullEnd:
Note: Output changed: hl = bits 16-31 of the result, do rra after the routine returns to get a = bits 8-15 of the result.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on September 18, 2012, 09:37:08 am
And if you ever want a signed high multiplication, I think this routine would work along with that one:
Code: [Select]
p_MulFullSigned:
.db __MulFullSignedEnd-1-$
push hl
call $3F00+sub_MulFull
pop bc
xor a
bit 7,b
jr z,$+4
sbc hl,de
or d
ret p
sbc hl,bc
ret
__MulFullSignedEnd:

Edit: more optimized
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: squidgetx on December 12, 2012, 10:22:33 am
Optimizing constant address calls?
Anyway, 5->oVAR : (oVAR)() compiles to
Code: [Select]
ld hl, 5
push hl
call $9D9D
when it could just compile to
Code: [Select]
call $0005

Right now the only way to call an address that's not a label is using asm(CDXXXX), and that way makes assigning r1-r6 arguments extremely annoying (manual store)
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on February 15, 2013, 07:02:51 pm
I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation:
Code: [Select]
p_NthStr:
.db __NthStrEnd-$+1
pop bc
pop de
push bc
ex de,hl
__NthStrLoop:
ld a,d
or e
ret z
xor a
ld b,h
cpir
dec de
jr __NthStrLoop
__NthStrEnd:
It took me a second to figure out what you were doing with 'ld b,h', but when I did, I saw that you could just move it outside the loop to save 4 t-states each loop. But then I realised that BC is already large enough since it holds the return address, so you can actually just remove it altogether.
Code: [Select]
p_NthStr:
.db __NthStrEnd-$+1
pop bc
pop de
push bc
ex de,hl
__NthStrLoop:
ld a,d
or e
ret z
xor a
cpir
dec de
jr __NthStrLoop
__NthStrEnd:

I hope that actually works!
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: calc84maniac on February 15, 2013, 11:18:31 pm
I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on February 15, 2013, 11:29:47 pm
I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation

You do have an outdated version of Axe, I already added that optimizaion in 1.2.0. :P


I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).

Pfft what are the chances of that...
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on February 16, 2013, 07:27:26 am
You do have an outdated version of Axe, I already added that optimizaion in 1.2.0. :P
Darn, I actually do have 1.2.1 in a different folder, I completely forgot about that .__. I am glad that I got something right, though :D

I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).
I was worried about that, but I figured that it would be pretty rare. It would definitely be the only scenario that it would fail, too. .__.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Deep Toaster on April 06, 2013, 05:42:16 pm
Don't know if it's been mentioned before (and maybe there's a reason it's this way), but p_SendByte starts by loading B and C individually where p_GetByte loads them together (saving a byte).
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on April 06, 2013, 05:44:01 pm
No reason whatsoever. Good catch.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on July 04, 2013, 08:59:58 am
EDIT: Jacobly pointed out the case HL = 8000h, so this doesn't work D:

Hopefully this file has the updated SDiv routine. I have this:
Original routine
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
xor d
jp p,__SDivSkip1-p_SDiv-1
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
__SDivSkip1:
bit 7,d
jr z,__SDivSkip2
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
__SDivSkip2:
call $3F00+sub_Div
x_SDivEntry:
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
   Smaller routine: 1 byte, 1|6 cycles saved
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
xor d
jp p,__SDivSkip1-p_SDiv-1
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
__SDivSkip1:
xor d
jp p,__SDivSkip2-p_SDiv-1
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
__SDivSkip2:
call $3F00+sub_Div
x_SDivEntry:
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
And my only change is the two lines after __SDivSkip1.
Same size, save at least 1 cycle (up to 6 cycles).

EDIT: The same modification can be made to the fixed point signed division routine.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Matrefeytontias on July 12, 2013, 01:30:49 pm
Seeing the discussion about Fill( in the axiom request thread, I was surprized that this wasn't implemented this way already :

Code: [Select]
; Fill(ptr, amount, byte (not word))
; hl = ptr, de = byte, bc = amount
 ld (hl),e
 dec bc
 ld a,c
 or b
 ret z ; or whatever to quit
 ld e,l
 ld d,h
 inc de
 ldir
 ret ;              ↑

I don't think it's really optimized though >_>
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: jo-thijs on October 31, 2013, 12:02:41 pm
I found this in the Commands.inc file of axe1.2.2a:
p_IntNe:
   .db 8
   xor   a
   sbc   hl,de
   jr   z,$+5
   ld   hl,1

I can't find the purpose of xor a.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Runer112 on October 31, 2013, 12:14:49 pm
Reset the carry flag for sbc hl,de it seems. :P
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on July 27, 2015, 02:49:49 pm
I think I finally have a major optimization after having worked on link routines for the past couple of weeks. I didn't modify the timeout or syncing code, just the core get/send stuff. I've tested it and it is reliable.

For reference, in the even that p_SendByte doesn't have to wait, the new routine is 931cc vs 1647cc. Here are my proposed routines:

p_GetByte: +0 bytes, presumably as much faster as p_SendByte
Code: [Select]
p_GetByte:
.db __GetByteEnd-$-1
di
ld bc,$0803 ;Bit counter in b, bit mask in c
ld hl,-1
xor a
out (0),a ;Make sure we are reset
in a,(0)
and c ;Check to see if sender is ready
dec a
ret nz ;If not, then go back
inc a
out (0),a ;Relay a confirmation
ex (sp),hl ;Wait at until confirmation is read (59 T-states minimum)
ex (sp),hl
ld a,(de) ;Bit counter in b and bitmask in c
xor a ;Store received byte in l
ld hl,$AA
out (0),a ;Reset the ports to receive data

__GetByteLoop:
    in a,(0)
    xor l
    rra
    jr c,__GetByteLoop
    in a,(0)
    rra
    rra             ;bits cycled in are masked with 0x55. Need to invert anyways, so mask at the end with 0xAA
    rr l
    djnz __GetByteLoop
    ret
   
p_SendByte: -4 bytes, -723cc
Code: [Select]
p_SendByte:
    .db __SendByteEnd-$-1
di
ld bc,$5503 ;Bit counter in b, bit mask in c
ld a,%00000010
out (0),a ;Indicate we are ready to send
__SendByteTimeout:
dec hl
ld a,h
or l
jr z,__SendByteDone
in a,(0) ;Loop is 59 T-states maximum
and c
jr nz,__SendByteTimeout ;Keep looping till we get it
out (0),a
__SendLoop:
    rrc e
    ccf
    rla
    sla b
    ccf
    rla
    out (0),a
    ex (sp),hl
    ex (sp),hl
    nop
    jr nz,__SendLoop
;need 37cc
    xor a
    ex (sp),hl
    ex (sp),hl
__SendByteDone
    out (0),a
    ret
__SendByteEnd:
EDIT: I looked at the timeout code for p_SendByte, and realized that my code didn't need B to be a counter but instead I was using D as a kind of counter. By using B instead of D, I could cut out the ld d,$55, saving 2 bytes and 7cc.
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on September 21, 2019, 07:19:54 pm
Here is an optimized p_LineShr routine. NOTE: It flips the meaning of the carry flag on output, so the line routines that use this will need to ret c instead of ret nc.
Original routine
Code: [Select]
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
cp 64
ret nc
ld h,a
ld a,d
cp 64
ret nc

ld a,l
cp 96
ret nc
ld a,e
cp 96
ret nc

sub l
jr nc,__LineShrSkipRev
ex de,hl
neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
push af ; Saving DX (it will be popped into HL below)
ld a,l ; IX+=L/8+D*12 (actually D*4+D*4+D*4)
rra
rra
rra
and %00011111
ld c,a
ld b,0
add ix,bc
ld a,d
add a,a
add a,a
ld c,a
add ix,bc
add ix,bc
add ix,bc
ld a,l ; Calculating the starting pixel mask
and %00000111
inc a
ld b,a
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data
ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
pop hl ; Recalling DX
ld l,a ; H=DX, L=DY
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
scf ; Setting up gradient counter
ccf
rra
scf
ret ; c=1, z=vertical major
__LineShrEnd:
   
Optimized routine: -4 bytes, -13cc
Code: [Select]
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
ld h,a
ld a,63
cp h
ret c
cp d
ret c

ld a,95
cp l
ret c
cp e
ret c
ld a,e

sub l
jr nc,__LineShrSkipRev
ex de,hl
neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
push af ; Saving DX (it will be popped into HL below)
ld a,d
add a,a
add a,a
ld c,a
ld b,0
add ix,bc
add ix,bc
add ix,bc
ld a,l
and 7
ld e,a
xor l
rra
rra
rra
ld c,a
add ix,bc
ld b,a
inc b
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data
ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
pop hl ; Recalling DX
ld l,a ; H=DX, L=DY
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
res 0,a ; Setting up gradient counter
rrca
ret ; c=0, z=vertical major
__LineShrEnd:



Or this version, it only save 3 bytes, but saves 10 more clock cycles:
Code: [Select]
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
ld h,a
ld a,63
cp h
ret c
cp d
ret c

ld a,95
cp l
ret c
cp e
ret c
ld a,e

sub l
jr nc,__LineShrSkipRev
ex de,hl
neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
ld e,a ; Saving DX
ld a,l ; IX+=L/8+D*12 (actually D*4+D*4+D*4)
rra
rra
rra
and %00011111
ld c,a
ld b,0
add ix,bc
ld a,d
add a,a
add a,a
ld c,a
add ix,bc
add ix,bc
add ix,bc
ld a,l ; Calculating the starting pixel mask
and %00000111
inc a
ld b,a
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data

ld h,e ; DX
ld l,a ; DY

ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
res 0,a ; Setting up gradient counter
rrca
ret ; c=0, z=vertical major
__LineShrEnd:
Title: Re: Assembly Programmers - Help Axe Optimize!
Post by: Xeda112358 on October 20, 2019, 10:47:31 am
p_EQ0

The current routine is 7 bytes and 36cc:
Code: [Select]
;7 bytes, 36cc
ld a,l
or h
add a,255
sbc hl,hl
inc hl

But we can save 8cc without sacrificing bytes:
Code: [Select]
;7 bytes, 28cc
xor a
cp h
ld h,a
sbc a,l
sbc a,a
ld l,a
inc l