Author Topic: Assembly Coding Optimization  (Read 13218 times)

0 Members and 1 Guest are viewing this topic.

Offline Halifax

  • LV9 Veteran (Next: 1337)
  • *********
  • Posts: 1334
  • Rating: +2/-1
    • View Profile
    • TI-Freakware
Assembly Coding Optimization
« on: March 23, 2007, 10:46:00 am »
Ok I have compiled this code from MaxCoderz and figured it would be useful for some ASM programmers over here. You all can post your own routines and optimized stuff here too.


Optimization 1 - Loading 0 into A
----------------------------------
Unoptimized
c1-->
CODE
ec1
ld a,0
c2
ec2
to
c1
-->
CODE
ec1
xor a

or

sub a,a
c2
ec2

ld a,0 is unoptimized because it takes more bytes and tstates than both "xor a" and "sub a,a". They both do the same as ld a,0
------------------------------------


Optimization 2 - Comparing the value 0
------------------------------------
Unoptimized
c1
-->
CODE
ec1
cp 0
c2
ec2
to
c1
-->
CODE
ec1
or a
c2
ec2

same with above. "or a" takes less space and tstates than "cp 0" and does the same thing.
-------------------------------------


Optimization 3 - Loading numbers into register pairs
-------------------------------------
Unoptimized
c1
-->
CODE
ec1
ld b,$88
ld c,$45
c2
ec2
to
c1
-->
CODE
ec1
ld bc,$8845
c2
ec2

the first one takes 14 tstates while "ld bc,$8845" only takes 10 tstates. It is self-explanatory that what goes in B comes first and C comes after.
---------------------------------------


Optimization 4 - Loading a Pic onto the screen (SPEED OPTIMIZATION)
---------------------------------------
Unoptimized
c1
-->
CODE
ec1
ld hl,gbuf
ld de,pic
ld bc,768
ldir
c2
ec2
to
c1
-->
CODE
ec1
ld hl,gbuf
ld de,pic
ld bc,768

copyfast:
 
There are 10 types of people in this world-- those that can read binary, and those that can't.

Fallen Ghost

  • Guest
Assembly Coding Optimization
« Reply #1 on: March 24, 2007, 06:13:00 am »
For #5:

Why should interrupts be enabled?

If an interrupt triggers, it will return correctly (lets suppose) and let the stack at the same value, therefore not modifying the bytes you are going to have when you push, so the result is the same.
But as we know, interrupts can trigger anytime. If it happens that the interrupt is triggered when b=1 and there is only a couple pushes to do, then if in the interrupt routine, there are more pushes than the number remaining in your routine, then whatever data was before the buffer will be erased/modified, so it is no a good idea, therefore interrupts disabled.

Offline Halifax

  • LV9 Veteran (Next: 1337)
  • *********
  • Posts: 1334
  • Rating: +2/-1
    • View Profile
    • TI-Freakware
Assembly Coding Optimization
« Reply #2 on: March 24, 2007, 11:22:00 am »
Fallen_Ghost none of those routines belong to me. They were copied right from Maxcoderz. I just pasted them in here as I stated in the top of my post up there

QUOTE

Ok I have compiled this code from MaxCoderz and figured it would be useful for some ASM programmers over here. You all can post your own routines and optimized stuff here too.
There are 10 types of people in this world-- those that can read binary, and those that can't.

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2912
  • Rating: +471/-17
    • View Profile
    • TI-Boy CE
Assembly Coding Optimization
« Reply #3 on: March 25, 2007, 08:48:00 am »
If you don't want the flags affected use ld a,0 instead of xor a
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Jon

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 278
  • Rating: +0/-0
    • View Profile
Assembly Coding Optimization
« Reply #4 on: March 25, 2007, 04:44:00 pm »
Here's a simple size optimization for direct input:
Instead of:
c1-->
CODE
ec1
out (1),a
nop
nop
in a,(1)
c2
ec2

Use:
c1
-->
CODE
ec1
out (1),a
ld a,(de)
in a,(1)
c2
ec2

LD A,(DE) creates the same delay as 2 NOP's, but it takes 1 byte instead of 2

Offline Halifax

  • LV9 Veteran (Next: 1337)
  • *********
  • Posts: 1334
  • Rating: +2/-1
    • View Profile
    • TI-Freakware
Assembly Coding Optimization
« Reply #5 on: March 26, 2007, 09:44:00 am »
wow very nice

EDIT: I just gained (with this post here) post number 666!
There are 10 types of people in this world-- those that can read binary, and those that can't.

Offline Iambian

  • Coder Of Tomorrow
  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 739
  • Rating: +216/-3
  • Cherry Flavoured Nommer of Fishies
    • View Profile
Assembly Coding Optimization
« Reply #6 on: April 10, 2007, 10:16:00 am »
Size and speed optimization:
If you make an unconditional CALL prior to a RET, you can replace that CALL with a JP instruction (since you're then going to be using that calling routine's RET). This won't work if the calling routine doesn't exit out using a RET, but that's up to you to decide. But if it works, that's one byte saved by not having to use a RET.

If the calling routine is local enough, you can save another byte by using the JR instruction instead.
i.e.
c1-->
CODE
ec1 CALL Someroutine
 RET
;more code
Someroutine:
 XOR A
 RETc2
ec2
can be condensed to this:
c1
-->
CODE
ec1 JR Someroutine
;more code
SomeRoutine:
 XOR A
 RETc2
ec2
If you can rearrange the code to make "SomeRoutine" appear right after the calling routine, you can save two more bytes by omitting JR altogether.

In this respect, it may do you some good to rearrange your ASM code to take advantage of this kind of optimization. Of course, you're going to have to be wary about errors resulting in the use of JR.

Oh, and as a response to the previous post that made a size optimization using a "LD A,(DE)", it's not exactly the same as using two NOPs. The "LD A,(DE)" is faster by one clockcycle. I understand that this is negligible, but I just wanted to point that out.
A Cherry-Flavored Iambian draws near... what do you do? ...

Offline Jon

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 278
  • Rating: +0/-0
    • View Profile
Assembly Coding Optimization
« Reply #7 on: April 12, 2007, 11:56:00 am »
yeah good point :)smile.gif 7<8  heh

Offline Iambian

  • Coder Of Tomorrow
  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 739
  • Rating: +216/-3
  • Cherry Flavoured Nommer of Fishies
    • View Profile
Assembly Coding Optimization
« Reply #8 on: April 16, 2007, 09:21:00 am »
It might also be worth it to mention, in case you're doing interrupts, that if you wanted to call the TI-OS's interrupt service routine (RST 38h) on a conditional jump, you could save some memory by using a trick of the Z80's instruction set.

Instead of, say, "CALL C,0038h" or "JP C,0038h", you could do "JR C,$FF". Conditions will vary, all of which will do the same thing.

This works because the "$FF" is an offset to make a relative jump one byte behind the already-executed instruction ( JR C,$+1). The part of the argument for JR, "$FF", is the opcode for RST 38h. In that sense, you're combining two instructions in one.

For what you'd use this for, I'd have no idea. Perhaps someone that wanted to call the interrupt service routine while the interrupts were off? Perhaps it might be a way to keep romcalls like _getCSC and _getKey working while the interrupts are gone.
A Cherry-Flavored Iambian draws near... what do you do? ...

Offline Jon

  • LV5 Advanced (Next: 300)
  • *****
  • Posts: 278
  • Rating: +0/-0
    • View Profile
Assembly Coding Optimization
« Reply #9 on: April 16, 2007, 04:09:00 pm »
That's brilliant man.  And it's pure luck that the opcode for rst 38h is the signed 8-bit value for -1. That's friggin' awesome, kudos! :)smile.gif
Although, wouldn't the command be jr c,$-1 ? I believe tasm will interpret jr c,$ff to mean jr c,$00ff, and hence give you a range of relative branch error.

Offline Iambian

  • Coder Of Tomorrow
  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 739
  • Rating: +216/-3
  • Cherry Flavoured Nommer of Fishies
    • View Profile
Assembly Coding Optimization
« Reply #10 on: April 17, 2007, 04:28:00 am »
I actually got that trick off of some very old Z80 documentation. I just thought it curious.

In TASM, JR (condition),$+2 gives you the instruction after it, and JR (condition),$-1 gives you the instruction behind it. To make it continuously loop on itself (jump back to the beginning of the instruction), you'd do JR (condition),$+0, so it's natural to then believe that JR (condition),$+1 would give you one byte within the instruction.

Also, no errors will happen since TASM already recognizes the instruction to take a single byte argument. Believe me; I've tried a similar stunt before (though not as memory-efficient), especially if you read up on my unreadable Z80 source :)smile.gif

Also, to keep on topic:

For a speed optimization when working with a list of two-byte values, you can abuse the stack to quickly address these values. That is, set SP to the start of the list. Then you can repeatedly POP values off the table. The values are not destroyed, but only read. If you need to edit a value, say, for updating, you can just PUSH the value back in and use some instructions like INC SP (twice) to move the pointer back to where you were (although just POP-ing the value will save two clockcycles as opposed to using INC SP twice, except if you're trying to POP to an index register). Accessing values in this fashion will save you many clockcycles, especially compared to the standard "LD E,(HL) \ INC HL \ LD D,(HL) \ INC HL" sequence that eats up a hefty 26 clockcycles. Simply POP-ing the value will use up only ten clockcycles.

This is especially useful for bubble-sorting or perhaps grabbing the largest and smallest value on the table.

Remember to disable interrupts and to save SP prior to editing SP to read the table.
------------------
Another optimization trick is first of all, when you want to, say, stop the program to output an error message, you could, instead of loading in HL the address of the string, do the following instead:

c1-->
CODE
ec1CALL ErrorCode \ .db "ERROR1",0

ErrorCode:
 POP HL
;process string address now in HL
 JR ProgramEndc2
ec2
That works because when the CALL is made, the address in the stack will point to the string thereafter, since it would've been the next address to execute from then. The ErrorCode will have a jump to a place that will restore SP prior to exiting, so this will only work if you have code to restore the stack or something. If all your error messages are of a fixed size, you could save more space by cutting out that null-terminator and editing your text output routine to cope with a fixed-size.

For further optimization (in case you don't want to have code to jump over all that CALL and text), all ASCII alphabet characters are within the code block for loads. If you do not have any special characters, spaces, or numbers, you could actually have the call a conditional one and have your little Z80 run the text as if it was code. This is a dangerous practice, especially if you were going to change your text, but in this case, you'd be looking at an opcode table to determine which registers get destroyed in the various loads. The reason why I say that you cannot use the space character is because its opcode refers to a JR instruction. If you were especially savvy about the placement of your code, you could use this feature to your advantage.

Take care with the use of that optimization. It certainly isn't a speed optimization, but it most certainly is a size optimization, especially if you're CALLing on a condition that JR doesn't take (like M or P, for instance).
A Cherry-Flavored Iambian draws near... what do you do? ...

Fallen Ghost

  • Guest
Assembly Coding Optimization
« Reply #11 on: April 24, 2007, 02:29:00 pm »
On other trick I found out is a small speed optimization (but you loose 1 byte):
instead of doing this (22T states, 3 bytes)c1-->
CODE
ec1ld e,a ;4,1
ld d,0 ;7,2
add hl,de;11,1c2
ec2

One could do (19/20T states, 4 bytes)c1
-->
CODE
ec1add a,l

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2912
  • Rating: +471/-17
    • View Profile
    • TI-Boy CE
Assembly Coding Optimization
« Reply #12 on: April 27, 2007, 02:27:00 am »
And, rather than comparing you can repeatedly 'dec a'. And as a bonus you can assume a is 0 when the function gets called. :)smile.gif

Edit: There is also no command "call (hl)". Not to mention addresses are two bytes, not one.
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Fallen Ghost

  • Guest
Assembly Coding Optimization
« Reply #13 on: April 27, 2007, 10:37:00 am »
But actually, "cp reg8" takes 4T, while "cp imm8" takes 2 bytes and 7T.

For that effect, you could do this:

c1-->
CODE
ec1ld hl,jump_table
add a,a

Offline Halifax

  • LV9 Veteran (Next: 1337)
  • *********
  • Posts: 1334
  • Rating: +2/-1
    • View Profile
    • TI-Freakware
Assembly Coding Optimization
« Reply #14 on: April 27, 2007, 10:41:00 am »
Oh yeah your right Fallen_Ghost heh good catch. :thumb:thumb.gif

Isn't it kind of self explanatory that jp (ix) would work since jp (hl) works  ;)wink.gif

Oh yeah and just in case you didn't know or wanted an eaiser way to find if a command works then you could just look in TASM80.tab
There are 10 types of people in this world-- those that can read binary, and those that can't.