I would prefer no extra RAM usage at all. Otherwise, games may not run on any TI-84+ manufactured after April 2007 and will not be compatible with the regular 83+, meaning a considerable drop in the author's audience.Generally ram usage is just the routine needs some temporary bytes to store data. (bytes in the program itself or inserted in the TI-OS available RAM or free ram zones of the TI-OS) Only when you need a good amount of memory you use the extra ram pages.
In the new calcs only the page 53h is still there. It is same port as before. Dunno what 3rd party software uses it...Actually, we don't know which one is still there. All we know is that pages $82-$87 appear to be the same 16K of physical memory.
In the new calcs only the page 53h is still there. It is same port as before. Dunno what 3rd party software uses it...If I'm not mistaken, I think Omnicalc's Quick Apps uses it and it works fine on a new 84+se.
Safe Copy requires the undocumented instruction:It won't be compatible. But it doesn't require that instruction anyway. I always do this:
in f,(c)
Will that be compatible with the Nspire? If anyone has one, could you please try adding this to any Axe Parser code and see if it crashes:
Asm(0E10ED70)
in a,($10)
rla
jr c,$-3 ;or was it "nc"?
bit 7,d
jr z,__MulSNotNeg
inc b
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
__MulSNotNeg:
xor a
or d
jr nz,$+3
ex de,hl
ld a,l
ld hl,0
_multloop:
rra
jr nc,$+3
add hl,de
sla e
rl d
or a
jr nz,_multloop
ret
ld a,d
rrca
cp d
jr nz,$+3
ex de,hl
xor a
inc h
jr nz,$+3
sub e
ld h,a
ld a,l
ld l,0
or a ;Returns if multiplying by 0 or -256, also resets carry flag
ret z
_multloop:
rra
jr nc,$+3
add hl,de
sla e
rl d
or a
jr nz,_multloop
ret
Kind of. I'm going to release all of my templated assembly code that the executable programs use, but I don't think I will release the source of the parser itself. Right now, I'm not too worried about the optimizations. Its the actual code of the Parser I am trying to finished first so I can release a beta, but I keep getting distracted by wantting to add more commands since they're relatively easier and more fun :PNever release the source to public before releasing the software, btw. If for example, you decide to post Axe 0.2 both on Omnimaga and Ticalc at the same time, wait until it makes it to ticalc archives before releasing the source. Better protection against code thieves.
I've got no clue whether or not you've seen this, but I think it might be of some help. http://map.grauw.nl/sources/external/z80bits.htmlYeah, that's what I've been using, but it doesn't have many signed routines, just unsigned.
Hmmm it would seem so. Is it really so much of a hassle to flip the negative bit, multiply/divide, then flip it again? Seems trivial compared to any other modification, although i'm not an asm guy :PYou can't just flip a bit. You have to subtract from zero.
By the way calcmaniac, your code didn't work. I tried -1 times 1 and it returned a weird number. But I was able to create my own routine after looking at some other code. Its a little slower, but its roughly the same size as the original 8bit routine.I just typed it in on my calculator and tried $ffff*$0001 and $0001*$ffff and both gave $ffff. Did you make a typo somewhere or something?
Min_HLDE:
xor a
sbc hl,de
jr c,$+4
ld h,a
ld l,a
add hl,de
Max_HLDE:
xor a
sbc hl,de
jr nc,$+4
ld h,a
ld l,a
add hl,de
That's a cleaver trick!
That's a cleaver trick! But wouldn't something like this be simpler?Ah, I didn't think of that. :P Here's a good way to do signed comparison by the way:
or a
sbc hl,de
add hl,de
jr nc,$+3
ex de,hl
But I'm trying to convert all of my math commands to signed operations anyway, so I would need to tweak it a bit.
Does anyone know any good sin/cos routines that are under 128 bytes? The entire circle should be 256 brads (binary radians) so each quadrant is 64. It doesn't need to be 100% accurate, but it should be pretty close. It doesn't need to be that fast either, but I would prefer using a method that doesn't need multiplication such as a look up table or CORDIC.I think yes. It was made by Will West but I couldn't find the original post in Revsoft so I add as attach.
DispGraphRR:
di
ld a,$80
out ($10),a
ld (save_sp),sp
ld l,plotSScreen&$ff - 1
ld de,appbackupscreen - plotSScreen
ld sp,plotSScreen - appbackupscreen + 12
ld c,$1f
dec (iy+asmflags2)
jr nz,gray4skip
ld (iy+asmflags2),3
jr gray4entry3
gray4skip:
ld a,(flags+asmflags2)
dec a
jr z,gray4entry2
gray4entry1:
ld h,plotSScreen >> 8
inc l
ld b,64
inc c
ld a,c
cp $2c
jr z,gray4end
out ($10),a
gray4loop1:
ld a,(hl)
add hl,de
xor (hl)
and %11011011
xor (hl)
add hl,sp
out ($11),a
djnz gray4loop2
gray4entry2:
ld h,plotSScreen >> 8
inc l
ld b,64
inc c
ld a,c
cp $2c
jr z,gray4end
out ($10),a
gray4loop2:
ld a,(hl)
add hl,de
xor (hl)
and %01101101
xor (hl)
add hl,sp
out ($11),a
djnz gray4loop3
gray4entry3:
ld h,plotSScreen >> 8
inc l
ld b,64
inc c
ld a,c
cp $2c
jr z,gray4end
out ($10),a
gray4loop3:
ld a,(hl)
add hl,de
xor (hl)
and %10110110
xor (hl)
add hl,sp
out ($11),a
djnz gray4loop1
jr gray4entry1
gray4end:
ld sp,(save_sp)
ei
ret
ld a,c
cp $2c
jr z,__Disp4LvlDone
ld h,plotSScreen >> 8
inc l
ld b,64
inc c
out ($10),a
inc (iy+asm_flag2)
jr z,__Disp4Lvlentry3
ld a,(flags+asm_flag2)
inc a
jr z,__Disp4Lvlentry2
ld (iy+asm_flag2),-2
So that you don't need to initialize the byte that keeps track of gray layer since it falls through if the number was uninitialized.I have the newer 84 with the slower LCD by the way. I'll see if this fixes it. Although, I would rather the user not need to do this when running axe games.If it's a shoddy LCD it really can't be helped. :( It's a shame that some of the hardware is low quality.
p_Disp4Lvl:
di
ld a,$80
out ($10),a
ld (OP1),sp
ld sp,appbackupscreen - plotSScreen
ld e,(plotSScreen-appbackupscreen+12)&$ff
ld c,-$0C
ex af,af'
ld a,%11011011
inc (iy+asm_flag2)
jr z,__Disp4Lvlskip
add a,a
ld hl,(flags+asm_flag2)
inc l
jr z,__Disp4Lvlskip
rlca
ld (iy+asm_flag2),-2
__Disp4Lvlskip:
ld l,plotSScreen&$ff-1
ex af,af'
__Disp4Lvlentry:
ld a,c
add a,$2C
ld h,plotSScreen>>8
inc l
ld b,64
out ($10),a
__Disp4Lvlloop:
ld a,(hl)
add hl,sp
xor (hl)
ex af,af'
cp e <-- Epic coincidence that e just happened to be between 182 and 219
rra
ld d,a
ex af,af'
and d
xor (hl)
out ($11),a
ld d,(plotSScreen-appbackupscreen+12)>>8
add hl,de
djnz __Disp4Lvlloop
inc c
jr nz,__Disp4Lvlentry
__Disp4LvlDone:
ld sp,(OP1)
ei
ret
__Disp4LvlEnd:
Hmm, remember when I suggested you could use IX as a pointer to the variables for easier access when doing 8-bit operations? That would be too much of a hassle, we agreed. But what if you moved the variables to the END of savesscreen? Then they would be within the range of the IY register (which points 4 bytes after the end of savesscreen). Just something to consider :)Hmm... interesting proposition. Although it would optimize some operations like addition and subtraction, there is still one key advantage to using the existing variable slots. By using only the least significant bytes of those variables, you never need to do "conversions" when you switch from word to byte mode. You have to get input and output somehow otherwise there's not much advantage to the new mode. By being able to skip the conversions, I think that will save more memory in the long run than by using the iy or ix registers.
I've further optimized that grayscale command and its basically the same size as my original now except way faster which is excellent.Cool to hear :D I can't wait to see the difference :)
calc83manicYou downgraded calc84maniac to 6 MHz hardware :(
calc83manicYou downgraded calc84maniac to 6 MHz hardware :(
oh crap I didn't notice x.x. I hope he still finishes all his projects including the 8 level grayscale raycaster D: *runs*calc83manicYou downgraded calc84maniac to 6 MHz hardware :(
He also downgraded him to having a mood disorder.
LOL, you guys... :Poh crap I didn't notice x.x. I hope he still finishes all his projects including the 8 level grayscale raycaster D: *runs*calc83manicYou downgraded calc84maniac to 6 MHz hardware :(
He also downgraded him to having a mood disorder.
It's awesome that you were able to further optimize the grayscale, this makes me very happy! :Dlol :PLOL, you guys... :Poh crap I didn't notice x.x. I hope he still finishes all his projects including the 8 level grayscale raycaster D: *runs*calc83manicYou downgraded calc84maniac to 6 MHz hardware :(
He also downgraded him to having a mood disorder.
Maybe you could use ixl/ixh to avoid the shadow reg? It is 9 clocks for ld iirc tho :S.Aren't those incompatible with the Nspire, though? Or am I confusing them?
I'm pretty sure they're not compatible, since they are undocumented. :(Maybe you could use ixl/ixh to avoid the shadow reg? It is 9 clocks for ld iirc tho :S.Aren't those incompatible with the Nspire, though? Or am I confusing them?
What are these "conversions" you are talking about, and why are they affected by using the IY register? You can keep using normal memory loads/stores as much as you want, IY is just a bonus optimization (especially once you add 8-bit math mode). I'd imagine being able to directly add, subtract, and, or, xor variables in only 3 bytes is worth moving the variables to the end of the buffer.Hmm, remember when I suggested you could use IX as a pointer to the variables for easier access when doing 8-bit operations? That would be too much of a hassle, we agreed. But what if you moved the variables to the END of savesscreen? Then they would be within the range of the IY register (which points 4 bytes after the end of savesscreen). Just something to consider :)Hmm... interesting proposition. Although it would optimize some operations like addition and subtraction, there is still one key advantage to using the existing variable slots. By using only the least significant bytes of those variables, you never need to do "conversions" when you switch from word to byte mode. You have to get input and output somehow otherwise there's not much advantage to the new mode. By being able to skip the conversions, I think that will save more memory in the long run than by using the iy or ix registers.
xor a
ld b,a
ld c,a
cpir
ld hl,-1
sbc hl,bc
ld hl,(var_a)
ld de,5
or a
sbc hl,de
;The Z condition code is set to correspond to 1
jp nz,end
ld hl,(var_b)
ld de,-10
add hl,de
;The NC condition code is set to correspond to 1
ld hl,(var_a)
jp c,no_inc
inc hl
no_inc:
I'm thinking it would be neat for the commands that return booleans to not always have to calculate the 0 or 1 value directly. I was thinking instead that the command will set an internal compiler variable that tells which condition code to use. Then, the following command can either optimize according to the condition or otherwise generate a 0 or 1 as usual.
ld hl,(var_a)
ld de,5
or a
sbc hl,de
;The NZ condition code is set to correspond to 1
ld hl,1
jp nz,_
ld hl,(var_b)
ld de,6
or a
sbc hl,de
;The NZ condition code is set to correspond to 1
ld hl,1
jr nz,_
dec hl
_
p_Sqrt:
.db __SqrtEnd-1-$
ld a,l
ld l,h
ld de,$0040
ld h,d
ld b,8
or a
__SqrtLoop:
sbc hl,de
jr nc,__SqrtSkip
add hl,de
__SqrtSkip:
ccf
rl d
rla
adc hl,hl
rla
adc hl,hl
djnz __SqrtLoop
ld h,0
ld l,d
ret
__SqrtEnd:
Code: (Current code) [Select] p_GetBit0: | Code: (Optimized code) [Select] p_GetBit0: |
Code: (Current code) [Select] p_SIntGt: | Code: (Optimized code) [Select] p_SIntGt: |
Changing the high order bit does work, actually. It changes a comparison in the -32768 to 32767 range to a comparison in the 0 to 65535 range (effectively changing from a signed comparison to an unsigned comparison).
hl | de | sbc hl,de | c | p/v | s | hl>>de |
2000 | 6000 | C000 | 1 | 0 | 1 | 0 |
2000 | A000 | 8000 | 1 | 1 | 1 | 1 |
2000 | E000 | 4000 | 1 | 0 | 0 | 1 |
6000 | 2000 | 4000 | 0 | 0 | 0 | 1 |
6000 | A000 | C000 | 1 | 1 | 1 | 1 |
6000 | E000 | 8000 | 1 | 1 | 1 | 1 |
A000 | 2000 | 8000 | 0 | 0 | 1 | 0 |
A000 | 6000 | 4000 | 0 | 1 | 0 | 0 |
A000 | E000 | C000 | 1 | 0 | 1 | 0 |
E000 | 2000 | C000 | 0 | 0 | 1 | 0 |
E000 | 6000 | 8000 | 0 | 0 | 1 | 0 |
E000 | A000 | 4000 | 0 | 0 | 0 | 1 |
Yeah, I'm still reading all of this, even though I'm less active, I still visit just about every day :) I've even been able to do a little more progress with Axe even with my busy schedule.Ah phew, good to hear x.x. Still, I hope the schedule won't get even more hectic with the time. X.x
It seemed to tell me that signed comparisons relied on an xor of the p/v and s flags. Which makes no sense, but that's what wabbitemu was telling me.It actually does make a bit of sense. Whether the mathematical (non-overflowed) result of the subtraction is positive or negative should give you the result of the comparison. However, if there was a signed overflow, it will give the wrong result. So the sign flag needs to be inverted if there was an overflow, and XOR achieves this perfectly.
Cool, Quigibo added all my optimized auto-optimizations :) But I think you missed p_GetBit15, which can be optimized to be the same as p_Mod2.
Output(0,0,"Hello World")
Output(0,0,"Hello World
This relates to the underlying z80 machine code (or ASM, if you prefer) that Axe generates. ;)
p_Add510:
.db 4
inc h
inc h
dec hl
dec hl
p_Add514:
.db 4
inc h
inc h
inc hl
inc hl
p_Add767:
.db 4
inc h
inc h
inc h
dec hl
p_Add769:
.db 4
inc h
inc h
inc h
inc hl
p_Add1024:
.db 4
inc h
inc h
inc h
inc h
p_Sub510:
.db 4
dec h
dec h
inc hl
inc hl
p_Sub514:
.db 4
dec h
dec h
dec hl
dec hl
p_Sub767:
.db 4
dec h
dec h
dec h
inc hl
p_Sub769:
.db 4
dec h
dec h
dec h
dec hl
p_Sub1024:
.db 4
dec h
dec h
dec h
dec h
Code: (Current code) [Select] p_EQN256: | Code: (Optimized code) [Select] p_EQN256: |
Code: (Current code) [Select] p_NEN3: | Code: (Optimized code) [Select] p_NEN3: |
Concerning optimizations:
I used an Assembly Disassembler to disassemble my game uPong and the code was 1531 lines! It was huge. Hopefully, thanks to Runner's and other members' optimizations, Axe programs can get smaller and smaller :)
These will decrease size, but also slightly decrease speed. I believe that Quigibo prefers smaller sizes over slightly slower code.I think it depends, though. If the slight speed decrease is 0.01 FPS to 0.02 it may not sound like much if executed once but if someone uses something 20 times a loop he will see the difference. Otherwise I think he definitively optimizes for size.
The problem with a compile setting for size/speed is that generally, you want part of your code to be optimized for size and another part to be optimized for speed. This could be done with "speed-op" and "size-op" commands or perhaps a prefix for labels like "Lbl +LBL" indicates everything until the next label is optimized for speed and everything else by default is optimized for size. Any shared subroutines like multiplication are added in the second pass so if multiplication was flagged to be fast in one place, it would be fast everywhere else too since it has to add the huge routine to your code anyway, all the other calls should share it.That would be a nice idea actually. My worry was effectively about some parts needing to be smaller but others faster. Good luck with whatever you decide. :)
It would make the parser a lot bigger, but it has room. I'm almost full on the first page, but I still have about 10KB of extra room on the second page. Commands are data in the Axe app so they would go on that page anyway.
Code: [Select] ;##### OLD ##### | Code: [Select] ;###### NEW ###### |
Code: (Original routine: 18 bytes, ~72 cycles) [Select] p_Nib1: | Code: (Optimized routine: 17 bytes, ~105 cycles) [Select] p_Nib1: |
Code: (Original routine: 18 bytes, ~68 cycles) [Select] p_Nib2: | Code: (Optimized routine: 15 bytes, ~77 cycles) [Select] p_Nib2: |
Code: (Original routine: 23 bytes, ~127 cycles) [Select] p_NibSto: | Code: (Optimized routine: 22 bytes, ~110 cycles) [Select] p_NibSto: |
Code: (Original routine: 16 bytes, 38425 cycles) [Select] p_InvBuff: | Code: (Optimized routine: 16 bytes, 28474 cycles) [Select] p_InvBuff: |
Code: (Original routine: 18 bytes, a lot of cycles) [Select] p_Unarchive: | Code: (Optimized routine: 18 bytes, a lot of-4 cycles) [Select] p_Unarchive: |
Code: (Original routine: 18 bytes, a lot of cycles) [Select] p_Archive: | Code: (Optimized routine: 18 bytes, a lot of-4 cycles) [Select] p_Archive: |
Code: (Original routine: 55 bytes, a lot of cycles) [Select] p_GetArc: | Code: (Optimized routine: 51 bytes, a lot of cycles) [Select] p_GetArc: |
Code: (Original routine: 13 bytes, ~110 cycles) [Select] p_GetBit: | Code: (Optimized routine: 12 bytes, ~152 cycles) [Select] p_GetBit: |
Code: (Original routine: 13 bytes, 338 cycles) [Select] p_FlipV: | Code: (Optimized routine: 13 bytes, 322 cycles) [Select] p_FlipV: |
Code: (Original routine: 21 bytes, 1907 cycles) [Select] p_FlipH: | Code: (Optimized routine: 21 bytes, 1891 cycles) [Select] p_FlipH: |
Code: (Original routine: 22 bytes, 2874 cycles) [Select] p_RotC: | Code: (Optimized routine: 20 bytes, 2708 cycles) [Select] p_RotC: |
Code: (Original routine: 22 bytes, 2874 cycles) [Select] p_RotCC: | Code: (Optimized routine: 20 bytes, 2708 cycles) [Select] p_RotCC: |
p_Nib2:
.db __Nib2End-$-1
xor a
srl h
rr l
rrd
jr c,__Nib2Skip
rld
__Nib2Skip:
ld l,a
ld h,0
ret
__Nib2End:
However, these are the concerns I have: First, the sprite rotation commands, why did you move the ret to the middle of the routine? It looks like that's just going to add more cycles since a conditional jr takes the same amount of cycles as a regular jr anyway.
Next, is it really a safe assumption that all ROM pages are between $7F and $FF for all current models and potentially future models?
And lastly, are you sure trying to modifying rom (unsuccessfully) has no potential side effects to things like flags and registers?
This would be really great. When I programmed in Axe, knowing the size of commands was always something I wanted to be aware of, and I'm sure a lot of people would like it, since some people might be really tight on memory and want to find every way to optimize for size.
EDIT: By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it.
Faster buffer inversion routine. 9951 cycles saved.It's over 9000!!!!
By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it./me loves runner
By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it./me loves runner
Lol I just actually noticed that ;DFaster buffer inversion routine. 9951 cycles saved.It's over 9000!!!!
What?!? O.O
9000?
<_< Yea, I know... I had to...
But seriously dude, all those optimizations are awesome! ;D
p_RotC:
.db __RotCEnd-1-$
ex de,hl
ld c,8
__RotCLoop1:
ld hl,vx_SptBuff+8
ld b,8
ld a,(de)
__RotCLoop2:
dec l
rra
rr (hl)
djnz __RotCLoop2
inc de
dec c
jr nz,__RotCLoop1
ret
__RotCEnd:
p_RotCC:
.db __RotCCEnd-1-$
ex de,hl
ld c,8
__RotCCLoop1:
ld hl,vx_SptBuff+8
ld b,8
ld a,(de)
__RotCCLoop2:
dec l
rla
rl (hl)
djnz __RotCCLoop2
inc de
dec c
jr nz,__RotCCLoop1
ret
__RotCCEnd:
ld hl,(var)
dec hl
ld a,h
or l
jp nz,DS_End
;Code inside statement goes here
ld hl,max
DS_End:
ld (var),hl
;const->{expr}
;Evaluate expr here
ld (hl),const
;const->{expr}r
;Evaluate expr here
ld (hl),const & $FF
inc hl
ld (hl),const >> 8
;const->{expr}rr
;Evaluate expr here
ld (hl),const >> 8
inc hl
ld (hl),const & $FF
;0->{expr}r or 0->{expr}rr
;Evaluate expr here
xor a
ld (hl),a
inc hl
ld (hl),a
Code: (Original code: 26 bytes, 152/66 cycles) [Select]
| Code: (Optimized code: 24 bytes, 144/66 cycles) [Select]
| Code: (Optimized (and fixed?) code: 24 bytes, 144/113 cycles) [Select]
|
Code: (Original code) [Select]
| Code: (Fixed code) [Select]
|
Code: (Original code: 18 bytes, n*55+51 cycles) [Select]
| Code: (Optimized code: 18 bytes, n*39+64 cycles) [Select]
|
I see you've been reading up on my Commands documentation, eh squidgetx? Yeah, that's an interesting thing I discovered when speed testing the display commands. On calculators like mine with the old, "good" screen drivers, the screen driver delay seems to be pretty low and constant from calculator to calculator. DispGraph could run just as fast or faster than DispGraphr on these calculators. However, due to inconsistencies with the screen drivers in newer units, the routine may run too fast for the driver on some calculators, causing display problems, so Quigibo had to add a portion of code to pause the routine until the driver says it is ready. However, this pause itself adds some overhead time, making the routine slower.
Quigibo, the DispGraphr routine doesn't have any throttling system in place, yet no problems have been reported with it on newer calculators. Could you just remove the throttling system from the DispGraph routine and add one or two time-wasting instructions to make each loop iteration take as many cycles as each DispGraphr loop iteration?
EDIT: Hmm I don't know if Quigibo reads this thread and would see that, so I'm probably going to post that in a major thread he reads or send him a message about that.
Print()
The second paragraph is my suggested optimization. The 3-level grayscale routine doesn't have a throttling system, yet there have been no reports of display problems from anybody. Wouldn't this suggest that all the screen drivers can handle routines that have as much delay as this? The data copying loop in the 3-level grayscale routine takes 72 cycles per byte output, so could delays simply be added to the normal screen display routine to make its loop at least 72 cycles?
Code: (Original code) [Select]
| Code: (Optimized code) [Select]
|
Code: (Original code) [Select] p_GetBit2: | Code: (Optimized code) [Select] p_GetBit2: |
Code: (Original code: 14 bytes, n*37+36 cycles) [Select] p_Sqrt: | Code: (Optimized code: 13 bytes, n*37+32 cycles) [Select] p_Sqrt: |
Code: (Original code: 29 bytes, too lazy to test cycles) [Select] p_Sin: | Code: (Optimized code: 26 bytes, too lazy to test-8 cycles) [Select] p_Sin: |
Code: (Original code: 11 bytes, n*31+17 cycles) [Select] p_Log: | Code: (Optimized code: 10 bytes, n*31+13 cycles) [Select] p_Log: |
Code: (Mathematically correct code: 13 bytes, only a little bit slower cycles) [Select] p_Log: |
Code: (Mathematically correct code: 16 bytes, only a little bit slower cycles) [Select] p_Exp: |
Code: (Original code: 33 bytes, 1481 cycles) [Select] __DrawMskAligned: | Code: (Optimized code: 31 bytes, 1409 cycles) [Select] __DrawMskAligned: |
Code: (Original) [Select] p_Min: | Code: (Optimized) [Select] p_Min: |
Code: (Original) [Select] p_Max: | Code: (Optimized) [Select] p_Max: |
p_FastCopy6MHz:
.db __FastCopy6MHzEnd-1-$
in a,($02)
rla
in a,($20)
push af
xor a
out ($20),a
call sub_FastCpy
pop af
ret c
out ($20),a
ret
__FastCopy6MHzEnd:
Code: (Original) [Select] p_IntGt: | Code: (Optimized) [Select] p_IntGt: |
p_Sqrt:
.db __SqrtEnd-1-$
ex de,hl
ld bc,$8000
ld h,c
ld l,c
__SqrtLoop:
srl b
rr c
add hl,bc
ex de,hl
sbc hl,de
jr nc,__SqrtNoOverflow
add hl,de
ex de,hl
or a
sbc hl,bc
;jr __SqrtOverflow ;Commented out in favor of super optimization
.db $DA ;JP C opcode to skip next 2 bytes since carry is reset here.
__SqrtNoOverflow:
ex de,hl
add hl,bc
__SqrtOverflow:
srl h
rr l
srl b
rr c
jr nc,__SqrtLoop
ret
__SqrtEnd:
How does it compare to this one (http://ourl.ca/4175/130486)? I suggested it a while ago but Quigibo either didn't see it or didn't seem to be interested in it. :PEr... that looks impressive :P I'll have to take a closer look at how it works sometime
p_GE0:
.db 3
ld hl,1
p_GT65535:
.db 3
ld hl,0
p_LE65535:
.db 3
ld hl,1
p_LT0:
.db 3
ld hl,0
p_GE1 =p_NE0
p_GT0 =p_NE0
p_LE0 =p_EQ0
p_LT1 =p_EQ0
p_GE32768 =p_Div32768
p_GT32767 =p_Div32768
p_LE32767 =p_SGE0
p_LT32768 =p_SGE0
p_GE65535 =p_EQN1
p_GT65534 =p_EQN1
p_LE65534 =p_NEN1
p_LT65535 =p_NEN1
p_GEconstMod256EQ0:
.db 6
ld a,h
sub const>>8
sbc hl,hl
inc hl
p_GTconstMod256EQ255:
.db 6
ld a,h
sub const+1>>8
sbc hl,hl
inc hl
p_LEconstMod256EQ255:
.db 6
ld a,h
add a,-(const+1>>8)
sbc hl,hl
inc hl
p_LTconstMod256EQ0:
.db 6
ld a,h
add a,-(const>>8)
sbc hl,hl
inc hl
p_GEconst:
.db 8
xor a
ld de,-const
add hl,de
ld h,a
rla
ld l,a
p_GTconst:
.db 8
xor a
ld de,-(const+1)
add hl,de
ld h,a
rla
ld l,a
p_LEconst:
.db 7
ld de,-const
add hl,de
sbc hl,hl
inc hl
p_LTconst:
.db 7
ld de,-(const+1)
add hl,de
sbc hl,hl
inc hl
#define FULLSPEED in a,2 \ rla \ sbc a,a \ out (20h),a
Note: This has the side effect of out (0),0 on the TI-83+. Is that okay?Stolenborrowed from WikiTI (http://wikiti.brandonw.net/index.php?title=83Plus:Ports:20):Code: [Select]
#define FULLSPEED in a,2 \ rla \ sbc a,a \ out (20h),a
And this gives you the added bonus of the CPU operating approximately 25KHz faster at full speed mode!
Note: This has the side effect of out (0),0 on the TI-83+. Is that okay?Stolenborrowed from WikiTI (http://wikiti.brandonw.net/index.php?title=83Plus:Ports:20):Code: [Select]
#define FULLSPEED in a,2 \ rla \ sbc a,a \ out (20h),a
And this gives you the added bonus of the CPU operating approximately 25KHz faster at full speed mode!
Note: This has the side effect of out (0),0 on the TI-83+. Is that okay?Stolenborrowed from WikiTI (http://wikiti.brandonw.net/index.php?title=83Plus:Ports:20):Code: [Select]
#define FULLSPEED in a,2 \ rla \ sbc a,a \ out (20h),a
And this gives you the added bonus of the CPU operating approximately 25KHz faster at full speed mode!
#define FULLSPEED in a,(2) \ and 80h \ rlca \ out (20h),a
The only side effect of this is that on the TI-83+ Basic this will cause both linkport lines to go high - which shouldn't matter too much if you're not using the linkport at that time, especially since both lines are high normally...
Code: (Original code: 56 bytes) [Select] p_GetArc: | Code: (Optimized code: 51 bytes) [Select] p_GetArc: |
Oops, necropost, oh well :P
I don't know if this approach was purposely left out, as it's 15 bytes larger than the current routine and sometimes slower. I'm referring to the square root routine. Whereas the current routine (14 bytes) takes 37n+38 T-states (linear time), where n is the result+1 (1-256), the following routine (29 bytes) takes 5n+800 T-states (near constant time), where n is the number of set bits in the result (0-8). The existing routine is faster for values that would yield results of 0-19, but this routine would be faster for values that would yield results of 20-255, which is a much broader range of the 8-bit spectrum. Also, it would be much more reliable to run at a near constant speed in programs which rely on that to run smoothly themselves. The existing routine would take only a few hundred T-states for low inputs, but would take up to OVER NINE THOUSAND T-states to calculate the square roots for the highest inputs. So it's up to you if this is something you want to use.Code: [Select]p_Sqrt:
.db __SqrtEnd-1-$
ld a,l
ld l,h
ld de,$0040
ld h,d
ld b,8
or a
__SqrtLoop:
sbc hl,de
jr nc,__SqrtSkip
add hl,de
__SqrtSkip:
ccf
rl d
rla
adc hl,hl
rla
adc hl,hl
djnz __SqrtLoop
ld h,0
ld l,d
ret
__SqrtEnd:
Would it be possible to have normal and full compilation modes. So if a program runs at normal by default the code for changing the clock to normal at every dispgraph wouldn't be needed. Also this could be used by 83+ owners so that when they compile a program, full commands are ignored.Full commands are already ignored in 83 Plus mode. In fact, if a Full command is run on an 83 Plus, HL returns zero and nothing happens. But I have seen the size of the Full command, and yes, I think it would be a good idea to have some option where Full and Normal are skipped.
Code: (Original code: 46 bytes, ~59389 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select] p_FastCopy: | Code: (Optimized code: 45 bytes, ~57841 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select] p_FastCopy: |
Code: (Original code: 47 bytes, ~59389 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select] p_DrawAndClr: | Code: (Optimized code: 45 bytes, ~57841 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select] p_DrawAndClr: |
Code: (Original code: 66 bytes, ~63507 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select] p_DispGS: | Code: (Optimized code: 66 bytes, ~58660 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select] p_DispGS: |
Code: (Original code: 79 bytes, ~78433 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select] p_Disp4Lvl: | Code: (Optimized code: 82 bytes, ~70740 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select] p_Disp4Lvl: |
And as a side note, would it be possible to reformat DS<() so that the variable is reinitialized to its maximum value at the End? That way, 3 bytes could be saved by having both the zero and not zero conditions using the same store command. For example:Code: [Select]ld hl,(var)
dec hl
ld a,h
or l
jp nz,DS_End
;Code inside statement goes here
ld hl,max
DS_End:
ld (var),hl
Now that you have absolute jumps implemented:
Code: (Original code) [Select]
p_Exchange:
.db 13
pop de
ex (sp),hl
pop bc
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
ld a,b
or c
jr nz,$-8
Code: (Optimized code) [Select]
p_Exchange:
.db 12
pop de
ex (sp),hl
pop bc
__ExchangeLoop:
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
jp pe,__ExchangeLoop ;or is it po?
Code: (Original code: 27 bytes, ~220 cycles) [Select] p_DKeyVar: | Code: (Optimized code: 23 bytes, ~259 cycles) [Select] p_DKeyVar: |
p_DispGS:
.db __DispGSEnd-1-$
ld hl,plotSScreen
ld de,appBackUpScreen
call $0000
push af
ld a,$07
out ($10),a ;many cc into
ld a,(flags+asm_Flag2)
rra
sbc a,a
xor %01010101
ld (flags+asm_Flag2),a
ld c,a
ld a,$80
__DispGSNext:
push af
out ($10),a ;74cc into, 71cc loop
ex (sp),hl ;waste
ex (sp),hl ;waste
rrc c
ld b,12
ld a,$20
out ($10),a ;71cc into
push af ;waste
pop af ;waste
__DispGSLoop:
inc bc ;waste
dec c ;waste
ld a,(de)
and c
or (hl)
inc de
inc hl
out ($11),a ;72cc into, 71cc loop
ld a,(hl) ;waste
djnz __Disp4Lvlloop
pop af
inc a
bit 6,a
jr z,__Disp4Lvlentry
__DispGSDone:
pop af
out ($20),a
ld a,$05
out ($10),a ;83cc into
ret c
ei
ret
__DispGSEnd:
.db rp_Ans,__DispGSEnd-p_DispGS-8
p_Inverse:
.db __InverseEnd-1-$
xor a
bit 7,h
push af
jr z,$+8
sub l
ld l,a
sbc a,a
sub h
ld h,a
xor a
ex de,hl
ld bc,16<<8
ld hl,1
call $0000 ;sub_Div+10
pop af
ret z
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__InverseEnd:
.db rp_Ans,12
p_ReadArc:
.db __ReadArcEnd-1-$
ld c,a
in a,(6)
ld b,a
ld a,h
set 6,h
res 7,h
rlca
rlca
dec a
and %00000011
add a,c
out (6),a
ld c,(hl)
inc hl
bit 7,h
jr z,__ReadArcNoBoundary
set 6,h
res 7,h
inc a
out (6),a
__ReadArcNoBoundary:
ld l,(hl)
ld h,c
ld a,b
out (6),a
ret
__ReadArcEnd:
p_ReadArcApp:
.db __ReadArcAppEnd-1-$
push hl
ld hl,$0000
ld de,ramCode
ld bc,__ReadArcAppRamCodeEnd-__ReadArcAppRamCode
ldir
pop hl
ld e,a
ld c,6
in b,(c)
ld a,h
set 6,h
res 7,h
rlca
rlca
dec a
and %00000011
add a,e
call ramCode
ld e,d
inc hl
bit 7,h
jr z,__ReadArcAppNoBoundary
set 6,h
res 7,h
inc a
__ReadArcAppNoBoundary:
call ramCode
ex de,hl
ret
__ReadArcAppEnd:
.db rp_Ans,__ReadArcAppEnd-p_ReadArcApp-3
__ReadArcAppRamCode:
out (6),a
ld d,(hl)
out (c),b
ret
__ReadArcAppRamCodeEnd:
This is a really simple one. When an interrupt is called, interrupts are automatically disabled. So you don't need to start the interrupt routine with DI.
There's no chance of overlapping a sector boundary, but yeah you can overlap a page boundary. TI-OS doesn't allow variables to cross sector boundaries.QuoteThis is a really simple one. When an interrupt is called, interrupts are automatically disabled. So you don't need to start the interrupt routine with DI.
They are disabled automatically already... there is a di at the start of the interrupt routine. Is there some bug with that?
Also, about those archive reading commands... archive reading isn't as useful as it should be due to those sector boundary issues. For instance, you can't reliably iterate a tilemap in archive because there is a small chance it could overlap between a sector boundary and iterating over it would add a "glitch byte" to the map since each sector adds an extra byte in front. Although I guess you could modify those routines to take that into account, that might work since you can't read more than 64 consecutive kilobytes anyway.
Old: | New: |
Code: [Select] p_SDiv: | Code: [Select] p_SDiv: |
p_88Div:
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ld bc,$1000
ld a,l
ld l,h
ld h,c
call __DivLoop
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
why not // ? is that already something?/me checks the new wiki
ah, it is...
what about ./? the dot is a 'point' and .* for multiplication
decimal dot doesn't mean a comment if it's not the first token on the line.
:If H
: 35->A
:Else
: 10->A
:End
:./B->C
Code: (Original code: 45 bytes, a lot of cycles) [Select] p_IntSetup: | Code: (Optimized code: 42 bytes, a lot minus 5 cycles) [Select] p_IntSetup: |
Next(): 2 bytes and a few cycles saved. Also, isn't the end-of-VAT check in the wrong place? I could be wrong because my VAT experience isn't too great, but because this routine checks for the end of the VAT at the start, wouldn't this command advance the VAT pointer to the end of the VAT and not recognize it as the end until the next Next()? This would cause problems with programs reading garbage VAT data for the last "entry." If I'm right about this (which may not be the case), the third block of code I posted should hopefully recognize the end of the VAT as soon as it hits it and never advance the VAT pointer to point to the end.
Code: (Original code: 26 bytes, 152/66 cycles) [Select]
ld hl,(axv_X1t)
ld de,($982E)
or a
sbc hl,de
ret z
add hl,de
ld de,-6
add hl,de
ld e,(hl)
inc e
xor a
ld d,a
sbc hl,de
ld (axv_X1t),hl
ret
Code: (Optimized code: 24 bytes, 144/66 cycles) [Select]
ld hl,(axv_X1t)
ld de,($982E)
or a
sbc hl,de
ret z
add hl,de
ld de,-6
add hl,de
ld a,(hl)
cpl
ld e,a
add hl,de
ld (axv_X1t),hl
ret
Code: (Optimized (and fixed?) code: 24 bytes, 144/113 cycles) [Select]
ld hl,(axv_X1t)
ld de,-6
add hl,de
ld a,(hl)
cpl
ld e,a
add hl,de
ld de,($982E)
or a
sbc hl,de
ret z
add hl,de
ld (axv_X1t),hl
ret
A mix of a feature request and an optimization:
How about compound assignment operators (e.g. +=) for most Axe operations? They would offer savings on every operation that doesn't use a 2-byte variable at a constant address as the main operand. They could also offer even larger savings on basic operations like addition, subtraction, and bitwise logic.
Code: (17 bytes, ~62 cycles) [Select] ld a,%11011011 | Code: (16 bytes, ~60 cycles) [Select] ld a,%11011011 |
If condition
do stuff
Else
16->W
End
Obviously, after the Else, HL has to be 0. Thus the 16 can be reduced to a ld l,16 instead of ld hl,16. It might be possible to auto-optimize stuff like 1->A:2->B into 1->A+1->B, but you could always leave that to the user like usual.Also, I found it a bit annoying that when I did something like If E<(96*256), the part in the parentheses wasn't reduced to a constant before doing the less-than operation. Could the look-ahead parsing be able to detect constants in parentheses?
This times a million and five.Also, I found it a bit annoying that when I did something like If E<(96*256), the part in the parentheses wasn't reduced to a constant before doing the less-than operation. Could the look-ahead parsing be able to detect constants in parentheses?
This times a million.
Code: (Old code: 76 bytes) [Select] p_GetArc: | Code: (New code: 69 bytes) [Select] p_GetArc: |
And on the topic of stuff that involves port 6, I think it would be nice if the archive byte reading routine avoided using a B_CALL for a massive speed boost, especially for code compiled as programs:
p_ReadArc: 18 bytes (2x) larger, but ~1400 cycles (!!!10x!!!) fasterCode: (36 bytes, ~142 cycles) [Select]p_ReadArc:
.db __ReadArcEnd-1-$
ld c,a
in a,(6)
ld b,a
ld a,h
set 6,h
res 7,h
rlca
rlca
dec a
and %00000011
add a,c
out (6),a
ld c,(hl)
inc hl
bit 7,h
jr z,__ReadArcNoBoundary
set 6,h
res 7,h
inc a
out (6),a
__ReadArcNoBoundary:
ld l,(hl)
ld h,c
ld a,b
out (6),a
ret
__ReadArcEnd:
p_ReadArcApp: 36 bytes (3x) larger, but ~1050 cycles (4x) fasterCode: (54 bytes, ~396 cycles) [Select]p_ReadArcApp:
.db __ReadArcAppEnd-1-$
push hl
ld hl,$0000
ld de,ramCode
ld bc,__ReadArcAppRamCodeEnd-__ReadArcAppRamCode
ldir
pop hl
ld e,a
ld c,6
in b,(c)
ld a,h
set 6,h
res 7,h
rlca
rlca
dec a
and %00000011
add a,e
call ramCode
ld e,d
inc hl
bit 7,h
jr z,__ReadArcAppNoBoundary
set 6,h
res 7,h
inc a
__ReadArcAppNoBoundary:
call ramCode
ex de,hl
ret
__ReadArcAppEnd:
.db rp_Ans,__ReadArcAppEnd-p_ReadArcApp-3
__ReadArcAppRamCode:
out (6),a
ld d,(hl)
out (c),b
ret
__ReadArcAppRamCodeEnd:
Code: (Old code: 22 bytes) [Select] p_CopyArc: | Code: (New code: 28 bytes) [Select] p_CopyArc: |
;===============================================================
sqrtE:
;===============================================================
;Input:
; E is the value to find the square root of
;Outputs:
; A is E-D^2
; B is 0
; D is the rounded result
; E is not changed
; HL is not changed
;Destroys:
; C
;
xor a ;1 4 4
ld d,a ;1 4 4
ld c,a ;1 4 4
ld b,4 ;2 7 7
sqrtELoop:
rlc d ;2 8 32
ld c,d ;1 4 16
scf ;1 4 16
rl c ;2 8 32
rlc e ;2 8 32
rla ;1 4 16
rlc e ;2 8 32
rla ;1 4 16
cp c ;1 4 16
jr c,$+4 ;4 12|15 48+3x
inc d ;-- -- --
sub c ;-- -- --
djnz sqrtELoop ;2 13|8 47
cp d ;1 4 4
jr c,$+3 ;3 12|11 12|11
inc d ;-- -- --
ret ;1 10 10
;===============================================================
;Size : 29 bytes
;Speed : 347+3x cycles plus 1 if rounded down
; x is the number of set bits in the result.
;===============================================================
Code: (Old code: 27 bytes, ~220.5 cycles) [Select] p_DKeyVar: | Code: (New code: 17 bytes, ~225 cycles) [Select] p_DKeyVar: |
Code: (Old code: 25 bytes, 670 cycles) [Select] p_ToHex: | Code: (New code: 25 bytes, 639 cycles) [Select] p_ToHex: |
Code: (Old code: 17 bytes, 27542 cycles) [Select] p_ShiftLeft: | Code: (New code: 16 bytes, 27475 cycles) [Select] p_ShiftLeft: |
Code: (Old code: 17 bytes, 27542 cycles) [Select] p_ShiftRight: | Code: (New code: 16 bytes, 27475 cycles) [Select] p_ShiftRight: |
Code: (Old code: 23 bytes) [Select] p_FreqOut: | Code: (New code: 22 bytes) [Select] p_FreqOut: |
Code: (Old code: 42 bytes, a lot of cycles) [Select] p_IntSetup: | Code: (New code: 38 bytes, more cycles but who cares?) [Select] p_IntSetup: |
Code: (Old code: 13 bytes, a lot of cycles) [Select] p_DtoF: | Code: (New code: 11 bytes, a lot plus a few cycles) [Select] p_DtoF: |
Code: ((Old code: 11 bytes)) [Select] p_Length: | Code: ((New code: 10 bytes)) [Select] p_Length: |
Code: (Old Code: 19 bytes, 63.5*n+37 cycles) [Select] p_CheckSum: | Code: (New Code: 19 bytes, 44.5*n+65 cycles) [Select] p_CheckSum: |
p_CheckSum:
.db __CheckSumEnd-$-1
ld b,h
ld c,l
pop hl
ex (sp),hl
xor a
ld d,a
__CheckSumLoop:
add a,(hl)
jr nc,$+3
inc d
cpi
jp pe,__CheckSumLoop
ld h,d
ld l,a
ret
__CheckSumEnd:
Thanks :) I think I learned it from you folks :)Actually, ex (sp),hl takes 2 fewer cycles than pop af and push af combined, so it's faster too :)
EDIT: It does use 2 more cycles though, right?
p_EzSprite:
.db 7
pop de
ld a,e
pop de
ld d,a
B_CALL(_DisplayImage)
p_EzSprite:
.db 6
pop bc
pop de
ld d,c
B_CALL(_DisplayImage)
Code: (Old Code: 7 bytes, 30 or 38 cycles) [Select] p_DecWord: | Code: (New Code: 6 bytes, 29 or 36) [Select] p_DecWord: |
Code: [Select] p_Input: | Code: [Select] p_Input: |
Hm...I say this as an Axe programmer, not knowing ASM...how about UPSIDE DOWN TEXT! om nom nom nomYeah! it could be something like Fix 11 :D
Original Code: [Select] p_SDiv: | Optimized Code: [Select] p_SDiv: |
Optimized (speed) +1 byte, avg -19 cycles Code: [Select] p_SDiv: | Optimized (size) -4 bytes, avg +37 cycles Code: [Select] p_SDiv: | Optimized (lol) -8 bytes, avg +110-4 cycles Code: [Select] p_SDiv: |
Original Code: [Select] p_SortD: | Optimized Code: [Select] p_SortD: |
Original Code: [Select] p_Reciprocal: | Optimized avg -2 cycles Code: [Select] p_Reciprocal: | Optimized Moar avg -33 cycles Code: [Select] p_Reciprocal: |
Original Code: [Select] p_Mod: | Optimized Code: [Select] p_Mod: |
Not an optimization, but I'm posting this here since more assembly people will read it. Since the Bitmap() command is being replaced with something actually useful, that means the "Fix 8" and "Fix 9" will also need to be replaced. Are there any useful flags (particularly for text) that would be useful to Axe programmers that I haven't already covered with the other fix commands? A couple I can think of are an APD toggle or Lowercase toggle.
p_GetKeyPause: ; Change to subroutine
.db __GetKeyPauseEnd-1-$
B_CALL(_GetKeyRetOff)
res 7,(iy+40)
ld h,0
ld l,a
cp $fc
ret c
ld a,($8446)
ld h,a ; Edit: something like inc h \ ld l,a might be easier
; since lowercase letters would be consecutive
; or ld hl,($8446) \ ld h,a
ret
__GetKeyPauseEnd-1-$
.db 3
sbc hl,de
ld a,h
or l
.db 2
sbc hl,de
.db 8
ld de,$0000
add hl,de
sbc hl,hl
inc hl
dec hl
.db 6
ld de,$0000
add hl,de
sbc hl,hl
.db 3
sbc hl,hl
ld a,h
or l
.db 2
sbc hl,hl
I figured that you already checked that the value of a is not used.Original Code: [Select] p_ArcTan: | Optimized Code: [Select] p_ArcTan: |
Original Code: [Select] p_DrawBmp: | Optimized Code: [Select] p_DrawBmp: |
Original Code: [Select] p_88Mul: | Optimized Code: [Select] p_88Mul: |
Original Code: [Select] p_SDiv: | Optimized Code: [Select] p_SDiv: |
Original Code: [Select] p_Reciprocal: | Optimized Code: [Select] p_Reciprocal: |
Smaller routine: 14 bytes, ~836 cycles Code: [Select] p_Mul: | Faster routine: 16 bytes, ~741 cycles Code: [Select] p_Mul: |
Even faster routine: 18 bytes, ~749 cycles for 16-bit inputs (h!=0), ~386 cycles for 8-bit inputs (h=0) Code: [Select] p_Mul: |
Original routine: 38 bytes, ~1128 cycles Code: [Select] p_88Mul: | Smaller routine: 18 bytes, ~1089 cycles Code: [Select] p_88Mul: |
Faster routine: 34 bytes, ~831 cycles Code: [Select] p_88Mul: |
Faster routine: 35 bytes, ~900 cycles Code: [Select] p_88Mul: |
ld de,-range
add hl,de
ld de,jumptable_end
jr c,default
add hl,hl
add hl,de
ld e,(hl)
inc hl
ld d,(hl)
default:
ex de,hl
jp (hl)
.dw Label0
.dw Label1
.dw Label2
;.....
jumptable_end:
ld de,-range
add hl,de
jr c,default
add hl,hl
ld de,jumptable_end
add hl,de
ld e,(hl)
inc hl
ld d,(hl)
ex de,hl
jp (hl)
default:
ld de,-range
add hl,de
jr c,routine_end
ex de,hl
ld hl,jumptable_end
add hl,de
ld a,(hl)
add hl,de
ld l,(hl)
ld h,a
jp (hl)
routine_end:
p_SGT0:
.db 8
ld a,h
or l
jr z,$+6
add hl,hl
sbc hl,hl
inc hl
p_SLE0:
.db 9
ld a,h
or l
jr z,$+6
add hl,hl
ccf
sbc hl,hl
inc hl
p_SLtLeXX:
.db 11
ld a,h
add a,$80
ld h,a
ld de,$0000 ;$8000-const
add hl,de
sbc hl,hl
inc hl
.db rp_Ans,6
p_SGtGeXX:
.db 12
ld a,h
add a,$80
ld h,a
xor a
ld de,$0000 ;$8000-const
add hl,de
ld h,a
rla
ld l,a
.db rp_Ans,6
p_SIntGt:
.db 11
scf
sbc hl,de
add hl,hl
jp pe,$+4
ccf
sbc hl,hl
inc hl
p_SIntGe:
.db 11
xor a
sbc hl,de
add hl,hl
jp po,$+4
ccf
ld h,a
rla
ld l,a
p_SIntLt:
.db 11
scf
sbc hl,de
add hl,hl
jp po,$+4
ccf
sbc hl,hl
inc hl
p_SIntLe:
.db 11
xor a
sbc hl,de
add hl,hl
jp pe,$+4
ccf
ld h,a
rla
ld l,a
Original Code: [Select] xor a | Optimized Code: [Select] xor a |
Original Code: [Select] p_Pix: | Optimized Code: [Select] p_Pix: |
Original Code: [Select] p_ArcTan: | Optimized Code: [Select] p_ArcTan: |
p_DrawOr:
.db __DrawOrEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop bc ;Input c = Sprite Y Position
pop de ;Input e = Sprite X Position
push af
ld b,7
ld a,e
add a,b
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,a
ld a,c
add a,b
jr c,__DrawOrClipTop
sub 64+7
ret nc
cpl
cp b
jr c,__DrawOrClipBottom
ld a,b
jr __DrawOrClipBottom
__DrawOrClipTop:
inc ix
inc c
jr nz,__DrawOrClipTop
__DrawOrClipBottom:
inc a
ld b,0
sla c
sla c
add hl,bc
add hl,bc
add hl,bc
ld c,d
add hl,bc
ld b,a
ld a,e
and 7
jr z,__DrawOrAligned
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawOrLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawOrShift:
srl c
rra
djnz __DrawOrShift
and e
or (hl)
ld (hl),a
dec hl
ld a,c
and d
or (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawOrLoop
ret
__DrawOrAligned:
ld de,12
__DrawOrAlignedLoop:
ld a,(ix)
or (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawOrAlignedLoop
ret
__DrawOrEnd:
p_DrawXor:
.db __DrawXorEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop bc ;Input c = Sprite Y Position
pop de ;Input e = Sprite X Position
push af
ld b,7
ld a,e
add a,b
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,a
ld a,c
add a,b
jr c,__DrawXorClipTop
sub 64+7
ret nc
cpl
cp b
jr c,__DrawXorClipBottom
ld a,b
jr __DrawXorClipBottom
__DrawXorClipTop:
inc ix
inc c
jr nz,__DrawXorClipTop
__DrawXorClipBottom:
inc a
ld b,0
sla c
sla c
add hl,bc
add hl,bc
add hl,bc
ld c,d
add hl,bc
ld b,a
ld a,e
and 7
jr z,__DrawXorAligned
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawXorLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawXorShift:
srl c
rra
djnz __DrawXorShift
and e
xor (hl)
ld (hl),a
dec hl
ld a,c
and d
xor (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawXorLoop
ret
__DrawXorAligned:
ld de,12
__DrawXorAlignedLoop:
ld a,(ix)
xor (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawXorAlignedLoop
ret
__DrawXorEnd:
dec hl
dec bc
ld a,b
or c
jr nz,__FreqOutLoop2
with this:cpd
jp pe,__FreqOutLoop2
However, the issue was that the frequency would be thrown off as it cut out 8*HL cycles. However, when I was stealing the code for my own evil intentions, I saw this optimisation and thought of that issue and here is my solution:
p_FreqOut:
xor a
__FreqOutLoop1:
push bc
xor %00000011
ld e,a
__FreqOutLoop2:
ld a,h
or l
jr z,__FreqOutDone
cpd
ld a,e
scf
jp pe,__FreqOutLoop2
__FreqOutDone:
pop bc
out ($00),a
ret nc
jr __FreqOutLoop1
__FreqOutEnd:
The way the code is reordered, now, it should only cut out 8*HL/BC cycles which is much less than 8*HL. I think Runer said that it might be up to 1% faster for higher notes and negligible for lower notes.p_DrawOr:
.db __DrawOrEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawOrClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawOrClipBottom
ld b,d
jr __DrawOrNoClipV
__DrawOrClipTop:
inc ix
inc e
jr nz,__DrawOrClipTop
__DrawOrClipBottom:
ld b,a
__DrawOrNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
inc b
ld a,c
and d
ld d,-7*3
add hl,de
jr z,__DrawOrAligned
ld e,c
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawOrLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawOrShift:
srl c
rra
djnz __DrawOrShift
and e
or (hl)
ld (hl),a
dec hl
ld a,c
and d
or (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawOrLoop
ret
__DrawOrAligned:
ld de,12
__DrawOrAlignedLoop:
ld a,(ix)
or (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawOrAlignedLoop
ret
__DrawOrEnd:
p_DrawXor:
.db __DrawXorEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawXorClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawXorClipBottom
ld b,d
jr __DrawXorNoClipV
__DrawXorClipTop:
inc ix
inc e
jr nz,__DrawXorClipTop
__DrawXorClipBottom:
ld b,a
__DrawXorNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
inc b
ld a,c
and d
ld d,-7*3
add hl,de
jr z,__DrawXorAligned
ld e,c
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawXorLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawXorShift:
srl c
rra
djnz __DrawXorShift
and e
xor (hl)
ld (hl),a
dec hl
ld a,c
and d
xor (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawXorLoop
ret
__DrawXorAligned:
ld de,12
__DrawXorAlignedLoop:
ld a,(ix)
xor (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawXorAlignedLoop
ret
__DrawXorEnd:
p_DrawOff:
.db __DrawOffEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawOffClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawOffClipBottom
ld b,d
jr __DrawOffNoClipV
__DrawOffClipTop:
inc ix
inc e
jr nz,__DrawOffClipTop
__DrawOffClipBottom:
ld b,a
__DrawOffNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawOffAligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawOffLoop
inc d
cp 96-7
jr nc,__DrawOffLoop
inc d
__DrawOffLoop:
push bc
ld b,c
ld c,(ix+0)
xor a
ld e,$FF
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
or (hl)
and e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:
ld bc,13
add hl,bc
inc ix
pop bc
djnz __DrawOffLoop
ret
__DrawOffAligned:
ld e,12
__DrawOffAlignedLoop:
ld a,(ix)
ld (hl),a
inc ix
add hl,de
djnz __DrawOffAlignedLoop
ret
__DrawOffEnd:
p_DrawMsk:
.db __DrawMskEnd-1-$
ex (sp),hl
pop ix ;Input hl = Sprite
pop de
pop bc
push hl
ld hl,plotSScreen
ld d,7
ld a,e
add a,d
jr c,__DrawMskClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawMskClipBottom
ld b,d
jr __DrawMskNoClipV
__DrawMskClipTop:
inc ix
inc e
jr nz,__DrawMskClipTop
__DrawMskClipBottom:
ld b,a
__DrawMskNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawMskAligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawMskLoop
inc d
cp 96-7
jr nc,__DrawMskLoop
inc d
__DrawMskLoop:
push bc
push hl
ld b,c
ld e,(ix+0)
xor a
ld h,a
ld c,(ix+8)
__DrawMskShift:
srl e
rr h
srl c
rra
djnz __DrawMskShift
ld b,h
pop hl
push af
dec d
jr z,__DrawMskSkipRight1
push bc
xor b
cpl
ld c,a
ld a,(hl)
or b
and c
ld (hl),a
pop bc
__DrawMskSkipRight1:
dec hl
inc d
push de
jr z,__DrawMskSkipLeft1
ld a,c
xor e
cpl
ld d,a
ld a,(hl)
or e
and d
ld (hl),a
__DrawMskSkipLeft1:
ld de,appBackUpScreen-plotSScreen+1
add hl,de
pop de
pop af
dec d
jr z,__DrawMskSkipRight2
or b
cpl
and (hl)
or b
ld (hl),a
__DrawMskSkipRight2:
dec hl
inc d
jr z,__DrawMskSkipLeft2
ld a,c
or e
cpl
and (hl)
or e
ld (hl),a
__DrawMskSkipLeft2:
ld bc,plotSScreen-appBackUpScreen+13
add hl,bc
inc ix
pop bc
djnz __DrawMskLoop
ret
__DrawMskAligned:
push hl
ld de,appBackUpScreen-plotSScreen
add hl,de
ld a,(ix+0)
ld d,a
xor (ix+8)
cpl
ld e,a
and (hl)
or d
ld (hl),a
pop hl
ld a,(hl)
or d
and e
ld (hl),a
inc ix
ld de,12
add hl,de
djnz __DrawMskAligned
ret
__DrawMskEnd:
p_DrawMsk2:
.db __DrawMsk2End-1-$
ex (sp),hl
pop ix ;Input hl = Sprite
pop de
pop bc
push hl
ld hl,plotSScreen
ld d,7
ld a,e
add a,d
jr c,__DrawMsk2ClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawMsk2ClipBottom
ld b,d
jr __DrawMsk2NoClipV
__DrawMsk2ClipTop:
inc ix
inc e
jr nz,__DrawMsk2ClipTop
__DrawMsk2ClipBottom:
ld b,a
__DrawMsk2NoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawMsk2Aligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawMsk2Loop
inc d
cp 96-7
jr nc,__DrawMsk2Loop
inc d
__DrawMsk2Loop:
push bc
push hl
ld b,c
ld e,(ix+0)
xor a
ld h,a
ld c,(ix+8)
__DrawMsk2Shift:
srl e
rr h
srl c
rra
djnz __DrawMsk2Shift
ld b,h ;e = left spr, b = right spr, c = left msk, a = right msk
pop hl
dec d
jr z,__DrawMsk2SkipRight
cpl
and (hl)
xor b
ld (hl),a
__DrawMsk2SkipRight:
dec hl
inc d
jr z,__DrawMsk2SkipLeft
ld a,c
cpl
and (hl)
xor e
ld (hl),a
__DrawMsk2SkipLeft:
ld bc,13
add hl,bc
inc ix
pop bc
djnz __DrawMsk2Loop
ret
__DrawMsk2Aligned:
ld e,12
__DrawMsk2AlignedLoop:
ld a,(ix+8)
cpl
and (hl)
xor (ix+0)
ld (hl),a
inc ix
add hl,de
djnz __DrawMsk2AlignedLoop
ret
__DrawMsk2End:
Original Code: [Select] p_MulFull: | Optimized Code: [Select] p_MulFull: |
p_MulFullSigned:
.db __MulFullSignedEnd-1-$
push hl
call $3F00+sub_MulFull
pop bc
xor a
bit 7,b
jr z,$+4
sbc hl,de
or d
ret p
sbc hl,bc
ret
__MulFullSignedEnd:
ld hl, 5
push hl
call $9D9D
when it could just compile to call $0005
p_NthStr:
.db __NthStrEnd-$+1
pop bc
pop de
push bc
ex de,hl
__NthStrLoop:
ld a,d
or e
ret z
xor a
ld b,h
cpir
dec de
jr __NthStrLoop
__NthStrEnd:
It took me a second to figure out what you were doing with 'ld b,h', but when I did, I saw that you could just move it outside the loop to save 4 t-states each loop. But then I realised that BC is already large enough since it holds the return address, so you can actually just remove it altogether.p_NthStr:
.db __NthStrEnd-$+1
pop bc
pop de
push bc
ex de,hl
__NthStrLoop:
ld a,d
or e
ret z
xor a
cpir
dec de
jr __NthStrLoop
__NthStrEnd:
I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation
I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).
You do have an outdated version of Axe, I already added that optimizaion in 1.2.0. :PDarn, I actually do have 1.2.1 in a different folder, I completely forgot about that .__. I am glad that I got something right, though :D
I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).I was worried about that, but I figured that it would be pretty rare. It would definitely be the only scenario that it would fail, too. .__.
Original routine Code: [Select] p_SDiv: | Smaller routine: 1 byte, 1|6 cycles saved Code: [Select] p_SDiv: |
; Fill(ptr, amount, byte (not word))
; hl = ptr, de = byte, bc = amount
ld (hl),e
dec bc
ld a,c
or b
ret z ; or whatever to quit
ld e,l
ld d,h
inc de
ldir
ret ; ↑
p_GetByte:
.db __GetByteEnd-$-1
di
ld bc,$0803 ;Bit counter in b, bit mask in c
ld hl,-1
xor a
out (0),a ;Make sure we are reset
in a,(0)
and c ;Check to see if sender is ready
dec a
ret nz ;If not, then go back
inc a
out (0),a ;Relay a confirmation
ex (sp),hl ;Wait at until confirmation is read (59 T-states minimum)
ex (sp),hl
ld a,(de) ;Bit counter in b and bitmask in c
xor a ;Store received byte in l
ld hl,$AA
out (0),a ;Reset the ports to receive data
__GetByteLoop:
in a,(0)
xor l
rra
jr c,__GetByteLoop
in a,(0)
rra
rra ;bits cycled in are masked with 0x55. Need to invert anyways, so mask at the end with 0xAA
rr l
djnz __GetByteLoop
ret
p_SendByte:
.db __SendByteEnd-$-1
di
ld bc,$5503 ;Bit counter in b, bit mask in c
ld a,%00000010
out (0),a ;Indicate we are ready to send
__SendByteTimeout:
dec hl
ld a,h
or l
jr z,__SendByteDone
in a,(0) ;Loop is 59 T-states maximum
and c
jr nz,__SendByteTimeout ;Keep looping till we get it
out (0),a
__SendLoop:
rrc e
ccf
rla
sla b
ccf
rla
out (0),a
ex (sp),hl
ex (sp),hl
nop
jr nz,__SendLoop
;need 37cc
xor a
ex (sp),hl
ex (sp),hl
__SendByteDone
out (0),a
ret
__SendByteEnd:
EDIT: I looked at the timeout code for p_SendByte, and realized that my code didn't need B to be a counter but instead I was using D as a kind of counter. By using B instead of D, I could cut out the ld d,$55, saving 2 bytes and 7cc.
Original routine |
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc
;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
cp 64
ret nc
ld h,a
ld a,d
cp 64
ret nc
ld a,l
cp 96
ret nc
ld a,e
cp 96
ret nc
sub l
jr nc,__LineShrSkipRev
ex de,hl
neg
;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
push af ; Saving DX (it will be popped into HL below)
ld a,l ; IX+=L/8+D*12 (actually D*4+D*4+D*4)
rra
rra
rra
and %00011111
ld c,a
ld b,0
add ix,bc
ld a,d
add a,a
add a,a
ld c,a
add ix,bc
add ix,bc
add ix,bc
ld a,l ; Calculating the starting pixel mask
and %00000111
inc a
ld b,a
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data
ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
pop hl ; Recalling DX
ld l,a ; H=DX, L=DY
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
scf ; Setting up gradient counter
ccf
rra
scf
ret ; c=1, z=vertical major
__LineShrEnd:
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc
;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
ld h,a
ld a,63
cp h
ret c
cp d
ret c
ld a,95
cp l
ret c
cp e
ret c
ld a,e
sub l
jr nc,__LineShrSkipRev
ex de,hl
neg
;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
push af ; Saving DX (it will be popped into HL below)
ld a,d
add a,a
add a,a
ld c,a
ld b,0
add ix,bc
add ix,bc
add ix,bc
ld a,l
and 7
ld e,a
xor l
rra
rra
rra
ld c,a
add ix,bc
ld b,a
inc b
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data
ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
pop hl ; Recalling DX
ld l,a ; H=DX, L=DY
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
res 0,a ; Setting up gradient counter
rrca
ret ; c=0, z=vertical major
__LineShrEnd:
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc
;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
ld h,a
ld a,63
cp h
ret c
cp d
ret c
ld a,95
cp l
ret c
cp e
ret c
ld a,e
sub l
jr nc,__LineShrSkipRev
ex de,hl
neg
;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
ld e,a ; Saving DX
ld a,l ; IX+=L/8+D*12 (actually D*4+D*4+D*4)
rra
rra
rra
and %00011111
ld c,a
ld b,0
add ix,bc
ld a,d
add a,a
add a,a
ld c,a
add ix,bc
add ix,bc
add ix,bc
ld a,l ; Calculating the starting pixel mask
and %00000111
inc a
ld b,a
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data
ld h,e ; DX
ld l,a ; DY
ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
res 0,a ; Setting up gradient counter
rrca
ret ; c=0, z=vertical major
__LineShrEnd:
;7 bytes, 36cc
ld a,l
or h
add a,255
sbc hl,hl
inc hl
;7 bytes, 28cc
xor a
cp h
ld h,a
sbc a,l
sbc a,a
ld l,a
inc l