Author Topic: Assembly Programmers - Help Axe Optimize! (Read 145156 times)

calc84maniac · « **Reply #285 on:** September 18, 2012, 09:37:08 am »

And if you ever want a signed high multiplication, I think this routine would work along with that one:

p_MulFullSigned:
 .db __MulFullSignedEnd-1-$
 push hl
 call $3F00+sub_MulFull
 pop bc
 xor a
 bit 7,b
 jr z,$+4
 sbc hl,de
 or d
 ret p
 sbc hl,bc
 ret
__MulFullSignedEnd:

Edit: more optimized

squidgetx · « **Reply #286 on:** December 12, 2012, 10:22:33 am »

Optimizing constant address calls?
Anyway, 5->^oVAR : (^oVAR)() compiles to

Code: [Select]

ld hl, 5
push hl
call $9D9D

when it could just compile to

Code: [Select]

call $0005

Right now the only way to call an address that's not a label is using asm(CDXXXX), and that way makes assigning r1-r6 arguments extremely annoying (manual store)

Xeda112358 · « **Reply #287 on:** February 15, 2013, 07:02:51 pm »

I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation:

Code: [Select]

p_NthStr:
 .db __NthStrEnd-$+1
 pop bc
 pop de
 push bc
 ex de,hl
__NthStrLoop:
 ld a,d
 or e
 ret z
 xor a
 ld b,h
 cpir
 dec de
 jr __NthStrLoop
__NthStrEnd:

It took me a second to figure out what you were doing with 'ld b,h', but when I did, I saw that you could just move it outside the loop to save 4 t-states each loop. But then I realised that BC is already large enough since it holds the return address, so you can actually just remove it altogether.

Code: [Select]

p_NthStr:
 .db __NthStrEnd-$+1
 pop bc
 pop de
 push bc
 ex de,hl
__NthStrLoop:
 ld a,d
 or e
 ret z
 xor a
 cpir
 dec de
 jr __NthStrLoop
__NthStrEnd:

I hope that actually works!

calc84maniac · « **Reply #288 on:** February 15, 2013, 11:18:31 pm »

I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).

Runer112 · « **Reply #289 on:** February 15, 2013, 11:29:47 pm »

Quote from: Xeda112358 on February 15, 2013, 07:02:51 pm

I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation

You do have an outdated version of Axe, I already added that optimizaion in 1.2.0.

Quote from: calc84maniac on February 15, 2013, 11:18:31 pm

I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).

Pfft what are the chances of that...

Xeda112358 · « **Reply #290 on:** February 16, 2013, 07:27:26 am »

Quote from: Runer112 on February 15, 2013, 11:29:47 pm

You do have an outdated version of Axe, I already added that optimizaion in 1.2.0.

Darn, I actually do have 1.2.1 in a different folder, I completely forgot about that .__. I am glad that I got something right, though

Quote from: calc84maniac on February 15, 2013, 11:18:31 pm

I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).

I was worried about that, but I figured that it would be pretty rare. It would definitely be the only scenario that it would fail, too. .__.

Deep Toaster · « **Reply #291 on:** April 06, 2013, 05:42:16 pm »

Don't know if it's been mentioned before (and maybe there's a reason it's this way), but p_SendByte starts by loading B and C individually where p_GetByte loads them together (saving a byte).

Runer112 · « **Reply #292 on:** April 06, 2013, 05:44:01 pm »

No reason whatsoever. Good catch.

Xeda112358 · « **Reply #293 on:** July 04, 2013, 08:59:58 am »

EDIT: Jacobly pointed out the case HL = 8000h, so this doesn't work

Hopefully this file has the updated SDiv routine. I have this:

Original routine

Code: [Select]

p_SDiv:
 .db __SDivEnd-1-$
 ld a,h
 xor d
 push af
 xor d
 jp p,__SDivSkip1-p_SDiv-1
 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a
__SDivSkip1:
 bit 7,d
 jr z,__SDivSkip2
 xor a
 sub e
 ld e,a
 sbc a,a
 sub d
 ld d,a
__SDivSkip2:
 call $3F00+sub_Div
x_SDivEntry:
 pop af
 ret p
 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a
 ret
__SDivEnd:

Smaller routine: 1 byte, 1|6 cycles saved

Code: [Select]

p_SDiv:
 .db __SDivEnd-1-$
 ld a,h
 xor d
 push af
 xor d
 jp p,__SDivSkip1-p_SDiv-1
 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a
__SDivSkip1:
 xor d
 jp p,__SDivSkip2-p_SDiv-1
 xor a
 sub e
 ld e,a
 sbc a,a
 sub d
 ld d,a
__SDivSkip2:
 call $3F00+sub_Div
x_SDivEntry:
 pop af
 ret p
 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a
 ret
__SDivEnd:

And my only change is the two lines after __SDivSkip1.
Same size, save at least 1 cycle (up to 6 cycles).

EDIT: The same modification can be made to the fixed point signed division routine.

Matrefeytontias · « **Reply #294 on:** July 12, 2013, 01:30:49 pm »

Seeing the discussion about Fill( in the axiom request thread, I was surprized that this wasn't implemented this way already :

Code: [Select]

; Fill(ptr, amount, byte (not word))
; hl = ptr, de = byte, bc = amount
 ld (hl),e
 dec bc
 ld a,c
 or b
 ret z ; or whatever to quit
 ld e,l
 ld d,h
 inc de
 ldir
 ret ;              ↑

I don't think it's really optimized though >_>

jo-thijs · « **Reply #295 on:** October 31, 2013, 12:02:41 pm »

I found this in the Commands.inc file of axe1.2.2a:
p_IntNe:
   .db 8
   xor   a
   sbc   hl,de
   jr   z,$+5
   ld   hl,1

I can't find the purpose of xor a.

Runer112 · « **Reply #296 on:** October 31, 2013, 12:14:49 pm »

Reset the carry flag for sbc hl,de it seems.

Xeda112358 · « **Reply #297 on:** July 27, 2015, 02:49:49 pm »

I think I finally have a major optimization after having worked on link routines for the past couple of weeks. I didn't modify the timeout or syncing code, just the core get/send stuff. I've tested it and it is reliable.

For reference, in the even that p_SendByte doesn't have to wait, the new routine is 931cc vs 1647cc. Here are my proposed routines:

p_GetByte: +0 bytes, presumably as much faster as p_SendByte

Code: [Select]

p_GetByte:
 .db __GetByteEnd-$-1
 di
 ld bc,$0803  ;Bit counter in b, bit mask in c
 ld hl,-1
 xor a
 out (0),a   ;Make sure we are reset
 in a,(0)
 and c   ;Check to see if sender is ready
 dec a
 ret nz   ;If not, then go back
 inc a
 out (0),a   ;Relay a confirmation
 ex (sp),hl   ;Wait at until confirmation is read (59 T-states minimum)
 ex (sp),hl
 ld a,(de)   ;Bit counter in b and bitmask in c
 xor a   ;Store received byte in l
 ld hl,$AA
 out (0),a   ;Reset the ports to receive data

__GetByteLoop:
    in a,(0)
    xor l
    rra
    jr c,__GetByteLoop
    in a,(0)
    rra
    rra             ;bits cycled in are masked with 0x55. Need to invert anyways, so mask at the end with 0xAA
    rr l
    djnz __GetByteLoop
    ret

p_SendByte: -4 bytes, -723cc

Code: [Select]

p_SendByte:
    .db __SendByteEnd-$-1
 di
 ld bc,$5503  ;Bit counter in b, bit mask in c
 ld a,%00000010
 out (0),a   ;Indicate we are ready to send
__SendByteTimeout:
 dec hl
 ld a,h
 or l
 jr z,__SendByteDone
 in a,(0)   ;Loop is 59 T-states maximum
 and c
 jr nz,__SendByteTimeout ;Keep looping till we get it
 out (0),a
__SendLoop:
    rrc e
    ccf
    rla
    sla b
    ccf
    rla
    out (0),a
    ex (sp),hl
    ex (sp),hl
    nop
    jr nz,__SendLoop
;need 37cc
    xor a
    ex (sp),hl
    ex (sp),hl
__SendByteDone
    out (0),a
    ret
__SendByteEnd:

EDIT: I looked at the timeout code for p_SendByte, and realized that my code didn't need B to be a counter but instead I was using D as a kind of counter. By using B instead of D, I could cut out the ld d,$55, saving 2 bytes and 7cc.

Xeda112358 · « **Reply #298 on:** September 21, 2019, 07:19:54 pm »

Here is an optimized p_LineShr routine. NOTE: It flips the meaning of the carry flag on output, so the line routines that use this will need to ret c instead of ret nc.

Original routine

Code: [Select]

p_LineShr:
 .db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
 ld a,l
 pop bc
 pop hl
 pop de
 ex (sp),hl
 ld d,l
 pop hl
 ex (sp),hl
 push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
 cp 64
 ret nc
 ld h,a
 ld a,d
 cp 64
 ret nc

 ld a,l
 cp 96
 ret nc
 ld a,e
 cp 96
 ret nc

 sub l
 jr nc,__LineShrSkipRev
 ex de,hl
 neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
 push af   ; Saving DX (it will be popped into HL below)
 ld a,l   ; IX+=L/8+D*12 (actually D*4+D*4+D*4)
 rra
 rra
 rra
 and %00011111
 ld c,a
 ld b,0
 add ix,bc
 ld a,d
 add a,a
 add a,a
 ld c,a
 add ix,bc
 add ix,bc
 add ix,bc
 ld a,l   ; Calculating the starting pixel mask
 and %00000111
 inc a
 ld b,a
 ld a,%00000001
__LineShrMaskLoop:
 rrca
 djnz __LineShrMaskLoop
 ld c,a
 ld a,h   ; Calculating delta Y and negating the Y increment if necessary
 sub d   ; This is the last instruction for which we need the original data
 ld de,12
 jr nc,__LineShrSkipNeg
 ld de,-12
 neg
__LineShrSkipNeg:
 pop hl   ; Recalling DX
 ld l,a   ; H=DX, L=DY
 cp h
 jr nc,__LineVert  ; Line is rather vertical than horizontal
 ld a,h
__LineVert:
 ld b,a   ; Pixel counter
 inc b
 cp l
 scf    ; Setting up gradient counter
 ccf
 rra
 scf
 ret    ; c=1, z=vertical major
__LineShrEnd:

Optimized routine: -4 bytes, -13cc

Code: [Select]

p_LineShr:
 .db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
 ld a,l
 pop bc
 pop hl
 pop de
 ex (sp),hl
 ld d,l
 pop hl
 ex (sp),hl
 push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
 ld h,a
 ld a,63
 cp h
 ret c
 cp d
 ret c

 ld a,95
 cp l
 ret c
 cp e
 ret c
 ld a,e

 sub l
 jr nc,__LineShrSkipRev
 ex de,hl
 neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
 push af   ; Saving DX (it will be popped into HL below)
 ld a,d
 add a,a
 add a,a
 ld c,a
 ld b,0
 add ix,bc
 add ix,bc
 add ix,bc
 ld a,l
 and 7
 ld e,a
 xor l
 rra
 rra
 rra
 ld c,a
 add ix,bc
 ld b,a
 inc b
 ld a,%00000001
__LineShrMaskLoop:
 rrca
 djnz __LineShrMaskLoop
 ld c,a
 ld a,h   ; Calculating delta Y and negating the Y increment if necessary
 sub d   ; This is the last instruction for which we need the original data
 ld de,12
 jr nc,__LineShrSkipNeg
 ld de,-12
 neg
__LineShrSkipNeg:
 pop hl   ; Recalling DX
 ld l,a   ; H=DX, L=DY
 cp h
 jr nc,__LineVert  ; Line is rather vertical than horizontal
 ld a,h
__LineVert:
 ld b,a   ; Pixel counter
 inc b
 cp l
 res 0,a   ; Setting up gradient counter
 rrca
 ret    ; c=0, z=vertical major
__LineShrEnd:

Or this version, it only save 3 bytes, but saves 10 more clock cycles:

Code: [Select]

p_LineShr:
 .db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
 ld a,l
 pop bc
 pop hl
 pop de
 ex (sp),hl
 ld d,l
 pop hl
 ex (sp),hl
 push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
 ld h,a
 ld a,63
 cp h
 ret c
 cp d
 ret c

 ld a,95
 cp l
 ret c
 cp e
 ret c
 ld a,e

 sub l
 jr nc,__LineShrSkipRev
 ex de,hl
 neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
 ld e,a   ; Saving DX
 ld a,l   ; IX+=L/8+D*12 (actually D*4+D*4+D*4)
 rra
 rra
 rra
 and %00011111
 ld c,a
 ld b,0
 add ix,bc
 ld a,d
 add a,a
 add a,a
 ld c,a
 add ix,bc
 add ix,bc
 add ix,bc
 ld a,l   ; Calculating the starting pixel mask
 and %00000111
 inc a
 ld b,a
 ld a,%00000001
__LineShrMaskLoop:
 rrca
 djnz __LineShrMaskLoop
 ld c,a
 ld a,h   ; Calculating delta Y and negating the Y increment if necessary
 sub d   ; This is the last instruction for which we need the original data

 ld h,e   ; DX
 ld l,a   ; DY

 ld de,12
 jr nc,__LineShrSkipNeg
 ld de,-12
 neg
__LineShrSkipNeg:
 cp h
 jr nc,__LineVert  ; Line is rather vertical than horizontal
 ld a,h
__LineVert:
 ld b,a   ; Pixel counter
 inc b
 cp l
 res 0,a   ; Setting up gradient counter
 rrca
 ret    ; c=0, z=vertical major
__LineShrEnd:

Xeda112358 · « **Reply #299 on:** October 20, 2019, 10:47:31 am »

p_EQ0

The current routine is 7 bytes and 36cc:

Code: [Select]

;7 bytes, 36cc
 ld a,l
 or h
 add a,255
 sbc hl,hl
 inc hl

But we can save 8cc without sacrificing bytes:

Code: [Select]

;7 bytes, 28cc
 xor a
 cp h
 ld h,a
 sbc a,l
 sbc a,a
 ld l,a
 inc l

Author Topic: Assembly Programmers - Help Axe Optimize! (Read 145156 times)

calc84maniac

Re: Assembly Programmers - Help Axe Optimize!

squidgetx

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

calc84maniac

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

Deep Toaster

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

Matrefeytontias

Re: Assembly Programmers - Help Axe Optimize!

jo-thijs

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!