Author Topic: ASM Optimized routines (Read 102422 times)

chickendude · « **Reply #60 on:** October 01, 2012, 09:21:55 am »

Not an optimized "routine", or really even useful, but i was trying to figure out how to add new instructions (and not macros) to spasm, and finally got this:

Code: [Select]

.addinstr add ahl,de 00CE19 3 NOP 1 ;add hl,de \ adc a,0
.addinstr add ahl,cde 8919 2 NOP 1 ;add hl,de \ adc a,c

Now whenever i want to do 24-bit addition (well, kinda) i can just use the new add ahl,de and add ahl,cde instructions

EDIT: oops, my bbcode leaves a lot to be desired...

ben_g · « **Reply #61 on:** October 01, 2012, 03:02:45 pm »

so does it work like this?

Code: [Select]

.addinstr <instruction> <asm code for the instruction, written in hex> <amounth of bytes of asm code> NOP 1

Xeda112358 · « **Reply #62 on:** October 01, 2012, 06:35:31 pm »

In most cases, yes, that is what you will want to do. If you open up the file TASM80, you will see the format of other such instructions.

chickendude · « **Reply #63 on:** October 01, 2012, 06:37:09 pm »

More or less, though you have to watch out for the ordering of the hex (it's backwards):
00CE19
19 = add hl,de
CE00 = adc a,0

And maybe there's a limit of four bytes/command, 'cuz nothing over four bytes seems to work. The command will take up that many bytes, but everything past four bytes seems to turn into random bytes.

You can use other values, too, like an asterisk can be used to pull in data, for example from tasmtabs.htm (NOTOUCH=NOP):

Code: [Select]

                                          EXAMPLE         EXAMPLE
INSTRUCTION DEFINITION                    SOURCE          OBJECT
-------------------------------------------------------------------
XYZ  *      FF   3  NOTOUCH 1             xyz 1234h       FF 34 12
XYZ  *      FF   2  NOTOUCH 1             xyz 1234h       FF 34
ZYX  *      FE   3  SWAP    1             zyx 1234h       FE 12 34
ZYX  *      FE   3  R2      1             zyx $+4         FE 01 00
ABC  *,*    FD   3  COMBINE 1             abc 45h,67h     FD 45 67
ABC  *,*    FD   3  CSWAP   1             abc 45h,67h     FD 67 45
ADD  A,#*   FC   2  NOTOUCH 1             add A,#'B'      FC 42
RET  ""     FB   1  NOTOUCH 1             ret             FB
LD   IX,*   21DD 4  NOTOUCH 1             ld  IX,1234h    DD 21 34 12
LD   IX,*   21DD 4  NOTOUCH 1 1 0         ld  IX,1234h    DD 21 68 24
LD   IX,*   21DD 4  NOTOUCH 1 0 1         ld  IX,1234h    DD 21 35 12
LD   IX,*   21DD 4  NOTOUCH 1 1 1         ld  IX,1234h    DD 21 69 24
LD   IX,*   21DD 4  NOTOUCH 1 8 12        ld  IX,34h      DD 21 12 34

chickendude · « **Reply #64 on:** February 01, 2013, 05:08:48 am »

Here's a quick filled rectangle routine i wrote the other night. It probably doesn't belong here since i don't think it's that optimized, but it should be smaller and faster than Jason Kovacs' routine posted on the previous page. As always, do whatever you want with the code and please don't give me credit! It takes the starting x/y coordinates in de, and the width and height in pixels in bc. rectangle_filled_solid will make a solid black rectangle, rectangle_filled_xor will xor the pixels within the rectangle's area.
EDIT: Added Xeda's optimisations. (see below for a slightly faster version)

Code: [Select]

#ifdef TI83P
GBUF_LSB = $40
GBUF_MSB = $93
#else
GBUF_LSB = $29
GBUF_MSB = $8E
#endif

;b = # columns
;c = # rows
;d = starting x
;e = starting y
rectangle_filled_xor:
 ld a,$AE   ;xor (hl)
 jr rectangle_filled2
rectangle_filled_solid:
 ld a,$B6   ;or (hl)
rectangle_filled2:
 push de
 push bc
  ld (or_xor),a  ;use smc for xor/solid fill
  ld a,d    ;starting x
  and $7    ;what bit do we start on?
  ex af,af'
   ld a,d   ;starting x
   ld l,e   ;ld hl,e
   ld h,0   ; ..
   ld d,h   ;set d = 0
   add hl,de  ;starting y * 12
   add hl,de  ;x3
   add hl,hl  ;x6
   add hl,hl  ;x12
   rra    ;a = x coord / 8
   rra    ;
   rra    ;
   and %00011111 ;starting x/8 (starting byte in gbuf)
   add a,GBUF_LSB
   ld e,a   ;
   ld d,GBUF_MSB ;
   add hl,de  ;hl = offset in gbuf
  ex af,af'   ;carry should be reset and z affected from and $7
  ld e,a
  ld a,%10000000
   jr z,$+6
   rra
   dec e
   jr nz,$-2
  ld d,a    ;starting bit to draw
rectangle_loop_y:
  push bc
  push hl
rectangle_loop_x:
   ld e,a   ;save a (overwritten with or (hl))
or_xor = $
   or (hl)   ;smc will modify this to or/xor
   ld (hl),a
   ld a,e   ;recall a
   rrca   ;rotate a to draw the next bit
    jr nc,$+3
    inc hl
   djnz rectangle_loop_x
  pop hl    ;hl = first column in gbuf row
  ld c,12    ;b = 0, bc = 12
  add hl,bc   ;move down to next row
  pop bc    ;restore b (# columns)
  ld a,d    ;restore a (starting bit to draw)
  dec c
   jr nz,rectangle_loop_y
rectangle_end:
 pop bc
 pop de
 ret

Here's a little example of how you could use the routine to make a text box:

Code: [Select]

start: 
 ld de,$041B 
 ld bc,$1103 
 call draw_box2 
 call ionFastCopy 
 bcall _getkey 
 ret 
 
draw_box2: 
 call rectangle_filled_solid 
 inc d 
 inc e 
 dec b 
 dec b 
 dec c 
 dec c 
 call rectangle_filled_xor 
 ret

EDIT: Small optimization

Xeda112358 · « **Reply #65 on:** February 01, 2013, 01:39:49 pm »

Awesome, I definitely like rectangel codes! I am glad you preserve the coordinates, too. Here are a few optimisations that I see without examining the code too much.

Code: [Select]

  ld hl,or_xor 
  ld (hl),a   ;use smc for xor/solid fill

To save a byte and a clock cycle, this can simply be:

Code: [Select]

  ld (or_xor),a

Code: [Select]

   rra    ;a = x coord / 8 
   rra    ; 
   rra    ; 
   and %00011111 ;starting x/8 (starting byte in gbuf) 
   ld e,a   ; 
   add hl,de  ;add x 
   ld de,gbuf  ; 
   add hl,de  ;hl = offset in gbuf

To save 7 clock cycles and 0 bytes:

Code: [Select]

   rra    ;a = x coord / 8 
   rra    ; 
   rra    ; 
   and %00011111 ;starting x/8 (starting byte in gbuf) 
   or 40h   ;gbuf=9340h, 40h = %01000000, so this won't cause problems
   ld e,a   ;
   ld d,93h
   add hl,de  ;add x

Code: [Select]

  pop hl 
  pop bc    ;restore b (# columns) 
  pop af 
  ld de,12 
  add hl,de   ;move down to next row

Since B is already 0 at the start of this code, save a byte and 3 t-states:

Code: [Select]

  pop hl 
  ld c,12
  add hl,bc
  pop bc    ;restore b (# columns) 
  pop af

And now you can actually modify that whole routine to be a little faster:

Code: [Select]

  ld d,a
rectangle_loop_y: 
  push bc
  push hl 
rectangle_loop_x: 
   ld e,a 
or_xor = $ 
   or (hl)   ;smc will modify this to or/xor 
   ld (hl),a 
   ld a,e 
   rrca 
    jr nc,$+3 
    inc hl 
   djnz rectangle_loop_x 
  pop hl 
  ld c,12
  add hl,bc
  pop bc
  ld a,d
  dec c 
   jr nz,rectangle_loop_y

You get rid of a push af / pop af in the loop which takes 21 t-states and replace it with a ld a,d which is 4 t-states.

This is definitely smaller than my routine. In mine, I make a 12-byte pattern to OR or XOR onto the screen o_o
One of the tricks I use is to find the first byte non-zero byte of the pattern:

Code: [Select]

;D is the x coordinate, here
;HL points to the pattern buffer (12 bytes)
     ld a,d
     and 7
     ld a,80h
     ld b,a
     jr z,$+5
       rrca
       djnz $-1
;A is the mask if it were for a pixel
;B is 0
     add a,a
     dec a
     ld (hl),a

So for example, if D and 7 = 3, you would get %00010000. 'add a,a' turns that to %00100000 and then dec a → %00011111.
And if you are worried about ' D and 7 ' = 0, you get %10000000→%00000000→%11111111 which is correct, or if 'D and 7' returns 7, %00000001→%00000010→%00000001.

chickendude · « **Reply #66 on:** February 03, 2013, 04:11:16 am »

That's a cool trick with the add a,a \ dec a. My first go at it only handled multiples of 8 and looked like this:

Code: [Select]

;b = # columns 
;c = # rows 
;d = starting x 
;e = starting y 
rectangle_filled2: 
 ld a,d  ;a = starting x coord 
 ld l,e  ;ld hl,e 
 ld h,0  ; .. 
 ld d,h  ;set d = 0 
 add hl,de ;starting y * 12 
 add hl,de ;x3 
 add hl,hl ;x6 
 add hl,hl ;x12 
 rra   ;a = x coord / 8 
 rra   ; 
 rra   ; 
 and %00011111 ;starting x/8 
 ld e,a  ; 
 add hl,de ;add x 
 ld de,gbuf 
 add hl,de ;offset in gbuf 
 ld a,b   ;b = no columns 
 rra 
 rra 
 rra 
 and %00011111 ;no. columns / 8 
 ld b,a 
 ld a,12 
 sub b 
 ld e,a 
 ld d,0 
rectangle_loop_y: 
 push bc 
rectangle_loop_x: 
  ld (hl),$FF 
  inc hl 
  djnz rectangle_loop_x 
 pop bc   ;restore c (# columns) 
 add hl,de  ;move down to next row 
 dec c 
  jr nz,rectangle_loop_y 
rectangle_end: 
 ret

Much smaller and faster, but certain cases (for example, a rectangle less than 8 pixels wide, 2 byte rectangles, etc.) seemed like they were going to bump up the size quite a bit and what I wanted was to write it more for size than speed since it's not being used in any time-critical areas of the game (mostly in menus and the battle engine). How do you handle cases where, say, a rectangle doesn't fill an entire byte?

The gbuf bit is also pretty creative (i guess you could also just add a,$40) but unfortunately wouldn't work out since it's being written for the 83 as well, though i guess a simple define would do. Yep:

Code: [Select]

#ifdef TI83P
GBUF_LSB = $40
GBUF_MSB = $93

 .org progstart-2
 .db $bb,$6d
#else 
GBUF_LSB = $29
GBUF_MSB = $8E
 .org progstart
#endif
;...
 rra    ;a = x coord / 8
 rra    ;
 rra    ;
 and %00011111 ;starting x/8 (starting byte in gbuf)
 add a,GBUF_LSB
 ld e,a   ;
 ld d,GBUF_MSB ;
 add hl,de  ;hl = offset in gbuf

...works just fine for the 83/+

I hope deeph doesn't mind, if you feel like checking their project out it's over at yAronet: http://www.yaronet.com/posts.php?s=153983

EDIT: Here's a version that draws vertically (here b = height and c = width) which seems to be slightly faster (and's the same size):

Code: [Select]

#ifdef TI83P
GBUF_LSB = $40
GBUF_MSB = $93
#else
GBUF_LSB = $29
GBUF_MSB = $8E
#endif

;b = # rows
;c = # columns
;d = starting x
;e = starting y
rectangle_filled_xor:
 ld a,$AE   ;xor (hl)
 jr rectangle_filled2
rectangle_filled_solid:
 ld a,$B6   ;or (hl)
rectangle_filled2:
 push de
 push bc
  ld (or_xor),a  ;use smc for xor/solid fill
  ld a,d    ;starting x
  and $7    ;what bit do we start on?
  ex af,af'
   ld a,d   ;starting x
   ld l,e   ;ld hl,e
   ld h,0   ; ..
   ld d,h   ;set d = 0
   add hl,de  ;starting y * 12
   add hl,de  ;x3
   add hl,hl  ;x6
   add hl,hl  ;x12
   rra    ;a = x coord / 8
   rra    ;
   rra    ;
   and %00011111 ;starting x/8 (starting byte in gbuf)
   add a,GBUF_LSB
   ld e,a   ;
   ld d,GBUF_MSB ;
   add hl,de  ;hl = offset in gbuf
  ex af,af'   ;carry should be reset and z affected from and $7
  ld d,a
  ld a,%10000000
   jr z,$+6
   rra
   dec d
   jr nz,$-2
  ld e,12
rectangle_loop_x:
  push af
  push bc
  push hl
   ld c,a
rectangle_loop_y:
or_xor = $
   or (hl)   ;smc will modify this to or/xor
   ld (hl),a
   ld a,c
   add hl,de
   djnz rectangle_loop_y
  pop hl
  pop bc
  pop af
  rrca
   jr nc,$+3
   inc hl
  dec c
   jr nz,rectangle_loop_x
rectangle_end:
 pop bc
 pop de
 ret

EDIT: Small optimization

NanoWar · « **Reply #67 on:** February 03, 2013, 12:39:46 pm »

So these work only with rectangle width W where W modulo 8 = 0 and W >= 8 ?

Any chance of a generic rectangle routine?

chickendude · « **Reply #68 on:** February 03, 2013, 11:51:44 pm »

They should work with any rectangle with a width/height greater than 0 (there's no error checking for invalid dimensions and no clipping). I can see what i can do for a generic rectangle routine, what i've been doing is just calling the routine twice:

Code: [Select]

start:
 ld de,$0204
 ld bc,$2121
 call draw_box2
 call ionFastCopy
 bcall _getkey
 ret

draw_box2:
 call rectangle_filled_solid
 inc d
 inc e
 dec b
 dec b
 dec c
 dec c
 call rectangle_filled_xor
 ret

EDIT: What i meant was that the first routine i wrote only handled multiples of 8, the other routines (the one in post #64 and at the bottom of #66) can handle any rectangle with valid (non-zero) dimensions. Well, as long as they don't go off-screen!

EDIT2: Here's a simple normal rectangle routine. I just converted the other routine to not draw the inside of the rectangle, so it won't erase what's inside the rectangle. It's currently at 90 bytes and not terribly fast, slightly slower than drawing a filled rectangle. By default a rectangle needs to be at least 3x2 (that is X*Y), but if you don't mind adding 8 bytes (altogether 99 bytes) you can use it to draw horizontal and vertical lines and even plot pixels

You can just uncomment those lines.

Code: [Select]

;b = # rows
;c = # columns
;d = starting x
;e = starting y
rectangle:
 push de
 push bc
  ld a,d    ;starting x
  and $7    ;what bit do we start on?
  ex af,af'
   ld a,d   ;starting x
   ld l,e   ;ld hl,e
   ld h,0   ; ..
   ld d,h   ;set d = 0
   add hl,de  ;starting y * 12
   add hl,de  ;x3
   add hl,hl  ;x6
   add hl,hl  ;x12
   rra    ;a = x coord / 8
   rra    ;
   rra    ;
   and %00011111 ;starting x/8 (starting byte in gbuf)
   add a,GBUF_LSB
   ld e,a   ;
   ld d,GBUF_MSB ;
   add hl,de  ;hl = offset in gbuf
  ex af,af'   ;carry should be reset and z affected from and $7
  ld e,a
  ld a,%10000000
   jr z,$+6
   rra
   dec e
   jr nz,$-2
  dec b    ;you could adjust your input to take care of this, ie b = width-2, c = height-1 and save 3 bytes here
  dec b    ;we draw the ends separately
  dec c    ;we'll draw the last line at the end
  ld d,a    ;starting bit to draw
;d = starting bit
rectangle_loop_y:
  push bc
   push hl
    call rectangle_loop_x
   pop hl   ;hl = first column in gbuf row
   ld c,12   ;b = 0, bc = 12
   add hl,b  ;move down to next row
  pop bc    ;restore b (# columns)
  xor a
;  cp c    ; # UNCOMMENT TO ALLOW LINES WITH A HEIGHT OF 1 PIXEL
;  jr z,rectangle_end ; # 
  ld (ld_hl),a  ;change ld (hl),a to nop
  ld a,d    ;restore a (starting bit to draw)
  dec c
   jr nz,rectangle_loop_y
  ld a,$77   ;ld (hl),a
  ld (ld_hl),a  ;return nop to ld (hl),a
  ld a,d
  call rectangle_loop_x
rectangle_end:
 pop bc
 pop de
 ret

rectangle_loop_x:
 or (hl)     ;first bit
 ld (hl),a
; inc b     ; # UNCOMMENT TO ALLOW LINES WITH A WIDTH OF 1 PIXEL
;  ret z     ; # 
; dec b     ; # 
;  jr z,rectangle_loop_x_end ; # 
 ld a,d
rectangle_loop_x_inner:
 rrca     ;rotate a to draw the next bit
  jr nc,$+3
  inc hl
 ld e,a     ;save a (overwritten with or (hl))
 or (hl)     ;smc will modify this to or/xor
ld_hl = $
 ld (hl),a
 ld a,e     ;recall a
 djnz rectangle_loop_x_inner
rectangle_loop_x_end:
 rrca     ;rotate a to draw the next bit
  jr nc,$+3
  inc hl
 or (hl)     ;last bit
 ld (hl),a
 ret

If you don't care about the coordinates, you can just remove the push/pops and get another 4 bytes. And i've got another idea which should be faster (essentially working with bytes instead of pixels), though it might take up a bit more space. If you're interested let me know and i'll try to work on it some.

As always, use and abuse without restrictions

EDIT3: Small optimization

Xeda112358 · « **Reply #69 on:** February 04, 2013, 07:23:40 am »

EDIT Five years later, I found my code didn't work that well

At the bottom of this post is a working routine, but not ideal.

Hmm, for 'rectangle_loop_x' and 'rectangle_loop_x_end' you have 'or (hl)' which I think needs to be SMC'd.

I decided to try and optimise my code a bit and I managed to optimise it for speed in most cases and size. Unfortunately, the code is still much bigger than chickendude's at 133 bytes

To give an indicator of speed (6MHz):

my old routine, 9x18 rectangle : 1926 times in two seconds
my new routine, 9x18 : 5668 times in two seconds
chickendude's old routine : 1287 times in two seconds
chickendude's new routine : untested .__.

So for cases where you don't need crazy speed, chickendude's is still very fast and much smaller (73 bytes versus 133).

Code: [Select]

Rectangle_or:
 ld a,$B6
 jr Rectangle
Rectangle_xor:
 ld a,$AE
Rectangle:
;    DE = (x,y)
;    BC = (height,width)
 ld (smc_logic1),a
 ld (smc_logic0),a
 push de
 push bc
 push bc
 ld a,d
 call ComputeByte
 ld (smc_FirstByte),a
 ex (sp),hl
 ld a,d
 neg
 and 7
 ld b,a
 ld a,l
 sub b
 ex (sp),hl
 ld c,a
 call ComputeByte
 cpl
 ld (smc_LastByte),a
 ld b,a
    sra c \ sra c \ sra c
    inc c
; ld a,c
; and %11111000
; rra \ rra \ rra
;   inc a
; ld c,a

 ld a,d
 ld d,0
 ld h,d
 ld l,e
 add hl,hl
 add hl,de
 add hl,hl
 add hl,hl
 and %11111000
 rra \ rra \ rra
 add a,GBUF_LSB
 ld e,a
 ld d,GBUF_MSB
 add hl,de
;HL points to the first byte 
 pop de
;D is the height
;E is the number of bytes wide 
 inc c
 dec c
 jr nz,RectOverLoop-1
  ld a,(smc_FirstByte)
  and b
  ld c,a  ;value
  ld b,d  ;height
  ld de,12
   ld a,c
smc_logic1:
   or (hl)
   ld (hl),a
   add hl,de
   djnz $-4
   pop bc
   pop de
   ret
 ld e,c
RectOverLoop:
 ld b,e
 ld c,12
 .db 3Eh       ;start of ld a,*
smc_FirstByte:
 .db 0
RectLoop:
smc_logic0:
 or (hl)
 ld (hl),a
 inc hl
 dec c
    jr z,ExitLoop
 ld a,-1
 djnz RectLoop
;    jp p,$+4
;    dec b
 .db 3Eh       ;start of ld a,*
smc_LastByte:
 .db 0
 or (hl)
 ld (hl),a
 add hl,bc
ExitLoop:
    dec d
 jr nz,RectOverLoop
 pop bc
 pop de
 ret

ComputeByte:
 and 7
 ld b,a
 ld a,80h
 jr z,$+5
   rrca
   djnz $-1
 add a,a
 dec a
 ret

Necro-edit:
This code is working, and it performs clipping. Using the above benchmarks, this code draws approximately 4240 of those rectangle per two seconds at 6MHz on an actual calc. The downside is the size

The core routine is 119 bytes, and the code for XOR rectangle is an additional 61 bytes, OR rectangle is also 61 bytes, and Erase rectangle is 66 bytes. They do not preserve registers. However, they were made to run in an app, so they don't rely on SMC-- If they did use SMC, it could probably fit all of the routines in just under 200 bytes.

Code: [Select]

;;
;;rectXOR
;;rectOR
;;rectErase
;;  (B,C) = (x,y) signed
;;  (D,E) = (w,h) unsigned
;;  HL points to buf

rectXOR:
    push hl
    call rectSub
    pop ix
    ret nc
    ex de,hl
    add ix,de
    ex de,hl
    push ix
    pop hl
    dec b
    jp m,xorrect0
    inc b
xor_rect_loop:
    push bc
    push hl
    ld a,(hl) \ xor d \ ld (hl),a \ inc hl
    dec b
    jr z,$+8
    ld a,(hl) \ cpl \ ld (hl),a \ inc hl \ djnz $-4
    ld a,(hl) \ xor e \ ld (hl),a
    ld bc,12
    pop hl
    add hl,bc
    pop bc
    dec c
    jr nz,xor_rect_loop
    ret
xorrect0:
    ld a,d
    and e
    ld b,c
    ld c,a
    ld de,12
    ld a,c
    xor (hl)
    ld (hl),a
    add hl,de
    djnz $-4
    ret
rectErase:
    push hl
    call rectSub
    pop ix
    ret nc
    ex de,hl
    add ix,de
    ex de,hl
    push ix
    pop hl
    ld a,d
    cpl
    ld d,a
    ld a,e
    cpl
    ld e,a
    dec b
    jp m,eraserect0
    inc b
erase_rect_loop:
    push bc
    push hl
    ld a,(hl) \ and d \ ld (hl),a \ inc hl
    dec b
    jr z,$+7
    xor a
    ld (hl),a \ inc hl \ djnz $-2
    ld a,(hl) \ and e \ ld (hl),a
    ld bc,12
    pop hl
    add hl,bc
    pop bc
    dec c
    jr nz,erase_rect_loop
    ret
eraserect0:
    ld a,d
    xor e
    ld b,c
    ld c,a
    ld de,12
    ld a,c
    and (hl)
    ld (hl),a
    add hl,de
    djnz $-4
    ret
rectOR:
    push hl
    call rectSub
    pop ix
    ret nc
    ex de,hl
    add ix,de
    ex de,hl
    push ix
    pop hl
    dec b
    jp m,orrect0
    inc b
or_rect_loop:
    push bc
    push hl
    ld a,(hl) \ or d \ ld (hl),a \ inc hl
    dec b
    jr z,$+8
    ld c,-1
    ld (hl),c \ inc hl \ djnz $-2
    ld a,(hl) \ or e \ ld (hl),a
    ld bc,12
    pop hl
    add hl,bc
    pop bc
    dec c
    jr nz,or_rect_loop
    ret
orrect0:
    ld a,d
    and e
    ld b,c
    ld c,a
    ld de,12
    ld a,c
    or (hl)
    ld (hl),a
    add hl,de
    djnz $-4
    ret
rectsub:
;(B,C) = (x,y) signed
;(D,E) = (w,h) unsigned
;Output:
;  Start Mask  D
;  End Mask    E
;  Byte width  B
;  Height      C
;  buf offset  HL
  bit 7,b
  jr z,+_
  ;Here, b is negative, so we have to add width to x.
  ;If the result is still negative, the entire box is out of bounds, so return
  ;otherwise, set width=newvalue,b=0
  ld a,d
  add a,b
  ret nc
  ld d,a
  ld b,0
_:
  bit 7,c
  jr z,+_
  ld a,e
  add a,c
  ret nc
  ld e,a
  ld c,0
_:
;We have clipped all negative areas.
;Now we need to verify that (x,y) are on the screen.
;If they aren't, then the whole rectangle is off-screen so no need to draw.
  ld a,b
  cp 96
  ret nc
  ld a,c
  cp 64
  ret nc
;Let's also verfiy that height and width are non-zero:
  ld a,d
  or a
  ret z
  ld a,e
  or a
  ret z
;Now we need to clip the width and height to be in-bounds
  add a,c
  cp 65
  jr c,+_
  ;Here we need to set e=64-c
  ld a,64
  sub c
  ld e,a
_:
  ld a,d
  add a,b
  cp 97
  jr c,+_
  ;Here we need to set d=96-b
  ld a,96
  sub b
  ld d,a
_:
;B is starting X
;C is starting Y
;D is width
;E is height

  push bc
  ld a,b
  and 7
  ld b,a
  ld a,-1
  jr z,+_
  rra \ djnz $-1
_:
  inc a
  cpl
  ld h,a    ;start mask

  ld a,b
  add a,d
  and 7
  ld b,a
  ld a,-1
  jr z,+_
  rra \ djnz $-1
_:
  inc a
  ld l,a  ;end mask
  ex (sp),hl
  ;stack now holds DE
  ;HL is now the coordinates
  ;B=0, C=height
  ;A,BC are free to destroy
  ld a,h
  ld h,b
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,hl
  ld b,a
  rrca
  rrca
  rrca
  and 31
  add a,l
  ld l,a
  jr nc,$+3
  inc h

;B is the starting x, D is width
;Only A,B,D,E are available
  ld a,b
  add a,d
  and $F8
  ld d,a

  ld a,b
  and $F8
  ld b,a
  ld a,d
  sub b
  rrca
  rrca
  rrca
  ld b,a
  ld c,e
  pop de
  scf
  ret

chickendude · « **Reply #70 on:** February 04, 2013, 11:37:01 am »

The other routine is just a plain rectangle (non-filled) drawing routine. I don't bother SMC'ing the or (hl), just the ld (hl),a. What it does is draw the first and last pixels of the line/border outside of the main loop then either draws (ld (hl),a) all pixels in between or skips (nop's) them all. It's a bit larger so unless the speed is that important you might be better off drawing two filled rectangles with the other routine, one OR'd and a slightly smaller one XOR'd.

Also, i just realized i don't need the "or a" after "ex af,af'" since the flags should still be preserved from the "and $7", so we can delete that.

It's interesting to see how our syntax/style differs even on bits of code that do exactly the same thing

And if you don't mind sacrificing just a few clocks (altogether maybe between 8-64), maybe you could try this and save 2 bytes:

Code: [Select]

ComputeByte:
 and 7
 ld b,a
 ld a,$FF
 ret z
   srl a  ;or or a \ rra
   djnz $-2
 ret

Xeda112358 · « **Reply #71 on:** February 04, 2013, 01:03:55 pm »

Quote from: chickendude on February 04, 2013, 11:37:01 am

It's interesting to see how our syntax/style differs even on bits of code that do exactly the same thing

I noticed that, too, it is kind of neat. I noticed that I mask A and then shift, whereas you shift, then mask. I've done both, but I typically do the former.

Quote from: chickendude on February 04, 2013, 11:37:01 am

And if you don't mind sacrificing just a few clocks (altogether maybe between 8-64), maybe you could try this and save 2 bytes:

Nice, I like that! When B is not 0, that is actually anywhere from 6 cycles faster to 18 cycles slower, so I think that is a great trade-off.
By adding a byte, I can make your routine anywhere from 0 cycles to 24 cycles faster when b>0

Code: [Select]

ComputeByte:
 neg   ; or 'cpl \ inc a
 and 7
 ld b,a
 ld a,FFh
 ret z
 add a,a
 djnz $-1
 ret

This happens to be faster than my original and smaller by one byte. Still, it is larger than yours

Xeda112358 · « **Reply #72 on:** March 07, 2013, 04:14:15 pm »

EDIT3: (continued below) Smaller, faster version below.
I wanted to optimise an old routine in Grammer to compute the GCD of two 16-bit numbers. I came up with this:

Code: [Select]

GCDDE_HL:
;Inputs:
;     HL,DE are the two values
;Outputs:
;     B is 0
;     DE is 0
;     HL is the GCD
;     C is not changed
;Destroys:
;     A
     xor a              ;AF     4
     ld b,a             ;47     4
CheckMax:               ;
     sbc hl,de          ;ED52   15n
     jr z,AdjustGCD     ;28**   12n-5
     jr nc,ParityCheck  ;30**   12n-5
    xor a              ;AF     4(n-a)
     sub l              ;95     4(n-a)
     ld l,a             ;6F     4(n-a)
     sbc a,a            ;9F     4(n-a)
     sub h              ;94     4(n-a)
     ld h,a             ;67     4(n-a)
     ex de,hl
     jp CheckMax        ;C3**** 10(n-a)
ParityCheck:            ;
     bit 0,e            ;CB**   8a
     jr nz,DE_Odd       ;20**   12a-5b
     bit 0,l            ;CB**   8b
     jr z,BothEven      ;28**   12b-5c
     rr d               ;CB**   8(n-a-b-c)
     rr e               ;CB**   8(n-a-b-c)
     jp CheckMax        ;C3**** 10(n-a-b-c)
BothEven:               ;
     inc b              ;04     4c
     rr d \ rr e        ;       16c
     rr h \ rr l        ;       16c
     jp CheckMax        ;       10c
DE_Odd:                 ;
     bit 0,l            ;       8b
     jr nz,BothOdd      ;       12b-5d
     rr h \ rr l        ;       16(n-a-b-d)
     jp CheckMax        ;       10(n-a-b-d)
BothOdd:                ;
     sbc hl,de          ;       15d
     rr h \ rr l        ;       16d
     jp CheckMax        ;       10d
AdjustGCD:              ;
     ex de,hl           ;       4
     inc b              ;       4
     dec b              ;       4
     ret z              ;       11+4(k>0)
     add hl,hl          ;       11k
     djnz $-1           ;       13k-5
     ret                ;       --

It is a lot faster than my other version which used division to compute the mod of two 16-bit numbers

The JP instructions can be changed to JR for better portability and to save a byte each time.

EDIT: And if I didn't make a mistake, the 8-bit version:

Code: [Select]

GCD_A_C:
;Outputs:
;    A is the GCD
;    C should be the smallest odd number that divides both inputs
;    B is 0
;Destroys:
;    D
     ld b,1
CheckMax:
     sub c
     jr z,AdjustGCD
     jr nc,ParityCheck
     neg
     ld d,a
     ld a,c
     ld c,a
     jr CheckMax
ParityCheck:
     rrc c
     jr c,c_Odd
     inc b
     rrca
     jr nc,CheckMax
     rlca
     djnz CheckMax
c_Odd:
     rlc c
     rrca
     jr nc,CheckMax
     rlca
     jr CheckMax
AdjustGCD:
     ld a,c
     dec b
     ret z
     add a,a
     djnz $-1
     ret

EDIT2: I think I computed a massive overestimate of the slowest speed of the first routine to be a little over 4000 cycles. My old routine used about 1500 cycles at the fastest for a non-trivial result. 1500 is likely close to the slowest that the new routine will run at

EDIT3: 14 bytes saved, runs faster?

Code: [Select]

GCDDE_HL:
;Inputs:
;     HL,DE are the two values
;Outputs:
;     B is 0
;     DE is 0
;     HL is the GCD
;     C is not changed
;     A is not changed
     ld b,1
     or a
CheckMax:               ;
     sbc hl,de          ;ED52   15n
     jr z,AdjustGCD     ;28**   12n-5
     jr nc,ParityCheck  ;30**   12n-5
     add hl,de
     or a
     ex de,hl
ParityCheck:            ;
     bit 0,e            ;CB**   8a
     jr nz,DE_Odd       ;20**   12a-5b
     bit 0,l            ;CB**   8b
     jr z,BothEven      ;28**   12b-5c
     rr d               ;CB**   8(n-a-b-c)
     rr e               ;CB**   8(n-a-b-c)
     jp CheckMax        ;C3**** 10(n-a-b-c)
BothEven:               ;
     inc b              ;04     4c
     rr d \ rr e        ;       16c
HL_Even:
     rr h \ rr l        ;       16c
     jp CheckMax        ;       10c
DE_Odd:                 ;
     bit 0,l            ;       8b
     jr z,HL_Even       ;       12b-5d
     sbc hl,de          ;       15d
     rr h \ rr l        ;       16d
     jp nz,CheckMax        ;       10d
AdjustGCD:              ;
     ex de,hl           ;       4
     dec b              ;       4
     ret z              ;       11+4(k>0)
     add hl,hl          ;       11k
     djnz $-1           ;       13k-5
     ret                ;       --

Xeda112358 · « **Reply #73 on:** July 04, 2013, 08:46:10 am »

EDIT: Fixed a problem to take care of the case where HL= 8000h (thanks Jacobly!)
This routine a few pages back can be optimised:

Quote from: Quigibo on May 01, 2010, 03:19:23 am

Code: [Select]
SignedDivision: ld a,h xor d push af bit 7,h jr z,$+8 xor a sub l ld l,a sbc a,a sub h ld h,a bit 7,d jr z,$+8 xor a sub e ld e,a sbc a,a sub d ld d,a call RegularDivision pop af add a,a ret nc xor a sub l ld l,a sbc a,a sub h ld h,a ret

For the sign testing, I came up with this:

Code: [Select]

SignedDivision:
 ld a,h
 xor d
 push af

 xor d
 jp p,$+9
 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a

 bit 7,d
 jr z,$+8
 xor a
 sub e
 ld e,a
 sbc a,a
 sub d
 ld d,a

 call RegularDivision

 pop af
 ret p

 xor a
 sub l
 ld l,a
 sbc a,a
 sub h
 ld h,a
 ret

In all, it saves 1 bytes and at least 5 t-states (it will be either 5 or 10).

Xeda112358 · « **Reply #74 on:** October 20, 2013, 03:53:19 pm »

Here is a sqrtL routine:

Code: [Select]

SqrtL:
;Inputs:
;     L is the value to find the square root of
;Outputs:
;      C is the result
;      B,L are 0
;     DE is not changed
;      H is how far away it is from the next smallest perfect square
;      L is 0
;      z flag set if it was a perfect square
;Destroyed:
;      A
     ld bc,400h       ; 10    10
     ld h,c           ; 4      4
sqrt8Loop:            ;
     add hl,hl        ;11     44
     add hl,hl        ;11     44
     rl c             ; 8     32
     ld a,c           ; 4     16
     rla              ; 4     16
     sub a,h          ; 4     16
     jr nc,$+5        ;12|19  48+7x
       inc c
       cpl
       ld h,a
     djnz sqrt8Loop   ;13|8   47
     ret              ;10     10
;287+7x, x is the number of bits in the result
;min: 287
;max: 315
;19 bytes

Also, in case anybody needed a small GCD (Greatest Common Divisor) routine, I have this:

Code: [Select]

GCDHL_DE:
;Outputs:
;     DE is the GCD
GCDLoop:
     or a
     sbc hl,de
     ret z
     jr nc,$-3
     add hl,de
     ex de,hl
     jp GCDLoop

(a faster one is a few posts up)
If you need a fast way to see if a 16-bit number is divisible by 3 (without actually dividing)

Code: [Select]

HL_mod_3:
;Outputs:
;     Preserves HL
;     A is the remainder
;     destroys DE,BC
;     z flag if divisible by 3, else nz
     ld bc,030Fh
     ld a,h
     add a,l
     sbc a,0   ;conditional decrement
;Now we need to add the upper and lower nibble in a
     ld d,a
     and c
     ld e,a
     ld a,d
     rlca
     rlca
     rlca
     rlca
     and c

     add a,e
     sub c
     jr nc,$+3
     add a,c
;add the lower half nibbles

     ld d,a
     sra d
     sra d
     and b
     add a,d
     sub b
     ret nc
     add a,b
     ret
;at most 132 cycles, at least 123

Author Topic: ASM Optimized routines (Read 102422 times)

chickendude

Re: ASM Optimized routines

ben_g

Re: ASM Optimized routines

Xeda112358

Re: ASM Optimized routines

chickendude

Re: ASM Optimized routines

chickendude

Re: ASM Optimized routines

Xeda112358

Re: ASM Optimized routines

chickendude

Re: ASM Optimized routines

NanoWar

Re: ASM Optimized routines

chickendude

Re: ASM Optimized routines

Xeda112358

Re: ASM Optimized routines

chickendude

Re: ASM Optimized routines

Xeda112358

Re: ASM Optimized routines

Xeda112358

Re: ASM Optimized routines

Xeda112358

Re: ASM Optimized routines

Xeda112358

Re: ASM Optimized routines