Omnimaga: The Coders Of Tomorrow
Welcome, Guest. Please login or register.
 
Omnimaga: The Coders Of Tomorrow
23 May, 2013, 21:40:51 *
Welcome, Guest. Please login or register.

Login with username, password and session length
 
   home   news downloads projects tutorials misc forums rules new posts irc about Login Register  
+-OmnomIRC

You must Register, be logged in and have at least 40 posts to use this shout-box! If it still doesn't show up afterward, it might be that OmnomIRC is disabled for your group or under maintenance.

Note: You can also use an IRC client like mIRC, X-Chat or Mibbit to connect to an EFnet server and #omnimaga.

Pages: 1 ... 17 18 [19] 20   Go Down
  Print  
Author Topic: Assembly Programmers - Help Axe Optimize! -  (Read 20510 times) Bookmark and Share
0 Members and 1 Guest are viewing this topic.
Runer112
Project Author
LV10 31337 u53r (Next: 2000)
*
Offline Offline

Gender: Male
Last Login: Today at 20:47:12
Date Registered: 02 July, 2009, 06:38:05
Posts: 1680


Total Post Ratings: +493

View Profile
« Reply #270 on: 13 December, 2011, 06:57:46 »
+5

Yeah, I see no way to optimize the full 32-bit multiplication... But fixed-point multiplication, now that's an entirely different story! First, here's a totally different approach to sign handling that reduces p_88Mul to less than half of its current size! Grin


Original routine: 38 bytes, ~1128 cycles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
p_88Mul:
.db __88MulEnd-1-$
ld a,h
xor d
push af
bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
call $3F00+sub_MulFull
ld l,h
ld h,a
pop af
xor h
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__88MulEnd:
   Smaller routine: 18 bytes, ~1089 cycles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
p_88Mul:
.db __88MulEnd-1-$
push hl
call $3F00+sub_MulFull
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:


20 bytes saved? Not bad at all! But what if you're more interested in shaving off cycles than bytes? Don't worry, I covered that base too. Instead of using the slower p_MulFull, this final routine uses my faster p_Mul for 8 bits of the multiplication and an inlined, slightly different version of faster multiplication for the other 8 bits. End result: it's about 260 cycles faster than the smaller solution, or about 30% faster! Grin It's 16 bytes larger than my smaller method, but actually it would often end up resulting in smaller programs because it relies on the much more popular p_Mul instead of p_MulFull.


Faster routine: 34 bytes, ~831 cycles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
p_88Mul:
.db __88MulEnd-1-$
push hl
ld c,l
ld a,h
ld l,0
ld b,b \ .db 8 \ call $3F00+sub_Mul
ld a,c
ld bc,8<<8+0
__88MulNext:
add hl,hl
rla
jr nc,__88MulSkip
add hl,de
adc a,c
__88MulSkip:
djnz __88MulNext
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:
« Last Edit: 13 December, 2011, 07:04:38 by Runer112 » Logged
Quigibo
The Executioner
LV11 Super Veteran (Next: 3000)
***********
Offline Offline

Gender: Male
Last Login: 21 May, 2013, 02:03:21
Date Registered: 22 January, 2010, 05:02:37
Location: Los Angeles
Posts: 2022


Topic starter
Total Post Ratings: +1019

View Profile
« Reply #271 on: 13 December, 2011, 08:19:54 »
0

Wow thanks!  However there seems to be an issue.  The 3 pictures attached are the output from the Mandelbrot Set demo program. The first is the original routine.  The second is your new size optimized version.  As you can see it works, but the rounding appears to be asymmetrical (which might still be okay).  The last one is your speed optimized version.  I think you have a bug somewhere...  Tongue


* mbrot1.gif (1.71 KB, 192x128 - viewed 387 times.)

* mbrot2.gif (1.71 KB, 192x128 - viewed 385 times.)

* mbrot3.gif (1.78 KB, 192x128 - viewed 381 times.)
Logged

___Axe_Parser___
Today the calculator, tomorrow the world!
Runer112
Project Author
LV10 31337 u53r (Next: 2000)
*
Offline Offline

Gender: Male
Last Login: Today at 20:47:12
Date Registered: 02 July, 2009, 06:38:05
Posts: 1680


Total Post Ratings: +493

View Profile
« Reply #272 on: 13 December, 2011, 08:38:30 »
0

I think I can explain the asymmetry of the size-optimized version. Because it adjusts signs differently, I think it now rounds down instead of towards zero like the old routine.

However, I have no clue what is going on with the speed-optimized routine. Can you look at the debugger and confirm that the call to sub_Mul is actually entering where it's supposed to be entering, at __MulByte? Because I wouldn't be surprised if the fact that you probably had to add the offset call macro for call nz,__MulByte in p_Mul is messing up the offset calls due to its own size.
« Last Edit: 13 December, 2011, 08:40:46 by Runer112 » Logged
Quigibo
The Executioner
LV11 Super Veteran (Next: 3000)
***********
Offline Offline

Gender: Male
Last Login: 21 May, 2013, 02:03:21
Date Registered: 22 January, 2010, 05:02:37
Location: Los Angeles
Posts: 2022


Topic starter
Total Post Ratings: +1019

View Profile
« Reply #273 on: 13 December, 2011, 09:07:20 »
0

The disassembly looks fine to me.  All the jumps calls and everything of that nature are aligned.  I tried 4 test cases with different combinations of sign values and they seemed okay.  Since the generated picture is relatively close to the original given that it was a chaotic system sensitive to errors, I would guess it is only a few special cases that cause it to return a wrong result.

EDIT: I made a program to run them side by side on random numbers and quit when the output is different.  Here is an output that gives different results between the routines:

$FFE0 ** $F5F1 (-0.125 ** -10.059)

Results in $0143 (1.26) in size optimized.
Results in $0239 (2.22) in speed optimized.
« Last Edit: 13 December, 2011, 09:31:40 by Quigibo » Logged

___Axe_Parser___
Today the calculator, tomorrow the world!
Runer112
Project Author
LV10 31337 u53r (Next: 2000)
*
Offline Offline

Gender: Male
Last Login: Today at 20:47:12
Date Registered: 02 July, 2009, 06:38:05
Posts: 1680


Total Post Ratings: +493

View Profile
« Reply #274 on: 13 December, 2011, 10:04:07 »
0

That edit was helpful, it gave me a hunch as to what the problem was and (I think) that hunch was correct. Unfortunately, the fix for this problem will cost a byte and about 70 cycles. It will still be about 20% faster than the small routine though. And it still relies on the more common p_Mul instead of p_MulFull, so being 17 bytes larger might still be worth it.


Faster routine: 35 bytes, ~900 cycles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
p_88Mul:
.db __88MulEnd-1-$
push hl
ld c,l
ld a,h
ld l,0
ld b,b \ .db 8 \ call $3F00+sub_Mul
ld b,8
__88MulNext:
add hl,hl
rla
rl c
jr nc,__88MulSkip
add hl,de
adc a,0
__88MulSkip:
djnz __88MulNext
pop bc
bit 7,b
jr z,$+3
sub e
ld l,h
ld h,a
bit 7,d
ret z
sub c
ld h,a
ret
__88MulEnd:
« Last Edit: 13 December, 2011, 10:04:48 by Runer112 » Logged
calc84maniac
Epic z80 roflpwner
Coder Of Tomorrow
LV11 Super Veteran (Next: 3000)
*
Offline Offline

Gender: Male
Last Login: 20 May, 2013, 21:27:24
Date Registered: 28 August, 2008, 05:09:05
Location: Right behind you.
Posts: 2735


Total Post Ratings: +373

View Profile
« Reply #275 on: 18 December, 2011, 06:29:24 »
0

So... Z-Test. At a cost of 8 cycles, you can go from 17 bytes plus 3 bytes times the number of options (limited to something like 85?) to 16 bytes plus 2 bytes times the number of options (limited to amount of program space).

Here's my method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  ld de,-range
  add hl,de
  ld de,jumptable_end
  jr c,default
  add hl,hl
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
default:
  ex de,hl
  jp (hl)
  .dw Label0
  .dw Label1
  .dw Label2
  ;.....
jumptable_end:
Logged

"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman
Quigibo
The Executioner
LV11 Super Veteran (Next: 3000)
***********
Offline Offline

Gender: Male
Last Login: 21 May, 2013, 02:03:21
Date Registered: 22 January, 2010, 05:02:37
Location: Los Angeles
Posts: 2022


Topic starter
Total Post Ratings: +1019

View Profile
« Reply #276 on: 18 December, 2011, 06:39:31 »
0

Wow thanks!  I was considering that, but I assumed the overhead would be large, not smaller!  Thanks!

Also, I could move the labels to the data section of the code to make it even faster!


1
2
3
4
5
6
7
8
9
10
11
12
 ld de,-range
  add hl,de
  jr c,default
  add hl,hl
  ld de,jumptable_end
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
  ex de,hl
  jp (hl)
default:
« Last Edit: 18 December, 2011, 06:41:17 by Quigibo » Logged

___Axe_Parser___
Today the calculator, tomorrow the world!
calc84maniac
Epic z80 roflpwner
Coder Of Tomorrow
LV11 Super Veteran (Next: 3000)
*
Offline Offline

Gender: Male
Last Login: 20 May, 2013, 21:27:24
Date Registered: 28 August, 2008, 05:09:05
Location: Right behind you.
Posts: 2735


Total Post Ratings: +373

View Profile
« Reply #277 on: 18 December, 2011, 06:50:41 »
0

If you wanted to save 2 cycles in the case of a jump, you could use an odd table setup with all the LSBs in a row followed by all the MSBs in a row, like so:


1
2
3
4
5
6
7
8
9
10
11
12
 ld de,-range
  add hl,de
  jr c,routine_end
  ex de,hl
  ld hl,jumptable_end
  add hl,de
  ld a,(hl)
  add hl,de
  ld l,(hl)
  ld h,a
  jp (hl)
routine_end:

I imagine that might not work well with the way pointers are handled in the compiler, though.

Edit:
And I suppose the current Z-Test is actually limited to 39 options due to the range of the JR instruction...
« Last Edit: 18 December, 2011, 06:56:28 by calc84maniac » Logged

"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman
Runer112
Project Author
LV10 31337 u53r (Next: 2000)
*
Offline Offline

Gender: Male
Last Login: Today at 20:47:12
Date Registered: 02 July, 2009, 06:38:05
Posts: 1680


Total Post Ratings: +493

View Profile
« Reply #278 on: 19 December, 2011, 08:12:53 »
0

First, an optimization that I can't give you code for: making *^CONST use an equivalent constant division optimization if one exists. And don't forget about the trivial cases, *^1 and *^0. Of course, these only apply if you don't change this operation to return a 32-bit result somehow. Which it really should. Tongue

Next, some silly optimizations: ^0, <<ᴇ8000, >>ᴇ7FFF should simply be 0, while ≥≥ᴇ8000 and ≤≤ᴇ7FFF should simply be 1. If you're wondering why ^0 should be 0, that's what the general modulus routine would return anyways.

Finally, some optimizations for signed comparisons. These have been lacking general forms which take advantage of absolute jumps as well as optimized forms for constants for quite some time. Thanks to jacobly and calc84maniac for helping me come up with the first two! If either of you two are reading this, feel free to look at the other operations and try to optimize them. Wink


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
p_SGT0:
.db 8
ld a,h
or l
jr z,$+6
add hl,hl
sbc hl,hl
inc hl
p_SLE0:
.db 9
ld a,h
or l
jr z,$+6
add hl,hl
ccf
sbc hl,hl
inc hl
p_SLtLeXX:
.db 11
ld a,h
add a,$80
ld h,a
ld de,$0000 ;$8000-const
add hl,de
sbc hl,hl
inc hl
.db rp_Ans,6
p_SGtGeXX:
.db 12
ld a,h
add a,$80
ld h,a
xor a
ld de,$0000 ;$8000-const
add hl,de
ld h,a
rla
ld l,a
.db rp_Ans,6
p_SIntGt:
.db 11
scf
sbc hl,de
add hl,hl
jp pe,$+4
ccf
sbc hl,hl
inc hl
p_SIntGe:
.db 11
xor a
sbc hl,de
add hl,hl
jp po,$+4
ccf
ld h,a
rla
ld l,a
p_SIntLt:
.db 11
scf
sbc hl,de
add hl,hl
jp po,$+4
ccf
sbc hl,hl
inc hl
p_SIntLe:
.db 11
xor a
sbc hl,de
add hl,hl
jp pe,$+4
ccf
ld h,a
rla
ld l,a
« Last Edit: 19 December, 2011, 21:52:21 by Runer112 » Logged
jacobly
LV4 Regular (Next: 200)
****
Offline Offline

Last Login: Today at 20:14:38
Date Registered: 09 October, 2011, 01:53:09
Posts: 199

Total Post Ratings: +149

View Profile
« Reply #279 on: 20 December, 2011, 11:47:46 »
+1

p_DrawOff: save 1 byte, save ~40 cycles
Original

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
xor a
ld e,a
dec a
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
and (hl)
or e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
cpl
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:
Optimized

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
xor a
ld e,$FF
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
or (hl)
and e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:

p_Pix: save 2 bytes, save ~6 cycles
Original

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
p_Pix:
.db __PixEnd-1-$ ;Draws pixel (c,l)
ld de,plotSScreen
pop af
pop bc
push af
ld b,0

ld a,l
cp 64
ld a,b
ret nc
ld a,c
cp 96
ld a,b
ret nc

ld h,b
ld a,l
add a,a
add a,l
ld l,a
add hl,hl
add hl,hl
add hl,de
ld a,c
srl c
srl c
srl c
add hl,bc
and %00000111
ld b,a
ld a,%10000000
ret z
___GetPixLoop:
rrca
djnz ___GetPixLoop
ret
__PixEnd:
Optimized

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
p_Pix:
.db __PixEnd-1-$ ;Draws pixel (c,l)
ld de,plotSScreen
pop af
pop bc
push af
ld b,0

ld a,c
cp 96
ld a,b
ret nc
sla l
ret c
sla l
ret c

ld h,b
ex de,hl
add hl,de
add hl,de
add hl,de
ld a,c
srl c
srl c
srl c
add hl,bc
and %00000111
ld b,a
ld a,%10000000
ret z
___GetPixLoop:
rrca
djnz ___GetPixLoop
ret
__PixEnd:

p_ArcTan: save 1 byte, save ~1 cycle
Original

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
p_ArcTan:
.db __ArcTanEnd-1-$
ex de,hl ;de = y
pop hl
ex (sp),hl ;hl = x
push hl
ld a,h ;\
xor d ;/ Get parity
jp m,__ArcTanSS-p_ArcTan-1
add hl,de ;\
jr __ArcTanDS ; |
__ArcTanSS: ; |hl = x +- y
sbc hl,de ; |
__ArcTanDS: ;/
ex de,hl ;de = x +- y
ld b,6 ;\
__ArcTan64: ; |
add hl,hl ; |hl = 64y
djnz __ArcTan64 ;/
call $3F00+sub_SDiv ;hl = 64y/(x +- y)
pop af ;\
rla ; |Right side, fine
ret nc ;/
sbc a,a ;\
sub h ; |Reverse sign extend
ld h,a ;/
ld a,l ;\
add a,128 ; |Add or sub 128
ld l,a ;/
ret
__ArcTanEnd:
Optimized

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
p_ArcTan:
.db __ArcTanEnd-1-$
ex de,hl ;de = y
pop hl
ex (sp),hl ;hl = x
push hl
ld a,h ;\
xor d ;/ Get parity
jp m,__ArcTanSS-p_ArcTan-2
add hl,de ;\
ld c,c \ .db $FA ; |
;jr __ArcTanDS ; |
__ArcTanSS: ; |hl = x +- y
sbc hl,de ; |
__ArcTanDS: ;/
ex de,hl ;de = x +- y
ld b,6 ;\
__ArcTan64: ; |
add hl,hl ; |hl = 64y
djnz __ArcTan64 ;/
call $3F00+sub_SDiv ;hl = 64y/(x +- y)
pop af ;\
rla ; |Right side, fine
ret nc ;/
sbc a,a ;\
sub h ; |Reverse sign extend
ld h,a ;/
ld a,l ;\
add a,128 ; |Add or sub 128
ld l,a ;/
ret
__ArcTanEnd:
Logged
jacobly
LV4 Regular (Next: 200)
****
Offline Offline

Last Login: Today at 20:14:38
Date Registered: 09 October, 2011, 01:53:09
Posts: 199

Total Post Ratings: +149

View Profile
« Reply #280 on: 24 December, 2011, 20:45:01 »
+1

p_DrawOr/Xor: save 17 bytes (plus 4 every time a custom buffer is used)
aligned saves 98 cycles, unaligned saves ~173 cycles
save additional 21 cycles every time a custom buffer is used

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
p_DrawOr:
.db __DrawOrEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop bc ;Input c = Sprite Y Position
pop de ;Input e = Sprite X Position
push af
ld b,7
ld a,e
add a,b
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,a
ld a,c
add a,b
jr c,__DrawOrClipTop
sub 64+7
ret nc
cpl
cp b
jr c,__DrawOrClipBottom
ld a,b
jr __DrawOrClipBottom
__DrawOrClipTop:
inc ix
inc c
jr nz,__DrawOrClipTop
__DrawOrClipBottom:
inc a
ld b,0
sla c
sla c
add hl,bc
add hl,bc
add hl,bc
ld c,d
add hl,bc
ld b,a
ld a,e
and 7
jr z,__DrawOrAligned
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawOrLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawOrShift:
srl c
rra
djnz __DrawOrShift
and e
or (hl)
ld (hl),a
dec hl
ld a,c
and d
or (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawOrLoop
ret
__DrawOrAligned:
ld de,12
__DrawOrAlignedLoop:
ld a,(ix)
or (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawOrAlignedLoop
ret
__DrawOrEnd:

p_DrawXor:
.db __DrawXorEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop bc ;Input c = Sprite Y Position
pop de ;Input e = Sprite X Position
push af
ld b,7
ld a,e
add a,b
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,a
ld a,c
add a,b
jr c,__DrawXorClipTop
sub 64+7
ret nc
cpl
cp b
jr c,__DrawXorClipBottom
ld a,b
jr __DrawXorClipBottom
__DrawXorClipTop:
inc ix
inc c
jr nz,__DrawXorClipTop
__DrawXorClipBottom:
inc a
ld b,0
sla c
sla c
add hl,bc
add hl,bc
add hl,bc
ld c,d
add hl,bc
ld b,a
ld a,e
and 7
jr z,__DrawXorAligned
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawXorLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawXorShift:
srl c
rra
djnz __DrawXorShift
and e
xor (hl)
ld (hl),a
dec hl
ld a,c
and d
xor (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawXorLoop
ret
__DrawXorAligned:
ld de,12
__DrawXorAlignedLoop:
ld a,(ix)
xor (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawXorAlignedLoop
ret
__DrawXorEnd:
Logged
Xeda112358
Xombie. I am it.
Coder Of Tomorrow
LV12 Extreme Poster (Next: 5000)
*
Online Online

Last Login: Today at 21:36:18
Date Registered: 31 October, 2010, 08:46:36
Location: Land of Little Cubes and Tea, NY
Posts: 3760


Total Post Ratings: +609

View Profile
« Reply #281 on: 24 December, 2011, 22:58:42 »
0

I finally have an optimisation that might work or be useful >.> Runer112 apparently mentioned optimising the p_FreqOut routine by replacing:

1
2
3
4
5
6
dec hl
dec bc
ld a,b
or c
jr nz,__FreqOutLoop2
with this:

1
2
3
cpd
jp pe,__FreqOutLoop2
However, the issue was that the frequency would be thrown off as it cut out 8*HL cycles. However, when I was stealing the code for my own evil intentions, I saw this optimisation and thought of that issue and here is my solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

p_FreqOut:
xor a
__FreqOutLoop1:
push bc
        xor     %00000011
ld e,a
__FreqOutLoop2:
ld a,h
or l
jr z,__FreqOutDone
cpd
ld a,e
        scf
jp pe,__FreqOutLoop2
__FreqOutDone:
pop bc
out ($00),a
ret nc
jr __FreqOutLoop1
__FreqOutEnd:
The way the code is reordered, now, it should only cut out 8*HL/BC cycles which is much less than 8*HL. I think Runer said that it might be up to 1% faster for higher notes and negligible for lower notes.


EDIT: Okay, found a problem: It is actually 2 cycles slower in the inside loop, now, so that will just slow the routine by 2*hl, too
« Last Edit: 24 December, 2011, 23:03:24 by Xeda112358 » Logged



Grammer Download (2.29.04.12)
Latest update (possibly incomplete)
My pastebin
Spoiler for FileSyst:
FileSyst is an application that provides a folder and filesystem for the TI-83+/84+ calculators. It is designed to be easy to access and use in BASIC, and it can be used to access game files and save data, or to create a command prompt, among other things:

Spoiler for Graphiti:
This is a graph explorer for graph theory. It will require lots of work to finish. Currently you can:
Add/delete vertices
Add edges (direction not shown, but they are directed)
Arrange vertices in a circle (in the future, you will be able to define levels of rings and the number of nodes in each)
Create complete graphs quickly

Plans:
Add adjacency matrix viewer
Deleting edges
Multiple graphs support
Arrows for directed graphs
Planarity testing
Matrix operations
Weighted edges
Chromatic polynomials
Chromatic numbers

Spoiler for Stats:

Samocal             [o---------]
Virtual Processor   [o---------]
EnG                 [oo--------]
Grammer             [ooo-------]
AsmComp             [ooo-------]
Partex              [oooo------]
BatLib              [oooooooo--]
Grammer82           [----------]
Grammer68000        [----------]


Pseudonyms:  Zeda, Xeda, Thunderbolt
Languages:   English, français
Programming: z80 Assmebly
             Grammer
             TI-BASIC (83/84/+/SE, 89/89t/92)
Known For:   -Creator of the Grammer programming language
              (Winning program of zContest2011)
             -BatLib- One of the most feature packed libraries for BASIC programmers available
              with over 100 functions and a simple programming language
             -Learning to program z80 in hexadecimal before using an assembler (no computer was
              available!)
╔═╦╗░╠═╬╣▒║ ║║▓╚═╩╝█


jacobly
LV4 Regular (Next: 200)
****
Offline Offline

Last Login: Today at 20:14:38
Date Registered: 09 October, 2011, 01:53:09
Posts: 199

Total Post Ratings: +149

View Profile
« Reply #282 on: 26 December, 2011, 19:17:26 »
+2

p_DrawOr: 18 bytes saved
p_DrawXor: 18 bytes saved
p_DrawOff: 14 bytes saved
p_DrawMsk: 10 bytes saved
p_DrawMsk2: 11 bytes saved

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
p_DrawOr:
.db __DrawOrEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawOrClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawOrClipBottom
ld b,d
jr __DrawOrNoClipV
__DrawOrClipTop:
inc ix
inc e
jr nz,__DrawOrClipTop
__DrawOrClipBottom:
ld b,a
__DrawOrNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
inc b
ld a,c
and d
ld d,-7*3
add hl,de
jr z,__DrawOrAligned
ld e,c
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawOrLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawOrShift:
srl c
rra
djnz __DrawOrShift
and e
or (hl)
ld (hl),a
dec hl
ld a,c
and d
or (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawOrLoop
ret
__DrawOrAligned:
ld de,12
__DrawOrAlignedLoop:
ld a,(ix)
or (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawOrAlignedLoop
ret
__DrawOrEnd:

p_DrawXor:
.db __DrawXorEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawXorClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawXorClipBottom
ld b,d
jr __DrawXorNoClipV
__DrawXorClipTop:
inc ix
inc e
jr nz,__DrawXorClipTop
__DrawXorClipBottom:
ld b,a
__DrawXorNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
inc b
ld a,c
and d
ld d,-7*3
add hl,de
jr z,__DrawXorAligned
ld e,c
ld c,a
ld a,e
cp -7
sbc a,a
ld d,a
and e
cp 96-7
sbc a,a
ld e,a
__DrawXorLoop:
push bc
ld b,c
ld c,(ix)
xor a
__DrawXorShift:
srl c
rra
djnz __DrawXorShift
and e
xor (hl)
ld (hl),a
dec hl
ld a,c
and d
xor (hl)
ld (hl),a
ld c,13
add hl,bc
inc ix
pop bc
djnz __DrawXorLoop
ret
__DrawXorAligned:
ld de,12
__DrawXorAlignedLoop:
ld a,(ix)
xor (hl)
ld (hl),a
inc ix
add hl,de
djnz __DrawXorAlignedLoop
ret
__DrawXorEnd:

p_DrawOff:
.db __DrawOffEnd-1-$
push hl
pop ix ;Input ix = Sprite
ld hl,plotSScreen ;Input hl = Buffer
pop af
pop de ;Input e = Sprite Y Position
pop bc ;Input c = Sprite X Position
push af
ld d,7
ld a,e
add a,d
jr c,__DrawOffClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawOffClipBottom
ld b,d
jr __DrawOffNoClipV
__DrawOffClipTop:
inc ix
inc e
jr nz,__DrawOffClipTop
__DrawOffClipBottom:
ld b,a
__DrawOffNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawOffAligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawOffLoop
inc d
cp 96-7
jr nc,__DrawOffLoop
inc d
__DrawOffLoop:
push bc
ld b,c
ld c,(ix+0)
xor a
ld e,$FF
__DrawOffShift:
srl c
rr e
rra
djnz __DrawOffShift
dec d
jr z,__DrawOffSkipRight
ld b,a
or (hl)
and e
ld (hl),a
ld a,b
__DrawOffSkipRight:
dec hl
inc d
jr z,__DrawOffSkipLeft
and (hl)
or c
ld (hl),a
__DrawOffSkipLeft:
ld bc,13
add hl,bc
inc ix
pop bc
djnz __DrawOffLoop
ret
__DrawOffAligned:
ld e,12
__DrawOffAlignedLoop:
ld a,(ix)
ld (hl),a
inc ix
add hl,de
djnz __DrawOffAlignedLoop
ret
__DrawOffEnd:

p_DrawMsk:
.db __DrawMskEnd-1-$
ex (sp),hl
pop ix ;Input hl = Sprite
pop de
pop bc
push hl
ld hl,plotSScreen
ld d,7
ld a,e
add a,d
jr c,__DrawMskClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawMskClipBottom
ld b,d
jr __DrawMskNoClipV
__DrawMskClipTop:
inc ix
inc e
jr nz,__DrawMskClipTop
__DrawMskClipBottom:
ld b,a
__DrawMskNoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawMskAligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawMskLoop
inc d
cp 96-7
jr nc,__DrawMskLoop
inc d

__DrawMskLoop:
push bc

push hl

ld b,c
ld e,(ix+0)
xor a
ld h,a
ld c,(ix+8)
__DrawMskShift:
srl e
rr h
srl c
rra
djnz __DrawMskShift

ld b,h
pop hl
push af

dec d
jr z,__DrawMskSkipRight1

push bc
xor b
cpl
ld c,a

ld a,(hl)
or b
and c
ld (hl),a
pop bc

__DrawMskSkipRight1:
dec hl
inc d
push de
jr z,__DrawMskSkipLeft1

ld a,c
xor e
cpl
ld d,a

ld a,(hl)
or e
and d
ld (hl),a

__DrawMskSkipLeft1:
ld de,appBackUpScreen-plotSScreen+1
add hl,de
pop de
pop af
dec d
jr z,__DrawMskSkipRight2

or b
cpl

and (hl)
or b
ld (hl),a

__DrawMskSkipRight2:
dec hl
inc d
jr z,__DrawMskSkipLeft2

ld a,c
or e
cpl

and (hl)
or e
ld (hl),a

__DrawMskSkipLeft2:
ld bc,plotSScreen-appBackUpScreen+13
add hl,bc

inc ix
pop bc
djnz __DrawMskLoop
ret
__DrawMskAligned:
push hl
ld de,appBackUpScreen-plotSScreen
add hl,de

ld a,(ix+0)
ld d,a
xor (ix+8)
cpl
ld e,a

and (hl)
or d
ld (hl),a

pop hl

ld a,(hl)
or d
and e
ld (hl),a

inc ix
ld de,12
add hl,de
djnz __DrawMskAligned
ret
__DrawMskEnd:

p_DrawMsk2:
.db __DrawMsk2End-1-$
ex (sp),hl
pop ix ;Input hl = Sprite
pop de
pop bc
push hl
ld hl,plotSScreen
ld d,7
ld a,e
add a,d
jr c,__DrawMsk2ClipTop
sub 64+7
ret nc
cpl
cp d
jr c,__DrawMsk2ClipBottom
ld b,d
jr __DrawMsk2NoClipV
__DrawMsk2ClipTop:
inc ix
inc e
jr nz,__DrawMsk2ClipTop
__DrawMsk2ClipBottom:
ld b,a
__DrawMsk2NoClipV:
ld a,c
add a,d
cp 96+7
ret nc
rrca
rrca
rrca
and $1f
ld d,0
sla e
sla e
add hl,de
add hl,de
add hl,de
ld e,a
add hl,de
inc b
ld a,c
and 7
jr z,__DrawMsk2Aligned
ld e,c
ld c,a
ld a,e
cp -7
jr nc,__DrawMsk2Loop
inc d
cp 96-7
jr nc,__DrawMsk2Loop
inc d
__DrawMsk2Loop:
push bc
push hl

ld b,c
ld e,(ix+0)
xor a
ld h,a
ld c,(ix+8)
__DrawMsk2Shift:
srl e
rr h
srl c
rra
djnz __DrawMsk2Shift

ld b,h ;e = left spr, b = right spr, c = left msk, a = right msk
pop hl

dec d
jr z,__DrawMsk2SkipRight

cpl
and (hl)
xor b
ld (hl),a

__DrawMsk2SkipRight:
dec hl
inc d
jr z,__DrawMsk2SkipLeft

ld a,c
cpl
and (hl)
xor e
ld (hl),a

__DrawMsk2SkipLeft:
ld bc,13
add hl,bc

inc ix
pop bc
djnz __DrawMsk2Loop
ret
__DrawMsk2Aligned:
ld e,12
__DrawMsk2AlignedLoop:
ld a,(ix+8)
cpl
and (hl)
xor (ix+0)
ld (hl),a
inc ix
add hl,de
djnz __DrawMsk2AlignedLoop
ret
__DrawMsk2End:
Logged
Runer112
Project Author
LV10 31337 u53r (Next: 2000)
*
Offline Offline

Gender: Male
Last Login: Today at 20:47:12
Date Registered: 02 July, 2009, 06:38:05
Posts: 1680


Total Post Ratings: +493

View Profile
« Reply #283 on: 03 February, 2012, 06:56:23 »
0

Just a small optimization I see with the new Nth string command. Because you restack the return location by popping it into bc, you're already loading bc with a value that's at least $4000 for applications and at least $8000 for programs, so the ld b,h inside the loop is not necessary.
Logged
jacobly
LV4 Regular (Next: 200)
****
Offline Offline

Last Login: Today at 20:14:38
Date Registered: 09 October, 2011, 01:53:09
Posts: 199

Total Post Ratings: +149

View Profile
« Reply #284 on: 18 September, 2012, 09:03:40 »
+2

Thanks to a suggestion from calc84maniac, I have optimized the routine that is used for both *^ and ** to be 25-50% faster. Grin In addition, every use of *^ would be 2 bytes smaller.

p_MulFull: same size, save 300-550 cycles
Original

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
p_MulFull:
.db __MulFullEnd-1-$
ld c,h
ld a,l
ld hl,0
ld b,16
__MulFullNext:
add hl,hl
rla
rl c
jr nc,__MulFullSkip
add hl,de
adc a,0
jr nc,__MulFullSkip
inc c
__MulFullSkip:
djnz __MulFullNext
ret
__MulFullEnd:
Optimized

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
p_MulFull:
.db __MulFullEnd-1-$
xor a
ld c,h
ld h,a
or l
ld l,h
call nz,__MulFullByte-p_MulFull-1
ld a,c
__MulFullByte:
ld b,8
__MulFullNext:
rra
jr nc,__MulFullSkip
add hl,de
__MulFullSkip:
rr h
rr l
djnz __MulFullNext
ret
__MulFullEnd:
Note: Output changed: hl = bits 16-31 of the result, do rra after the routine returns to get a = bits 8-15 of the result.
« Last Edit: 18 September, 2012, 09:10:34 by jacobly » Logged
Pages: 1 ... 17 18 [19] 20   Go Up
  Print  
 
Jump to:  

Powered by EzPortal
Powered by MySQL Powered by SMF 1.1.18 | SMF © 2013, Simple Machines Powered by PHP
Page created in 0.457 seconds with 31 queries.
Skin by DJ Omnimaga edited from SMF default theme with the help of tr1p1ea.
All programs, games and songs avaliable on this website are property of their respective owners.
Best viewed in Opera, Firefox, Chrome and Safari with a resolution of 1024x768 or above.