Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Topics - Xeda112358

Pages: [1] 2 3 ... 13
1
TI Z80 / Z80 Optimized Routines Repository
« on: August 13, 2019, 02:38:20 pm »
Hi folks! I've noticed that the "Z80 Optimized Routines" threads and their equivalents on various sites aren't very easy to navigate. I am starting a repository on GitHub in the hopes of addressing these three issues:

  • Organization! "Is this routine documented? What page is it on?
  • Collaboration! "Is there a better version later in the thread? On what page!? Here is yetanotherversion!"
  • Cleanliness! "What is this random request doing in the middle of the thread?"

I initialized the repository here.

My plan is to start porting Cemtech's thread, Omnimaga's thread, UnitedTI's thread, Z80 Heaven's routines, and my private routines folder.


If you want to help port documentation, I only ask that you cite the original author if possible, except when the original author doesn't care to be cited. If you want to add your own routines, keep it organized! And please, if you see an optimization, please make it!


A final note: I think it would be great to have an eZ80 and TI-BASIC repository, too, but I don't think I'm up for maintaining that!

2
TI Z80 / Shuffle - Shuffle a TI List Really Fast
« on: July 28, 2019, 11:49:54 pm »
Hey there, it's ya gender non-specific diminutive Zeda, here, and today we'll be looking at the Fisher-Yates algorithm and just how freaking efficient it can be for shuffling a list. For reference, it takes one second to shuffle a 999-element list at 6MHz, and if that ain't the way your deity intended it, I don't know what is.

First, how do we shuffle L1 in BASIC?
Code: [Select]
rand(dim(L1->L2
SortA(L2,L1

This is a super clever algorithm, but slow as heck as the lists get bigger. Plus, it uses an extra list of the same size, wasting precious RAM. So how does the Fisher-Yates algorithm work? You start at the last element. Randomly choose an element up to and including the current element and swap them. Now move down one element and repeat (so now the last element is off limits, then the last two, et cetera). Repeat this until there is one element left.

This is easy to perform in-place, and it performs n-1 swaps, making it significantly faster than the BASIC algorithm above. In fact, let's implement it in BASIC:
Code: [Select]
dim(L1->N
For(K,N,2,-1
randInt(1,K->A
L1(K->B
L1(A->L1(K
B->L1(A
End
This takes approximately 37.5 seconds to sort a 999 element list. I don't even have the RAM needed to test the regular method, but extrapolating, it would take the "normal" method approximately 73 seconds for 999 elements. So basically, the Fisher-Yates algorithm is actually faster even in TI-BASIC (after about 400 elements, though).

So without further ado, the assembly code!
Code: [Select]
;Randomizes a TI-list in Ans

_RclAns= 4AD7h
seed1  = $80F8
seed2  = $80FC

seed1_0=seed1
seed1_1=seed1+2
seed2_0=seed2
seed2_1=seed2+2
#define bcall(x) rst 28h \ .dw x


.db $BB,$6D
.org $9D95

; Put it into 15MHz mode if possible!
  in a,(2)
  add a,a
  sbc a,a
  out (20h),a


; Initialize the random seed
  ld hl,seed1
  ld b,7
  ld a,r
_:
  xor (hl)
  ld (hl),a
  inc hl
  djnz -_
  or 99
  or (hl)
  ld (hl),a
 

; Locate Ans, verify that it is a list or complex list
  bcall(_RclAns)
  ex de,hl
  ld c,(hl)
  inc hl
  ld b,(hl)
  inc hl
  ld (list_base),hl
  dec a
  jr z,+_
  sub 12
  ret nz
  dec a
_:

;A is 0 if a real list, -1 if complex
;HL points to the first element
;BC is the number of elements
  and $29     ;make it either NOP or ADD HL,HL
  ld (get_complex_element),a
  sub 29h
  sbc a,a
;FF if real, 00 if complex
  cpl
  and 9
  add a,9
  ld (element_size),a


shuffle_loop:
  push bc

  push bc
  call rand
  pop bc
  ex de,hl
  call mul16
  dec bc
  ;swap elements DE and BC
 
  call get_element
  push hl
  ld d,b
  ld e,c
  call get_element
  pop de

  call swap_elements
 
  pop bc
  dec bc
  ld a,c
  dec a
  jr nz,shuffle_loop
  inc b
  dec b
  jr nz,shuffle_loop
  ret


swap_elements:
;HL and DE point to the elements
element_size = $+2
  ld bc,255
_:
  ld a,(de)
  ldi
  dec hl
  ld (hl),a
  inc hl
  djnz -_
  ret


get_element:
;Input:
;   DE is the element to locate
;Output:
;   HL points to the element
  ld l,e
  ld h,d
  add hl,hl
  add hl,hl
  add hl,hl
  add hl,de
get_complex_element:
  nop
list_base = $+1
  ld de,0
  add hl,de
  ret


rand:
;Tested and passes all CAcert tests
;Uses a very simple 32-bit LCG and 32-bit LFSR
;it has a period of 18,446,744,069,414,584,320
;roughly 18.4 quintillion.
;LFSR taps: 0,2,6,7  = 11000101
;291cc
;Thanks to Runer112 for his help on optimizing the LCG and suggesting to try the much simpler LCG. On their own, the two are terrible, but together they are great.
    ld hl,(seed1)
    ld de,(seed1+2)
    ld b,h
    ld c,l
    add hl,hl \ rl e \ rl d
    add hl,hl \ rl e \ rl d
    inc l
    add hl,bc
    ld (seed1_0),hl
    ld hl,(seed1_1)
    adc hl,de
    ld (seed1_1),hl
    ex de,hl
;;lfsr
    ld hl,(seed2)
    ld bc,(seed2+2)
    add hl,hl \ rl c \ rl b
    ld (seed2_1),bc
    sbc a,a
    and %11000101
    xor l
    ld l,a
    ld (seed2_0),hl
    ex de,hl
    add hl,bc
    ret


mul16:
;BC*DE
  ld hl,0
  ld a,16
mul16_loop:
  add hl,hl
  rl e
  rl d
  jr nc,+_
  add hl,bc
  jr nc,+_
  inc de
_:
  dec a
  jr nz,mul16_loop
  ret
It isn't perfect, but it is pretty good and importantly, it is fast! The biggest problem is in the random number generator, but even that is still pretty good for this application.

3
TI Z80 / YATE - Yet Another Tilemap Engine
« on: May 27, 2019, 09:58:09 pm »
Hi all! At Eeems' suggestion, I am posting this work-in-progress code :P

I've been working on-and-off on my tilemap engine since my Pokemon Amber attempt. The current rewrite is slower, but a bit easier to work with on the my end. It also allows the surrounding map to still be animated if I need to pull up a menu or dialog! I've also added in map compression using modified code from my LZD project, as well as event tiles!

In this version, I mess around with OS RAM to make larger chunks of contiguous non-user RAM, so it might cause crashes.

As of now, all you can do is walk around the map and enter/exit map editing mode. Since maps are currently stored in the app, map editing is not permanent. If you leave the map and come back, all changes are reverted.

If you want to edit the map:
Code: [Select]
Press [mode] to enter map editing mode
Use [+] and [-] to scroll through the available tiles for the map
Use [*] to toggle walkability. You'll see corner guards if the tile can't be walked on.
Use [Enter] to set a tile.
Use [stat] to exit map editing

When you hover over a tile that can't be walked over, a solid border will be drawn.
When you walk over an event tile, a dotted border will be drawn.


If you want to *actually* edit or create maps, you'll have to do that on your computer and recompile the app. You'll need spasm (with app signing enabled!) and Python with numpy. The map structure is:
Code: [Select]
.db height,width
 .db start_y,start_x   ;The player will be drawn at y+5 and x+6 !
 .db number_of_tiles
 .dw tile_index_0
 .dw tile_index_1
 ...

;map data
 .db blah
 .db blah
 .db blah
...

;Event code
 .db 0
The map data is actually stored transposed. So the tiles you see across the top row are actually stored in the left column of bytes! This makes it a headache to code.

You can use up to 64 different tiles in a map.

Each byte in the map has the index into the tileset stored in the lower six bits. Setting  the top bit means it can't be walked on, and setting bit 6 means walking into it triggers an event.

When events are triggered, the map's event handler is parsed an executed. It is a very basic scripting language that currently handles addition, subtraction, =, <, &, warping, and conditional execution. Scripts end with a null byte (0x00) and are stored similarly to how they'd be parsed in RPN format, currently with the exception of conditional execution (since the condition is checked before executing). 'x' and 'y' can be used to read the player's current coordinates, 'i' evaluates a condition (like an If statement) and if it is true, it executes the next command, otherwise it skips. 'w' is used to warp and requires bytes in the form of `map_id,start_y-5,start_x-6`. Finally, 8-bit integers are prefixed by a `.db 1`. For a convoluted example, here is the event handler for inside the "house" :
Code: [Select]
;if (y == 3 or y == 4) and x == 7:
;    warp(0,2-5,5-6)
.db "y",_num,3,"-",_num,2,"<x",_num,7,"=&iw",0,2-5,5-6
.db 0


Once you've made your map, you'll need to compress it. First you'll compile it to a binary, then you need to compress that data and convert the compressed data back to assembly. For example:
Code: [Select]
spasm maps/map9.z80 maps/map9.bin
python zlz.py maps/map9.bin maps/map9_comp.z80 map9

And finally, you need to add the map to the mapLUT found near the bottom of main.z80, just add the label name (in the above case, `.dw map9`) and include the compressed version of the file at the bottom.



The system is largely interrupt-based. A custom interrupt updates the current keypress, updates the tile animation, updates the LCD, and draws the tilemap. All you have to do is set the appropriate flag:
Code: [Select]
rpgflags  = 33
updateLCD = 0   ;Tells the interrupts to update the LCD
drawmap   = 1   ;Tells the interrupts to redraw the map
animate   = 2   ;Tells the interrupts to animate tiles
noupdate  = 3   ;Overrides the interrupt's LCD update and tilemap redraw
When a tile's animation counter runs out, it automatically sets the drawmap flag. When a map is redrawn, it automatically set the updateLCD flag and resets the drawmap flag (no need to keep doing it). And if the noupdate flag is set, then interrupt-based map drawing and LCD updating is disabled (but these can still be used manually).

There are three graphics buffers, one of which is quite non-standard. They are all vertically aligned instead of horizontally aligned (the OS' format is horizontal). That means the second byte corresponds to the 8-pixel chunk below the first byte, as opposed to adjacent to it.
Code: [Select]
gbuf     : This is a 1120-byte buffer! It is 80 pixels tall and 112 pixels wide. Only the center part is shown. This is where the tiemap is drawn.
maskbuf : A 768 byte buffer. During LCD updates, the data here is ANDed on top of the gbuf data.
topbuf  : A 768 byte buffer. During LCD updates, the data here is ORed on top of the previous data.

The latter two buffers are used for overworld graphics, like the player sprite, or in the future, dialog, menus, and NPCs, etc.

4
TI Z80 / Unnamed Optimizing Compiler
« on: April 17, 2019, 11:30:48 pm »
Hi, for a long time I've wanted to create a compiler (targeting the Z80) that was even better at optimizing. Back when I was first brainstorming a Grammer compiler, I had in mind the issues of existing Z80 C compilers and even Axe. The Z80 really is poorly suited to C, but Axe is actually pretty amazing (as many know). One thing that C, Axe, and Grammer all have in common is a standard way of passing arguments. For example, in the latter cases, the first arguments are passed in HL and BC (respectively). This doesn't always produce optimal code, but it is a lot easier to work with.

So my fantasy was writing a program on my computer that tried to find a more optimal way of passing arguments. I haven't managed it, but it is a lovely dream :)

I want to put this small, undocumented set of programs out for the public. I wrote it today and there are a bunch of issues, but on some code it works really well.

As an example, here is an example of input:
Code: [Select]
Disp((2*x+1)*6+x)
Disp((27*x+1)*4+x)
And output:
Code: [Select]
  ld hl,(var_x)
  add hl,hl \ inc l
  ld b,h \ ld c,l \ add hl,hl \ add hl,bc \ add hl,hl
  ld de,(var_x)
  add hl,de
  call disp_uint16
  ld de,27
  ld hl,(var_x)
  call mulHLDE
  inc hl
  add hl,hl \ add hl,hl
  ld de,(var_x)
  add hl,de
  call disp_uint16

irToZ80.db basically contains the conversions from a high-level token to low level assembly. Each routine has input and output registers defined, as well as size and timing to use as metrics for optimization.

I reused tokens.db from another project, all that is important is name, type, and precedence.

If you run ./compile, it passes test.txt through sy.py to convert the source code into an intermediate representation (basically RPN), then passes that file through irToZ80.py to generate Z80 assembly code.

I basically implemented what I think is Djikstra's Shortest Path algorithm to find the solution that minimizes the target metric. In this case, I try to find the code that minimizes (6*size)^2+(speed)^2.

5
Math and Science / The General Class of Karatsuba-Like Multiplication
« on: March 16, 2019, 11:51:33 am »
In playing with ideas to improve multiplication in my float routines, I recreated an algorithm that I soon learned was a known, but rarely discussed generalization of the Karatsuba algorithm!

For a long time, I thought of the Karatsuba algorithm as a special case of Toom-Cook multiplication. If you aren't familiar with that, I discuss it here (see the spoiler!). To be fair, it is a special case, but there is actually another way to arrive to the Karatsuba algorithm, and honestly it is a bit easier to implement and recognize.

To arrive at the algorithm, we start with polynomial multiplication. I'll start off with multiplying two cubic polynomials and that should give us enough terms to get a good view of what's going on:
##
\begin{array}{l}
f(x)g(x) &=& (a_{0}+a_{1}x+a_{2}x^{2}+a_{3}x^{3})(b_{0}+b_{1}x+b_{2}x^{2}+b_{3}x^{3})\\

&=&a_{0}b_{0}\\
&&+x(a_{0}b_{1}+a_{1}b_{0})\\
&&+x^{2}(a_{0}b_{2}+a_{1}b_{1}+a_{2}b_{0})\\
&&+x^{3}(a_{0}b_{3}+a_{1}b_{2}+a_{2}b_{1}+a_{3}b_{0})\\
&&+x^{4}(a_{1}b_{3}+a_{2}b_{2}+a_{3}b_{1})\\
&&+x^{5}(a_{2}b_{3}+a_{3}b_{2})\\
&&+x^{6}(a_{3}b_{3})\\
\end{array}
##


There is a lot of symmetry going on there in the form of ##a_{n}b_{m}+\dots+a_{m}b_{n}##. My observation was that ##a_{n}b_{m}+a_{m}b_{n} = (a_{n}+a_{m})(b_{n}+b_{m})-a_{n}b_{n}-a_{m}b_{m}##.On the surface, this looks like we are trading two multiplications for three (more work), but we actually get to reuse some of those terms in other calculations. So let's reorganize:

##
\begin{array}{l}
f(x)g(x) &=& (a_{0}+a_{1}x+a_{2}x^{2}+a_{3}x^{3})(b_{0}+b_{1}x+b_{2}x^{2}+b_{3}x^{3})\\

&=&a_{0}b_{0}\\
&&+x((a_{0}b_{1}+a_{1}b_{0}))\\
&&+x^{2}((a_{0}b_{2}+a_{2}b_{0})+a_{1}b_{1})\\
&&+x^{3}((a_{0}b_{3}+a_{3}b_{0})+(a_{1}b_{2}+a_{2}b_{1}))\\
&&+x^{4}((a_{1}b_{3}+a_{3}b_{1})+a_{2}b_{2})\\
&&+x^{5}((a_{2}b_{3}+a_{3}b_{2}))\\
&&+x^{6}(a_{3}b_{3})\\
\end{array}
##


Now let's define some auxiliary calculations and rewrite the product:
##
z_{0} = a_{0}b_{0}\\
z_{1} = a_{1}b_{1}\\
z_{2} = a_{2}b_{2}\\
z_{3} = a_{3}b_{3}\\

z_{4} = (a_{0}+a_{1})(b_{0}+b_{1})\\
z_{5} = (a_{0}+a_{2})(b_{0}+b_{2})\\
z_{6} = (a_{0}+a_{3})(b_{0}+b_{3})\\
z_{7} = (a_{1}+a_{2})(b_{1}+b_{2})\\
z_{8} = (a_{1}+a_{3})(b_{1}+b_{3})\\
z_{9} = (a_{2}+a_{3})(b_{2}+b_{3})\\

\begin{array}{l}
f(x)g(x) &=& (a_{0}+a_{1}x+a_{2}x^{2}+a_{3}x^{3})(b_{0}+b_{1}x+b_{2}x^{2}+b_{3}x^{3})\\

&=&z_{0}\\
&&+x(z_{4}-z_{0}-z_{1})\\
&&+x^{2}(z_{5}-z_{0}-z_{2}+z_{1})\\
&&+x^{3}(z_{6}-z_{0}-z_{3}+z_{7}-z_{1}-z_{2})\\
&&+x^{4}(z_{8}-z_{1}-z_{3}+z_{2})\\
&&+x^{5}(z_{9}-z_{2}-z_{3})\\
&&+x^{6}z_{3}\\
\end{array}
##


So with just this trick, we went from 16 multiplies to 10, at the cost of 12 extra additions/subtractions. In general, to multiply two n-th degree polynomials, we need ##\frac{(n+1)(n+2)}{2}## multiplies instead of ##(n+1)^{2}##, so almost half as many multiplies are needed. These are both ##O(n^{2})## operations, but the trick is to apply recursion. For example, if we want to multiply two degree-15 polynomials, we can break it up into 4 cubic polynomials and treat the smaller polynomials as coefficients. Then we need to multiply 10 pairs of cubic polynomials, resulting in 100 total multiplies. That's a lot, but it's less than 256 with the standard method, or even 136 with the algorithm above. If we carry this out another step, we can multiply two 63-degree polynomials with 1000 multiplies instead of 4096 or 2080. In fact, for a ##(4^{n}-1)-th## degree polynomial, we need ##10^{n}## multiplies, making this algorithm ##O(n^{log_{4}(10)})=O(n^{1.66096\dots})##. This is a non-trivial, better-than-##O(n^{2})## multiplication algorithm.


Using our generalized technique, let's multiply two quadratic polynomials:
Spoiler For Spoiler:
##
z_{0} = a_{0}b_{0}\\
z_{1} = a_{1}b_{1}\\
z_{2} = a_{2}b_{2}\\
z_{3} = (a_{0}+a_{1})(b_{0}+b_{1})\\
z_{4} = (a_{0}+a_{2})(b_{0}+b_{2})\\
z_{5} = (a_{1}+a_{2})(b_{1}+b_{2})\\

\begin{array}{l}
f(x)g(x) &=& (a_{0}+a_{1}x+a_{2}x^{2})(b_{0}+b_{1}x+b_{2}x^{2})\\

&=&z_{0}\\
&&+x(z_{3}-z_{0}-z_{1})\\
&&+x^{2}(z_{4}-z_{0}-z_{2}+z_{1})\\
&&+x^{3}(z_{5}-z_{1}-z_{2})\\
&&+x^{4}(z_{2})\\
\end{array}
##

And by applying this recursively we get an ##O(n^{log_{3}(6)})=O(n^{1.63092\dots})## algorithm.

And the Karatsuba algorithm that we've all come to know and love:
Spoiler For Spoiler:
##
z_{0} = a_{0}b_{0}\\
z_{1} = a_{1}b_{1}\\
z_{2} = (a_{0}+a_{1})(b_{0}+b_{1})\\

\begin{array}{l}
f(x)g(x) &=& (a_{0}+a_{1}x)(b_{0}+b_{1}x)\\
&=&z_{0}+x(z_{2}-z_{0}-z_{1})+x^{2}(z_{1})\\
\end{array}
##

And by applying this recursively we get an ##O(n^{log_{2}(3)})=O(n^{1.58496\dots})## algorithm.


But is Karatsuba the best of this class of algorithm?
Spoiler For Yes:
To recursively apply multiplying ##(k-1)-th## degree polynomials, we get an algorithm that is ##
O(n^{log_{k}((k)(k+1)/2)})=O(n^{log_{k}(k)+log_{k}(k+1)-log_{k}(2)})
=O(n^{1+log_{k}(k+1)-log_{k}(2)})
=O(n^{2+log_{k}(1+1/k)-log_{k}(2)})
##

And the integer, ##k>1## that minimizes the exponent is 2.

A neat application of the non-recursive algorithm discussed above can be used to multiply a number by it's complement, which is another approach to how I came up with this Sine Approximation Algorithm.
Spoiler For Spoiler:
So let's assume that all of our coefficients are either 0 or 1. The complement of 0 is 1, and vice versa. This is useful in programming, since inverting the bits of a number is basically doing -x-1.

Let's define some operations. Where x and y are binary (0 or 1), then
  • ~x means the complement of x, and can be rewritten (1-x).
  • x|y is bitwise-OR and can be rewritten as x+y-xy.
  • x&y is bitwise-AND and can be rewritten as x*y or xy for brevity.
  • x^y is bitwise-XOR and can be rewritten as x+y-2xy.
Observe, then, that:
  • x~x is always 0.
  • x&x=x.
  • x~y+y~x = x-xy+y-yx = x+y-2xy = x^y.


So if we want to multiply, say, a 4-bit number by its complement, we have:
Code: [Select]
z0 = a0~a0=0
z1 = a1~a1=0
z2 = a2~a2=0
z3 = a3~a3=0

z4 = (a0+a1)(~a0+~a1) = a0~a1 + a1~a0 = a0^a1
z5 = (a0+a2)(~a0+~a2) = a0~a2 + a2~a0 = a0^a2
z6 = (a0+a3)(~a0+~a3) = a0~a3 + a3~a0 = a0^a3
z7 = (a1+a2)(~a1+~a2) = a1~a2 + a2~a1 = a1^a2
z8 = (a1+a3)(~a1+~a3) = a1~a3 + a3~a1 = a1^a3
z9 = (a2+a3)(~a2+~a3) = a2~a3 + a3~a2 = a2^a3
f(x)g(x) = (a0+a1x+a2x^2+a3x^3)(~a0+~a1x+~a2x^2+~a3x^3)

=z0                      |=0
 +x(z4-z0-z1)            | +x(a0^a1)
 +x^2(z5-z0-z2+z1)       | +x^2(a0^a2)
 +x^3(z6-z0-z3+z7-z1-z2) | +x^3(a0^a3+a1^a2)
 +x^4(z8-z1-z3+z2)       | +x^4(a1^a3)
 +x^5(z9-z2-z3)          | +x^5(a2^a3)
 +x^6z3                  | +0

6
TI Z80 / [BASIC] bigpi- Arbitrary Precision Pi Calculator
« on: March 14, 2019, 12:22:41 pm »
Happy Pi day!

Here is an arbitrary precision calculator for the digits of pi. It is written in TI-BASIC, so it is slooooow, but it works!

You basically pass in how many digits you want and it returns them in 5-digit chunks in L1. Keep in mind, if a chunk only has 4 digits, that means the top digit is 0 !

7
TI Z80 / A Calculator for the TI-83+/84+
« on: February 19, 2019, 01:05:18 am »
I have been working on a side project for a little while that will revolutionize the 83+/84+ monochrome series: A basic calculator program.

That's right, your TI-83+ is no longer just for games. If you want to add or subtract-- or even multiply and divide-- this program can do that for you!

Be warned, some of the math is a little buggy still. You'll need the Floatlib app located here, but otherwise just run it as an assembly program. I made a custom input routine, so it might feel a little clunky, but it should be pretty straight-forward (except [DEL] actually acts as a backspace). Exit with [ON].

8
TI Z80 / [Grammer] Quadratic Formula
« on: February 01, 2019, 12:18:42 am »
Requires Grammer 2.50.1.1 and up.

This program gives you the zeros of a quadratic function! Inputs must include a decimal point so that Grammer knows it is a float. Hopefully that can be fixed in the future. Also it looks like Grammer has some float issues, so it gives weird answers when the result is complex.

9
News / POTY 2018 !
« on: December 04, 2018, 11:37:42 am »
TICalc.org's Program Of The Year polls are open!
For the TI-83+/84+ category, it looks like a bunch of older projects split between two authors-- myself and squidgetx (ticalc profile).

Up for the vote are the following programs (in the order listed on ticalc.org).

Batlib (ticalc link)
Batlib is a huge library for BASIC programmers. It has 120+ functions mostly for graphics and string and data manipulation, list and matrix manipulation, compression, and much more.



CopyProg (ticalc link)
This is a small program that allows users to copy variables from RAM or Archive to another variable. CopyProg2 has a few other features as well (like line reading and reading the names of the variables on your calc). It allows you to do things like copy an archived appvar to a temp program for execution ;)



Embers (ticalc link)
Embers is an ARPG that won Omnimaga Contest 2012. This was a really well put together game. It features good graphics, good AI, and storyline.



FloatLib (ticalc link)
Floatlib is an app that holds many single-precision float routines from addition to hyperbolic tangent. It comes with a built-in reference manual and there are some examples like computing the Mandelbrot Set.



Gravity Guy (ticalc link)
Gravity Guy is "a port/variation of the popular iphone/flash game" of the same name. You basically get to flip the direction of gravity to help navigate obstacles.


LblRW (ticalc link)
LblRW is a small utility for BASIC programmers that lets you read or modify data within the BASIC program, using a label offset. That's a mouthful. Basically, "Hey, I want to store player data in my RPG, let's store it after Lbl PD." Or you can store monster data for quick access for example ;)
(no screenshot, sorry :( )


StickNinja (ticalc link)
StickNinja was an Omnimaga Contest 2011 entry that earned 3rd place. It's basically a platformer with awesome Stick Figure and Ninja graphics. Collect coins, destroy enemies; It's got it all.

10
TI Z80 / Hacked Lists!
« on: November 04, 2018, 10:56:56 am »
So, fun fact: floating point numbers can be changed to a variable reference instead, kind of like a shortcut on your computer, and the OS doesn't verify it.

In assembly, when you are looking for a variable, you put a name in OP1 and use FindSym or ChkFindSym. Those names are 9 bytes, which is also the size of a float (not a coincidence-- it's part of why names are limited to 8 bytes).
You can actually change the contents of a Real variable to such a name, and when you try to read it's value, it will instead return the value of the variable it points to.
For example, change the bytes of the Real variable, A, from 008000000000000000 (the number 0) to 015D01000000000000 and when you read A, it will return the contents of L2.

Or, since lists are just stored as a bunch of Real (or Complex) numbers, you can modify elements of the list to be pointers. In this way, you could read Str1, Str2, Str3,..., Str0 by reading an element of L1, which could be useful, occasionally :P

Some things to note:
You can't modify the contents of the original variable in this way. If A points to Str2, then "33"+A→A does not modify Str2. However, "33"+A will work like "33"+Str2.
Storing the names of programs and appvars and whatnot isn't useful.


Attached is a program that turns L1 into a 4-element list {L2,[A],Str1,L1}. Here is the source. You need to delete L1 first, my code wasn't reliably deleting it.
Code: [Select]
#define bcall(x) rst 28h \ .dw x
rMOV9TOOP1  = 20h
_ChkFindSym = 42F1h
_CreateRList= 4315h
.db $BB,$6D
.org $9D95
  ld hl,name
  rst rMOV9TOOP1
  bcall(_ChkFindSym)
  ret c
  ld hl,name
  rst rMOV9TOOP1
  ld hl,4
  bcall(_CreateRList)
  inc de
  inc de
  ld hl,data
  ld bc,31
  ldir
  ret
data:
  .db 1,$5D,1,0,0,0,0,0,0
  .db 2,$5C,0,0,0,0,0,0,0
  .db 4,$AA,0,0,0,0,0,0,0
name:
  .db 1,$5D,0,0

11
Math and Science / Efficient Plotting of Polynomials
« on: October 28, 2018, 10:32:12 pm »
Hi folks, I wanted to share a technique I came up with for evaluating sequential points of a polynomial, which is especially useful when you are drawing them. A naive approach would be evaluate the whole polynomial at each point, but the method I'll show you allows you to evaluate at a few points, and then for each subsequent point just use additions.

The basic premise is to use ##f(x)## as a starting point to get to ##f(x+1)##. For example, take ##f(x)=3+2x##. To get to ##f(x+1)##, all you have to do is add 2. Things get more complicated as you go to higher degree polynomials. For example, take ##f(x)=3+2x-3x^{2}+x^{3}##. Then to get to ##f(x+1)##, you need to add ##3x^{2}-3x##. However, you can go even further by finding how to get from ##3x^{2}-3x## to ##3(x+1)^{2}-3(x+1)##.

Now comes the theory.

Take ##f_{1}(x)=f(x+1)-f(x)## and ##f_{n+1}(x)=f_{n}(x+1)-f_{n}(x)##.
Then ##f(x+1)=f(x)+f_{1}(x)## and ##f_{n}(x+1)=f_{n}(x)+f_{n+1}(x)##.

##\begin{aligned}
f_{2}(x) &=f_{1}(x+1)-f_{1}(x),\\
&=(f(x+2)-f(x+1))-(f(x+1)-f(x)),\\
&=f(x+2)-2f(x+1)+f(x).
\end{aligned}##


##\begin{aligned}
f_{3}(x)&=f_{2}(x+1)-f_{2}(x),\\
&=(f(x+3)-2f(x+2)+f(x+1))-(f(x+2)-2f(x+1)+f(x)),\\
&=f(x+3)-3f(x+2)+3f(x+1))-f(x).
\end{aligned}##

And in general,
##f_{n}(x)=\sum_{k=0}^{n}{{n\choose k}(-1)^{k}f(x+n-k)}##
If you set ##x=0##:
##f_{n}(0)=\sum_{k=0}^{n}{{n\choose k}(-1)^{k}f(n-k)}##

So now let's take again, ##f(x)=3+2x-3x^{2}+x^{3}##.
##\begin{aligned}
f(0)&=3,\\
f_{1}(0)&=f(1)-f(0)&=0,\\
f_{2}(0)&=f(2)-2f(1)+f(0)&=0,\\
f_{3}(0)&=f(3)-3f(2)+3f(1)-f(0)&=6,\\
f_{k>3}(0)&=0.\\
\end{aligned}##

So what these values allow us to do, ##{3,0,0,6}##, is use a chain of additions to get the next value, instead of evaluating the cubic polynomial at each x:
##\begin{array}{rrrr}
[&3,&0,&0,&6]\\
[&3,&0,&6,&6]\\
[&3,&6,&12,&6]\\
[&9,&18,&18,&6]\\
[&27,&36,&24,&6]\\
[&63,&60,&30,&6]\\
[&123,&90,&36,&6]\\
[&213,&126,&42,&6]\\
...
\end{array}##

Basically, add the second element to the first, third element to the second, and fourth element to the third. The new value is the first. Repeat. So the next iteration would yield [213+126,126+42,42+6,6]=[339,168,48,6], and indeed, f(8)=339.

I'd like to apologise for how garbled this is, I'm really tired.

Also, you should note that for an n-th degree polynomial, you need to compute up to ##f_{n}(x)##, but everything after that is 0.

12
TI Z80 / Yet Another VAT Sort
« on: October 09, 2018, 07:23:57 pm »
A few months ago I implemented heapsort and used it to create a sorted array of VAT entries. While it is fast, instead of directly sorting the VAT, it creates an array of pointers to each entry and sorts the pointers. This means that the VAT itself remains unsorted (not necessarily an issue), and a variable amount of RAM needs to be allocated. In the example, if the user had over 383 variables in the named variables table, only the first 383 entries could be sorted (unlikely, but an issue nevertheless).

My goal here was to sort the VAT in-place, using only a fixed amount of overhead memory. After a couple of weeks of brainstorming and testing, I ended up with an adaptive insertion sort implementation and a mergesort implementation.
Code: (Insertion Sort) [Select]
pTemp       = 982Eh ;bottom of named vars VAT
progPtr = 9830h ;top of named vars VAT
OP1         = 8478h
;uses 19 bytes of RAM, 4 bytes of stack space
#define tmp OP1+14
#define head tmp+1
#define head_lag tmp+3
#macro advanceVAT()
#ifdef speed
;17cc saved per iteration
;8 bytes vs a 3 byte call (but no 8-byte subroutine)
  ld bc,-6
  add hl,bc
  sbc a,a   ;HL<=FE66, so carry flag is set
  sub (hl)
  ld c,a
  add hl,bc
#else
  call advance_VAT    ;preserves DE
#endif
#endmacro


sortVAT:
#ifdef nointerrupt
  di
#endif
  ld hl,(progPtr)
isort_main:
_:
  ld (head_lag),hl
  ld d,h
  ld e,l
  advanceVAT()
  ld (head),hl
#ifdef speed
;11 bytes, 29cc or 46cc. avg=29.06640625cc
  ld a,(pTemp)
  cp l
  jr nz,$+7
  ld a,(pTemp+1)
  cp h
  ret z
#else
;adds 8 bytes, 55cc
  ld bc,(pTemp) ;Need to verify that we haven't reached the end of the progVAT
  or a          ;
  sbc hl,bc     ;
  ret z         ;
  add hl,bc     ;
#endif
  call cmpVAT
  ld hl,(head)
  jr nc,-_
;if it makes it here, then (head) needs to be inserted into the previous part of the VAT
;We might be able to speed it up a little more if I also grab the next element
;  If (head_advance) is bigger than (head), then no need to start the search from the beginning
  ld de,tmp
#ifdef speed
  ldd
  ldd
  ldd
  ldd
  ldd
  ldd
  ld b,0
  ld c,(hl)
  lddr
  ldd
#else
  ld bc,6
  lddr
  ld c,(hl)
  inc c
  lddr
#endif
  ld hl,(progPtr)
_:
  push hl
#ifdef speed
;+5 bytes, -11cc
  ld bc,-6
  add hl,bc
  ld de,tmp-6
  call cmpVAT_stepin
#else
  ex de,hl
  ld hl,tmp
  call cmpVAT
#endif
  pop hl
  jr c,+_
  advanceVAT()
  jp -_
_:
;HL is where to insert
  ld de,(head)
  or a
  sbc hl,de
  ld b,h
  ld c,l
  ld hl,-6
  add hl,de
  ld a,l
  sub a,(hl)
  ld l,a
  jr nc,$+4
  dec h
  or a
  inc de
  ex de,hl
#ifdef speed
  call fastldir
#else
  ldir
#endif
  ;begin at DE, copy tmp. First need size of tmp
  ld hl,tmp-6
  ld c,(hl)
  sbc hl,bc
  ld a,c
  ldir
#ifdef speed
  ldi
  ldi
  ldi
  ldi
  ldi
  ldi
  ldi
  add a,7
#else
  ld c,7
  add a,c
  ldir
#endif
  ld hl,(head_lag)
  ld c,a
  ld a,l
  sub c
  ld l,a
  jp nc,isort_main
  dec h
  jp isort_main
#ifndef speed
advance_VAT:
  ld bc,-6
  add hl,bc
  sbc a,a   ;HL<=FE66, so carry flag is set
  sub (hl)
  ld c,a
  add hl,bc
  ret
#endif
cmpVAT:
;if @HL>[email protected], return nc
  ld bc,-6
  add hl,bc
  ex de,hl
  add hl,bc
cmpVAT_stepin:
  ld a,(de)
  cp (hl)
  jr nc,first_longer
;the second name is longer.
  ld c,a
_:
  dec hl
  dec de
  ld a,(de)
  cp (hl)
  ret nz
  dec c
  jr nz,-_
  scf
  ret
first_longer:
;the first name is longer, so load c with the size of the second name
  ld c,(hl)
_:
  dec hl
  dec de
  ld a,(de)
  cp (hl)
  ret nz
  dec c
  jr nz,-_
  ret
#ifdef speed
fastldir:
;copy BC bytes from HL to DE
;copy BC bytes from HL to DE
;breaks even at 26 bytes
; 5% faster than LDIR with 35 bytes
;10% faster than LDIR with 48 bytes
;15% faster than LDIR with 91 bytes
;20% faster than LDIR with 635 bytes
;max is ~ 20.8% faster than LDIR
;Cost: 104+16N+10ceiling(N/16)
    push hl
;    push af
    xor a
    sub c
    and 15               ;change to n-1
    add a,a
    ld hl,ldirloop
    add a,l
    ld l,a
#if (ldirloop>>8)!=(_ldirloop_end>>8)
    jr nc,$+3  ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes.
    inc h       ;
#endif
;    pop af
    ex (sp),hl
    ret
ldirloop:
;n=16, (number of LDI instructions, use qty of 4,8,16,32,64)
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
_ldirloop_end:
    ldi
    jp pe,ldirloop
    ret
#endif
#undefine tmp
#undefine head
#undefine head_lag1
.echo $-sortVAT," bytes"
Code: (Mergesort) [Select]
pTemp       = 982Eh ;bottom of named vars VAT
progPtr = 9830h ;top of named vars VAT
OP1         = 8478h
;uses 29 bytes of state.
; 4 bytes stack space
; 10 bytes variables
; 15 bytes swap
#define var_n OP1+15
#define partition_size var_n+2
#define var_k partition_size
#define head0 var_n+4
#define head1 var_n+6
#define tail var_n+8

cmpVAT:
;HL points to one entry
;DE points to the second entry
;return c if first>second
;return nc if first<=second
  ld bc,-6
  add hl,bc
  ld a,(hl)
  ex de,hl
  add hl,bc
  ld b,(hl)
  ex de,hl
  ;A is the size of the first name
  ;HL points to the byte above the first name
  ;B is the size of the second name
  ;DE points to the byte above the second name
  ld c,a
  jr nc,second_name_longer
_:
  dec hl
  dec de
  ld a,(de)
  cp (hl)
  ret nz
  dec c
  djnz -_
  inc c
  scf
  ret
second_name_longer:
  dec hl
  dec de
  ld a,(de)
  cp (hl)
  ret nz
  dec c
  jr nz,second_name_longer
  ret
nthstr:
;Input:
;   HL points to the size byte of the name of the first element
;   BC is the number of elements
;Output:
;   c flag set if end of VAT reached, else nc
;   HL points to the element
;speed:
; worst: 34+(95-7/256)*BC
; best:  34+(95-14/256)*BC
  ld de,(pTemp)
_:
  or a      ;Can do some analysis to omit this code for most runs
  sbc hl,de ;could speed this up by almost 37%
  add hl,de ;Still, only adds .001 seconds per 100 VAT entries :P
  ret c     ;35cc
  ld a,-6
  sub (hl)
  add a,l
  ld l,a
  jr c,$+3
  dec h
  or a
  cpd
  jp pe,-_
  or a
  ret
sortVAT:
  ld hl,(progPtr) ;get the number of elements in the VAT
  ld de,-6        ;
  add hl,de       ;
  ld bc,0         ;
  call nthstr     ;
  ld (var_n),bc   ;
  ld hl,1
  ld (var_k),hl
sort_main:      ;until partition_size exceeds size of VAT
  ld hl,(progPtr)
sort_main_0:    ;merge multiples of partition_size
  ld (head0),hl
  ld de,-6
  add hl,de
  ld bc,(var_k)
  call nthstr
  jp c,sort_main_0_end
  push hl
  ld bc,(var_k)
  call nthstr
  ld de,6
  add hl,de
  ld (tail),hl
  pop hl
  add hl,de
  ld (head1),hl

  ex de,hl    ;head1
  ld hl,(head0)
  ;or a
  sbc hl,de
    ld hl,(tail)
  jr z,mergeloop_end
  ;or a         ;head0>=head1, so c flag should be reset already
  sbc hl,de
  jr z,mergeloop_end-1
  ld de,(head1)
mergeloop:
  ld hl,(head0)
  call cmpVAT
  jr nc,+_
#ifdef speed
;saves 42cc
  ld hl,(head1)
  ld de,var_n-1
  ldd
  ldd
  ldd
  ldd
  ldd
  ldd
  ld a,(hl)
  ldd
  ld c,a
  add a,7
  ld b,0
#else
  ld hl,(head1)
  ld de,-6
  add hl,de
  ld a,(hl)
  ld de,var_n-1
  ld hl,(head1)
  add a,7
  ld b,0
  ld c,a
#endif
  lddr
;HL now points to the bottom
  ld de,(head1)
  ld (head1),hl
  ld hl,(head0)
  sbc hl,de ;number of bytes that need to be shifted up by A
  ld b,h
  ld c,l
;need to shift from DE to (head1)
  ld hl,(head1)
  ex de,hl
  inc de
  inc hl
#ifdef speed
  call fastldir
#else
  ldir
#endif

  ld hl,var_n-1
  ld de,(head0)
  ld c,a
  lddr
  ex de,hl
  jp $+9
_:
  ld a,l
  sub c
  ld l,a
  jr nc,$+3
  dec h

  ld (head0),hl
  ld de,(head1)
  or a
  sbc hl,de
  ld hl,(tail)
  jr z,mergeloop_end
  ;or a         ;head0>=head1, so c flag is reset already
  sbc hl,de
  jr nz,mergeloop
  add hl,de
mergeloop_end:
  ex de,hl
  ;ld de,(tail)
  ld hl,(pTemp)
  ;or a
  sbc hl,de
  ex de,hl
  jp nz,sort_main_0
sort_main_0_end:
  ld hl,(var_k)
  add hl,hl
  ld (var_k),hl 
  ld bc,(var_n)
  add hl,bc
  jp nc,sort_main
  ret
#ifdef speed
fastldir:
;copy BC bytes from HL to DE
;copy BC bytes from HL to DE
;breaks even at 26 bytes
; 5% faster than LDIR with 35 bytes
;10% faster than LDIR with 48 bytes
;15% faster than LDIR with 91 bytes
;20% faster than LDIR with 635 bytes
;max is ~ 20.8% faster than LDIR
;Cost: 104+16N+10ceiling(N/16)
    push hl
    push af
    xor a
    sub c
    and 15               ;change to n-1
    add a,a
    ld hl,ldirloop
    add a,l
    ld l,a
#if (ldirloop>>8)!=(_ldirloop_end>>8)
    jr nc,$+3  ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes.
    inc h       ;
#endif
    pop af
    ex (sp),hl
    ret
ldirloop:
;n=16, (number of LDI instructions, use qty of 4,8,16,32,64)
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
_ldirloop_end:
    ldi
    jp pe,ldirloop
    ret
#endif
.echo $-$9D95," bytes"
#undefine var_n
#undefine partition_size
#undefine var_k
#undefine head0
#undefine head1
#undefine tail



I want to clarify that my mergesort method had to be in-place using O(1) space, so it devolved into an O(n2) algorithm.
With my estimates, mergesort would be faster than a non-adaptive insertion sort at 90 elements in the worst case. In practice, at 96 elements it was about 5% faster, probably because it wasn't a worst case scenario.

Either way, my insertion sort is fast compared with other implementations like that in MirageOS or DoorsCS7. It's much faster when the VAT is mostly sorted already, like if it gets sorted and then a user adds a few new programs. In that case, their implementations still seem to be O(n2) whereas mine is closer to O(n).

In both codes, the biggest speed factor was in shifting VAT entries. In the speed optimized versions, I take advantage of LDI over LDIR where I can, and I use my fastldir routine to eke out as much performance as possible, gaining a 10% speed boost in the overall routine in my test.

13
Site Feedback and Questions / Login Broken
« on: August 06, 2018, 04:31:38 pm »
I've noticed that when I try to sign in to Omni via the home page ( https://www.omnimaga.org/index.php ), it just hangs with a "Loading..." status bar at the top and never actually logs in. However, when I navigate to any other page and try to log in at the top, it works fine.
It has happened to me a few times in the past, but I so rarely need to log in that I attributed it to my internet connection.

14
TI Z80 / Heapsort, VATSort, and ListSort
« on: July 03, 2018, 10:29:08 pm »
Hey all, I implemented the heapsort algorithm with some features inspired by Sean McLaughlin's implementation. Heapsort operates in O(n*log(n)) time, so it is fast even as array sizes increase. The cool perk to Sean's implementation that I wanted to include in mine is the ability to have a callback function that performs comparisons. This is fantastic if your array contains pointers to, say, an array of strings. The callback function can take the inputs (in my case, HL and DE), perform a string comparison, and return the result to the heapsort routine. Heapsort will then know whether or not to swap the elements.

As an example, I created a VAT sorting routine. It creates an array of pointers to the VAT entries of named variables (like Programs, Appvars, etc.), and then uses a fancy callback function that compares the names of the VAT entries. Returned is an alphabetically sorted array of pointers.

As well, I made a very fast routine to sort TI lists. The callback routine for this takes indices, constructs pointers to the actual list data, compares them, then returns the result. Since the OS routine uses an O(n2) algorithm, my program can perform way faster on larger lists. I ran a few tests on random lists and came up with:

* at 100 elements, the OS routine and my routine were pretty close in speed.
* at 400 elements, my routine sorted it in ~1.3 seconds, versus ~9.5. for the OS routine
* at 999 elements, my routine sorted it in ~3.0 seconds, versus ~55.5. for the OS.

Code: (Heapsort) [Select]
#ifndef scrap
#define scrap 8000h
.echo "Warning, 'scrap' not defined, defining scrap=8000h"
#endif
#ifndef SMC
arraybase= scrap
arraylen = scrap+2
#endif
heapflags= 33
curchild = 0
heapsort:
;   HL points to the array data
;   BC is the size of the array. NOT GREATER THAN 32767
;   IX points to the routine that compares the values
#ifdef fast
    call heapify
    ld hl,(arraylen)
#else
    push bc
    call heapify
    pop hl
#endif
_:
    dec hl
    ld (arraylen),hl
#ifndef fast
    push hl
#endif
    ld de,(arraybase)
    add hl,hl
    inc de
    add hl,de
#ifdef fast
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
#else
    call swap
#endif
    ld bc,1
    call propogate
#ifdef fast
    ld hl,(arraylen)
#else
    pop hl
#endif
    ld a,h
    or l
    jr nz,-_   
    ret
heapify:
;Inputs:
;   HL points to the array data
;   BC is the size of the array. NOT GREATER THAN 32767
;   IX points to the routine that compares the values
    ld (arraybase),hl
    ld (arraylen),bc
    srl b
    rr c
_:
    push bc
    call propogate
    pop bc
    dec bc
    ld a,b
    or c
    jr nz,-_
    ret
propogate:
;BC-1 is the parent index
;2BC is the child1 index
    res curchild,(iy+heapflags)
proppost:
    sla c
    rl b
    ld d,b
    ld e,c
#ifdef SMC
arraylen=$+1
    ld hl,0
#else
    ld hl,(arraylen)
#endif
    sbc hl,de
    add hl,de
    ret c  ;no children
;compare the two children
#ifdef SMC
arraybase=$+1
    ld hl,0
#else
    ld hl,(arraybase)
#endif
    add hl,de
    add hl,de
    inc hl
    ld d,(hl)
    dec hl
    ld e,(hl)
    dec hl
    push hl
    ld a,(hl)
    dec hl
    ld l,(hl)
    ld h,a
    ;HL holds the value of child0
    ;DE holds the value of child1
    jr z,+_
    call callix
    jr nc,+_
    ex de,hl
    pop de
    inc de
    inc de
    set curchild,(iy+heapflags)
    push de
_:
;{stack} points to the child
;HL is the value of the child
;BC points to the parent
;now compare the child and parent
    ex de,hl
    ld hl,(arraybase)
    add hl,bc
    push hl
    dec hl
    ld a,(hl)
    dec hl
    ld l,(hl)
    ld h,a
    call callix
    pop hl
    pop de
    ret nc
    dec hl
    call swap
;values swapped, now set parent=child
;BC is the index of child1
    bit curchild,(iy+heapflags)
    jp z,proppost
    inc bc
    jp propogate
swap:
;HL points to the top of one word
;DE points to the top of another
;Must preserve BC
#ifdef fast
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    inc c
    inc bc
#else
    call +_
_:
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    inc bc
#endif
    ret
callix:
    jp (ix)
Even when optimizing for speed over size, my code is smaller, but I'm fairly sure there are more optimizations to be found!

15
TI Z80 / LZD Image Compression/Decompression
« on: June 27, 2018, 08:41:18 pm »
Hey all, I have been playing with compression. I made a Python program that compresses data, such as a TI Image, as well as a calculator program that can decompress such data to the graph screen.

While the Python program is general-purpose, the Z80 program assumes that the input is a compressed image, so you can mess things up! The neat thing is that in my example, the calc program (LZD) and the compressed data, along with their file headers, all came out to less than the size of the original image!

Pages: [1] 2 3 ... 13