Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Topics - Xeda112358

Pages: [1] 2 3 ... 13
1
News / POTY 2018 !
« on: December 04, 2018, 11:37:42 am »
TICalc.org's Program Of The Year polls are open!
For the TI-83+/84+ category, it looks like a bunch of older projects split between two authors-- myself and squidgetx (ticalc profile).

Up for the vote are the following programs (in the order listed on ticalc.org).

Batlib (ticalc link)
Batlib is a huge library for BASIC programmers. It has 120+ functions mostly for graphics and string and data manipulation, list and matrix manipulation, compression, and much more.



CopyProg (ticalc link)
This is a small program that allows users to copy variables from RAM or Archive to another variable. CopyProg2 has a few other features as well (like line reading and reading the names of the variables on your calc). It allows you to do things like copy an archived appvar to a temp program for execution ;)



Embers (ticalc link)
Embers is an ARPG that won Omnimaga Contest 2012. This was a really well put together game. It features good graphics, good AI, and storyline.



FloatLib (ticalc link)
Floatlib is an app that holds many single-precision float routines from addition to hyperbolic tangent. It comes with a built-in reference manual and there are some examples like computing the Mandelbrot Set.



Gravity Guy (ticalc link)
Gravity Guy is "a port/variation of the popular iphone/flash game" of the same name. You basically get to flip the direction of gravity to help navigate obstacles.


LblRW (ticalc link)
LblRW is a small utility for BASIC programmers that lets you read or modify data within the BASIC program, using a label offset. That's a mouthful. Basically, "Hey, I want to store player data in my RPG, let's store it after Lbl PD." Or you can store monster data for quick access for example ;)
(no screenshot, sorry :( )


StickNinja (ticalc link)
StickNinja was an Omnimaga Contest 2011 entry that earned 3rd place. It's basically a platformer with awesome Stick Figure and Ninja graphics. Collect coins, destroy enemies; It's got it all.

2
TI Z80 / Hacked Lists!
« on: November 04, 2018, 10:56:56 am »
So, fun fact: floating point numbers can be changed to a variable reference instead, kind of like a shortcut on your computer, and the OS doesn't verify it.

In assembly, when you are looking for a variable, you put a name in OP1 and use FindSym or ChkFindSym. Those names are 9 bytes, which is also the size of a float (not a coincidence-- it's part of why names are limited to 8 bytes).
You can actually change the contents of a Real variable to such a name, and when you try to read it's value, it will instead return the value of the variable it points to.
For example, change the bytes of the Real variable, A, from 008000000000000000 (the number 0) to 015D01000000000000 and when you read A, it will return the contents of L2.

Or, since lists are just stored as a bunch of Real (or Complex) numbers, you can modify elements of the list to be pointers. In this way, you could read Str1, Str2, Str3,..., Str0 by reading an element of L1, which could be useful, occasionally :P

Some things to note:
You can't modify the contents of the original variable in this way. If A points to Str2, then "33"+A→A does not modify Str2. However, "33"+A will work like "33"+Str2.
Storing the names of programs and appvars and whatnot isn't useful.


Attached is a program that turns L1 into a 4-element list {L2,[A],Str1,L1}. Here is the source. You need to delete L1 first, my code wasn't reliably deleting it.
Code: [Select]
#define bcall(x) rst 28h \ .dw x
rMOV9TOOP1  = 20h
_ChkFindSym = 42F1h
_CreateRList= 4315h
.db $BB,$6D
.org $9D95
  ld hl,name
  rst rMOV9TOOP1
  bcall(_ChkFindSym)
  ret c
  ld hl,name
  rst rMOV9TOOP1
  ld hl,4
  bcall(_CreateRList)
  inc de
  inc de
  ld hl,data
  ld bc,31
  ldir
  ret
data:
  .db 1,$5D,1,0,0,0,0,0,0
  .db 2,$5C,0,0,0,0,0,0,0
  .db 4,$AA,0,0,0,0,0,0,0
name:
  .db 1,$5D,0,0

3
Math and Science / Efficient Plotting of Polynomials
« on: October 28, 2018, 10:32:12 pm »
Hi folks, I wanted to share a technique I came up with for evaluating sequential points of a polynomial, which is especially useful when you are drawing them. A naive approach would be evaluate the whole polynomial at each point, but the method I'll show you allows you to evaluate at a few points, and then for each subsequent point just use additions.

The basic premise is to use ##f(x)## as a starting point to get to ##f(x+1)##. For example, take ##f(x)=3+2x##. To get to ##f(x+1)##, all you have to do is add 2. Things get more complicated as you go to higher degree polynomials. For example, take ##f(x)=3+2x-3x^{2}+x^{3}##. Then to get to ##f(x+1)##, you need to add ##3x^{2}-3x##. However, you can go even further by finding how to get from ##3x^{2}-3x## to ##3(x+1)^{2}-3(x+1)##.

Now comes the theory.

Take ##f_{1}(x)=f(x+1)-f(x)## and ##f_{n+1}(x)=f_{n}(x+1)-f_{n}(x)##.
Then ##f(x+1)=f(x)+f_{1}(x)## and ##f_{n}(x+1)=f_{n}(x)+f_{n+1}(x)##.

##\begin{aligned}
f_{2}(x) &=f_{1}(x+1)-f_{1}(x),\\
&=(f(x+2)-f(x+1))-(f(x+1)-f(x)),\\
&=f(x+2)-2f(x+1)+f(x).
\end{aligned}##


##\begin{aligned}
f_{3}(x)&=f_{2}(x+1)-f_{2}(x),\\
&=(f(x+3)-2f(x+2)+f(x+1))-(f(x+2)-2f(x+1)+f(x)),\\
&=f(x+3)-3f(x+2)+3f(x+1))-f(x).
\end{aligned}##

And in general,
##f_{n}(x)=\sum_{k=0}^{n}{{n\choose k}(-1)^{k}f(x+n-k)}##
If you set ##x=0##:
##f_{n}(0)=\sum_{k=0}^{n}{{n\choose k}(-1)^{k}f(n-k)}##

So now let's take again, ##f(x)=3+2x-3x^{2}+x^{3}##.
##\begin{aligned}
f(0)&=3,\\
f_{1}(0)&=f(1)-f(0)&=0,\\
f_{2}(0)&=f(2)-2f(1)+f(0)&=0,\\
f_{3}(0)&=f(3)-3f(2)+3f(1)-f(0)&=6,\\
f_{k>3}(0)&=0.\\
\end{aligned}##

So what these values allow us to do, ##{3,0,0,6}##, is use a chain of additions to get the next value, instead of evaluating the cubic polynomial at each x:
##\begin{array}{rrrr}
[&3,&0,&0,&6]\\
[&3,&0,&6,&6]\\
[&3,&6,&12,&6]\\
[&9,&18,&18,&6]\\
[&27,&36,&24,&6]\\
[&63,&60,&30,&6]\\
[&123,&90,&36,&6]\\
[&213,&126,&42,&6]\\
...
\end{array}##

Basically, add the second element to the first, third element to the second, and fourth element to the third. The new value is the first. Repeat. So the next iteration would yield [213+126,126+42,42+6,6]=[339,168,48,6], and indeed, f(8)=339.

I'd like to apologise for how garbled this is, I'm really tired.

Also, you should note that for an n-th degree polynomial, you need to compute up to ##f_{n}(x)##, but everything after that is 0.

4
TI Z80 / Yet Another VAT Sort
« on: October 09, 2018, 07:23:57 pm »
A few months ago I implemented heapsort and used it to create a sorted array of VAT entries. While it is fast, instead of directly sorting the VAT, it creates an array of pointers to each entry and sorts the pointers. This means that the VAT itself remains unsorted (not necessarily an issue), and a variable amount of RAM needs to be allocated. In the example, if the user had over 383 variables in the named variables table, only the first 383 entries could be sorted (unlikely, but an issue nevertheless).

My goal here was to sort the VAT in-place, using only a fixed amount of overhead memory. After a couple of weeks of brainstorming and testing, I ended up with an adaptive insertion sort implementation and a mergesort implementation.
Code: (Insertion Sort) [Select]
pTemp       = 982Eh ;bottom of named vars VAT
progPtr = 9830h ;top of named vars VAT
OP1         = 8478h
;uses 19 bytes of RAM, 4 bytes of stack space
#define tmp OP1+14
#define head tmp+1
#define head_lag tmp+3
#macro advanceVAT()
#ifdef speed
;17cc saved per iteration
;8 bytes vs a 3 byte call (but no 8-byte subroutine)
  ld bc,-6
  add hl,bc
  sbc a,a   ;HL<=FE66, so carry flag is set
  sub (hl)
  ld c,a
  add hl,bc
#else
  call advance_VAT    ;preserves DE
#endif
#endmacro


sortVAT:
#ifdef nointerrupt
  di
#endif
  ld hl,(progPtr)
isort_main:
_:
  ld (head_lag),hl
  ld d,h
  ld e,l
  advanceVAT()
  ld (head),hl
#ifdef speed
;11 bytes, 29cc or 46cc. avg=29.06640625cc
  ld a,(pTemp)
  cp l
  jr nz,$+7
  ld a,(pTemp+1)
  cp h
  ret z
#else
;adds 8 bytes, 55cc
  ld bc,(pTemp) ;Need to verify that we haven't reached the end of the progVAT
  or a          ;
  sbc hl,bc     ;
  ret z         ;
  add hl,bc     ;
#endif
  call cmpVAT
  ld hl,(head)
  jr nc,-_
;if it makes it here, then (head) needs to be inserted into the previous part of the VAT
;We might be able to speed it up a little more if I also grab the next element
;  If (head_advance) is bigger than (head), then no need to start the search from the beginning
  ld de,tmp
#ifdef speed
  ldd
  ldd
  ldd
  ldd
  ldd
  ldd
  ld b,0
  ld c,(hl)
  lddr
  ldd
#else
  ld bc,6
  lddr
  ld c,(hl)
  inc c
  lddr
#endif
  ld hl,(progPtr)
_:
  push hl
#ifdef speed
;+5 bytes, -11cc
  ld bc,-6
  add hl,bc
  ld de,tmp-6
  call cmpVAT_stepin
#else
  ex de,hl
  ld hl,tmp
  call cmpVAT
#endif
  pop hl
  jr c,+_
  advanceVAT()
  jp -_
_:
;HL is where to insert
  ld de,(head)
  or a
  sbc hl,de
  ld b,h
  ld c,l
  ld hl,-6
  add hl,de
  ld a,l
  sub a,(hl)
  ld l,a
  jr nc,$+4
  dec h
  or a
  inc de
  ex de,hl
#ifdef speed
  call fastldir
#else
  ldir
#endif
  ;begin at DE, copy tmp. First need size of tmp
  ld hl,tmp-6
  ld c,(hl)
  sbc hl,bc
  ld a,c
  ldir
#ifdef speed
  ldi
  ldi
  ldi
  ldi
  ldi
  ldi
  ldi
  add a,7
#else
  ld c,7
  add a,c
  ldir
#endif
  ld hl,(head_lag)
  ld c,a
  ld a,l
  sub c
  ld l,a
  jp nc,isort_main
  dec h
  jp isort_main
#ifndef speed
advance_VAT:
  ld bc,-6
  add hl,bc
  sbc a,a   ;HL<=FE66, so carry flag is set
  sub (hl)
  ld c,a
  add hl,bc
  ret
#endif
cmpVAT:
;if @HL>[email protected], return nc
  ld bc,-6
  add hl,bc
  ex de,hl
  add hl,bc
cmpVAT_stepin:
  ld a,(de)
  cp (hl)
  jr nc,first_longer
;the second name is longer.
  ld c,a
_:
  dec hl
  dec de
  ld a,(de)
  cp (hl)
  ret nz
  dec c
  jr nz,-_
  scf
  ret
first_longer:
;the first name is longer, so load c with the size of the second name
  ld c,(hl)
_:
  dec hl
  dec de
  ld a,(de)
  cp (hl)
  ret nz
  dec c
  jr nz,-_
  ret
#ifdef speed
fastldir:
;copy BC bytes from HL to DE
;copy BC bytes from HL to DE
;breaks even at 26 bytes
; 5% faster than LDIR with 35 bytes
;10% faster than LDIR with 48 bytes
;15% faster than LDIR with 91 bytes
;20% faster than LDIR with 635 bytes
;max is ~ 20.8% faster than LDIR
;Cost: 104+16N+10ceiling(N/16)
    push hl
;    push af
    xor a
    sub c
    and 15               ;change to n-1
    add a,a
    ld hl,ldirloop
    add a,l
    ld l,a
#if (ldirloop>>8)!=(_ldirloop_end>>8)
    jr nc,$+3  ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes.
    inc h       ;
#endif
;    pop af
    ex (sp),hl
    ret
ldirloop:
;n=16, (number of LDI instructions, use qty of 4,8,16,32,64)
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
_ldirloop_end:
    ldi
    jp pe,ldirloop
    ret
#endif
#undefine tmp
#undefine head
#undefine head_lag1
.echo $-sortVAT," bytes"
Code: (Mergesort) [Select]
pTemp       = 982Eh ;bottom of named vars VAT
progPtr = 9830h ;top of named vars VAT
OP1         = 8478h
;uses 29 bytes of state.
; 4 bytes stack space
; 10 bytes variables
; 15 bytes swap
#define var_n OP1+15
#define partition_size var_n+2
#define var_k partition_size
#define head0 var_n+4
#define head1 var_n+6
#define tail var_n+8

cmpVAT:
;HL points to one entry
;DE points to the second entry
;return c if first>second
;return nc if first<=second
  ld bc,-6
  add hl,bc
  ld a,(hl)
  ex de,hl
  add hl,bc
  ld b,(hl)
  ex de,hl
  ;A is the size of the first name
  ;HL points to the byte above the first name
  ;B is the size of the second name
  ;DE points to the byte above the second name
  ld c,a
  jr nc,second_name_longer
_:
  dec hl
  dec de
  ld a,(de)
  cp (hl)
  ret nz
  dec c
  djnz -_
  inc c
  scf
  ret
second_name_longer:
  dec hl
  dec de
  ld a,(de)
  cp (hl)
  ret nz
  dec c
  jr nz,second_name_longer
  ret
nthstr:
;Input:
;   HL points to the size byte of the name of the first element
;   BC is the number of elements
;Output:
;   c flag set if end of VAT reached, else nc
;   HL points to the element
;speed:
; worst: 34+(95-7/256)*BC
; best:  34+(95-14/256)*BC
  ld de,(pTemp)
_:
  or a      ;Can do some analysis to omit this code for most runs
  sbc hl,de ;could speed this up by almost 37%
  add hl,de ;Still, only adds .001 seconds per 100 VAT entries :P
  ret c     ;35cc
  ld a,-6
  sub (hl)
  add a,l
  ld l,a
  jr c,$+3
  dec h
  or a
  cpd
  jp pe,-_
  or a
  ret
sortVAT:
  ld hl,(progPtr) ;get the number of elements in the VAT
  ld de,-6        ;
  add hl,de       ;
  ld bc,0         ;
  call nthstr     ;
  ld (var_n),bc   ;
  ld hl,1
  ld (var_k),hl
sort_main:      ;until partition_size exceeds size of VAT
  ld hl,(progPtr)
sort_main_0:    ;merge multiples of partition_size
  ld (head0),hl
  ld de,-6
  add hl,de
  ld bc,(var_k)
  call nthstr
  jp c,sort_main_0_end
  push hl
  ld bc,(var_k)
  call nthstr
  ld de,6
  add hl,de
  ld (tail),hl
  pop hl
  add hl,de
  ld (head1),hl

  ex de,hl    ;head1
  ld hl,(head0)
  ;or a
  sbc hl,de
    ld hl,(tail)
  jr z,mergeloop_end
  ;or a         ;head0>=head1, so c flag should be reset already
  sbc hl,de
  jr z,mergeloop_end-1
  ld de,(head1)
mergeloop:
  ld hl,(head0)
  call cmpVAT
  jr nc,+_
#ifdef speed
;saves 42cc
  ld hl,(head1)
  ld de,var_n-1
  ldd
  ldd
  ldd
  ldd
  ldd
  ldd
  ld a,(hl)
  ldd
  ld c,a
  add a,7
  ld b,0
#else
  ld hl,(head1)
  ld de,-6
  add hl,de
  ld a,(hl)
  ld de,var_n-1
  ld hl,(head1)
  add a,7
  ld b,0
  ld c,a
#endif
  lddr
;HL now points to the bottom
  ld de,(head1)
  ld (head1),hl
  ld hl,(head0)
  sbc hl,de ;number of bytes that need to be shifted up by A
  ld b,h
  ld c,l
;need to shift from DE to (head1)
  ld hl,(head1)
  ex de,hl
  inc de
  inc hl
#ifdef speed
  call fastldir
#else
  ldir
#endif

  ld hl,var_n-1
  ld de,(head0)
  ld c,a
  lddr
  ex de,hl
  jp $+9
_:
  ld a,l
  sub c
  ld l,a
  jr nc,$+3
  dec h

  ld (head0),hl
  ld de,(head1)
  or a
  sbc hl,de
  ld hl,(tail)
  jr z,mergeloop_end
  ;or a         ;head0>=head1, so c flag is reset already
  sbc hl,de
  jr nz,mergeloop
  add hl,de
mergeloop_end:
  ex de,hl
  ;ld de,(tail)
  ld hl,(pTemp)
  ;or a
  sbc hl,de
  ex de,hl
  jp nz,sort_main_0
sort_main_0_end:
  ld hl,(var_k)
  add hl,hl
  ld (var_k),hl 
  ld bc,(var_n)
  add hl,bc
  jp nc,sort_main
  ret
#ifdef speed
fastldir:
;copy BC bytes from HL to DE
;copy BC bytes from HL to DE
;breaks even at 26 bytes
; 5% faster than LDIR with 35 bytes
;10% faster than LDIR with 48 bytes
;15% faster than LDIR with 91 bytes
;20% faster than LDIR with 635 bytes
;max is ~ 20.8% faster than LDIR
;Cost: 104+16N+10ceiling(N/16)
    push hl
    push af
    xor a
    sub c
    and 15               ;change to n-1
    add a,a
    ld hl,ldirloop
    add a,l
    ld l,a
#if (ldirloop>>8)!=(_ldirloop_end>>8)
    jr nc,$+3  ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes.
    inc h       ;
#endif
    pop af
    ex (sp),hl
    ret
ldirloop:
;n=16, (number of LDI instructions, use qty of 4,8,16,32,64)
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
    ldi
_ldirloop_end:
    ldi
    jp pe,ldirloop
    ret
#endif
.echo $-$9D95," bytes"
#undefine var_n
#undefine partition_size
#undefine var_k
#undefine head0
#undefine head1
#undefine tail



I want to clarify that my mergesort method had to be in-place using O(1) space, so it devolved into an O(n2) algorithm.
With my estimates, mergesort would be faster than a non-adaptive insertion sort at 90 elements in the worst case. In practice, at 96 elements it was about 5% faster, probably because it wasn't a worst case scenario.

Either way, my insertion sort is fast compared with other implementations like that in MirageOS or DoorsCS7. It's much faster when the VAT is mostly sorted already, like if it gets sorted and then a user adds a few new programs. In that case, their implementations still seem to be O(n2) whereas mine is closer to O(n).

In both codes, the biggest speed factor was in shifting VAT entries. In the speed optimized versions, I take advantage of LDI over LDIR where I can, and I use my fastldir routine to eke out as much performance as possible, gaining a 10% speed boost in the overall routine in my test.

5
Site Feedback and Questions / Login Broken
« on: August 06, 2018, 04:31:38 pm »
I've noticed that when I try to sign in to Omni via the home page ( https://www.omnimaga.org/index.php ), it just hangs with a "Loading..." status bar at the top and never actually logs in. However, when I navigate to any other page and try to log in at the top, it works fine.
It has happened to me a few times in the past, but I so rarely need to log in that I attributed it to my internet connection.

6
TI Z80 / Heapsort, VATSort, and ListSort
« on: July 03, 2018, 10:29:08 pm »
Hey all, I implemented the heapsort algorithm with some features inspired by Sean McLaughlin's implementation. Heapsort operates in O(n*log(n)) time, so it is fast even as array sizes increase. The cool perk to Sean's implementation that I wanted to include in mine is the ability to have a callback function that performs comparisons. This is fantastic if your array contains pointers to, say, an array of strings. The callback function can take the inputs (in my case, HL and DE), perform a string comparison, and return the result to the heapsort routine. Heapsort will then know whether or not to swap the elements.

As an example, I created a VAT sorting routine. It creates an array of pointers to the VAT entries of named variables (like Programs, Appvars, etc.), and then uses a fancy callback function that compares the names of the VAT entries. Returned is an alphabetically sorted array of pointers.

As well, I made a very fast routine to sort TI lists. The callback routine for this takes indices, constructs pointers to the actual list data, compares them, then returns the result. Since the OS routine uses an O(n2) algorithm, my program can perform way faster on larger lists. I ran a few tests on random lists and came up with:

* at 100 elements, the OS routine and my routine were pretty close in speed.
* at 400 elements, my routine sorted it in ~1.3 seconds, versus ~9.5. for the OS routine
* at 999 elements, my routine sorted it in ~3.0 seconds, versus ~55.5. for the OS.

Code: (Heapsort) [Select]
#ifndef scrap
#define scrap 8000h
.echo "Warning, 'scrap' not defined, defining scrap=8000h"
#endif
#ifndef SMC
arraybase= scrap
arraylen = scrap+2
#endif
heapflags= 33
curchild = 0
heapsort:
;   HL points to the array data
;   BC is the size of the array. NOT GREATER THAN 32767
;   IX points to the routine that compares the values
#ifdef fast
    call heapify
    ld hl,(arraylen)
#else
    push bc
    call heapify
    pop hl
#endif
_:
    dec hl
    ld (arraylen),hl
#ifndef fast
    push hl
#endif
    ld de,(arraybase)
    add hl,hl
    inc de
    add hl,de
#ifdef fast
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
#else
    call swap
#endif
    ld bc,1
    call propogate
#ifdef fast
    ld hl,(arraylen)
#else
    pop hl
#endif
    ld a,h
    or l
    jr nz,-_   
    ret
heapify:
;Inputs:
;   HL points to the array data
;   BC is the size of the array. NOT GREATER THAN 32767
;   IX points to the routine that compares the values
    ld (arraybase),hl
    ld (arraylen),bc
    srl b
    rr c
_:
    push bc
    call propogate
    pop bc
    dec bc
    ld a,b
    or c
    jr nz,-_
    ret
propogate:
;BC-1 is the parent index
;2BC is the child1 index
    res curchild,(iy+heapflags)
proppost:
    sla c
    rl b
    ld d,b
    ld e,c
#ifdef SMC
arraylen=$+1
    ld hl,0
#else
    ld hl,(arraylen)
#endif
    sbc hl,de
    add hl,de
    ret c  ;no children
;compare the two children
#ifdef SMC
arraybase=$+1
    ld hl,0
#else
    ld hl,(arraybase)
#endif
    add hl,de
    add hl,de
    inc hl
    ld d,(hl)
    dec hl
    ld e,(hl)
    dec hl
    push hl
    ld a,(hl)
    dec hl
    ld l,(hl)
    ld h,a
    ;HL holds the value of child0
    ;DE holds the value of child1
    jr z,+_
    call callix
    jr nc,+_
    ex de,hl
    pop de
    inc de
    inc de
    set curchild,(iy+heapflags)
    push de
_:
;{stack} points to the child
;HL is the value of the child
;BC points to the parent
;now compare the child and parent
    ex de,hl
    ld hl,(arraybase)
    add hl,bc
    push hl
    dec hl
    ld a,(hl)
    dec hl
    ld l,(hl)
    ld h,a
    call callix
    pop hl
    pop de
    ret nc
    dec hl
    call swap
;values swapped, now set parent=child
;BC is the index of child1
    bit curchild,(iy+heapflags)
    jp z,proppost
    inc bc
    jp propogate
swap:
;HL points to the top of one word
;DE points to the top of another
;Must preserve BC
#ifdef fast
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    inc c
    inc bc
#else
    call +_
_:
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    inc bc
#endif
    ret
callix:
    jp (ix)
Even when optimizing for speed over size, my code is smaller, but I'm fairly sure there are more optimizations to be found!

7
TI Z80 / LZD Image Compression/Decompression
« on: June 27, 2018, 08:41:18 pm »
Hey all, I have been playing with compression. I made a Python program that compresses data, such as a TI Image, as well as a calculator program that can decompress such data to the graph screen.

While the Python program is general-purpose, the Z80 program assumes that the input is a compressed image, so you can mess things up! The neat thing is that in my example, the calc program (LZD) and the compressed data, along with their file headers, all came out to less than the size of the original image!

8
Math and Science / Fast 3-digit Squaring, Toom-Cook
« on: June 26, 2018, 02:18:40 pm »
Hi all, I realized that I have a lot of unshared results that I've worked out over the years! (Then again, I have a lot of shared results :P )
Today's is dealing with squaring 3-digit numbers.
If you aren't familiar with the Toom-Cook algorithm, read the spoiler! It's cool stuff. Otherwise, continue down.
Spoiler For Prerequisite:
The Toom-Cook algorithm basically generalizes multiplying integers as multiplying polynomials.For example, if we want to multiply 47*81, we can think of it as multiplying (7+4x)*(1+8x) and evaluating at x=10 (since our numbers were in base 10). This seems like a lot of unnecessary complexity, and it is for such small numbers, but let's look at a few key ideas.

If you are given 3 points on a plane, you can construct a unique quadratic function that passes through each point. The key here is that it is unique meaning there are no other degree-2 polynomials that pass through each point-- there is only one and that one must exists. On the other hand given 4 points, there might be a parabola that passes through each point, but it is not a guarantee and is in fact unlikely. As well, given three points, there exist infinitely many cubic polynomials (or quartic or quintic, etc.) that pass through each point. In general, if you have n+1 points, you are guaranteed to be able to find a unique polynomial of degree n that is the only such polynomial that passes through each point.

Guarantees and uniqueness are cool and all, but actually constructing the polynomials from those points can be tricky (but still "quickly" doable in a computational sense)! Luckily for our case, the points won't be totally arbitrary. Let's look back at our Toom-Cook example of multiplying 47*81. The polynomial product is (7+4x)*(1+8x), which we know should be a quadratic equation. Then if we evaluate this product at three easy points, we should be able to reconstruct the result. Specifically, let's evaluate at x=-1, x=0, and x=1:
(7-4)*(1-8)=-21
7*1=7
(7+4)*(1+8)=99

The easiest way to reconstruct the polynomial by hand, in my opinion, is to use matrices to represent a system of equations. In this case, with ai being the points we evaluated at -1,0, and 1:

[[  1   (-1)1    (-1)2   a0]
 [  1   01       02      a1]
 [  1   11       12      a2]]


   [[  1  -1   1   a0]
=   [  1   0   0   a1]
    [  1   1   1   a2]]

And now we reduce this to reduced row-eschelon form.

   [[  0   1   0   (a2-a0)/2]
=>  [  1   0   0   a1]
    [  1   0   1   (a2+a0)/2]]


   [[  0   1   0   (a2-a0)/2]
=>  [  1   0   0   a1]
    [  0   0   1   (a2+a0)/2-a1]]


Now that we know the resulting coefficients, let's construct our polynomial:
a1+(a2-a0)x/2+((a2+a0)/2-a1)x2
=7+60x+32x2
Let's evaluate this at x=10 and we have 3200+600+7 = 3807 which is indeed 47*81.

The nice thing about this is, since we've already solved the system of equations, we can reuse this and we don't need to compute it every time. We just compute the product at -1, 0, and , then plug it into our formula et voila. In fact, let's make a BASIC program that multiplies two ten-digit numbers and returns a 20-digit result. Input is passed through two 2-element lists, L1 and L2, and use those elements as 5-digit 'digits' to compute the product.
Code: [Select]
{1.-1
sum(L1Ans)sum(L2Ans->A
L1(1)L2(1->B
sum(L1)sum(L2->C
.5(C-A
{B,Ans,Ans+A-B->L1
0->C
L1e-5->L1    ;engineering E
For(K,1,3
C+L1(k->L1(K
e-5int(Ans->C
End
C->L1(4
e5fPart(L1->L1
The output is a 20-digit integer comprised of four 5-digit numbers. The first element is the lowest 5 digits, the fourth element is the upper 5 digits.
To square a 3-digit number, we'll result in a degree-4 polynomial. Consequently, we'll need to evaluate at 5 points. The points I chose were {0,1,i,-1,-i} where 'i' is the imaginary unit. It may seem like evaluating at a complex point would be, well, complex, but we'll see. First, let's write our polynomial as a0+a1x+a2x2.
Then:
b0=f(0)=a02
b1=f(1)=(a0+a1+a2)2
b2=f(i)=(a0+a1i-a2)2
b3=f(-1)=(a0-a1+a2)2
b4=f(-i)=(a0-ia1-a2)2
Squaring complex numbers is easy though-- (a+bi)2= a2-b2+2abi = (a-b)(a+b)+2abi
So simplifying:
b2=(a0-a2+a1)(a0-a2-a1)+2i(a0-a2)a1
b4=(a0-a2+a1)(a0-a2-a1)-2i(a0-a2)a1

Noticing symmetry, let's define instead:
c3=(a0-a2+a1)(a0-a2-a1)
c4=2(a0-a2)a1

Then:
b2=c3+ic4
b4=c3-ic4


    1       0       0       0       0   b0
    1       1       1       1       1   b1
    1       i      -1      -i       1   c3+ic4
    1      -1       1      -1       1   b3
    1      -i      -1       i       1   c3-ic4


    1       0       0       0       0   b0
    1       0       1       0       1   (b1+b3)/2 = c1
=>  0       1       0       1       0   (b1-b3)/2 = c2
    1       0      -1       0       1   c3
    0       1       0      -1       0   c4


    1       0       0       0       0   b0
    1       0       0       0       1   (c1+c3)/2 = d1
=>  0       0       1       0       0   (c1-c3)/2 = d2
    0       1       0       0       0   (c2+c4)/2 = d3
    0       0       0       1       0   (c2-c4)/2 = d4


Then our polynomial is:
b0+d3x+d2x2+d4x3+(d1-b0)x4


So for an example, let's find 348*348
Then we have:
a0=8
a1=4
a2=3
b0=64
b1=(8+4+3)2=225
b3=(8-4+3)2=49
c3=(8-3+4)(8-3-4)=9
c4=2(8-3)*4=40
c1=(225+49)/2 = 137
c2=c1-49 = 88
d1=(137+9)/2 = 73
d2=d1-9 = 64
d3=(88+40)/2 = 64
d4=d3-40 = 24

So then, our result is:
64+64x+64x2+24x3+9x4
Evaluating at x=10 eyelids yields:
64+640+6400+24000+90000 = 121104, which is indeed 384*384

Now let's generate some Python code:
Code: [Select]
def sqr3(a,b,c):
    d=(a+b+c)^2    #roughly 75% of cases exceed the original bit-width
    e=(a-b+c)^2    #roughly 1/6 of the time this will exceed the original bit-width. However, whenever this exceeds the bit-width, then d also exceeds it.
    d=(d+e)/2
    e=d-e
    f=(a+b-c)(a-b-c)   #it is never the case that both multiplicands exceed  their original word width, but about 1/3 of the time, one of them exceeds it (~2/3 of the time, neither exceed).
    b*=(a-c)        #neither multiplicand exceeds the original bit-width (though the second operand may become negative).
    b+=b
    a*=a
    e=(e+b)/2
    b=e-b
    d=(d-f)/2
    f+=d-a
    return [a,e,d,b,f]
Note that three of the sub-multiplications are just squaring the input. Also note that all divisions are by a power of two, and all results and operations are integer operations, not complex (assuming all inputs are integers).

9
Other Programming Languages / How To Add Two Sprites
« on: March 14, 2018, 01:39:13 pm »
For a few years now, I've been playing on-and-off with various ideas related to parsing and evaluating expressions. Today as I was working out the code for adding two values, I came across the case of adding a 'sprite' variable to some other variable.

If you had a programming language where you had two sprite variables, what would you want or expect 'sprite1+sprite2' to do, if anything? For that matter, how about adding an integer or float to it? How about multiplication or division?

I know, for example, that I would want 'sprite1 or sprite2' to return a sprite with it's contents bit-wise ORed together.

10
TI Z80 / Hash-Based Instruction Lookup
« on: November 04, 2017, 11:24:57 am »
So I was working on some code to take an alphanumeric input string and look it up in a table and do something. After rehashing a binary search algorithm for the nth time, I remembered an old idea that I had for finding a string in set of data.
Spoiler For Spoiler:
I could basically speed up the search by xor-ing each byte of the search string and using that value is an indicator of a potential match. If the input is n bytes long, then I xor the first n starting bytes of the data to search. If the XORed value matches, double check that you haven't already found a match! Otherwise, set variables tail=data_start and head=data_start+n and x as the XORed value of the input, and acc as the XORed value of the first n bytes of the data. Now XOR the byte at tail with acc and store it back to acc. Increment tail, increment head. XOR the byte at head with acc and store it back to acc. If acc==x then double check the string between tail and head isn't a match, otherwise, continue until head goes beyond the data.
After some poking around on Google, I came across a hash based search algorithm that was just the insight that I needed!

So I came up with some sample commands and I was lucky enough to find a simple hashing function that provided a unique value for all 46 functions.

I'm going to provide the code here, even though it isn't very pretty. This way if I ever lose it, I'll have this as reference :)
This example executes the command "Disp" which I just made display everything after it.
Code: [Select]
#include "grammer3.inc"

_DispHL = 4507h
_NewLine = 452Eh
_PutS   = 450Ah
#define bcall(x) rst 28h \ .dw x
scrap = 8000h
denom = scrap
numer = scrap+4
out   = scrap+8
.db $BB,$6D
.org $9D95
    ld hl,testinput
    ld bc,testinput_end-testinput
    call hashlookup
    jr c,+_
    ld hl,s_NOTFOUND
    bcall(_PutS)
    ret
_:
    inc de \ inc de ; Now DE points to the pointer to the parsing code
    ex de,hl
    ld a,(hl) \ inc hl \ ld h,(hl) \ ld l,a
    ex de,hl
    cpi
    ret po
    ;HL points to the input, BC is the size left
    ;DE points to the code that parses the instruction
    push de
    ret
testinput:
    .db "Disp Yay, it works!",0
testinput_end:
hashlookup:
;HL points to the input
;BC is the max size
;return nc if failed to match, c if success
    ld (input_save),hl
    ld (input_savesize),bc
    ld a,b
    or c
    jr z,match_null
    call computehash
    ld hl,(input_savesize)
    xor a
    sbc hl,bc
    jr z,match_fail
    ld b,h
    ld c,l
    ld d,a
    ex de,hl
    add hl,hl
    ld (hash),hl


    ld de,hashlut_builtin
    add hl,de
    ld e,(hl)
    inc hl
    ld d,(hl)

    ld hl,(input_save)
;BC is the input size
;HL points to the input string
;DE points to the comparison
    ld a,(de)
    cp c
    jr nz,match_fail
    inc de
    ld b,c

_:
    ld a,(de)
    inc de
    cp (hl)
    jr nz,match_fail
    inc hl
    djnz -_
    scf
    ret
match_null:
    ld de,t_null+1
match_fail:
    ld hl,(input_save)
    ld bc,(input_savesize)
    or a
    ret
computehash:
    ld e,0
_:
    ld a,(hl)
    sub 48
    ret c
    cp 10
    jr c,$+5
    sub 7
    ret c
    cp 68
    ret nc
    ld d,a
    add a,a
    add a,d
    xor e
    ld e,a
    cpi
    jp pe,-_
    ret
   
   
hashlut_builtin:
.dw t_null              ;00
.dw t_null              ;01
.dw t_null              ;02
.dw t_null              ;03
.dw t_null              ;04
.dw t_null              ;05
.dw t_null              ;06
.dw t_null              ;07
.dw t_null              ;08
.dw t_null              ;09
.dw t_null              ;0a
.dw t_cosh              ;0b
.dw t_null              ;0c
.dw t_null              ;0d
.dw t_null              ;0e
.dw t_null              ;0f
.dw t_null              ;10
.dw t_null              ;11
.dw t_atan              ;12
.dw t_null              ;13
.dw t_sinh              ;14
.dw t_null              ;15
.dw t_null              ;16
.dw t_null              ;17
.dw t_null              ;18
.dw t_null              ;19
.dw t_null              ;1a
.dw t_null              ;1b
.dw t_sqrt              ;1c
.dw t_null              ;1d
.dw t_null              ;1e
.dw t_max               ;1f
.dw t_null              ;20
.dw t_null              ;21
.dw t_ClrDraw           ;22
.dw t_null              ;23
.dw t_null              ;24
.dw t_null              ;25
.dw t_null              ;26
.dw t_null              ;27
.dw t_null              ;28
.dw t_Ellipse           ;29
.dw t_null              ;2a
.dw t_null              ;2b
.dw t_null              ;2c
.dw t_null              ;2d
.dw t_null              ;2e
.dw t_null              ;2f
.dw t_null              ;30
.dw t_null              ;31
.dw t_null              ;32
.dw t_null              ;33
.dw t_null              ;34
.dw t_null              ;35
.dw t_null              ;36
.dw t_null              ;37
.dw t_null              ;38
.dw t_null              ;39
.dw t_ln                ;3a
.dw t_null              ;3b
.dw t_null              ;3c
.dw t_pi                ;3d
.dw t_null              ;3e
.dw t_null              ;3f
.dw t_null              ;40
.dw t_null              ;41
.dw t_null              ;42
.dw t_null              ;43
.dw t_null              ;44
.dw t_null              ;45
.dw t_null              ;46
.dw t_null              ;47
.dw t_null              ;48
.dw t_null              ;49
.dw t_null              ;4a
.dw t_null              ;4b
.dw t_null              ;4c
.dw t_null              ;4d
.dw t_null              ;4e
.dw t_null              ;4f
.dw t_null              ;50
.dw t_null              ;51
.dw t_null              ;52
.dw t_null              ;53
.dw t_null              ;54
.dw t_null              ;55
.dw t_null              ;56
.dw t_null              ;57
.dw t_null              ;58
.dw t_null              ;59
.dw t_null              ;5a
.dw t_null              ;5b
.dw t_null              ;5c
.dw t_null              ;5d
.dw t_null              ;5e
.dw t_null              ;5f
.dw t_null              ;60
.dw t_null              ;61
.dw t_null              ;62
.dw t_null              ;63
.dw t_null              ;64
.dw t_null              ;65
.dw t_null              ;66
.dw t_null              ;67
.dw t_null              ;68
.dw t_randint           ;69
.dw t_asinh             ;6a
.dw t_null              ;6b
.dw t_tan               ;6c
.dw t_null              ;6d
.dw t_null              ;6e
.dw t_null              ;6f
.dw t_null              ;70
.dw t_null              ;71
.dw t_null              ;72
.dw t_null              ;73
.dw t_null              ;74
.dw t_acosh             ;75
.dw t_null              ;76
.dw t_null              ;77
.dw t_null              ;78
.dw t_null              ;79
.dw t_null              ;7a
.dw t_null              ;7b
.dw t_null              ;7c
.dw t_null              ;7d
.dw t_null              ;7e
.dw t_null              ;7f
.dw t_null              ;80
.dw t_atanh             ;81
.dw t_null              ;82
.dw t_null              ;83
.dw t_null              ;84
.dw t_null              ;85
.dw t_Line              ;86
.dw t_sin               ;87
.dw t_null              ;88
.dw t_null              ;89
.dw t_e                 ;8a
.dw t_null              ;8b
.dw t_null              ;8c
.dw t_mod               ;8d
.dw t_null              ;8e
.dw t_null              ;8f
.dw t_null              ;90
.dw t_min               ;91
.dw t_Circle            ;92
.dw t_gcd               ;93
.dw t_null              ;94
.dw t_null              ;95
.dw t_null              ;96
.dw t_null              ;97
.dw t_cos               ;98
.dw t_null              ;99
.dw t_null              ;9a
.dw t_null              ;9b
.dw t_null              ;9c
.dw t_null              ;9d
.dw t_null              ;9e
.dw t_null              ;9f
.dw t_null              ;a0
.dw t_log2              ;a1
.dw t_null              ;a2
.dw t_Tilemap           ;a3
.dw t_log10             ;a4
.dw t_null              ;a5
.dw t_null              ;a6
.dw t_null              ;a7
.dw t_null              ;a8
.dw t_Text              ;a9
.dw t_null              ;aa
.dw t_null              ;ab
.dw t_null              ;ac
.dw t_null              ;ad
.dw t_Disp              ;ae
.dw t_null              ;af
.dw t_null              ;b0
.dw t_null              ;b1
.dw t_null              ;b2
.dw t_null              ;b3
.dw t_null              ;b4
.dw t_null              ;b5
.dw t_null              ;b6
.dw t_null              ;b7
.dw t_DispBuf           ;b8
.dw t_lcm               ;b9
.dw t_setseed           ;ba
.dw t_null              ;bb
.dw t_null              ;bc
.dw t_null              ;bd
.dw t_null              ;be
.dw t_null              ;bf
.dw t_pow10             ;c0
.dw t_null              ;c1
.dw t_null              ;c2
.dw t_null              ;c3
.dw t_null              ;c4
.dw t_pow2              ;c5
.dw t_null              ;c6
.dw t_null              ;c7
.dw t_null              ;c8
.dw t_null              ;c9
.dw t_null              ;ca
.dw t_null              ;cb
.dw t_null              ;cc
.dw t_null              ;cd
.dw t_null              ;ce
.dw t_null              ;cf
.dw t_null              ;d0
.dw t_null              ;d1
.dw t_null              ;d2
.dw t_null              ;d3
.dw t_null              ;d4
.dw t_null              ;d5
.dw t_null              ;d6
.dw t_null              ;d7
.dw t_null              ;d8
.dw t_null              ;d9
.dw t_null              ;da
.dw t_null              ;db
.dw t_null              ;dc
.dw t_Shiftbuf          ;dd
.dw t_randseed          ;de
.dw t_null              ;df
.dw t_null              ;e0
.dw t_null              ;e1
.dw t_exp               ;e2
.dw t_Output            ;e3
.dw t_null              ;e4
.dw t_Sprite            ;e5
.dw t_acos              ;e6
.dw t_null              ;e7
.dw t_Rect              ;e8
.dw t_null              ;e9
.dw t_null              ;ea
.dw t_null              ;eb
.dw t_null              ;ec
.dw t_rand              ;ed
.dw t_null              ;ee
.dw t_null              ;ef
.dw t_null              ;f0
.dw t_null              ;f1
.dw t_null              ;f2
.dw t_null              ;f3
.dw t_null              ;f4
.dw t_null              ;f5
.dw t_null              ;f6
.dw t_null              ;f7
.dw t_null              ;f8
.dw t_asin              ;f9
.dw t_null              ;fa
.dw t_null              ;fb
.dw t_null              ;fc
.dw t_null              ;fd
.dw t_null              ;fe
.dw t_tanh              ;ff
t_null:
.db 0
t_cosh:
.db 4,"cosh"
#include "commands\cosh.z80"
t_atan:
.db 4,"atan"
#include "commands\atan.z80"
t_sinh:
.db 4,"sinh"
#include "commands\sinh.z80"
t_sqrt:
.db 4,"sqrt"
#include "commands\sqrt.z80"
t_max:
.db 3,"max"
#include "commands\max.z80"
t_ClrDraw:
.db 7,"ClrDraw"
#include "commands\ClrDraw.z80"
t_Ellipse:
.db 7,"Ellipse"
#include "commands\Ellipse.z80"
t_ln:
.db 2,"ln"
#include "commands\ln.z80"
t_pi:
.db 2,"pi"
#include "commands\pi.z80"
t_randint:
.db 7,"randint"
#include "commands\randint.z80"
t_asinh:
.db 5,"asinh"
#include "commands\asinh.z80"
t_tan:
.db 3,"tan"
#include "commands\tan.z80"
t_acosh:
.db 5,"acosh"
#include "commands\acosh.z80"
t_atanh:
.db 5,"atanh"
#include "commands\atanh.z80"
t_Line:
.db 4,"Line"
#include "commands\Line.z80"
t_sin:
.db 3,"sin"
#include "commands\sin.z80"
t_e:
.db 1,"e"
#include "commands\e.z80"
t_mod:
.db 3,"mod"
#include "commands\mod.z80"
t_min:
.db 3,"min"
#include "commands\min.z80"
t_Circle:
.db 6,"Circle"
#include "commands\Circle.z80"
t_gcd:
.db 3,"gcd"
#include "commands\gcd.z80"
t_cos:
.db 3,"cos"
#include "commands\cos.z80"
t_log2:
.db 4,"log2"
#include "commands\log2.z80"
t_Tilemap:
.db 7,"Tilemap"
#include "commands\Tilemap.z80"
t_log10:
.db 5,"log10"
#include "commands\log10.z80"
t_Text:
.db 4,"Text"
#include "commands\Text.z80"
t_Disp:
.db 4,"Disp"
#include "commands\Disp.z80"
t_DispBuf:
.db 7,"DispBuf"
#include "commands\DispBuf.z80"
t_lcm:
.db 3,"lcm"
#include "commands\lcm.z80"
t_setseed:
.db 7,"setseed"
#include "commands\setseed.z80"
t_pow10:
.db 5,"pow10"
#include "commands\pow10.z80"
t_pow2:
.db 4,"pow2"
#include "commands\pow2.z80"
t_Shiftbuf:
.db 8,"Shiftbuf"
#include "commands\Shiftbuf.z80"
t_randseed:
.db 8,"randseed"
#include "commands\randseed.z80"
t_exp:
.db 3,"exp"
#include "commands\exp.z80"
t_Output:
.db 6,"Output"
#include "commands\Output.z80"
t_Sprite:
.db 6,"Sprite"
#include "commands\Sprite.z80"
t_acos:
.db 4,"acos"
#include "commands\acos.z80"
t_Rect:
.db 4,"Rect"
#include "commands\Rect.z80"
t_rand:
.db 4,"rand"
#include "commands\rand.z80"
t_asin:
.db 4,"asin"
#include "commands\asin.z80"
t_tanh:
.db 4,"tanh"
#include "commands\tanh.z80"

poparg:
    ret
tostr:
    ret
s_NOTFOUND:
    .db "Command Not Found",0
input_save:
    .dw 0
input_savesize:
    .dw 0
hash:
    .dw 0
The code for the Disp.z80 file, already convoluted for your viewing pleasure :P
Code: [Select]
    .dw __tokenizeDisp  ;First code generates the token, returns a pointer to the name in HL, size of the name in BC
    .dw __parsecodeDisp
    .dw __compiledcodeDisp
    .dw __helpDisp

__tokenizeDisp:
    ld hl,+_
    ld bc,1
    ret
_:
    .db tokGraphics+0
__helpDisp:
    .db 0
__compiledcodeDisp:
    .db _poparg
    .db _tostr
    .db _inlineasm \ .dw +_-$-2
    bcall(_PutS)
    ret
_:
__parsecodeDisp:
;    call poparg
;    call tostr
    bcall(_PutS)
    bcall(_NewLine)
    ret
Edit:
Oops, and the grammer3.inc file (don't get your hopes up, it's just a test :P).
Code: [Select]
tokVars     = $00   ;allows for 32 variable types (int, str, float, etc.)
tokGraphics = $20   ;allows for 32 basic graphics commands (plot(), Pxl(), Rect(), etc.)
tokControl  = $40   ;allows for 32 control commands (If, Goto, etc.)
tokMath     = $60   ;allows for 128 basic math commands
tokExtend   = $E0
tokInclude  = $F0

var_uint8   = 0
var_int8    = 1
var_uint16  = 2
var_int16   = 3
var_uint32  = 4
var_int32   = 5
var_uint64  = 6
var_int64   = 7
var_float   = 8
var_floatext= 9     ;extended precision float
var_fixed88 = 10    ;8.8 fixed point number, optimized for speed over var_fixed
var_fixed1616=11    ;16.16 fixed point number, optimized for speed over var_fixed
var_fixed   = 12    ;customized size for integer and fractional part, up to 64 bits total
var_rational= 13    ;customized size for the numerator and denominator of: int8, int16, int32, and int64. Denominator is always unsigned.
var_complex = 14    ;complex pair of floats.
var_complexext=15   ;complex pair of extended precision floats.
var_gaussian= 16    ;complex pair of integers. Customized size for the real and imaginary parts: int8, int16, int32, and int64.
var_gaussian8=17
var_gaussian16=18


var_sprite  = 28
var_array   = 29
var_list    = 30
var_str     = 31


_poparg     = 0
_pusharg    = 1
_tostr      = 2
_inlineasm  = 3

11
Math and Science / A faster Newton's Method Square Root
« on: October 10, 2017, 02:04:50 pm »
I decided to compute some square roots by hand to keep my tired self awake and alert at work. I opted to use Newton's Method this time and I noticed an optimization!
Instead of attaining 2N precision with a 2N-by-N->2N division, I split it up into an N-by-N multiply and a N-by-N->N division, which on the Z80 or similar processor can be nearly twice as fast.


The standard algorithm works as follows:

To find the square root of c, start with a guess called x.
Iterate the following: (x+c/x)/2


Each iteration doubles the digits of precision; try it on your calculator if you aren't familiar with it! To fin the square root of 21, guess '5'. Type 5, hit [Enter]. Then do .5(Ans+21/Ans and hit [Enter] a few times.

The way it works is, if the current x is bigger than the square root of c, then c/x is less, and vice-versa. So averaging the two gets a closer estimate to the actual square root. My observation was that if I had, say, 3 digits precision, then those digits stay the same in the next iteration. Through some observations, this would mean the first 3 digits of c/x will match that of x. In fact, for c/x, I'll only need to compute it to 6 digits, but if I can skip the first 3, then the division is easier and faster! So, lets optimize:


=>.5(x+c/x)
=>.5(x+x+(c/x-x))
=>x+.5(c-x*x)/x

So it's probably not so clear, but essentially for 2n-digits precision, I need to square an n-digit number and divide an n-digit number by n digits, to n digits precision. This replaces a 2n/2n division.

The former is faster, so let's try an example of finding the square root of 21:

c=21
Guess: x=5
(c-x*x) = -4
-4/5/2 = -.4
x-.4 = 4.6

(c-x*x) = 21-4.6*4.6 = -.16
-.16/4.6/2 = -.0173913... ~.02
x-.02 = 4.58

(c-x*x) = 21-4.58*4.58 = .0236
.0236/4.58/2 = .0025764... ~.0026
x+.0026 = 4.5826

(c-x*x) = 21-4.5826*4.5826 = -.00022276
-.00022276/4.5826/2 = -.000024304979... ~-.000024305
x-.000024305 = 4.582575695



In practice by hand, it actually doesn't save much time, but that's because by hand I usually can perform a 2n-by-2n division a little slower than half that of a 2n-by-2n and I can do multiplication and division in roughly the same speed. So overall This gets me about 20% faster.

On the Z80, a 2N-by-N->2N division takes roughly the time of 1.6 2N-by-2N->4N multiplies. As well, a 2N-by-2N->4N multiply takes roughly 3.5 times the time of an N-by-N->2N multiply using recursive Karatsuba multiplication.

So the standard algorithm takes roughly 5.6 N-by-N->2N multiplies worth of time.
The modified algorithm takes roughly 2.9 N-by-N->2N multiplies worth of time.

12
TI Z80 / Shunting-Yard Algorithm
« on: September 27, 2017, 08:37:33 pm »
So I have been really wanting to implement the Shunting-Yard algorithm for a few years now and I think I've got it in assembly. I keep finding bugs, but in case other people are interested, here is my code. It assumes that the operator stack and output (combined) won't exceed 768 bytes.

Things that would be useful:
  • Converting numbers to an intermediate format.
  • Handling multi-byte tokens and named vars and functions.
This way the code can be pseudo-compiled when it is converted to RPN, then an RPN calculator can parse the code faster or prep it for actual compiling.
Code: [Select]
#define bcall(x) rst 28h \ .dw x
saveSScreen = 86ECh
scrap=saveSScreen
outhead=8000h
stackhead = 8002h
.db $BB,$6D
.org $9D95
    ld hl,test
    ld bc,test_end-test
    call shuntingyard
    call rpn
    ld hl,scrap
    bcall(450Ah)
    bcall(452Eh)
    ret
shuntingyard:
    ld de,scrap
    ld (outhead),de
    ld d,(scrap/256)+3
    ld (stackhead),de
_:
    ld a,(hl)
    call +_
    cpi
    jp pe,-_
    ld hl,scrap+768
    ld de,(stackhead)
    or a
    sbc hl,de
    ld b,h
    ld c,l
    ld hl,(outhead)
    ex de,hl
    jr z,$+3
    ldir
    dec de
    xor a
    ld (de),a
    ret
_:
    cp '.'
    jp z,num_dec
    cp 30h
    jr c,+_
    cp 3Ah
    jp c,num
_:
    cp '('
    jp nz,+_
    ex de,hl
    ld hl,(stackhead)
    dec hl
    ld (hl),','
    dec hl
    ld (hl),a
    ld (stackhead),hl
    ex de,hl
    ret
_:
    cp ')'
    jp nz,checkunops
    push hl
    push bc
    ld hl,scrap+768
    ld de,(stackhead)
    sbc hl,de
    jp z,ERR_Unmatched_lparens
    ld b,h
    ld c,l
    ex de,hl
    ld de,(outhead)
;BC is the size of the stack. Use this in case there is a missing ')' so we don't read garbage.
;basically search for the matching '(' while piping out the stack to the output.
outerloop:
    ld a,(hl)
    cp '('
    jr z,parens_found
    ld a,','
_:
    cp (hl)
    ldi
    jp po,ERR_Unmatched_lparens
    jr z,outerloop
    jp -_
parens_found:
    inc hl
    inc hl
    ld (outhead),de
    ld (stackhead),hl
    pop bc
    pop hl
    ret
checkunops:
checkbinops:
;; if the token is an operator, then:
;; while there is an operator at the top of the operator stack with
;; greater than or equal to precedence and the operator is left associative:
;; pop operators from the operator stack, onto the output queue.
;; push the read operator onto the operator stack.
;;
;;
    push bc
    ex de,hl
    call getprecedence
    ld a,c
    pop bc
    ex de,hl
    jp c,search_function
    ;now C is the precedence, with lower bit = 1 if left-associative
    push bc
    push hl
    ld de,(stackhead)
    ld hl,scrap+768
    sbc hl,de
    ld b,h
    ld c,l
    ld hl,(outhead)
    ex de,hl
    jr z,pushop
    ;a is the precedence against which to compare
_:
    push hl
    push bc
    push af
    ld a,(hl)
    call getprecedence
    jr c,+_
    pop hl
    ld a,h      ;incoming
    cp c
    jr nz,$+4
    rra \ nop

    pop bc
    pop hl
   
 ;======================================================
    jr nc,pushop
.echo "The following code only works until we have to add >1 byte tokens."
 ldi
 ldi
    jp pe,-_
    jp $+6
_:
    pop af
    pop bc
    pop hl
pushop:
    ld (outhead),de
    pop de
    dec hl
    ld (hl),','
    dec hl
    ld a,(de)
    ld (stackhead),hl
    ld (hl),a
    ex de,hl
    pop bc
    ret
search_function:
    jp ERR_Func_Not_Found
getprecedence:
    ld hl,binops
    ld b,(binops_end-binops)/2
_:
    cp (hl)
    inc hl
    ld c,(hl)
    ret z
    inc hl
    djnz -_
    scf
    ret
binops:
    .db 4,  $01
    .db '=',$50
    .db '|',$60
    .db '&',$70
    .db '-',$81     ;right associative is odd
    .db '+',$80     ;left associative is even
    .db '/',$83     ;right associative
    .db '*',$82     ;left associative
    .db '^',$85     ;right associative
binops_end:
num:
    ld de,(outhead)
_:
    ldi
    jp po,+_
    ld a,(hl)
    cp '.'
    jr z,num_dec+4
    cp 30h
    jr c,+_
    cp 3Ah
    jr c,-_
_:
    ld a,','
    ld (de),a
    inc de
    ld (outhead),de
    dec hl
    inc bc
    ret
num_dec:
    ld de,(outhead)
_:
    ldi
    jp po,+_
    ld a,(hl)
    cp 30h
    jr c,+_
    cp 3Ah
    jr c,-_
_:
    cp '.'
    jp z,ERR_Syntax_00
    ld a,','
    ld (de),a
    inc de
    ld (outhead),de
    dec hl
    inc bc
    ret
ERR_Syntax_00:      ;Too many decimal points.
ERR_Func_Not_Found:
ERR_Unmatched_lparens:
    ret
rpn:
    ret
test:
;    .db "(3.1415926535)"
.db "(3.142+6/2-7)*3^6*3"
test_end:

13
Hi all, I lost my previous references and example programs and it took me this morning to locate this algorithm, digest it, and spew out my own version.  I looked on all of my calculators and Omni first, so I'm going to post it here for when it happens again :P

Anyways, this is one of my favorite algorithms for evaluating logarithms:

Code: [Select]
;Natural Logarithm on [.5,2]
;Single precision
a=.5(1+x)
g=.5(a+x/a) ;half precision divide
g=.5(g+x/g) ;full precision divide
b=a
a=(a+g)/2
c=.5(a+g)
g=.5(c+a*g/c) ;full precision divide
c=a
b = a-b/4
a=(a+g)/2
c = a-c/4-b/16
return (x-1)(1-1/4)(1-1/16)/c
  • It achieves single precision accuracy (at least 24.4996 bits) on the range of inputs from [.5,2].
  • During range reduction, x is usually reduced to some value on [c,2c].
    • The best precision is found when c=.5sqrt(2) (range is [.5sqrt(2),sqrt(2)], achieving at least 31.5 bits
    • I prefer c=2/3 since 2/3 and 4/3 are equidistant from 1-- it makes it easier for me to analyze time complexity. This still offers at least 29.97 bits, which is better than single precision
  • Cost is:
    • amean: 7 . 'amean' is the same cost as an add in binary floats
    • half divide: 1
    • full divide: 3
    • multiply:    1
    • shift-by-2:  3
    • shift-by-4:  2. This is sightly more efficient on the Z80 than 4 single-shifts when values are in RAM
    • add/sub:     5
    • add/sub const:1

I derived this algorithm from this wonderful paper which is annoyingly never at the top of a Google search and in fact took me a loooong time to ever stumble upon it, sadly.

This paper greatly accelerates the classic Borchardt-Gauss algorithm to be on par with the AGM algorithm. At their core, both algorithms perform an arithmetic and geometric mean, but AGM requires them to be done in parallel (emulated on single-core processors by some simple variable juggling), whereas B-G does them sequentially. As well, AGM achieves quadratic convergence or better (I've seen some exponential convergence in specific special cases), whereas classic B-G usually achieves linear convergence. Carlson's version of the B-G algorithm adds O(log(n)^2) additions and O(log(n)) space for quadratic convergence (where n is the number of desired bits of accuracy).

I like the B-G-C algorithm since I can easily obtain the inverse trig functions and inverse hyperbolic functions as well as the natural logarithm.

14
Sorry, I'm on my phone so I'll probably not go too in-depth on this :( Bug me for details if I don't get around to it and you need them :P

So:
Given ##x\in[-.5ln2,.5ln2]##
Let ##y=x^{2}##
Let ##a=\frac{x}{2}\frac{1+\frac{5y}{156}\left(1+\frac{3y}{550}\left(1+\frac{y}{1512}\right)\right)}{1+\frac{3y}{26}\left(1+\frac{5y}{396}\left(1+\frac{y}{450}\right)\right)}##
Then ##e^{x}\approx\frac{1+a}{1-a}##

Accuracy is ~75.9666 bits.
7 increments, 1 decrement
6 constant multiplications
6 general multiplications
2 general divisions
1 div by 2 (right shift or decrement an exponent)

For comparison, that's comparable to 16 terms of the Taylor's series, or 8 terms of the standard Padé expansion (exponential is special in that it comes out to f(x)/f(-x) so it can be done even easier than most).

I basically carried out a Padé expansion for e^x to infinity, noticed that after the constant term all the even coefficients were zero, so I used a Padé expansion on that function to quickly find our approximation for a.

In my usage, I actually implemented a 2x function since I'm using binary floats with 64-bit precision. I take int(x) and save that for a final exponent for the float. Remove that value from x. By definition of int(), x is now non-negative. If x≥.5, increment that saved exponent, subtract 1 from x. Now x is on [-.5,.5]. Now we need to perform 2x, but that is equivalent to ex*ln(2). We've effectively applied range reduction to our original input and now we can rely on the algorithm at the top. That integer part that we saved earlier now gets added to the exponent byte, et voilà !

I think when I calculated max error using my floats it was accurate up to the last two bits or something. Don't quote me on that as I don't have those notes on me at the moment and it was done months ago.

15
TI Z80 / Floatlib
« on: May 14, 2016, 06:44:13 pm »
I made a cute little app for my calc that I thought I would share! It basically puts all of my single-precision floats into one app and provides a somewhat easy way to access the routines through a jump table.

It even features an in-app reference for the routines which is useful for when I'm programming on my calc, as well as a hexadecimal translation of the header just in case I forget the code.

I included two examples. The first simply computes and displays eπ. The second uses an algorithm similar to that of the Arithmetic-Geometric Mean algorithm to compute arctangent. It takes the input number as a string in Ans.

Warning: I've run into a few bugs (some are documented, others I haven't chased down yet). Pro-tip: Don't store your program variables in the range of scrap to about scrap+30. The default value for scrap is 8000h. If you change this, please note that the app expects the scrap region to be 256 bytes.

Pages: [1] 2 3 ... 13