Omnimaga

Calculator Community => Other Calc-Related Projects and Ideas => TI Z80 => Topic started by: ACagliano on February 23, 2012, 09:36:57 am

Title: CalcRAR
Post by: ACagliano on February 23, 2012, 09:36:57 am: I believe that I once read about an idea by another programmer to make an Archiver of sorts, with the capacity to work with the DCS file structure. I learned on Cemetech that he put the project on hold. Reading this sort of made me want to work on a similar program myself, mainly for practice with different data types.

It'll work similarly to WinRAR for Windows and Stuffit Expander for Mac. That's why I name it CalcRAR. This program will create an archive file, a specially-formatted program with data that CalcRAR will read to determine what files to create, data to put in the files, and where to put the files. The file, after creation, will be compressed. When it is read, it will be decompressed. The best part, you can run this program just as you would anything else, given CalcRAR installed. The Archive is designed with the DCS AP feature, so when you run it, CalcRAR unpackages the Archive, automatically runs the program, then returns to CalcRAR, which removes the unpacked files and quits.

The reason I posted is because I need some advice. Is RLE the best compression to go with here? Or is another type better? And I may actually need some help implementing it.
Title: Re: CalcRAR
Post by: turiqwalrus on February 23, 2012, 10:14:20 am: well, the main problem here is that most options for compressing on-calc(especially if there's lots of features/fancy GUI, etc.) take up a lot of space, most of the time more than is saved by the compression of the program :P
Title: Re: CalcRAR
Post by: ACagliano on February 23, 2012, 10:18:57 am: Quote from: turiqwalrus on February 23, 2012, 10:14:20 am
well, the main problem here is that most options for compressing on-calc(especially if there's lots of features/fancy GUI, etc.) take up a lot of space, most of the time more than is saved by the compression of the program :P

Well, then what would you recommend? Just not compress it?
Title: Re: CalcRAR
Post by: turiqwalrus on February 23, 2012, 11:31:59 am: it might make more sense to compress it off-calc.
that would require computer access for every user though :\
I don't know...
Title: Re: CalcRAR
Post by: ACagliano on February 23, 2012, 11:54:04 am: Well, my idea was this...

The program is made, then copied to Archive. A second program is created in RAM. The first one is compressed and copied into the second. The first is then deleted. The second is the compressed Archive. That is how compression works.

Decompression I need to think about.
Title: Re: CalcRAR
Post by: DrDnar on February 23, 2012, 01:01:17 pm: I suggest looking up the DEFLATE standard (https://secure.wikimedia.org/wikipedia/en/wiki/DEFLATE), which is open and specified in an RFC, unlike RAR.
Title: Re: CalcRAR
Post by: ACagliano on February 23, 2012, 02:32:17 pm: Well, I'm a bit confused as to application in z80. Would you perhaps be so kind as to write me a routine that compresses and decompresses a file? Or perhaps psuedocode it.
Title: Re: CalcRAR
Post by: shmibs on February 23, 2012, 03:39:36 pm: Quote from: ACagliano on February 23, 2012, 02:32:17 pm
Would you perhaps be so kind as to write me a routine that compresses and decompresses a file? Or perhaps psuedocode it.
isn't that just asking for him to write the entire thing for you?
also, i was under the impression that more complex compression formats like this were compressed off-calc because the calculators have too little RAM to handle it. (see Iambian's pucrunch programs).
Title: Re: CalcRAR
Post by: ACagliano on February 23, 2012, 05:20:25 pm: Well, I would like him to demonstrate, in pseudocode, the basic concept behind performing compression and decompression like that. There is more to my project than just that. :)

Yes, I know that the RAM limitation is an issue. However, I'm looking for a compression algorithm that can be used on calc. If it can't, it defeats the purpose of this project.
Title: Re: CalcRAR
Post by: thepenguin77 on February 23, 2012, 05:34:54 pm: Quote from: shmibs on February 23, 2012, 03:39:36 pm
Quote from: ACagliano on February 23, 2012, 02:32:17 pm
Would you perhaps be so kind as to write me a routine that compresses and decompresses a file? Or perhaps psuedocode it.
isn't that just asking for him to write the entire thing for you?
also, i was under the impression that more complex compression formats like this were compressed off-calc because the calculators have too little RAM to handle it. (see Iambian's pucrunch programs).

I think Iambian made a pucrunch compressor as an Axiom (or something similar to that).

For ACagliano, there are a couple different methods you could use, but RLE is definitely not the right choice. There are two main reasons you wouldn't use RLE. The first is that most programs on calc do not have long strips of the byte and the second is that if a program did have long amounts of the same number, the programmer would either compress it themselves or render that on the fly. What this means is that I would assume that RLE would make most programs larger

So, for a better alternative, so far in my life I've used LZ77 and DEFLATE. LZ77 is what's called a floating window where you search the past X bytes to see if you can find a match for the data your are currently looking at. X can be whatever you want, but for this project, since you probably won't be compressing stuff that won't fit in ram, you can just search the whole program for a match. DEFLATE uses the exact same principle as LZ77 except it also compresses the "data" used to tell the decompressor where to search for a byte.

What I would suggest is that first you take a look at LZ77 and get that running. Once you're happy with how that works (even if you're still increasing size when you compress), you can then make a few slight modifications and you'll be using DEFLATE, at which point you'll probably get some nice ratios.

If you want some source examples, I've done LZ77 and DEFLATE compression on the computer and I've decompressed them both on the calculator. (Though DEFLATE was done in TruVid so the code is rather scary (and runs at $F000)). The only problem with my code is that I made up my own data format to fit my needs.

And for information on these two, the LZ77 Wikipedia page (http://en.wikipedia.org/wiki/LZ77) is probably enough to get you going. For DEFLATE this (http://www.gzip.org/zlib/rfc-deflate.html) is the page I used. (You can't make up DEFLATE like you can LZ77)
Title: Re: CalcRAR
Post by: ACagliano on February 23, 2012, 05:45:10 pm: Thanks, thepenguin. Instead of showing me the source code, can you perhaps try to give me an example stream of data, and then that same data compressed in LZ77. From there, perhaps I could decipher it.
Title: Re: CalcRAR
Post by: thepenguin77 on February 23, 2012, 06:42:49 pm: LZ77 is good for really big amounts of data, and it gets rid of byte boundaries, but here's a small example.

First of all, since you'll be working with sizes no bigger than calculator ram, you should probably make your sliding window just big enough to include the whole program, (2,048, 4,096, 8,192, 16,384). So, for this example, I'll go with 2,048 (11 bits). Also, for max match size, I'll use 7 bits.

Also, 0 will mean uncompressed and will be followed by the number of uncompressed bytes (7 bits) and 1 will mean compressed and will be followed by search distance then size.

Key:
(1 bit)
8 bits
[11 bits]
{7 bits}

00 01 02 03 04 05 06 07 08 09 00 00 06 07 08 09 00 00 06 02 03 04 05 06 08 09 04 03 06 06 06 06 06 06 06 06 06 06 04 04 05 08 09 09 01 01 01 02 03 04 05

(0) {12} 00 01 02 03 04 05 06 07 08 09 00 00 (1) [6] {7} (1) [17] {5} (0) {5} 08 09 04 03 06 (1) [01] {9} (0) {8} 04 04 05 08 09 09 01 (1) [45] {5}

That wasn't easy. In this small example, even though we didn't fully make use of all the bits we had in the size fields, and this data was super short, we still went from 408 bits to 292 bits.

Also, if you manage to follow what happened there, you'll see that this method actually encompasses RLE ;D
Title: Re: CalcRAR
Post by: ACagliano on February 24, 2012, 08:41:54 am: I'm sorry thepenguin, but I am sooo lost.
Title: Re: CalcRAR
Post by: shmibs on February 24, 2012, 12:30:30 pm: in his compressed data, if a bit is a 0, it will tell the decompressor that the next section of bytes in the compressed stream is a string of uncompressed bytes. the next 7 bits in that same byte indicate the number of uncompressed bytes that will follow. the first uncompressed section in his code is this:
(0) {12} 00 01 02 03 04 05 06 07 08 09 00 00
with a 0 bit indicating that it's uncompressed, and the next 7 holding the number 12, indicating 12 bytes of uncompressed data will follow.

after that, the decompressor checks for a 0 or 1 in the next bit again. if it sees a 1, it knows that the next session is compressed, in that, rather than storing uncompressed data, it checks the number stored in the next 11 bits to find the point in the data that is being pointed to. it then checks the 7 bits after that, which hold the number of bytes to be copied from that spot in the data.

in his example, the first compressed section is this:
(1) [6] {7}
firstly, i'm fairly certain that this should be
(1) [7] {6}
instead, which says, move into the compressed data 7 bytes, and copy the next 6 bytes to that point in the uncompressed data.
he's taking advantage that this section:
06 07 08 09 00 00 06 07 08 09 00 00
has the same string of bytes occur twice in a row to have it only occur once in the compressed data, and just refer back to it for the second one.

i hope that helps a bit. =)
Title: Re: CalcRAR
Post by: ACagliano on February 24, 2012, 12:40:35 pm: All it helps me to understand is the first bit. lol

So, bit 1 is the compressed/uncompressed status.
Bit 2-8 is the length of the data string.
What is the 11 and the other?
Title: Re: CalcRAR
Post by: shmibs on February 24, 2012, 01:12:27 pm: alright, well that's a start :)
what his compressor does is check for the same string of data occuring twice. the compressor looks at the compressed data it has already written and searches for a byte that equals the byte it's currently at in the uncompressed data. if it finds that byte, it then looks at the next byte in the uncompressed data and checks if that one matches up with the byte after the one it found in the compressed data as well. it keeps doing this until two of the bytes don't match. if the compressor wants to be absolutely certain that it has found the best possible match for the current point in the uncompressed data, it can store the position and size of the data that matched there and continue searching through the rest of the compressed data up to its current position in the same way, storing the new position and size if it finds a better match(one that is longer). it then writes the position and size of the best match in the compressed data and moves to the point in the uncompressed data that follows the section just matched. conversely, if it never found a match, it writes a 0, stores a 1 into the next 7 bits, and then adds the byte to the compressed data. it then moves to the next byte in the uncompressed data, and, if it can't find a match for that one either, adds it to the compressed data and updates the 7 bit number to hold a 2 instead.

since this compression method depends on matches in previous sections of the compressed data itself, when you first start compressing, everything will initially be stored as uncompressed data, as there is nothing to search through to find matches yet. also, you can change the number of bytes in a row that must match for the match to be considered "good enough" to be added as compressed data. depending on the file, this could either increase or decrease the resulting size, so it's a good idea to include an option to define this size yourself prior to compression and then test different sizes to see which comes out best.

in review, the compressing program looks through the data that has already been compressed and tries to find a string of bytes that matches the one it's currently looking at in the data. in the example, once the compressing program reaches this point in the data:

00 01 02 03 04 05 06 07 08 09 00 00 | 06 07 08 09 00 00 06 02 03 04 05 06 08 09 04 03 06 06 06 06 06 06 06 06 06 06 04 04 05 08 09 09 01 01 01 02 03 04 05

Key:
(1 bit)
8 bits
[11 bits]
{7 bits}

and if it has a minimum match length of 3 bytes, it will currently have this much compressed code written:

(0) {12} 00 01 02 03 04 05 06 07 08 09 00 00

and the next section written in the compressed data will be this:

(1) [7] {6}

where the (1) indicates that the next section is compressed,

[7] indicates that the data to be compressed begins 7 bytes from the beginning of the compressed data, and

{6} indicates that 6 bytes are to be referenced from that point, those 6 bytes being 06 07 08 09 00 00

the compressor will then have this much written to the compressed program:

(0) {12} 00 01 02 03 04 05 06 07 08 09 00 00 (1) [7] {6}

and will be at this point in the data:

00 01 02 03 04 05 06 07 08 09 00 00 06 07 08 09 00 00 | 06 02 03 04 05 06 08 09 04 03 06 06 06 06 06 06 06 06 06 06 04 04 05 08 09 09 01 01 01 02 03 04 05

EDIT: i wrote this without having ever used this compression method myself, and just noticed that LZ77 looks through the UNcompressed data when searching for matches rather than the compressed (which makes a whole lot more sense, now that i think about lt :P) otherwise, though, this should still hold true.
Title: Re: CalcRAR
Post by: ACagliano on February 24, 2012, 02:03:12 pm: I really do give up. lol. Sorry. Compression was never my strong point.

Edit: If someone want to write a compression/decompression routine, I will be eternally grateful. But, for the time being, I'm afraid, it is outside my skill level.
Title: Re: CalcRAR
Post by: DJ Omnimaga on February 24, 2012, 02:38:17 pm: An decompressor/compressor that is on and off calc would be nice, because it would allow some calc games to take much less space when some files are not needed. It would kinda be like some Celtic III features allowing you to package multiple programs together, except that they would be compressed.
Title: Re: CalcRAR
Post by: shmibs on February 24, 2012, 02:59:39 pm: iambian already made both of those.
athena (http://ourl.ca/8448) packages and compresses a bunch of files together, so that all take up less space, and his pucrunch axiom and asm routines make it so asm and axe programs can decompress files during runtime, so they take up less space in the archive when not in use.
Title: Re: CalcRAR
Post by: ACagliano on February 24, 2012, 03:04:28 pm: Oh. Can you link me to the pucrunch axiom? I'll just use that. Will it be sufficient?
Title: Re: CalcRAR
Post by: shmibs on February 24, 2012, 03:08:48 pm: here is it: link (http://ourl.ca/9960)

it's a bit confusing to use, but a lot simpler than writing your own routines =)
Title: Re: CalcRAR
Post by: DJ Omnimaga on February 24, 2012, 03:19:08 pm: Quote from: shmibs on February 24, 2012, 02:59:39 pm
iambian already made both of those.
athena (http://ourl.ca/8448) packages and compresses a bunch of files together, so that all take up less space, and his pucrunch axiom and asm routines make it so asm and axe programs can decompress files during runtime, so they take up less space in the archive when not in use.
Oh wait I thought it didn't do any compression. But does it let you unzip/rar files directly from TI-BASIC/Axe code like Celtic III with groups? ???
Title: Re: CalcRAR
Post by: shmibs on February 24, 2012, 03:41:39 pm: athena can't be run directly, as in it has a gui and can't be passed names for what to decompress. that doesn't seem like it would be too difficult to add, but, as it is, you could just tell people what to decompress in the program, have them do it, and then check afterwards if they did it right, or even use it as a method to choose what level to play, assuming levels are packaged separately.
Title: Re: CalcRAR
Post by: DJ Omnimaga on February 24, 2012, 03:50:29 pm: Yeah I was asking because it would be nice for BASIC coders, so users don't need to constantly ungroup chapters like in Illusiat 10, 11 and 12 or FFTOM.
Title: Re: CalcRAR
Post by: Iambian on February 25, 2012, 01:11:18 am: Just to be clear, the pucrunch Axiom does not do any compression. All it does is decompress stuff. It is up to the user to use pucrunch to compress their data from the PC side of things.

The Athena Packager/Installer project does all compression PC-side, and includes a utility on-calc that does the decompression.

However, the idea that games should be able to do their own package management is a good idea, and should be included in the next version of Athena. I'll have to think about it.
Title: Re: CalcRAR
Post by: ACagliano on February 25, 2012, 06:31:45 pm: Ok, let me see if I understand this....

to compress using LZ77.

1. Set a pointer to the start of data to compress.
2. Search the data for the longest string of data that does not repeat. Leave that uncompressed.
3. Locate other sequences that emulate parts of the data string in 2 and write them as references to it.

Is that right?
Title: Re: CalcRAR
Post by: shmibs on February 25, 2012, 06:47:28 pm: i don't think so.
from what i've gathered, reading what thepenguin and wikipedia had to say, if string 1 is the uncompressed data and string 2 is the compressed data.

1. have a pointer set to the beginning of string 1 and look back through previous parts of string 1 to find sections that match the current position.

2. if there isn't any match (as will be the case when you start, because there isn't any previous data to look back at) write the byte directly to string 2 and move the pointer in string 1 forward one byte; if there is a match, write the pointer value and size of the section matched in string 1 to string 2, and move the pointer in string 1 forward to the end of the section that was matched.

and then, when decompressing

1. check if the current section in string two is indicated as compressed or uncompressed
2. if it's uncompressed, write directly to string 1 and move the pointer in string 2 forward to the next section; if it's compressed, copy the data indicated in string 1 (which you have already uncompressed and written, because the decompression is linear) to the end of string 1, and move the pointer in string 2 to the next section.

i was wrong before when i said the pointer is to a section in the compressed data. it really points to a section in the UNcompressed data, which makes a lot more sense because there's more data available in which to find matches.
Title: Re: CalcRAR
Post by: ACagliano on February 25, 2012, 06:51:32 pm: Ok, I understand that now. But, my question is...How many bytes should your compressor check for at a time. What stops the compressor from taking the whole file as String1?
Title: Re: CalcRAR
Post by: ACagliano on February 28, 2012, 09:09:45 am: bump
Title: Re: CalcRAR
Post by: thepenguin77 on March 02, 2012, 11:09:19 pm: Good bump, I missed this.

As far as your question, honestly, the entire program should be String1, that will get you higher compression ratios.

I used to have a smaller version of this example, but honestly, I think my long drawn out example is easier to understand. The confusion really is in the details.

More realistic example:
1. Determine what the maximum distance your compressor will have to search in reverse in order to find a match. (For simplicity, put your entire program into the search window.) Take this distance and round up to the nearest power of 2. (2^11, 2^12, 2^13, etc)
2. Write the beginning of the program to _old, write the location of a new (large) buffer to _new. _new will actually increase by bits, not bytes. This will take special handling.
3. Start a counter called _count, this will represent how many uncompressable bytes we've seen since the last uncompressed chunk was written. This starts at 0.

Loop1:
4. Grab the byte at _old. Starting at the beginning of your data to compress, search for this byte stopping only at _old-1.

Loop2: (for each match)
5. See if the next byte matches the next byte after _old.
6. If it matches, check the next one as well. Keep repeating loop2 until you either have a mismatch or you reach your string length (128 - 7 bits is probably a good choice)

(For this stage, all that is required is that the beginning of the match starts before _old, the actual matching string is allowed to go past _old. This ability allows for this method to perform RLE. (Think it through how that would work, _old would point to the second byte of the repeating byte sequence) It is also important that you do not allow your program to search for matches starting at _old. This would result in finding one super long match all the way to the end of your program.)

7. Determine which match you found is the longest.
8. If the longest match is longer than 2 bytes GOTO Compressed.
If the longest match is shorter than 2 bytes GOTO Uncompressed. (The reason for 2 is this: the data for compressed data is a 1 bit flag, a 7 bit ptr, and then a 10-14 bit size word. This means compressed data actually takes up about 2.3 bytes. If we only found a 2 byte match, the compressed data would actually be bigger.)

Uncompressed:
9. Increase _old
10. Increase _count
11. If _count is at 127, follow the procedure in Compressed for writing uncompressed data (the 0 bit is needed still)
12. GOTO Finish

Compressed:
13. Write a zero bit to _new
14. Write _count (7 bits) to _new
15. Write _count bytes starting at _old-_count to _new
16. Write 0 to _count
17. Write a 1 bit to _new
18. Write the offset to the match we found to _new (variable bits)
19. Write the size of the match we found to _new (7 bits)

Finish:
20. GOTO loop1 until _old is at the end of the uncompressed buffer.
21. If _count is >0, write those bytes

This will write the compressed data, however, you're also going to want a header. In this header, you'll want to write things like the uncompressed size as well as the number of bits used for your offset.

Now hopefully you understand ;D