General Discussion > Math and Science

A "new" compression format [subject to changes]

(1/4) > >>

harold:
I was trying to write a Deflate decompressor, and that's perfectly doable, but occasionally annoying. The "new" format I'm suggesting is basically Deflate with some changes. Some to make it less annoying, others because "why not". The goal is mostly "improved decompression speed".

The main structure is the same. Matches in a sliding window, with literals and match-lengths (and End of Block marker) in the same alphabet, and Huffman coding.

But there are changes.
The main change is that the Huffman codes are packed in dwords (specifically to help decoding with a 32bit "window"), starting at the most significant bit (helps with decoding). Those dwords are then saved with little-endian byte-order (to help x86 - ARM can work this more easily than x86 could in the reverse situation). They're aligned to their natural boundary, so there may be a little padding after the header (the point is to read in dwords, aligning them makes that more efficient on some architectures and only costs 3 bytes at worst).
The Huffman codes are, of course, canonical, but a different kind of canonical than in Deflate: long codes must begin with zeroes. IE, using the rule "shorter codes have a numerically (if filled with zeros to the right) higher value than longer codes". This ensures that no more than 9 rightmost bits can ever differ from zero, which keeps your decoding tables small without trickery.
The maximum length for a Huffman code is 31, up from 15. (that usually won't help, I'm throwing that in "because why not")
Furthermore, the match-length symbols start at 256, with EndOfBlock assigned the code 288. This makes it slightly more convenient to index a table directly with the match-symbol.
All 32 match-length symbols are used, following the same pattern of "extra bits" as in Deflate, but extended. The last 4 match-length-symbols have 6 "extra bits".
The distance codes work like in Deflate64 (which is like Deflate, but with codes 30 and 31 being valid and getting 14 "extra bits").

The suggested way to decode this is to take two dwords, shift them left (with the bits of the second dword appearing in the lsb of the first dword) by some amount, count the leading zeroes (you can OR with 1 and use BSR since the maximum length of one symbol is 31 bits anyway), shift right by some amount depending on the number of leading zeroes, add some offset depending on the number of leading zeroes, then index a table with that.
Alternatively, take a (possibly unaligned) qword, rotate it by 32 bits, then shift left (normally), count leading zeroes, etc..

The header is changed to this:
if the first byte is 0, end of stream
otherwise, the first dword (little-endian) has some bitfields:
bit 0(lsb): 1 (constant) (to ensure the first byte is not zero)
bit 1: 0 if the block is compressed, 1 is the block is stored uncompressed
bit 2-31: length of this block
if block is stored: just the bytes, raw.
if block is compressed:
dword (little endian): length of this block when decompressed
uint16 (little endian): is_present, bit[n] (numbered from lsb up) indicates that the range of symbols [n * 16 .. n * 16 + 15] is part of the alphabet (1) or not (0)
uint16 (little endian): is_default, bit[n] (see above) indicates that the range (see above) has its own 16bit mask (0) or is completely present (1)
For every range that needs a mask (ie is present and not default), an uint16 (little endian) that indicates for every symbol whether it's part of the alphabet (1) or not(0)
Two more masks for the match-lengths (these two masks are always present)
Then, some bytes to store the code lengths:
The lowest 5 bits indicate the code length (a length of 0 may be coded - enables you to drop a subrange mask and save a byte sometimes), the upper 3 bits are a repeat-count. If the repeat-count is 0, the repeat count is 8 + next_byte.
[up to 3 bytes padding to align the next part]

None of those changes are really spectacular, but IMO they all contribute a little bit to make the format less annoying.

The part that I'm least sure about is the header. I'm also not sure whether allowing longer match distances is a mistake or not. Actually I'm not sure whether the whole deal for encoding matches is the right way to go or not. I couldn't come up with anything better, but it's a pain in the behind to parse. For most lengths `n`, parsing the match takes so much time that decoding `n` literals would have been faster.

As always, suggestions are welcome (that's why I'm posting this in the first place). Especially if you've written a Deflate decompressor before (otherwise the changes might seem random).

SpiroH:
This sounds interesting although a bit too technical for many of the users. As you seem to be mostly interested in performance, i would suggest you carry out some benchmarking and then try to 'sell it' afterwards. Good luck with it. :)

harold:
I'll benchmark it of course, but it'll be hard to filter out the contribution of the format. I already have an implementation mostly done.

harold:
Ok, here are the first benchmark results.
For the file alice29.txt from the Canterbury Corpus, using only Huffman compression and no matches (those don't quite work yet), the decompression time on my machine is (approximately and rounded upwards) 1.2 miliseconds.
That's purely the decompression time, ie not counting file IO and such.
The uncompressed file size is 152,089 bytes, the compressed file size is 87,784 bytes (it would be a lot smaller with matches, about 55kb, but they don't really work yet).
That gives a output-throughput of 120MB/s. For reference, 7zip decompression speed (as measured by its built-in benchmarker) is about 50MB/s for a single core on my machine (limiting to a single core because my code is currently single-threaded, but that's not a limitation due to the format).
I expect it to slow down when I enable matches, though. Long matches will be fast (as fast as memcpy, essentially), but the more common short matches have a lot of overhead.

It isn't the most relevant comparison of course, LZMA doesn't even use Huffman compression at all. But it was the easiest comparison to make, and it's something.

DJ Omnimaga:
I wonder if such compression would be fast enough for calcs?

Navigation

[0] Message Index

[#] Next page

Go to full version