### Runer112's Huffman Compression Tutorial

Submitted By: Runer112 Date: June 14, 2011, 03:04:45 pm Views: 984

Huffman compression doesn't employ any tactic involving null-terminated strings of 1s. It relies on an algorithm that simply encodes more common values with smaller strings of bits and less common values with larger strings of bits, with 1s and 0s throughout. It does so by counting the number of occurrences of each value present in a set of data to give each value a "weight." Then, the two units with the smallest total "weight" are paired together, becoming a new unit with the combined weight. The two units with the smallest "weights" are continually paired together until all units have been joined into a tree with one root element. This will result in values with a higher weight being further up the tree and having been assigned a shorter bit string. Here's is an example of a finished Huffman tree that I shamelessly stole from Wikipedia, based on the string: j'aime aller sur le bord de l'eau les jeudis ou les jours impairs

Anyways, back to ralphdspam's question.

Can someone explain how to save a tree for Huffman compression? That would be helpful.

Seeing as you're asking how to save a tree, I'll go ahead assuming you already know how to make a tree like above. If not, feel free to ask how that's done, or you can read the Wikipedia article which is how I learned how Huffman coding works.

There may be better ways than the method I'm about to describe, but I'm not very familiar with binary heap/tree storage so I'll just explain the simplest way I could imagine!

I would code the tree starting from the top down as an array consisting of nodes and leaves. Just to clarify some terms before we continue: a node is a point on the tree that branches into two other elements, and a leaf is a terminal element that doesn't branch further. So nodes would be the circles in the above diagram, and leaves would be the ends that represent pieces of data.

I would build the tree into the array by following node branches down to leaves, and after reaching each leaf, backtracking until you reach an unvisited branch and exploring that branch. And every time you visit a node or leaf that hasn't yet been reached, assign it the next element in the array. The above tree could be navigated like so by always navigating 0-branches with priority:

Spoiler For A pretty big step-by-step table explaining how one would determine the order of elements in the array:

 Step Current location Table so far 0 null null 1 0 null   0 2 00 null   0   00 3 000 null   0   00   000 4 0000 null   0   00   000   0000 5 00000 null   0   00   000   0000   00000 6 000000 null   0   00   000   0000   00000   000000 7 00000 null   0   00   000   0000   00000   000000 8 000001 null   0   00   000   0000   00000   000000   000001 9 00000 null   0   00   000   0000   00000   000000   000001 10 0000 null   0   00   000   0000   00000   000000   000001 11 00001 null   0   00   000   0000   00000   000000   000001   00001 12 0000 null   0   00   000   0000   00000   000000   000001   00001 13 000 null   0   00   000   0000   00000   000000   000001   00001 14 0001 null   0   00   000   0000   00000   000000   000001   00001   0001 15 000 null   0   00   000   0000   00000   000000   000001   00001   0001 16 00 null   0   00   000   0000   00000   000000   000001   00001   0001 17 001 null   0   00   000   0000   00000   000000   000001   00001   0001   001 18 00 null   0   00   000   0000   00000   000000   000001   00001   0001   001 19 0 null   0   00   000   0000   00000   000000   000001   00001   0001   001 20 01 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01 21 010 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010 22 0100 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100 23 01000 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000 24 0100 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000 25 01001 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001 26 0100 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001 27 010 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001 28 0101 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101 29 010 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101 30 01 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101 31 011 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011 32 0110 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110 33 011 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110 34 0111 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111 35 011 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111 36 01 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111 37 0 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111 38 null null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111 39 1 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1 40 10 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10 41 100 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100 42 1000 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000 43 100 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000 44 1001 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001 45 100 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001 46 10 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001 47 101 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101 48 10 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101 49 1 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101 50 11 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101   11 51 110 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101   11   110 52 11 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101   11   110 53 111 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101   11   110   111 54 11 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101   11   110   111 55 1 null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101   11   110   111 56 null null   0   00   000   0000   00000   000000   000001   00001   0001   001   01   010   0100   01000   01001   0101   011   0110   0111   1   10   100   1000   1001   101   11   110   111

Man that took forever to format right...

Anyway. Once you have finished that process, the final table that you have will tell you the placement in which to fill in the real data in the real table! Yay, we're almost done! Now go through the list you came up with from the above process element by element, looking at what the bit string corresponds to in the Huffman tree. If the element is a node, encode that element in the array with the first bit being a 0 and the following bits representing the distance minus one between this element and the element containing the current bit string with an additional 1 at the end (I'll explain this later). If the element is a leaf, encode the data with the first bit being a 1 and the following bits being the uncompressed data represented by that bit string. The example I've been using throughout this post would generate a table like so:

Data is represented as 8 bits in the form of [Type bit]:[Data], so the actual data would be 7 bits. A number for [Data] means these 7 bits should be the binary representation of the number, which will be added to the table pointer if the current table entry is of this type and the bit read is a 1. This, along with the mandated pointer+1 every iteration, will simulate a jump to the 1-branch of the node. A symbol for [Data] means the 7 bits should be the ASCII value of the symbol (disregarding the high bit, because none of our ASCII characters have this bit set anyway)

 Offset in table: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Data at location: 0:19 0:9 0:7 0:5 0:3 0:1 1:b 1:p 1:j 1:u 1:space 0:5 0:3 0:1 1:' 1:m 1:r 0:1 1:a 1:i 0:5 0:3 0:1 1:o 1:d 1:e 0:1 1:l 1:s

The decompressor can then decompress the data stream by repeating the following steps for each compressed value:
• Set a variable indicating the length, in number of bits, of the data stream
• Set a variable to point to the start of the table
• Halt decompression if bits left to read equals zero
• Check the table entry currently pointed to
• If the first bit is a 1, pull out and handle uncompressed data from current table entry, and then jump back to the second step
• (The first bit is a 0) Read a bit from the data stream and decrease the variable containing the number of bits left to read
• If the bit read is a 1, add the value of the current table entry to the table pointer
• Increase the table pointer by 1 entry
• Jump back to the third step

And that's it! Wasn't that simple?

But I'm sure that was confusing as hell, so please ask me what I mean by something if you get confused by any of this.

Tutorial Discussion topic