Huffman: generate the bits

Started by
8 comments, last by basananas 18 years, 3 months ago
I'm trying to create my own huffman compressing class.. I have built a tree and now I'm going to generate the bits for a value in the tree. a tree

      []
      /     []  [c]
   /\    
 [a]
if I understand correct this will generate the bits a:00 b:01 and c:1 altough b:01 and c:01 would be the same as just 1. and another thing, when decompressing later how can I know how many bits I need to read before I get to a leaf? bits: 0100100101001010010111001101011010101 this could generate many different trees because you could read all bits and just move left/right depending on 0/1 and then ju finaly hit a leaf at the end or you could read 3 bits and then start over at the top, or 5 bits and so on.
Advertisement
Well, I think that tree is meant to be:

    /\    ;  /\ c   ;a  b    ;


Which would make a 00, b, 01 and c 1. b's 01 is not the same as c's 1 - for b you have to write out 2 bits, but for c you only have to write 1. So when you decompress, you read bits and move down the tree until you reach a leaf, then that's the symbol you have. Then you start over.

So the stream you have starts something like: bacacba

Edit: second go at the diagram.
Furthermore, you store the determined tree with your compressed data so that you know how to decompress it later. This is sometimes referred to as a dictionary.

For decompression, with every bit you traverse the tree until you reach a leaf node. When you reach the leaf, output the uncompressed byte at that node and start traversing from the root of the tree again.

EDIT: ASCII amateurs [grin]
    (root)     /  \     0    1 (c)   / \   0   1 (a) (b)
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
Quote:Original post by McZ
when decompressing later how can I know how many bits I need to read before I get to a leaf?

bits: 0100100101001010010111001101011010101

this could generate many different trees because you could read all bits and just move left/right depending on 0/1 and then ju finaly hit a leaf at the end or you could read 3 bits and then start over at the top, or 5 bits and so on.
hmm, It sounds like you don't quite understand the principle yet.

You don't know how many bits you need to read before you get to a leaf, that's the whole point. You read 1 bit at a time, and if it's a zero you go left, if it's a one you go right, and if you hit a leaf then you output the symbol and the very next bit immediately tells you the first way to branch from the top of the tree again, for the next symbol. There only way for it to produce multiple different out symbol lists, is if you were to start from somewhere other than the very start of the bitstream when decompressing. You must always start from the start.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
so then I should store the stuff like this

00110a0010n

where a and n are the leaf characters.. but a is stored as bits too not an a like I wrote. so then the bitstream would be something like this

00110[bitstream for a]0010[bitstream for n]

so how do I know that I have found the first bit of the leaf and to read a byte for a character symbol at that leaf?
I'm not quite sure what you mean. If you have your tree and you want to output the bitstream for the symbol a, the bits you write are basically which way you have to traverse the tree in order to reach the leaf that represents a. Then, when you decode it, you traverse the tree using those bits, until you reach a leaf, when you output the character and start over. Try this for another explanation (it's pretty long though, and also describes other stuff).
I think he wants to know how to store the dictionary in the file. Please correct me if I'm wrong. If that is what you want to know then there are lots of ways of doing it. These are listed roughly in order of difficulty of implementation:
  • Don't. Use a fixed dictionary. This is often done when dealing with English text since the frequencies of characters in the English language are fairly predictable.

  • Store the frequencies of each character/byte at the beginning of the file. As long as the compressor and decompressor both use the same algorithm to generate the tree they will end up with the same codes. Adds a large but fixed-size overhead to the compressed file.

  • Store the lengths of each codeword. With a few constraints the decompressor can recover the original codewords. Often less overhead than storing character frequencies. Overhead is still fixed.

  • Store character frequencies or codeword lengths as above but in a compressed format. Lower overhead.

  • Use a default fixed dictionary which can be overridden on a file-by-file or block-by-block basis by an encoded dictionary. This is basically the technique the Deflate algorithm uses. When the default dictionary is not used the Deflate algorithm encodes the dictionary by huffman encoding the lengths of the huffman codes. Look up RFC 1951 (which ZQJ linked) for details.

Enigma
OK, let's take this from the top. You have the following string that you want to compress:

abccccba

And you have built the following tree (through whatever method) to compress the data with:

(root)
/ \
0 1 (c)
/ \
0 1
(a) (b)


Firstly, since you need the dictionary to decompress your data at the other end, you write out your dictionary. You can use one of the methods Enigma has mentioned above, or some other creative way of your own. All that matters is that you can reconstruct the same tree in the decompressor as you created in the compressor.

Next, you start compressing the data. You read in a symbol, look up the codeword for that symbol (eg, 'a'=00), and then write those bits to your output stream. Wash, rinse, repeat. This will give you the following output bits:

000111110100

You now have a compressed file that starts off with a description of your dictionary and the compressed data. The decompress this, first read your dictionary info and reconstruct your tree. Now start at the top of the tree and read your compressed data 1 bit at a time, taking the left/right branch for 0/1. When you reach a leaf node stop reading bits, spit out the symbol at that leaf node (the uncompressed data), move back to the top of your tree and start reading bits again. Keep going until you run out of bits and you (should) end up with your original data.

Going through the above compressed data, we start at the top of the tree. Read in the 0 so take the left branch, read in another 0 and take the next left branch. We find ourselves at a leaf node (a) so spit out an 'a' and go back to the top of the tree. Read in a 0 -> left branch. Read in a 1 -> right branch. We're at a leaf, spit out 'b', back to the top. Read in a 1 -> right branch, but we don't read in another bit because we've already reached a leaf so we spit out 'c' and go back to the top. Doing this, there is no way to confuse the bits '01' with '1'.

In general for a correct (but not necessarily optimal) Huffman tree, the codeword for any 1 symbol CANNOT be found at the start of the codeword for any other symbol. In the example tree, the codeword for 'c' is '1' so no other codeword can start with a '1'. Alternatively, if in some other tree 'c' was '01011', then no other codeword could start with the sequence '01011'. This is direct result of symbols only being on leaf nodes.
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
Thank you. finaly I begin to understand.

Enigma is correct it is how I can store the dictonary to recreate the tree. I have created a frequency table and a tree and generated the bits for each character. Now I just need a way to create a dictonary.

joanusdmentia> Thank you for that perfect explanation of how to recreate the compressed bitstream into uncompressed data.
I've built my own GZIP decompression algorithm once and found this article very helpful:

http://www.gzip.org/zlib/rfc-deflate.html

This topic is closed to new replies.

Advertisement