Storing bits in a container

Started by
12 comments, last by Komal Shashank 9 years ago
Hi... I have written a program which takes in a file and computes the Huffman encoded values for each byte of the file. Currently, this program encodes each byte and stores each byte and its corresponding encoded value in a std::map<std::uint8_t, std::vector<bool>>. I have been suggested several times not to use vector<bool> and to use std::bitset. But std::bitset is fixed size container and Huffman encodings can vary in length. So, I'm using vector<bool>.
Now, I'm trying to convert my encoded values to canonical form so that I can write it to the file on compression. I am referring to this article: http://www.arturocampos.com/ac_canonical_huffman.html#Canonical Huffman. And I am trying to write the given algorithm in C++ and I am confused as hell and failing miserably. The variables indicating code length, number of codes, etc. are pretty straightforward and can be declared as int and suchlike. However, what am I supposed to use for the variable "code"? I am really confused because some addition operations are being performed using this variable and integers. How would I do that if I declare it as vector<bool>? I tried declaring it as an integer but on printing, I'm getting garbage values for some reason. Which is the most ideal and correct way of doing this? Please help. Thank you.
Advertisement

Is there a Huffman curse I don't know about? Everywhere I post asking about anything related to Huffman, I'm not getting any response. Is it a very serious curse?

Seriously though, the joke aside, I would really appreciate some inputs, good or bad (I don't mind people telling me off as long as I get a response). Please help. Thanks.

P. S. : Sorry for the double post.

I guess you want to append the bits on +, so '0101' + '1100' = '01011100'.

Your original file consists of bytes, but if you want your fixed-length bytes (8 bits) to correspond to variable-length bit-codes, then your result doesn't necessarily have an integer number of bytes (for example 11 bits, which is 1 byte + 3 bits).

So I guess you should go through each byte and find its code and append the codes for every byte in the file to a single vector<bool> in your case, which gives you a compressed file expressed in bits, where one bool in the vector corresponds to one bit.

Then take 8 bools at a time from the vector and combine them to one 8-bit byte and write it to the compressed file. When you're done you have a new file with compressed bytes.

When you decompress you take all the bytes in the compressed file and expand them into a new vector<bool>, 8 bools for each byte in the file, and then to the reverse operation to get the original bytes from the variable-length bit-codes.

Typically this kind of compression is done as part of a stream. You encode the value into a byte and stream it out as you go..

If we start by assuming you've built your data dictionaries and encoding table, each source tuple will get encoded into a variable-length bit pattern. It looks like you have gotten that far, building up a Huffman coding table.

From there, let's say you start processing your data, and it comes out of the coding table with values: 10100 10110 011 1101 10011 ....

You would accumulate them. Internally it would accumulate a buffer. I imagine this sequence of function calls with these parameters:

bitWriter.AddBits( 5, 10100 ); // Nothing written, 10100 accumulated, 5 bits used

bitWriter.AddBits( 5, 10110 ); // 10100101 written, 10 accumulated, 2 bits used

bitWriter.AddBits( 3, 011 ); // Nothing written, 10011 accumulated, 5 bits used

bitWriter.AddBits( 4, 1101 ); // 10011110 written, 1 accumulated, 1 bit used

bitWriter.AddBits( 5, 10111); // Nothing written, 110111 accumulated, 6 bits used

...

When you read them back out as a decompression stream, you traverse the stream one bit at a time through your Huffman encoded tree, when you hit a leaf node you write the content in the leaf and reset back to the root with the next bit read.

while(bitReader.hasBits() ) {

treeNavigator.Reset();

while( treeNavigator.GetChild( bitReader.nextBit() ) // navigate down left/right until you hit a leaf node

;

dataStore.PushBack( treeNavigator.Value() );

}

Of course you may want to have an encoding tree function that handles more than one bit at a time, but effectively it will operate the same way navigating your tree.

Thanks for replying, guys. I really appreciate it. I just realized I wasn't a little clear with my question,

Now, I'm trying to convert my encoded values to canonical form so that I can write it to the file on compression.

Here I didn't mention that I wanted to write the tree to the file for later decompression. However, you did answer another question I actually had but didn't bother about yet, because it was the next hurdle of the process. Thank you for answering that and removing that hurdle early for me. I understood your explanation of writing and reading the bit stream very clearly. Thank you for that. smile.png

However, I am still not getting on how to convert the codes I have into canonical form. The algorithm mentioned in the site above is not very clear (at least, to me) and I'm finding it difficult to follow and implement. Can you share your inputs on this, by any chance? Please clarify this for me. You've been of great help. Thank you.

I have been suggested several times not to use vector and to use std::bitset. But std::bitset is fixed size container and Huffman encodings can vary in length. So, I'm using vector.

std::vector<bool> is specialized by the standard library - it doesn't act the same way as any other std::vector<T>, it acts like a variable sized bitset, using compact bit storage! So you're doing the right thing(tm) here.

OK... So, How would I incorporate that into the conversion of Huffman codes to canonical form using the algorithm in the above site?

The website you are looking at is like most are. They describe very detailed the algorithm for generating the codes, then glances over how to actually implement it, how to save the codes, how to most efficiently encode the table for decoding later.

You could try web-searching for one containing the interesting bits.

I implemented Huffman coding long ago and found using an array for remembering the counts and one each for left, right child, parent indices to build the tree in was the simplest implementation. Then if you need a code you start at the index of the symbol you want to encode, recursively walk that imagined tree from there to the top and on the way back take 1 bit at each step depending on if it was left or right child.

All those extraneous tables for remembering full codes with very large length differences are just a caching optimization you can leave out and add on later, when you already checked your implementations of the basic encoding and decoding algorithms are working.

Try this Wiki which is pretty clear: http://en.wikipedia.org/wiki/Canonical_Huffman_code

I checked out the site you linked but I thought it was difficult to follow..

I think the part that's unclear is about integers and bits.

A code of bits is also a binary integer, for example 00 = 0, 01 = 1, 10 = 2, 11 = 3.

So you can add 1 to a binary number and get a new number, which has a new bit-pattern

Thank you for the link, Erik. I have read through the procedure and it seems a lot clearer. I wouldn't know for sure until I implement it. Even in this explanation, the pseudo code mentions a variable 'code = 0'. As per what you explained above, am I to understand that the code variable is a binary integer? In that case, can I declare it as an int type and perform the left shift operation given? Just to clarify, shift operations can be done on int types, right? Please tell me this. Thanks for your help.

This topic is closed to new replies.

Advertisement