Sign in to follow this  
Komal Shashank

Storing bits in a container

Recommended Posts

Hi... I have written a program which takes in a file and computes the Huffman encoded values for each byte of the file. Currently, this program encodes each byte and stores each byte and its corresponding encoded value in a std::map<std::uint8_t, std::vector<bool>>. I have been suggested several times not to use vector<bool> and to use std::bitset. But std::bitset is fixed size container and Huffman encodings can vary in length. So, I'm using vector<bool>.
 
Now, I'm trying to convert my encoded values to canonical form so that I can write it to the file on compression. I am referring to this article: http://www.arturocampos.com/ac_canonical_huffman.html#Canonical Huffman. And I am trying to write the given algorithm in C++ and I am confused as hell and failing miserably. The variables indicating code length, number of codes, etc. are pretty straightforward and can be declared as int and suchlike. However, what am I supposed to use for the variable "code"? I am really confused because some addition operations are being performed using this variable and integers. How would I do that if I declare it as vector<bool>? I tried declaring it as an integer but on printing, I'm getting garbage values for some reason. Which is the most ideal and correct way of doing this? Please help. Thank you.
Edited by WDRKKS

Share this post


Link to post
Share on other sites

Is there a Huffman curse I don't know about? Everywhere I post asking about anything related to Huffman, I'm not getting any response. Is it a very serious curse?

 

Seriously though, the joke aside, I would really appreciate some inputs, good or bad (I don't mind people telling me off as long as I get a response). Please help. Thanks.

 

P. S. : Sorry for the double post.

Share this post


Link to post
Share on other sites

I guess you want to append the bits on +, so '0101' + '1100' = '01011100'.

Your original file consists of bytes, but if you want your fixed-length bytes (8 bits) to correspond to variable-length bit-codes, then your result doesn't necessarily have an integer number of bytes (for example 11 bits, which is 1 byte + 3 bits).

 

So I guess you should go through each byte and find its code and append the codes for every byte in the file to a single vector<bool> in your case, which gives you a compressed file expressed in bits, where one bool in the vector corresponds to one bit.

Then take 8 bools at a time from the vector and combine them to one 8-bit byte and write it to the compressed file. When you're done you have a new file with compressed bytes.

 

When you decompress you take all the bytes in the compressed file and expand them into a new vector<bool>, 8 bools for each byte in the file, and then to the reverse operation to get the original bytes from the variable-length bit-codes.

Share this post


Link to post
Share on other sites

Typically this kind of compression is done as part of a stream.  You encode the value into a byte and stream it out as you go..

 

 

If we start by assuming you've built your data dictionaries and encoding table, each source tuple will get encoded into a variable-length bit pattern. It looks like you have gotten that far, building up a Huffman coding table.

 

From there, let's say you start processing your data, and it comes out of the coding table with values: 10100 10110 011 1101 10011 ....

 

You would accumulate them.  Internally it would accumulate a buffer. I imagine this sequence of function calls with these parameters:

 

bitWriter.AddBits( 5, 10100 ); // Nothing written, 10100 accumulated, 5 bits used

bitWriter.AddBits( 5, 10110 ); // 10100101 written, 10 accumulated, 2 bits used

bitWriter.AddBits( 3, 011 );  // Nothing written, 10011 accumulated, 5 bits used

bitWriter.AddBits( 4, 1101 ); // 10011110 written, 1 accumulated, 1 bit used

bitWriter.AddBits( 5, 10111); // Nothing written, 110111 accumulated, 6 bits used

...

 

When you read them back out as a decompression stream, you traverse the stream one bit at a time through your Huffman encoded tree, when you hit a leaf node you write the content in the leaf and reset back to the root with the next bit read.

 

while(bitReader.hasBits() ) {

  treeNavigator.Reset();

  while( treeNavigator.GetChild( bitReader.nextBit() )  // navigate down left/right until you hit a leaf node

    ; 

  dataStore.PushBack( treeNavigator.Value() );

 

 

Of course you may want to have an encoding tree function that handles more than one bit at a time, but effectively it will operate the same way navigating your tree.

Share this post


Link to post
Share on other sites

Thanks for replying, guys. I really appreciate it. I just realized I wasn't a little clear with my question,

 

 

Now, I'm trying to convert my encoded values to canonical form so that I can write it to the file on compression.

 

 

Here I didn't mention that I wanted to write the tree to the file for later decompression. However, you did answer another question I actually had but didn't bother about yet, because it was the next hurdle of the process. Thank you for answering that and removing that hurdle early for me. I understood your explanation of writing and reading the bit stream very clearly. Thank you for that. smile.png

 

However, I am still not getting on how to convert the codes I have into canonical form. The algorithm mentioned in the site above is not very clear (at least, to me) and I'm finding it difficult to follow and implement. Can you share your inputs on this, by any chance? Please clarify this for me. You've been of great help. Thank you.

Share this post


Link to post
Share on other sites

I have been suggested several times not to use vector and to use std::bitset. But std::bitset is fixed size container and Huffman encodings can vary in length. So, I'm using vector.

std::vector<bool> is specialized by the standard library - it doesn't act the same way as any other std::vector<T>, it acts like a variable sized bitset, using compact bit storage! So you're doing the right thing(tm) here.

Share this post


Link to post
Share on other sites

The website you are looking at is like most are. They describe very detailed the algorithm for generating the codes, then glances over how to actually implement it, how to save the codes, how to most efficiently encode the table for decoding later.

You could try web-searching for one containing the interesting bits.

 

I implemented Huffman coding long ago and found using an array for remembering the counts and one each for left, right child, parent indices to build the tree in was the simplest implementation. Then if you need a code you start at the index of the symbol you want to encode, recursively walk that imagined tree from there to the top and on the way back take 1 bit at each step depending on if it was left or right child.

All those extraneous tables for remembering full codes with very large length differences are just a caching optimization you can leave out and add on later, when you already checked your implementations of the basic encoding and decoding algorithms are working.

Share this post


Link to post
Share on other sites

Try this Wiki which is pretty clear: http://en.wikipedia.org/wiki/Canonical_Huffman_code

I checked out the site you linked but I thought it was difficult to follow..

 

I think the part that's unclear is about integers and bits.

A code of bits is also a binary integer, for example 00 = 0, 01 = 1, 10 = 2, 11 = 3.

So you can add 1 to a binary number and get a new number, which has a new bit-pattern

Share this post


Link to post
Share on other sites

Thank you for the link, Erik. I have read through the procedure and it seems a lot clearer. I wouldn't know for sure until I implement it. Even in this explanation, the pseudo code mentions a variable 'code = 0'. As per what you explained above, am I to understand that the code variable is a binary integer? In that case, can I declare it as an int type and perform the left shift operation given? Just to clarify, shift operations can be done on int types, right? Please tell me this. Thanks for your help.

Share this post


Link to post
Share on other sites


As per what you explained above, am I to understand that the code variable is a binary integer? In that case, can I declare it as an int type and perform the left shift operation given? Just to clarify, shift operations can be done on int types, right?

 

Yes this should work well.

And ofcourse you need to remember or calculate how many bits are actually used for the code, as the int will always be the same number of bits in total (usually 32). And every time you shift left, one more bit is used.

Share this post


Link to post
Share on other sites


And ofcourse you need to remember or calculate how many bits are actually used for the code, as the int will always be the same number of bits in total (usually 32). And every time you shift left, one more bit is used.

 

 

Hi, Erik. Thanks for replying. If you don't mind me asking, can you elaborate on what you mean here? I didn't understand what you said here. Specifically, the 32 bit part. Thank you.

Share this post


Link to post
Share on other sites

This is about how integers work with bits in the computer. If you read up on bitwise operations and integers it should become clear.

When using an "int" for example, a certain number of bits is allocated by the compiler, it's not dynamic, so assigning 0 to the number uses the same number of bits as assigning 9999999.

It's usually irrelevant, but as you need to store your codes with a certain number of bits you need to be able to at some point pick out the specific bits actually required.

Share this post


Link to post
Share on other sites

OK... Now I understood what you meant. But I'm not sure how I would do this. Since I have the bit lengths for each code and the length will be same for each new corresponding canonical code, I was thinking I could convert resulting integer to binary and truncate it to that length. Correct me if I'm wrong, is this a good way of doing it?

 

Also, just to be clear, the code variable in the pseudo code is the new canonical code, right? And this is declared as integer? Because if these assumptions of mine are right, then I was thinking about doing the aforementioned method. Please clarify. Thank you.

Edited by WDRKKS

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this