Sign in to follow this  
Komal Shashank

How would I read/write bit by bit from/to a file?

Recommended Posts

Hi... I am trying to write Huffman codes, which are a string of bits contained in a vector<bool>, to a file. I wrote a simple test program to see how I would do this,

std::uint8_t F = 10111001;

std::ofstream K("C:/Users/WDR/Desktop/kml.enc", std::ios::binary);

for(int i = 0; i < 256; i++)
{
    K << F;
}

It didn't work and I was pointed out that the variable F is a decimal number and that I had to write it as std::uint8_t F = 0xB9 in order to write the exact byte. I was also told that I don't need this representation when writing from file to file directly because this is just a C++ representation and files are written with the exact bytes since they already in binary. However, since I am writing Huffman codes generated by the C++ program, they would be strings of 1's and 0's. I need to write these as individual bits to the compressed file. I was able to pack these 8 bits at a time to a std::uint8_t and write the packed byte to the file. The file is being written correctly with the individual numbers as bits. However, after packing the bits, the resulting byte is going to be a decimal number and hence, the byte in the file changes when I write it. Is there a way of changing this to a '0x' prefixed hex value so that it can be written correctly to the file? Here's what I'm doing (it's just a test),

std::vector<bool> L;

L.push_back(1);
L.push_back(0);
L.push_back(1);
L.push_back(1);
L.push_back(0);
L.push_back(1);
L.push_back(0);
L.push_back(1);

std::uint8_t J = 0;

for(int i = 0; i < 8; i ++)
{
	if(L[i] == 1)
	{
		J |= 1;
	}

	J <<= 1;
}
// J becomes decimal 10110101 instead of 0xB5

std::ofstream K("C:/Users/WDR/Desktop/nz.enc", std::ios::binary);

for(int i = 0; i < 256; ++i)
{
	K <<  J;
}

Also, how would I read this written file again bit by bit for decompression? This entire question could be a simple and silly one, but I couldn't find a proper answer on searching. I would really appreciate it if someone took the time to clarify and answer this for me. Thank you very much.

Share this post


Link to post
Share on other sites

I would recommend you to try use std::bitset instead (http://www.cplusplus.com/reference/bitset/bitset/)

 

It has methods to convert it to a binary representation you can easily write to file and read back. (either as long or long long if the bitfield is small enough, or as a std::string)

 

Edit: heh, just read your other thread, where you give your reason for not using bitset.

As you've heard, std::vector<bool> is a bit "special", this unfortunately means it is a bit tricky to read and write it to a file.

Seems you have to iterate through it and convert it to a binary representation. This could be done as a custom << stream operator.

 

Maybe someone else have some trick up their sleeve :)

Edited by Olof Hedman

Share this post


Link to post
Share on other sites

I have been suggested many times to use a std::bitset for storing my bits but since I am working with Huffman codes (variable bit length), I have to use std::vector<bool>. However, do you mean to say that I should use a std::bitset after I pack the 8 bits? In other words, should I write the contents of the std::uint8_t to the std::bitset before writing to file? Or, even better, completely eliminate the use of the std::uint8_t by packing the bits directly into the std::bitset?

 

Just to be clear, because I don't want to keep these questions again and again... I use std::vector<bool> for variable length Huffman codes, and std::bitset for packing those codes into bytes, right? Am I right? Please confirm. And thank you for answering.

 

EDIT: I didn't read your edit and went off to try the bitset packing method. Apparently, each bit in the bitset is being written as a byte, which kind of renders the definition of 'bit'set confusing (to me). Is there really no proper solution to this? Please help, guys. Thanks.

Edited by WDRKKS

Share this post


Link to post
Share on other sites
I'm guessing here that by "decimal" you mean "as text"?

If so, this is your problem:

std::uint8_t F = 10111001;
std::ofstream K("C:/Users/WDR/Desktop/kml.enc", std::ios::binary);
for(int i = 0; i < 256; i++)
{
    K << F;
}


The ostream::operator<< operator always outputs text, even if the stream is set to binary. This is a bit braindead, I know... Try:
std::uint8_t F = 10111001; // btw, this is missing a 0b prefix, it should be 0b10111001
std::ofstream K("C:/Users/WDR/Desktop/kml.enc", std::ios::binary);
for(int i = 0; i < 256; i++)
{
    K.write(&F, sizeof(F));
}
You should get a 256 byte long file where every byte is 0b10111001.

Share this post


Link to post
Share on other sites


However, after packing the bits, the resulting byte is going to be a decimal number and hence, the byte in the file changes when I write it.

 

The resulting byte doesn't change.  You are confusing content with representation.

0xB9 = 10111001 in binary, though your first snip of code is not properly assigning that value to F.

10111001 in binary is NOT the same as 10111001  in decimal.  The first is a base 2 representation, the second is a base 10 representation.  Hex is a base 16 representation.

 

If you are not explicit about how you are assigning a literal to a variable in C++, it makes certain assumptions.  The line:

std::uint8_t F = 10111001;

Assigns the decimal value 10111001 to F.  F is only an 8 bit value though, so can only store decimal values from 0-255, so you overflow the assignment and assign the decimal value 25.  There are various prefixes and suffixes you can use to inform the compiler what format the literal is in.  Without any of those, it makes assumptions of what type the literal.  Best to be explicit and not rely on the compiler to figure it out for you.

// J becomes decimal 10110101 instead of 0xB5

Here again, J does not become decimal 10110101.  In this case, you've manipulated the individual bits, and that is its binary representation.  Its decimal representation is 181, and both are equal to 0xB5.  They are all just different ways of representing the same 8 bits of data.

Share this post


Link to post
Share on other sites

Here again, J does not become decimal 10110101.  In this case, you've manipulated the individual bits, and that is its binary representation.  Its decimal representation is 181, and both are equal to 0xB5.  They are all just different ways of representing the same 8 bits of data.

 

 

OK... If that's the case, then why is it that I see a different byte in the Hex Editor when I write J to the file? I am writing 10110101 i.e. 0xB5 to the file, but after writing and seeing it in the Hex Editor, I see 01101010 i.e. 0x6A. Apparently, I'm doing an extra left shift here.

 

EDIT: I was doing an extra left shit in the for loop.

//this

for(int i = 0; i < 8; i ++)
{
	if(L[i] == 1)
	{
		J |= 1;
	}

	J <<= 1; // this line
}

//changed to this

for(int i = 0; i < 8; i ++)
{
	if(L[i] == 1)
	{
		J |= 1;
	}

	if(i < 7)
		J <<= 1; // this line
}

Now I get my desired output of writing 0xB5 into the file. Mistake on my part but at least this question cleared my confusion on bit representation. Thanks a lot, guys. Now, do I read this written file back in as bytes or bits? I know how to do it by bytes. How would I do it if it is bits? Thanks.

Edited by WDRKKS

Share this post


Link to post
Share on other sites


Just to be clear, because I don't want to keep these questions again and again... I use std::vector for variable length Huffman codes, and std::bitset for packing those codes into bytes, right? Am I right? Please confirm. And thank you for answering.

 

vector<bool> should be fine for variable length bit sets. 

The only problem is converting it to a compact representation to be written to file.

bitset could be used to convert, but it is likely a bit overkill. 

 

The code you have:

std::uint8_t J = 0;

for(int i = 0; i < 8; i ++)
{
    if(L[i] == 1)
    {
        J |= 1;
    }

    J <<= 1;
}

 

Is doing the conversion of the first 8 bits in the std::vector<bool> to a byte, this is also perfectly fine.

You just need to do this for each group of 8 bits and write to the file.

Since your set of bits is variable length, You probably need to add some header information on how many bits it has, to be read back correctly.

 

Then to read back, you first read how many bits are stored, then do something like this for each group:

 

addBitsToVector(uint8 bits, const std::vector<bool>& result) {
 int i = 8;
 do {
   result.emplace_back(bits&0x1);
   bits >>= 1;
} while (--i);
 
}

Share this post


Link to post
Share on other sites

I polished up the write code to write my actual vector<bool>. Here is what I did,

void File_Handler::Write_Encoded_File(std::vector<bool> Encoded_Data, const std::string Output_File_Name)
{
	std::ofstream Output_File(Output_File_Name, std::ios::binary);

	int Bit_Counter = 0;

	std::uint8_t Packed_Byte = 0;

	for(int i = 0; i < Encoded_Data.size(); ++i)
	{
		if(Encoded_Data[i] == 1)
		{
			Packed_Byte |= 1;
		}

		if(i < Encoded_Data.size() - 1)
		{
			Packed_Byte <<= 1;
		}

		++Bit_Counter;

		if(Bit_Counter == 8)
		{
			Output_File << Packed_Byte;

			Bit_Counter = 0;
		}
	}
}

Then, I printed the vector<bool> and its size to the console and used the above function to write the bits to the file. Comparing the console output to the file in the Hex Editor, I see that there are 10099 bits in the vector<bool> and 10096 bits in the file. Obviously, I did not write the last 3 bits because the bit counter did not hit 8. How would I pack the extra 5 bits and the write the last byte? Also, the bits in the file and the bits in the console do not match for the first 10096 bits. Is this the only reason why or am I writing it wrong? I'm getting very close solving this, so please tell me this. Thank you very much.

Share this post


Link to post
Share on other sites

You could reserve the first byte in the file to be a header that tells you how many value bits that are in the last byte of the file. That's what I did when I wrote a Huffman encoder/decoder in C several years ago. For example, if your Huffman code contains 11 bits and each byte contains 8 bits then the number of value bits in the last byte would be 3. If you want to guarantee that your code is portable then you can't assume that each byte only contains 8 bits. To get the number of bits in a byte you can either use the CHAR_BIT macro or std::numeric_limits<unsigned char>::digits.

Share this post


Link to post
Share on other sites

OK... My code is not going to be portable. It's just for Windows. I can do what you have suggested, but is that the only reason why my output bits are not matching with the file bits or is it something related to my code. Am I doing any mistake in the code above? Please tell me. Thanks.

Share this post


Link to post
Share on other sites

The reason the number of bits aren't matching is because you only output when you have 8bits accumulated so you'll need to account for those extra bits somehow. However, the issue with the output bits not matching the file bits is some other issue. I don't know what you're 'output bits' code is and I don't know how you're comparing that with your 'file bits' so maybe that's your problem.

Share this post


Link to post
Share on other sites

This is my output bits code... It just overloads the << operator so that I can print a vector<bool> using std::cout.

std::ostream& operator<<(std::ostream& os, std::vector<bool> Vec)
{
	std::copy(Vec.begin(), Vec.end(), std::ostream_iterator<bool>(os, ""));
	return os;
}

And since I'm not writing any header or anything and just writing bits directly from the vector<bool>, I'm comparing the first 2-3 bytes of the file in the Hex Editor with the first 16-24 bits in the console output. It doesn't match. It worked with the test program I did above where I just write one byte to the file 256 times to create a 256 byte file. There was a match. Somehow, it's not working here.

 

However, please clarify me this... Is the code I wrote above correct? Am I missing anything, like a reversing operation or something like that? Or is the function perfectly fine? Please tell me, thank you for your time.

Edited by WDRKKS

Share this post


Link to post
Share on other sites

Your code looks okay. I suggest you write a function that prints out your Huffman code as a string so you can compare the code before and after it has been written to a file. One thing to pay attention to is the bit order per byte. If someone else is to read your Huffman code then you need to document in which order that you should read the bits in each byte. I don't know how a vector of bools is implemented by your compiler but it could very well use the reverse bit order that you're using or even use an unsigned int as storage or similar. That could be why the bit patterns in memory and on file don't match each other exactly.

 

As others have already said, you really shouldn't be using a vector of bools to begin with since it's a deprecated and broken feature of the C++ standard. It's not hard to implement your own container if you want a dynamic bitset.

Share this post


Link to post
Share on other sites

SOL, I did a couple of tests and found out that the vector<bool> works fine (at least in this particular instance). I tested to see which part of the code is causing the problem and found out it was the Bit_Counter part of the loop. Somehow it is not packing the bits correctly. I tried to write just one byte in the same loop and it works and everything fits. The output window, the hex editor, everything matches. I'm not able to see what's wrong. The operation syntax is correct. Here's the edited code which writes only byte perfectly,

for(int i = 0; i < Encoded_Data.size() /*this is 8 bits long now*/; ++i)
{
	if(Encoded_Data[i] == 1)
	{
		Packed_Byte |= 1;
	}

	if(i < Encoded_Data.size() - 1 /*naturally, this will be 7*/)
	{
		Packed_Byte <<= 1;
	}

	/*++Bit_Counter;

	if(Bit_Counter == 8)
	{
		Output_File << Packed_Byte;

		Bit_Counter = 0;
	}*/                        //this I removed completely
}

Output_File << Packed_Byte;

Any idea why? I'm clueless. sad.png

Share this post


Link to post
Share on other sites

What are you testing with? This does not do what you think it does (taken from the code in your first post):
 

std::uint8_t F = 10111001;

 
 
 
This is wrong:

if (i < Encoded_Data.size() - 1 /*naturally, this will be 7*/)
{
    Packed_Byte <<= 1;
}

You should be testing against the number of bits in a byte and not the number of bits in the entire bit stream. Otherwise you'll end up with one extra shift. You need to detect the last byte separately.

Edited by SOL-2517

Share this post


Link to post
Share on other sites

What are you testing with? This does not do what you think it does (taken from the code in your first post):
 

std::uint8_t F = 10111001;

 

Yeah, I've moved on past that. I've already mentioned that I've been corrected on that matter. That's fine.

 

 

This is wrong:

if (i < Encoded_Data.size() - 1 /*naturally, this will be 7*/)
{
    Packed_Byte <<= 1;
}

You should be testing against the number of bits in a byte and not the number of bits in the entire bit stream. Otherwise you'll end up with one extra shift. You need to detect the last byte separately.

 

I thought about this a little... And came up with this. I don't know whether it is correct.

for(int i = 0; i < Encoded_Data.size(); ++i)
{
	if(Encoded_Data[i] == 1)
	{
		Packed_Byte |= 1;
	}

	++Bit_Counter;

	if(Bit_Counter != 8)
	{
		Packed_Byte <<= 1;
	}

	if(Bit_Counter == 8)
	{
		Output_File << Packed_Byte;

		Bit_Counter = 0;
	}
}

Why I say this is because, even though I am getting my desired output now, somehow still some bytes are not matching. Here's a sample,

From Hex Editor,
10100000 00100101 10001011 10110011 11010101

From Console Output,
10100000 00100101 10001011 10110011 01010101

As you can see, the fifth byte's first bit is 1 instead of 0. Any idea why?

 

EDIT: I forgot to reset the packed byte (Packed_Byte = 0) when I reset the bit counter. It is using the bit from the previous byte. Now it works perfectly. I also wrote the code for appending extra bits in order to write the last byte. I just have one last thing to do. I have to write the eof bit into the file. How would I do this? I couldn't find anything proper about it on googling.

Edited by WDRKKS

Share this post


Link to post
Share on other sites

Do you really have a need to work on the generated "binary" file at all, e.g. somehow modify or inspect the file through your Hex editor?

If no then why not just store the binaries as std::uint8_t and later read them back as std::uint8_t and convert to 0s and 1s as needed?

Share this post


Link to post
Share on other sites

Do you really have a need to work on the generated "binary" file at all, e.g. somehow modify or inspect the file through your Hex editor?

If no then why not just store the binaries as std::uint8_t and later read them back as std::uint8_t and convert to 0s and 1s as needed?

 

Wouldn't that defeat the purpose of *compressing* said file which as I understand it is what Huffman coding is mainly for? Unless I misunderstood the problem.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this