How would I read/write bit by bit from/to a file?

Started by
17 comments, last by Bacterius 9 years ago

Hi... I am trying to write Huffman codes, which are a string of bits contained in a vector<bool>, to a file. I wrote a simple test program to see how I would do this,


std::uint8_t F = 10111001;

std::ofstream K("C:/Users/WDR/Desktop/kml.enc", std::ios::binary);

for(int i = 0; i < 256; i++)
{
    K << F;
}

It didn't work and I was pointed out that the variable F is a decimal number and that I had to write it as std::uint8_t F = 0xB9 in order to write the exact byte. I was also told that I don't need this representation when writing from file to file directly because this is just a C++ representation and files are written with the exact bytes since they already in binary. However, since I am writing Huffman codes generated by the C++ program, they would be strings of 1's and 0's. I need to write these as individual bits to the compressed file. I was able to pack these 8 bits at a time to a std::uint8_t and write the packed byte to the file. The file is being written correctly with the individual numbers as bits. However, after packing the bits, the resulting byte is going to be a decimal number and hence, the byte in the file changes when I write it. Is there a way of changing this to a '0x' prefixed hex value so that it can be written correctly to the file? Here's what I'm doing (it's just a test),


std::vector<bool> L;

L.push_back(1);
L.push_back(0);
L.push_back(1);
L.push_back(1);
L.push_back(0);
L.push_back(1);
L.push_back(0);
L.push_back(1);

std::uint8_t J = 0;

for(int i = 0; i < 8; i ++)
{
	if(L[i] == 1)
	{
		J |= 1;
	}

	J <<= 1;
}
// J becomes decimal 10110101 instead of 0xB5

std::ofstream K("C:/Users/WDR/Desktop/nz.enc", std::ios::binary);

for(int i = 0; i < 256; ++i)
{
	K <<  J;
}

Also, how would I read this written file again bit by bit for decompression? This entire question could be a simple and silly one, but I couldn't find a proper answer on searching. I would really appreciate it if someone took the time to clarify and answer this for me. Thank you very much.

Advertisement

I would recommend you to try use std::bitset instead (http://www.cplusplus.com/reference/bitset/bitset/)

It has methods to convert it to a binary representation you can easily write to file and read back. (either as long or long long if the bitfield is small enough, or as a std::string)

Edit: heh, just read your other thread, where you give your reason for not using bitset.

As you've heard, std::vector<bool> is a bit "special", this unfortunately means it is a bit tricky to read and write it to a file.

Seems you have to iterate through it and convert it to a binary representation. This could be done as a custom << stream operator.

Maybe someone else have some trick up their sleeve :)

I have been suggested many times to use a std::bitset for storing my bits but since I am working with Huffman codes (variable bit length), I have to use std::vector<bool>. However, do you mean to say that I should use a std::bitset after I pack the 8 bits? In other words, should I write the contents of the std::uint8_t to the std::bitset before writing to file? Or, even better, completely eliminate the use of the std::uint8_t by packing the bits directly into the std::bitset?

Just to be clear, because I don't want to keep these questions again and again... I use std::vector<bool> for variable length Huffman codes, and std::bitset for packing those codes into bytes, right? Am I right? Please confirm. And thank you for answering.

EDIT: I didn't read your edit and went off to try the bitset packing method. Apparently, each bit in the bitset is being written as a byte, which kind of renders the definition of 'bit'set confusing (to me). Is there really no proper solution to this? Please help, guys. Thanks.

I'm guessing here that by "decimal" you mean "as text"?

If so, this is your problem:

std::uint8_t F = 10111001;
std::ofstream K("C:/Users/WDR/Desktop/kml.enc", std::ios::binary);
for(int i = 0; i < 256; i++)
{
    K << F;
}


The ostream::operator<< operator always outputs text, even if the stream is set to binary. This is a bit braindead, I know... Try:
std::uint8_t F = 10111001; // btw, this is missing a 0b prefix, it should be 0b10111001
std::ofstream K("C:/Users/WDR/Desktop/kml.enc", std::ios::binary);
for(int i = 0; i < 256; i++)
{
    K.write(&F, sizeof(F));
}
You should get a 256 byte long file where every byte is 0b10111001.


However, after packing the bits, the resulting byte is going to be a decimal number and hence, the byte in the file changes when I write it.

The resulting byte doesn't change. You are confusing content with representation.

0xB9 = 10111001 in binary, though your first snip of code is not properly assigning that value to F.

10111001 in binary is NOT the same as 10111001 in decimal. The first is a base 2 representation, the second is a base 10 representation. Hex is a base 16 representation.

If you are not explicit about how you are assigning a literal to a variable in C++, it makes certain assumptions. The line:


std::uint8_t F = 10111001;

Assigns the decimal value 10111001 to F. F is only an 8 bit value though, so can only store decimal values from 0-255, so you overflow the assignment and assign the decimal value 25. There are various prefixes and suffixes you can use to inform the compiler what format the literal is in. Without any of those, it makes assumptions of what type the literal. Best to be explicit and not rely on the compiler to figure it out for you.


// J becomes decimal 10110101 instead of 0xB5

Here again, J does not become decimal 10110101. In this case, you've manipulated the individual bits, and that is its binary representation. Its decimal representation is 181, and both are equal to 0xB5. They are all just different ways of representing the same 8 bits of data.


Here again, J does not become decimal 10110101. In this case, you've manipulated the individual bits, and that is its binary representation. Its decimal representation is 181, and both are equal to 0xB5. They are all just different ways of representing the same 8 bits of data.

OK... If that's the case, then why is it that I see a different byte in the Hex Editor when I write J to the file? I am writing 10110101 i.e. 0xB5 to the file, but after writing and seeing it in the Hex Editor, I see 01101010 i.e. 0x6A. Apparently, I'm doing an extra left shift here.

EDIT: I was doing an extra left shit in the for loop.


//this

for(int i = 0; i < 8; i ++)
{
	if(L[i] == 1)
	{
		J |= 1;
	}

	J <<= 1; // this line
}

//changed to this

for(int i = 0; i < 8; i ++)
{
	if(L[i] == 1)
	{
		J |= 1;
	}

	if(i < 7)
		J <<= 1; // this line
}

Now I get my desired output of writing 0xB5 into the file. Mistake on my part but at least this question cleared my confusion on bit representation. Thanks a lot, guys. Now, do I read this written file back in as bytes or bits? I know how to do it by bytes. How would I do it if it is bits? Thanks.


Just to be clear, because I don't want to keep these questions again and again... I use std::vector for variable length Huffman codes, and std::bitset for packing those codes into bytes, right? Am I right? Please confirm. And thank you for answering.

vector<bool> should be fine for variable length bit sets.

The only problem is converting it to a compact representation to be written to file.

bitset could be used to convert, but it is likely a bit overkill.

The code you have:

std::uint8_t J = 0;

for(int i = 0; i < 8; i ++)
{
    if(L[i] == 1)
    {
        J |= 1;
    }

    J <<= 1;
}

Is doing the conversion of the first 8 bits in the std::vector<bool> to a byte, this is also perfectly fine.

You just need to do this for each group of 8 bits and write to the file.

Since your set of bits is variable length, You probably need to add some header information on how many bits it has, to be read back correctly.

Then to read back, you first read how many bits are stored, then do something like this for each group:

addBitsToVector(uint8 bits, const std::vector<bool>& result) {
 int i = 8;
 do {
   result.emplace_back(bits&0x1);
   bits >>= 1;
} while (--i);
 
}

I polished up the write code to write my actual vector<bool>. Here is what I did,


void File_Handler::Write_Encoded_File(std::vector<bool> Encoded_Data, const std::string Output_File_Name)
{
	std::ofstream Output_File(Output_File_Name, std::ios::binary);

	int Bit_Counter = 0;

	std::uint8_t Packed_Byte = 0;

	for(int i = 0; i < Encoded_Data.size(); ++i)
	{
		if(Encoded_Data[i] == 1)
		{
			Packed_Byte |= 1;
		}

		if(i < Encoded_Data.size() - 1)
		{
			Packed_Byte <<= 1;
		}

		++Bit_Counter;

		if(Bit_Counter == 8)
		{
			Output_File << Packed_Byte;

			Bit_Counter = 0;
		}
	}
}

Then, I printed the vector<bool> and its size to the console and used the above function to write the bits to the file. Comparing the console output to the file in the Hex Editor, I see that there are 10099 bits in the vector<bool> and 10096 bits in the file. Obviously, I did not write the last 3 bits because the bit counter did not hit 8. How would I pack the extra 5 bits and the write the last byte? Also, the bits in the file and the bits in the console do not match for the first 10096 bits. Is this the only reason why or am I writing it wrong? I'm getting very close solving this, so please tell me this. Thank you very much.

Anybody...? Please help, guys. Any inputs are duly appreciated.

Sorry for the double post.

You could reserve the first byte in the file to be a header that tells you how many value bits that are in the last byte of the file. That's what I did when I wrote a Huffman encoder/decoder in C several years ago. For example, if your Huffman code contains 11 bits and each byte contains 8 bits then the number of value bits in the last byte would be 3. If you want to guarantee that your code is portable then you can't assume that each byte only contains 8 bits. To get the number of bits in a byte you can either use the CHAR_BIT macro or std::numeric_limits<unsigned char>::digits.

This topic is closed to new replies.

Advertisement