Sign in to follow this  

ifstream::read(), tellg(), binary mode and carriage return

This topic is 1192 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

It's an age-old issue that only seems to have one solution: use getline() to read text files in chunks, testing for failure after each read. However, writing a raw, simple, universal hands-on wrapper is for some reason infinitely more difficult without such hack.

 

I understand the necessity to interpret input when reading a text file. I also understand the necessity to read the file in chunks in text mode, since there's no telling how many line feeds there are.

 

What I do not understand is ifstream::read()'s insistence to insert carriage returns into a data read from a stream that's opened in binary mode. There's a lot of discussion on the web with respect to how to handle different situations, but I can't frankly find answers to two simple questions:

 

- why does ifstream::read() interpret line endings in binary mode?

- and how can it tell that I'm reading a text file in binary mode anyway? Does it read one byte at a time and if it doesn't see any non-printable characters, automatically adds a carriage return when it encounters a line feed?

 

This behavior is not only confusing - the lack of option to simply read any file as a binary data stream is perplexing. After all, tellg() reports the correct size in binary mode and data is transferred just as expected. Just not all of it.

Share this post


Link to post
Share on other sites

This is entierly dependent on the program and platform you used to save the text file.

Windows standard line endings are "\r\n", linux standard line endings are "\n".

In binary mode fstream's "read" is only exposing this platform specificness to you by returning the raw bytes in the actual file. It is likely your file actually does contain "\r\n" at every line ending.

 

It is text mode fstream that will interpret file << "hello\n";  or file << "hello" << std::endl; both as a chance to insert "\r\n" if you're on windows, or just "\n" if you're on linux.

 

--edit, if you want to see this for yourself in an easier way. Use gvim to convert the line endings in a text file between windows and linux.  In gvim, it will look fine either way because it determines the file mode by the presence of \r\n or \n. But from notepad the windows endings look ok, while the linux endings cause the file to appear as a single line.

Edited by KulSeran

Share this post


Link to post
Share on other sites

I understand this. What I can't understand is the simple fact that using tellg() to get the size of a binary stream and then reading said amount of raw bytes return different data. tellg() returns the same size that the OS reports. ifstream::read() returns that size minus number of lines in file bytes.

 

Note that I don't really care whether the stream contains just a '\n', '\r\n' or even just a '\r' - it's easy to handle each of these cases during parsing. All I want is to get the entire contents of a file opened in binary mode with one call without having to differentiate between text and binary sources, what the OS deems a valid line ending and what is actually written into the file itself.

 

I take it it's impossible, although it still escapes me why.

Share this post


Link to post
Share on other sites

You'll need to demonstrate what exactly you're doing.  Given that:

#include <fstream>
#include <vector>

int main(int argc, char **argv) {
  std::ifstream ifile("test.txt", std::ios::binary);
  ifile.seekg(0, std::ios::end);
  long fileSize = ifile.tellg();
  std::cout << "file is " << fileSize << "bytes" << std::endl;
  std::vector<char> bytes;
  bytes.resize(fileSize + 100, 0);
  ifile.seekg(0, std::ios::beg);
  ifile.read(&bytes[0], fileSize + 100);
  std::cout << "read in " << ifile.gcount() << "bytes" << std::endl;

  for (auto itr = bytes.begin(); itr != bytes.end(); ++itr) {
    std::cout << (int) *itr << " ";
  }
}

reports the same value for both size and bytes read.  In my case, 34 bytes for the test file:

this is some text





hello

The output is:

file is 34bytes
read in 34bytes
116 104 105 115 32 105 115 32 115 111 109 101 32 116 101 120 116 13 10 13 10 13
10 13 10 13 10 13 10 104 101 108 108 111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

for which you can clearly see there are 34 bytes in the file, and every newline is "\r\n" because i used notepad to create the file.

 

edit--

and if I use GVim to convert the file to unix line endings as I suggested, i get:

file is 29bytes
read in 29bytes
116 104 105 115 32 105 115 32 115 111 109 101 32 116 101 120 116 10 10 10 10 10
10 104 101 108 108 111 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Where it again, should be clear that no '\r' (13) were inserted without reason, as the unix line endings only contain the '\n' character.

Edited by KulSeran

Share this post


Link to post
Share on other sites

I'm basically doing what you're doing, except that I'm not over-allocating the buffer or inflating the file size by an arbitrary amount.

	in->open(sPath, std::ios::binary);

	std::streamoff cpos = in->tellg();
	in->seekg(0, std::ios::end);
	std::streamoff epos = in->tellg();
	iNumBytes = epos;
	in->seekg(cpos, std::ios::beg);
	data = NewArray(BYTE, iNumBytes + 1);
 
	in->read((char*)data, iNumBytes);
Edited by irreversible

Share this post


Link to post
Share on other sites

So, in your case, did you check that

in->gcount() == iNumBytes

A failure for which would imply that the file was not actually read in full.

 

Also, I notice in your code that if this isn't exactly what you have in your code, you should probably correct iNumBytes to

iNumBytes = epos - cpos;

because you might not seeking back to 0, since you're seeking back to cpos.

Edited by KulSeran

Share this post


Link to post
Share on other sites

Yup, the gcount() correctly matches the number of bytes reported by tellg() on all accounts:

 

READ FILE: 3898 3898
READ FILE: 65592 65592
READ FILE: 196662 196662
READ FILE: 196662 196662
READ FILE: 49208 49208
READ FILE: 196664 196664
READ FILE: 196662 196662
 
The conclusion I've come to is that ifstream::read(), simply put, interprets the input and prepends each newline with a carriage return, thus overflowing the actual amount of data that has to be read compared to the number of bytes on disk. This behavior is not only frustrating, it's frankly very silly.
 
As for the iNumBytes - the code is from IFileIO::GetSize(), which simply returns the total size of the file, not the remaining size. But thanks for the comment smile.png.
 
Edit: I just checked and apparently the file contains both CR and LF as reported by Notepad++. Which, in turn, implies that both the OS and ifstream are reporting a wrong size. Say what?
 
Edit 2: GetFileSize() also returns the same size.
Edited by irreversible

Share this post


Link to post
Share on other sites

Okay - this turned out to be a programming error :).

 

The bug was fairly silly, but subtle - apparently my pointer arithmetic didn't add up and the entire buffer wasn't parsed. Also, the difference happened to coincide with the number of lines in the file, compelling me to look elsewhere. When I NULL-terminated the buffer at the size index and checked for EOS instead of relying on offset calculations, everything added up. The strange thing is my offset calculation seems to be correct, which is why I didn't check it sooner.

Share this post


Link to post
Share on other sites

Edit: I just checked and apparently the file contains both CR and LF as reported by Notepad++. Which, in turn, implies that both the OS and ifstream are reporting a wrong size. Say what?
 
Edit 2: GetFileSize() also returns the same size.


Well then obviously you are mistaken about what the right size really is and just misunderstanding the results you are getting. Could you post a complete minimal example along with an input file (in hex, please) and the output you get? Then we can parse it, see if it's actually a bug, and, if not, explain where your understanding goes wrong smile.png

EDIT: ignore, OP has now reported the bug was elsewhere

Share this post


Link to post
Share on other sites

This topic is 1192 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this