binary I/O

Started by
28 comments, last by Fruny 15 years ago
So if I wanted to use getline() I guess I'd have to eat that '\n' somehow or seekg forward one byte or something.
Advertisement
Hey Apoch, you said this:
Quote:
You may also want to think of a way to avoid hard coding that "Data/Images/" path into your Anibox code. Maybe ImageLoader() should add it internally?

So what's the professional way to handle file I/O? The way I have it right now I assume there's a directory called "Data" in the same directory as the executable. And in the Data directory there are other directories like "Images" and "Sounds".

project5/project5 (executable)
project5/Data/Sounds
project5/Data/Images
project5/Data/Maps
porject5/Data/Anibox

But now in my code there's lots of places where I hard code "Data/Images" or whatever and load it into a string. It's a little sporadic where I do it at. Is it better form to put all that directory info into one place? Like ImageLoader ALWAYS looks for files in "Data/Images" and if it's not there then it says it can't find it? And the map loader always loads and writes to "Data/Maps" so it never needs to be passed directory info?

And then would it be even better to have a separate .h file where I #define all the directory names like

#define IMAGEDIR "Data/Images"
#define MAPDIR "Data/Maps"
Using #defines is not really a great idea. If anything you should just use constant strings.

If you want to keep things flexible, store your directory names in a config file and read that prior to doing any other file IO.

Note that there's nothing wrong with hardcoding your paths per se; but you should centralize that information instead of scattering it all across your code.


I'd suggest just a simple solution like this:

namespace FileLocations{   const char* PathToImages = "path/to/images/";   const char* PathToAudio = "path/to/audio/";   // etc.}



Then you can access the path definition with a simple FileLocations::PathToFoo, which is clear, self-explanatory, and avoids needless repetition of the path information.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Sounds good.
Quote:Original post by icecubeflower
Quote:
On a stream, the 'text' or 'binary' flag only controls whether end-of-line character translation should be performed

That's it? That's all it does? That doesn't make any sense to me.


Like Fruny says, it's "historical". There actually is one more difference: in text mode, there will typically be a specific byte that is interpreted as an "end of file" marker. Under Windows, this is the byte with value 26; under Linux, the byte with value 4. You can, incidentally, trigger "end of file" on std::cin by supplying those characters; they're typed with control-Z and control-D respectively. (The control key plus A through Z are mapped to characters 1 through 26, oddly enough. And if you want to annoy people (including yourself), try writing character 7 to std::cout multiple times.)

Quote:I thought binary was what was making all those 00xAADFD things. Man.


All files are composed of bytes. They do not "contain" text nor numbers nor anything else. That is an interpretation we impose upon them. Each byte, if you're on a sufficiently "normal" computer and are not using Unicode, is an 8-bit integral value that is interpreted as a character of text using the 'ASCII' mapping.

Consider the abstract concept of the number sixty-five. This is a small enough number to represent in 8 bits, so there is a single byte with that value. A file could well contain a byte with that value, and if you looked at it with a text editor, you would see the character 'A'. If you looked at it in a hex editor, you would see the value 41, which is sixty-five written in hexadecimal (base sixteen) - hence "hex". The hex editor would display the character for the digit symbol '4' and the character for the digit symbol '1', because that's how it's programmed.

That file could be meant to represent the text "A". But it could also be meant to represent the number sixty-five. The interpretation is up to whatever program is reading the file.

You could also, however, have a file which contains the two bytes that represent the text "65", which would then be interpreted by a person reading the file as the number sixty-five. If you open this file in a text editor, you see the character '6' followed by the character '5', making the text "65" (which happens to be something we interpret as a number) on screen. If you open it in a hex editor, you see the characters '3635', where '36' represent the first byte (54 in hex) and '35' represent the second byte (53 in hex).

Now, when you read in from a file using operator>>, with an integer type variable as the destination (short, int, long, or the unsigned variations of those), the resulting code attempts to read several bytes from the current file position "as a human would". That is, it would read the byte '6' (with value 54) and the byte '5' (with value 53) from the second file, and interpret those as digits 6 and 5 of a base ten number, and store the value 65 in the variable. It would fail on the first file, because the symbol 'A' is not used for writing numbers in base ten.

The behaviour of operator>> depends on the type of the destination variable. This is a deliberate design decision made so that the operator can "do the right thing" and interpret "human-readable" (a term often confused with "text") data. However, for all primitive types (and you'll need to remember this for later), the operator will skip past any "leading" whitespace at the current position in the stream, but leave any "trailing" whitespace behind, if a value is successfully read.

if you read into a 'char', for example, the stream will skip any whitespace, and then (assuming there is any data left in the source at all) read the first non-whitespace byte of the file into the variable. (Remember, a char is a byte, even when a char is not 8 bits - i.e. a byte is not necessarily 8 bits in C++! That is only a minimum value; bytes are allowed to be larger. But "on sufficiently normal systems", they are not larger.)

Quote:Except getline() wasn't working for me. The very first getline() read an empty string. And then after that it read cartoon1, cartoon2, etc.


std::getline() reads up to the first delimiter ('\n', by default) that it finds in the file. An empty string is a perfectly valid line of text - an empty line. Thus, if the stream is at a point where the very next character is '\n', then an empty line is read.

Now, consider what happens if your file contains a human-readable number, immediately followed by '\n', and then you use operator>> to read the human-readable number. The trailing whitespace is left alone, and '\n' is a kind of whitespace, so the stream is now at a point where the very next character is '\n'. :)

Quote:So then I switched to the >> operator and that worked.


The operator>>, when told to read into a string (either char* or std::string), will only read up to the next whitespace. If you want to read multiple words, it will not work.

Quote:I wonder if I did it Apoch's way now in binary if getline() will work.


Whether the file is binary or not has almost no bearing on what will happen when you call std::getline(). The only thing that happens is that in text mode, \r\n sequences in the file data will be interpreted as if the file actually contained \n (so the lines that you read in will not contain \r).

Quote:It basically worked before, it's just in text mode somewhere between the 19, the '\n' and cartoon1 it goofed up and read an empty string.


I still don't think you were understanding this properly (streams don't spontaneously change mode, which is what it sounds like you're saying), but hopefully the above has cleared things up.



If you want to avoid the problem with mixing operator>> and std::getline(), the standard recommendation is to:

1) Always use std::getline() for initial reads from the file; and then
2) If you need to extract human-readable numbers (or other stuff) out of a line, construct a std::stringstream object from the read-in line, and call operator>> on the stringstream.

This also gains you some robustness in the face of corrupt data. (Axiom: input is always harder to do than output. Of course, the part in between is usually even harder. :) ) If a line is supposed to contain a number but doesn't, the std::stringstream will "fail", but the file stream is unaffected.

Of course, that doesn't work so well if you have an operator>> overload for a class that expects to read multiple lines ;) In general, you have to Think(TM). Sorry! :)
I understood everything until the very end.
Quote:
1) Always use std::getline() for initial reads from the file; and then

I don't know what you mean by "initial reads". I think you just mean I can read strings terminated with a '\n'.

After that I think you are saying that >> will crash if I try to read something like "AABA" into an int.

I never heard of a stringstream before, I'm reading about it now. It looks a lot like fstream but there is no open() so I guess I'm not supposed to open the file with it. I guess that makes sense because you're telling me to have the file open already and do getline() for "initial reads." So the file is already open so I should construct a stringstream object from the read in line. I'm not really sure what that means, maybe I can figure it out.

I think basically you are saying >> will crash when reading bad data into an int so read it into a stringstream or something. And the stringstream will simply "fail" if it gets bad data so I can just handle it.

Right now I'm just assuming all files are perfectly formatted and I just >> whenever I expect an int. So I suppose when my program gets a bad file it will crash. Fruny tells me that is not code I can be proud of. So I guess I'll read up on stringstream.

So this is all better than using read() and write() like I started with? I kind of liked read() and write() but I couldn't figure out a way to use strings with read(). Plus read() and write() run into problems if I move to a different endian system.

So I guess when I have my file reading code to the point where I can "be proud of it" it will use getline and stringstream and if it gets a wacky format file the function reading it can return false or catch an exception or something and exit gracefully with an error message instead of crashing.

****************

Later. Wait I think I get what you mean. >> will never read in multiple words, it will stop at whitespace. getline() will read EVERYTHING until it hits a '\n'. So when you say "initial reads" you mean read everything with getline(). Use getline() first and then use stringstream to inspect the line you just grabbed with getline(). I think I can figure it out.

***************

Later. Okay I think you are saying use getline() whenever I'm reading into strings because you can read anything into a string. But reading bad data into an int will crash so read into a stringstream. ifstream>>stringstream. Then read the stringstream into an int. stringstream>>int And if it's bad data instead of crashing it will fail and I can deal with it.

[Edited by - icecubeflower on April 19, 2009 11:22:55 PM]
A common error-recovery technique when parsing simple files is to read a whole line into a string, use that string as the source for a stringstream and then read the elements of that line. That way, if the data is invalid, you can report it and move on to the next line as if no error had occurred. This means that once you have read the whole file, you can report each and every line where read has failed rather than just the first one.

When a read operation fails, e.g. when you try to read text into an int, the stream's failbit is set (which you can test with fail()). The offending characters are not consumed... which means you need to clear them out manually (hence why simply restarting on the next line is easier - or you can use the stream's ignore()) as well as the error flag itself (with clear()).

Here is, for example, an untested chunk of code that should read integer triplets from a file:

struct Point{   int x;   int y;   int z;};bool parse(std::istream& is, std::vector<Point>& points, std::ostream& log = std::clog){   bool parse_ok = true;   std::string line;   std::istringstream iss;   size_t line_number = 1;   while(!is.eof())   {      getline(is, line);      iss.str(line);      Point tmp;      iss >> tmp.x >> tmp.y >> tmp.z;      if(!iss.fail())      {         points.push_back(tmp);      }      else      {         log << "Invalid data on line " << line_number << std::endl;         parse_ok = false;      }      iss.clear();       // for the fail and end-of-file (line) flags!      ++line_number;   }   return parse_ok;}


If you don't do that and simply read 3 ints at a time, if you ever have an extra or missing number, your data will be corrupted in a way that is harder to find out.

Better code yet would test the fail bit after each individual read.
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan
Got it.

I can't remember what that std:: thing is for. I never use it. I think I don't need it because I said:

using namespace std;

Is it better to do it one way or the other or does it matter?
Quote:Original post by icecubeflower
I understood everything until the very end.
Quote:
1) Always use std::getline() for initial reads from the file; and then

I don't know what you mean by "initial reads". I think you just mean I can read strings terminated with a '\n'.


An initial read would be a read that is initial. Here, the relevant definition of initial would be "occurring at the beginning". I.e., you read data from the file with std::getline() first, and then parse it with the stringstream.

Quote:After that I think you are saying that >> will crash if I try to read something like "AABA" into an int.


It will not "crash". It will fail to read anything. It is important to understand exactly what happens. Try here.

Quote:I never heard of a stringstream before, I'm reading about it now. It looks a lot like fstream but there is no open() so I guess I'm not supposed to open the file with it.


I already told you that you construct it from the string that is read in with std::getline. Did you really not wonder why it is called a stringstream? :) That is because it reads data out of (or into, if you use it for output) a string.

Quote:So the file is already open so I should construct a stringstream object from the read in line. I'm not really sure what that means, maybe I can figure it out.


Here, "read in" is an adjective. The line has been read in; it is a read in line. You construct the stringstream object from that line by passing it to the stringstream constructor. That's why it's called a constructor.

Quote:I think basically you are saying >> will crash when reading bad data into an int so read it into a stringstream or something. And the stringstream will simply "fail" if it gets bad data so I can just handle it.


It would fail in the same way that the file stream would. But because only the stringstream is affected, it's much easier to recover.

Quote:So I suppose when my program gets a bad file it will crash.


Your program doesn't crash; it simply doesn't behave properly. If you loop to read input, you may end up with an infinite loop, for reasons that are explained in the previously linked FAQ.

Quote:So this is all better than using read() and write() like I started with?


There are reasons to use raw ("binary"; but that's an inaccurate term) I/O, and reasons to use formatted ("text"; likewise) I/O. If one were clearly better all the time, the other wouldn't exist.

Quote:
Later. Wait I think I get what you mean. >> will never read in multiple words, it will stop at whitespace. getline() will read EVERYTHING until it hits a '\n'. So when you say "initial reads" you mean read everything with getline(). Use getline() first and then use stringstream to inspect the line you just grabbed with getline(). I think I can figure it out.

***************

Later. Okay I think you are saying use getline() whenever I'm reading into strings because you can read anything into a string. But reading bad data into an int will crash so read into a stringstream. ifstream>>stringstream. Then read the stringstream into an int. stringstream>>int And if it's bad data instead of crashing it will fail and I can deal with it.


You don't read into a stringstream. They don't work like that. You read into a string, and construct the stringstream from the string (i.e., create a stringstream that will read from the string). Otherwise, that's basically it.
Quote:Original post by icecubeflower
I never use it.


I usually use it, unless I have reasons not to.

Quote:I think I don't need it because I said:

using namespace std;


Correct. Although one important rule is that you should never* do that in a header file - use fully qualified names instead. If you add a using directive in a header, anyone who use it sees his global namespace polluted by everything you imported there, instead of leaving the decision with them.

Quote:Is it better to do it one way or the other or does it matter?


It does matter, though you probably need not worry about it yet.

By the way, the C++ FAQ Lite is a very good read.


* for some values of "never". There may be situations where it makes sense, but in those cases it is deliberate and not just convenience.
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan

This topic is closed to new replies.

Advertisement