[C++] Basic Text File Parsing

Started by
10 comments, last by Zahlman 14 years, 1 month ago
Hello. I'm new to the site, I've been lurking a while and recently decided to plunge in and try to learn some OpenGL. I have some experience with c++ (beginner college courses, and messing around on my own), and was curious about trying to get into graphics. Anyway, I've been going through NeHe's tutorials. I'm up to lesson 10 now. I've been tweaking the code to get to know it better and oil my skills a bit (they're rusty from lack of use). Didn't have the glaux header, for instance, so I made a simple PNG loading class instead. Took me a whole weekend to read up on image loading, install the zlib and libpng libraries and write the code to get it working, but it was worth it once the cube loaded with the texture. So it's been fun. But I digress. Sorry. As I said, I'm at lesson 10, and was trying to update the c code in the world loading part of the tutorial to c++, making use of stream libraries and objects. The text file storing the vertex and texture mapping data has a few simple rules, - lines starting with // are comments and should be ignored. - there can be empty lines. - first line of valid data should read: "NUMPOLLIES nn", where nn is an int. - the rest of the data is formatted as lines of 5 ints each, holding the u, v, x, y and z coordinates. The update I did is basically:
SECTOR sector1;				// Our Model Goes Here:

int SetupWorld()
{
	float x, y, z, u, v;
	int numtriangles = 0 ;
	std::string line;
	std::fstream filein;
	std::stringstream ss;
	
	filein.open( "Data/World.txt", std::fstream::in );	// File To Load World Data From
    if ( !filein.is_open() ) abort();                   // If the file fails to load, abort

    while ( std::getline( filein , line ) )
    {
        if ( line[0] != '/' && ( ! line.empty() ) )   // If the line isn't a comment and you can read from it (not empty)
        {
            if ( line.find("NUMPOLLIES") == 0 )
            {
                ss << line.substr( 10 ) ;
                ss >> numtriangles ; 
                break ;                                 // once we have the number of triangles, we exit the loop
            }
        } 
    }
    if ( filein.eof() ) abort() ;                       // if we exited because the file ended without finding numtriangles, abort

	sector1.triangle = new TRIANGLE[numtriangles];
	sector1.numtriangles = numtriangles;
	for (int loop = 0; loop < numtriangles; loop++)
	{
		for (int vert = 0; vert < 3; )
		{
			std::getline( filein , line ) ;
			if ( line[0] != '/' && ( ! line.empty() ) )   // If the line isn't a comment and not empty
			{
                ss.clear() ;
                ss << line ;
                line.clear() ;
                ss >> x >> y >> z >> u >> v ;
                sector1.triangle[loop].vertex[vert].x = x;
                sector1.triangle[loop].vertex[vert].y = y;
                sector1.triangle[loop].vertex[vert].z = z;
                sector1.triangle[loop].vertex[vert].u = u;
                sector1.triangle[loop].vertex[vert].v = v;
                vert++ ;
            }
		}
	}
	filein.close() ;                                       // After getting all the triangles, close the file
	return 0 ;                                             // We're done
}
Thing is, I'm not sure that this is a good way of doing it. I've got the program working (need to slow it down, but it works), what I was hopiong you could help me with is noticing any obvious errors. Way I saw it, there are two parts to parsing the file: get the number of triangles, and then read the vertex data. That's why I put two loops. On the first loop, I enter the line into a string. If the first character is a slash, I decide it is a comment and skip it. If it is empty I also skip it. I keep skipping line until I find one that is neither empty nor starts with a slash. If the not-empty line starts with "numpollies" I feed a substring of the line to a stringstream object (ss). The substring would start from the end of NUMPOLLIES, so it should only contain whitespace and a number. I then feed it to the numtriangles variables, and once done I exit the loop. If I didn't exit the loop, it would keep going until I read the whole file and then abort the program. So then I enter the second loop. It is basically the same, but since I'm not looking for any keywords I just check for empty lines or comments, and anything that passes that filter is sent to ss and back out to the variables. The thing that bothers me is that I don't really like extracting a line from one stream, putting it in a string, doing the checking, then sending it back to another stream to load my variables. But I didn't see that I could do the necessary checks (comments, empty lines, keyword) any other way. Streams don't have find() methods and strings don't have formatted output methods. Is there something I'm missing? Thanks for reading this far, and I'll appreciate any help. [Edited by - Kian on February 26, 2010 7:12:15 PM]
Advertisement
I made some changes and added my own comments. This is completely untested and probably contains some errors, but the basic idea is sound, I think. As for your question about a better way of checking for comments: not really. Perhaps if you added a regular expression library.
int SetupWorld(){    std::fstream filein("Data/World.txt", std::fstream::in);    if ( !filein.is_open() ) abort();    do    {        std::string line;        std::getline( filein , line ) ;        // When you think about it, you just want to read until you find        // NUMPOLLIES, regardless of whether there was a comment or an empty line        if (  line.find("NUMPOLLIES") == 0 )        {            std::stringstream ss( line.substr(10) );            int numtriangles = 0;            ss >> numtriangles ;            // Let's pretend that sector1::triangle is a std::vector<TRIANGLE>            // Let's also pretend that TRIANGLE is a struct that contains an array of            // 3 VERTEXs.            sector1.triangle.reserve(numtriangles);            break ;        }    }while(filein.eof());    int num_verts = 0;    TRIANGLE triangle;    while(!filein.eof())    {        std::string line;        std::getline( filein , line ) ;        // You have to check if line is empty first. if line was empty then line[0] is incorrect        if (  !line.empty() && line[0] != '/' )        {            std::stringstream ss(line);            float x, y, z, u, v;            // You should really add some error checking.            // What if there weren't 5 floats on one line?            ss >> x >> y >> z >> u >> v ;            triangle.vertex[num_verts].x = x;            triangle.vertex[num_verts].y = y;            triangle.vertex[num_verts].z = z;            triangle.vertex[num_verts].u = u;            triangle.vertex[num_verts].v = v;            num_verts++;        }        if(num_verts == 3)        {            sector1.triangle.push_back(triangle);            num_verts = 0;        }    }    // filein will close automatically at the end of this function    // Why return anything if you're just going to hard-code 0 or abort?    return 0;}


[Edited by - nobodynews on February 26, 2010 2:14:45 AM]

C++: A Dialog | C++0x Features: Part1 (lambdas, auto, static_assert) , Part 2 (rvalue references) , Part 3 (decltype) | Write Games | Fix Your Timestep!

And just for fun, I tried a more compact version(in perl I've done similar text processing where I would put a regular expression in the if statement and then extract the data from the expression result...):
void SetupWorld(){    std::fstream filein("Data/World.txt", std::fstream::in);    if ( !filein.is_open() ) abort();    int vert = 0;    TRIANGLE triangle;    do    {        std::string line;        std::getline( filein , line ) ;        if ( line.find("NUMPOLLIES") == 0)        {            std::stringstream ss( line.substr(10) );            int numtriangles = 0;            ss >> numtriangles ;            sector1.triangle.reserve(numtriangles);        }        else if ( !line.empty() ) {}        else if ( line.find("/") == 0) {}        else // theoretically line contains 5 floats        {            std::stringstream ss(line);            float x, y, z, u, v;            ss >> x >> y >> z >> u >> v ;            triangle.vertex[vert].x = x;            triangle.vertex[vert].y = y;            triangle.vertex[vert].z = z;            triangle.vertex[vert].u = u;            triangle.vertex[vert].v = v;            vert++;            if(vert == 3)            {                sector1.triangle.push_back(triangle);                vert = 0;            }        }    }while(!filein.eof())}

C++: A Dialog | C++0x Features: Part1 (lambdas, auto, static_assert) , Part 2 (rvalue references) , Part 3 (decltype) | Write Games | Fix Your Timestep!

You could also load your objects using Daabli, instead of writing code to parse and load them manually.

Let's say your Vertex, Triangle and Sector structures were defined like so:
#include "Daabli.h"struct Vertex{    float x, y, z;    float u, v;    const bool Read(Daabli::Reader &r)    {        return            r.Read(x) && r.Read(y) && r.Read(z) &&            r.Read(u) && r.Read(v);    }};struct Triangle{    Vertex vertex[3];    const bool Read(Daabli::Reader &r)    {        return            r.Read(vertex[0]) &&            r.Read(vertex[1]) &&            r.Read(vertex[2]);    }};struct Sector{    int numtriangles;    std::vector<Triangle> triangle;    const bool Read(Daabli::Reader &r)    {        if( !r.ReadExpectedToken("NUMPOLLIES") || !r.Read(numtriangles) )            return false;        triangle.resize( numtriangles );        for(int i=0; i < numtriangles; ++i)            if( !r.Read( triangle ) )                return false;        return true;    }};

Then you could load your data using the following code:
int main(int /*argc*/, char * /*argv*/[]){    Daabli::Reader r;    if( !r.FromFile( "world.txt" ) )        return -1;    Sector sector;    if( !r.Read( sector ) )        return -1;    // Use sector here...    return 0;}

For more information on using Daabli, see this article:
http://www.gamedev.net/reference/programming/features/cppObjLoad/

Daabli will ignore C/C++ style comments in the input, skip leading/trailing whitespace, and provide you with error information.
Thank you very much for the replies. They're very helpful. I'll clean up the code a bit based on your suggestions. It is good to know I got the general idea right, at least.

Regarding file handling, I understand file input/output takes a lot of time compared to everything else. Is it best to try to read a file in one go (or in chunks as big as you can manage without exhausting your memory for large files) and then work with the contents from memory, or to do several inputs as you need them?

About some of the comments;

// Why return anything if you're just going to hard-code 0 or abort?

It was something of a hackish job. I was more interested in getting it working. All the aborts where also a way of getting some quick feedback. If it just hung, I didn't know what was failing. If it aborted, I knew where it might have been failing. As for why to return 0, I tend to avoid void functions. On a more serious project I would remove the aborts, and might either make it void or provide more return options.

// You should really add some error checking

I really should. But as I mentioned above, I didn't want to waste time thinking of error possibilities and adding a ton of error checking code. My priority was to find a good way of loading the data from the file first, get it running second. I knew the file was going to be well formatted, so error checking was the lowest priority.

// filein will close automatically at the end of this function

I prefer being explicit. Just the way I was taught, I guess. Is it actually detrimental to the program in any way if you close the file manually as oppossed to leaving it up to the function to do it when it is done?

// You could also load your objects using Daabli

I found links to a couple of libraries that offered that functionality while I researched a solution, but I wanted to solve the problem on my own first. File I/O is something basic that needs to be dealt with, and I needed the practice. That said, I'll look into it for future projects. No need to reinvent the wheel when there's a car available.
It would simplify parsing to remove the "NUMPOLLIES" keyword. The format of the file is unambiguous with just a single number on a line by itself, representing the number of triangles, and then lines with the vertex data. You can use comments to explain, if you need to.
Quote:Original post by Kian
Regarding file handling, I understand file input/output takes a lot of time compared to everything else. Is it best to try to read a file in one go (or in chunks as big as you can manage without exhausting your memory for large files) and then work with the contents from memory, or to do several inputs as you need them?

It depends. The standard streams will do some buffering by default. You can get fancy and mess with the buffer too. Sometimes it is easy to read the entire file into a string in one step, or in very large chunks at a time.

You can then parse the string in memory using std::stringstream. In fact, I believe you might even be able to read the data directly into the stringstream:
bool loadFile(const std::string &fileName, std::stringstream &contents){   std::ifstream in(fileName.c_str());   if(contents << in.rdbuf())   {       return true;   }   // I haven't worked on a low level with std::streams in a while   // This is supposed to copy the error flags from the file to the stream, so   // if you try to read from the stream it will fail too.   contents.setstate(in.rdstate());   return false;}


Quote:
// Why return anything if you're just going to hard-code 0 or abort?

It was something of a hackish job. I was more interested in getting it working. All the aborts where also a way of getting some quick feedback. If it just hung, I didn't know what was failing. If it aborted, I knew where it might have been failing. As for why to return 0, I tend to avoid void functions. On a more serious project I would remove the aborts, and might either make it void or provide more return options.

Even something like throwing std::runtime_error() with a descriptive string or returning some kind of error code would be better, IMO. A simple abort() only tells you that something is failing, not where and why.

Quote:
I really should. But as I mentioned above, I didn't want to waste time thinking of error possibilities and adding a ton of error checking code. My priority was to find a good way of loading the data from the file first, get it running second. I knew the file was going to be well formatted, so error checking was the lowest priority.

The only thing is that people have a tendency to get it working and then forget to come back and implement error checking. Then a few weeks (or even hours!) later they try to load a corrupt file, and the system either breaks in an unhelpful manner that gives no clues as to what goes on, or worse still loads the file but starts working on the invalid data. That is a problem you will waste a lot more time trying to figure out.

Quote:
I prefer being explicit. Just the way I was taught, I guess. Is it actually detrimental to the program in any way if you close the file manually as oppossed to leaving it up to the function to do it when it is done?

No, it will be fine. But the idiom in C++ is to use RAII, which is heavily in favour of being implicit about cleanup. This way you cannot forget it, either by inserting additional return statements, throwing an exception (or calling another function that ends up throwing) or otherwise moving around the code.


Quote:
Regarding file handling, I understand file input/output takes a lot of time compared to everything else. Is it best to try to read a file in one go (or in chunks as big as you can manage without exhausting your memory for large files) and then work with the contents from memory, or to do several inputs as you need them?


If speed is really what you're concerned about, maybe consider dropping the whole "text file" idea in exchange for a binary file? If you have to constantly parse lines, convert strings to floats, check line formats, you would get a huge speedup from just converting the file to binary. I used to try to parse large obj files that would take 10+ seconds to load, but if I convert them to a binary format I can load the file almost instantly.

Just store the number of polygons in the first four bytes of the file, and then just store floats in the rest of the file. Then when you're ready to read the file just do (in pseudo-c++, dont remember all the appropriate function names):
int numpollies;file.open();file.read((char*)&numpollies, sizeof(int));pollies = new float[5*3*numpollies]file.read((char*)pollies, sizeof(float)*5*3*numpollies);file.close()


And then you have an array of all of your information that you can format into your structures as need be. No need for strings, stringstreams, and much much faster. Only disadvantage is that its no longer easily human readable, and theres no comments. But if you want you can use your human readable format, then convert to a binary format once its working properly. Then you don't have to parse the text file every time you start the game.
[size=2]My Projects:
[size=2]Portfolio Map for Android - Free Visual Portfolio Tracker
[size=2]Electron Flux for Android - Free Puzzle/Logic Game
0) Be consistent about indentation. A tab is not four, nor eight, nor any number of spaces, no matter what your editor tells you or what you tell it.

1) Don't comment obvious things. Comments are a warning flag that something tricky is going on. If you repeatedly warn that something tricky is going on when it isn't, you run the risk of becoming the proverbial boy who cried wolf.

2a) Write the function in such a way that it you can tell it what SECTOR to initialize, instead of just assuming one.

There are two possibilities for this. One is to decide that the purpose of this function is to create a SECTOR; i.e. there's no particular reason a SECTOR has to exist (in an uninitialized state) before the file has loaded. In this case, you can rewrite the function as a constructor. This is the approach I have shown.

The other way is to decide that the purpose is to modify an existing SECTOR, in which case you would pass that SECTOR in as a parameter, by reference.

2b) Similarly, write the function in such a way that you can tell it what file to load from, instead of just assuming one. This is straightforward: add a parameter for the file name.

3) The purpose of the return value of a function is to return something meaningful. If you have nothing meaningful to return, then don't. The 'void' return type exists for a reason.

Since I'm writing the function as a constructor, there can't be a return type anyway. This is not a problem.

4) Handle errors at the point where you have the information needed to handle them. The calling code might have a better idea of what to do if the file isn't found or whatever. Maybe it doesn't, but it's still the calling code's responsibility. So report the error, by throwing an exception. (If you wrote the function to modify an existing SECTOR, you could return a status code instead; but exceptions are generally considered cleaner.)

5) There is no reason to call .close() manually on a file stream in normal cases. The underlying file will be closed automatically by the destructor of the fstream object.

6) Don't specify the 'in' file mode for an fstream; instead, just use an ifstream (which is always in in-mode). The file mode flags are provided for when you need to do something special (e.g. open for both input and output, or use a different mode such as appending, or specify binary file mode...). Similarly, use the constructor instead of .open(), when you can.

7) Don't do your own memory management. Instead of maintaining a length count and a dynamically allocated array, replace both with a std::vector instance (which keeps track of its length automatically).

8) Simplify your logic. Recognize impossible cases. For example, a line that contains "NUMPOLLIES" can't possibly be empty, and in particular a string that begins with "NUMPOLLIES" can't also begin with '/'.

9) Don't hard-code magic numbers, such as the length of the magic word "NUMPOLLIES". There is a simpler solution here: use the stringstream to parse the entire string.

10) Scope variables as tightly as possible, and initialize them with initial values, where possible. And when you do initialize, use a value that's useful. Initializing 'numtriangles' to 0 is not useful because we are going to overwrite it anyway, and we could overwrite it with 0. But initializing it to -1 is useful, because we are supposed to overwrite it with a non-negative value, and this helps us simplify the "did we read a good value?" logic.

11) You can often simplify code (reduce indentation) by turning conditions into "guard clauses". That is, instead of saying "if (something) { lots of code }", you say "if (not something) { GTFO of here }; lots of code" - and then the 'lots of code' is not indented. In the current case, instead of doing part of the 'vert' loop conditionally on reading a non-empty line, I set up a separate inner loop to read until we get a non-empty line, and then process it unconditionally. This also allows for using the 'for' loop construct in the natural way.

12) You need error checking in case data for a given triangle is corrupt, or in case you run out of file data while reading the triangles.

13) Don't use 'ss << line' to try to put the whole line into a stringstream, because it won't work - it will only read the first word. A stringstream is a stream, after all - the usual rules for reading apply. (That is wrong, but the design advice still applies:) But again, you can avoid this by re-creating the stringstream object, which happens naturally when you scope your variables properly.

14) Read into data members of a struct directly, instead of copying from local variables. To avoid repeated code to refer to the struct, alias it using a reference. (Better yet, write a helper function to overload operator<< for the struct; I can show this upon request.)

SECTOR::SECTOR(const std::string& filename) {	int numtriangles = -1;	std::string line;	std::ifstream filein(filename.c_str());		if (!filein.is_open()) { throw std::runtime_error("File not found"); }	while (std::getline(filein, line)) {		std::stringstream ss(line);		std::string word;		ss >> word;		if (word == "NUMPOLLIES") {			ss >> numtriangles; 			break;		} 	}	if (numtriangles < 0) { throw std::runtime_error("Polygon count in file negative or missing"); }	triangles.resize(numtriangles);	for (int loop = 0; loop < numtriangles; loop++) {		for (int vert = 0; vert < 3; vert++) {			do {				if (!std::getline(filein, line)) {					throw std::runtime_error("Premature end of file");				}			} while (line[0] == '/' || line.empty());			std::stringstream ss(line);			Vertex& v = triangles[loop].vertex[vert];			if (!(ss >> v.x >> v.y >> v.z >> v.u >> v.v)) {				throw std::runtime_error("Corrupt triangle data");			}		}	}}


[Edited by - Zahlman on February 26, 2010 7:27:16 PM]
Quote:Original post by Zahlman
13) Don't use 'ss << line' to try to put the whole line into a stringstream, because it won't work - it will only read the first word. A stringstream is a stream, after all - the usual rules for reading apply. But again, you can avoid this by re-creating the stringstream object, which happens naturally when you scope your variables properly.

Using << on strings with spaces will work with std::stringstream. Were you thinking of >> ?

This topic is closed to new replies.

Advertisement