Parsing / splitting a HUGE file?

Started by
38 comments, last by pragma Fury 18 years, 10 months ago
May I suggest a fopen/fscanf approach, since you know the format of the csv file.

Here's a quick bit of code I whipped up to read in a text file containing 100 csv entries composed of 10 comma-separated integers.

FILE *pFile = fopen("test.txt","r");// temporary storage for the values.int n1,n2,n3,n4,n5,n6,n7,n8,n9,n10;while(!feof(pFile)){   fscanf(pFile,"%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\r\n",      &n1,&n2,&n3,&n4,&n5,&n6,&n7,&n8,&n9,&n10);      // do something with the data.}fclose(pFile);/* Sample data:24464,26962,29358,11478,15724,19169,26500,6334,18467,415436,4827,11942,2995,491,9961,16827,23281,28145,570519895,19718,18716,17421,12382,292,153,3902,14604,323919894,17035,26299,25667,19912,1869,11538,14771,21726,54476868,28253,7711,15141,4664,17673,30333,31322,23811,28703778,27529,9741,8723,12859,20037,32757,32662,27644,25547*/


I dunno how fscanf would compare to reading in a line and tokenizing it using something like strtok, but it's gotta be faster than going char by char.
Advertisement
Another thing to look into is something like this:

#include <fstream>#include <iostream>#include <string>#include <vector>#include <sstream>using std::cout;using std::endl;std::ifstream fin("histY.csv.txt");std::string buff, elem;std::vector<std::string> fields;// Split by LFwhile( std::getline( fin, buff, char(10) ) ){	std::stringstream ss;	ss << buff;	while( std::getline( ss, elem, ',') )	{		// Erase the leading space if one is there		if( elem[0] == ' ' )			elem.erase( elem.begin() );		// Erase the trailing space if one is there		if( elem[elem.size() - 1] == ' ' )			elem.erase( --elem.end() );		// Show the element		cout << elem << endl;                // Now do whatever you want with each element here, probabally add it all the the vector	}        // When you get here, you have a vector of fileds, now you send it to the DB how you were doing it.}fin.close();


I'm still learning this standard C++ library stuff, so if I've mangled anything or there's a better way, anyone feel free to point it out, but that does compile and run. The test file used was:
first name,last name, address, city, state, month1, month2, month3first name,last name, address, city, state, month1, month2, month3
Quote:Original post by graveyard filla
Here is the function which takes the file and dumps it into the database. It worked fine with a 400 meg file (dumped 3.5 mill records). But I have another file that is 900 megs, and this is the one that gave me the "low on virtual memory" dealy. BTW, this code needs to be changed slightly to do an insert into 2 tables to be fully normalized.

This is for a power company.... this table has the power usage for like 8 million customers.. The problem is the schema they had was HORRIBLE, there table looked like this:

First Name, Last Name, Address, City, (etc), MonthUsage1, MonthUsage2, MonthUsage3, MonthUsage4, MonthUsage5.... etc. Not only did they cram 12 fields into the one table, but they made it RELATIVE, so MonthUsage1 isnt January, its actually August.

*** Source Snippet Removed ***


!

Well, it seems you still might have some minor normalization issues to sort out with the database in regards to storing redundent customer information. That and you've committed an affront to God with that code. Any chance you could add a rope ladder that you can pull up after you?
Thanks everyone for the replies... OK, I have some better ideas now, but I'm having problems understanding your guys suggestions..

Drew, from your example of getline(), it doesn't seem to fit with the reference i found for getline()... I'm guessing I should do something like fin.getline(&some_string,999999,"'") ? I put 999999 there because I want it to keep reading untill it finds the deliminator.

Michaelson, check the post right above yours, I posted the code.

Pragma, I'm trying to understand your example, but how does fscanf() know how many bytes / characters to read? And how does it know to ignore commas, etc.?

Thanks again.
FTA, my 2D futuristic action MMORPG
Quote:Original post by Michalson
Quote:Original post by graveyard filla
Here is the function which takes the file and dumps it into the database. It worked fine with a 400 meg file (dumped 3.5 mill records). But I have another file that is 900 megs, and this is the one that gave me the "low on virtual memory" dealy. BTW, this code needs to be changed slightly to do an insert into 2 tables to be fully normalized.

This is for a power company.... this table has the power usage for like 8 million customers.. The problem is the schema they had was HORRIBLE, there table looked like this:

First Name, Last Name, Address, City, (etc), MonthUsage1, MonthUsage2, MonthUsage3, MonthUsage4, MonthUsage5.... etc. Not only did they cram 12 fields into the one table, but they made it RELATIVE, so MonthUsage1 isnt January, its actually August.

*** Source Snippet Removed ***


!

Well, it seems you still might have some minor normalization issues to sort out with the database in regards to storing redundent customer information. That and you've committed an affront to God with that code. Any chance you could add a rope ladder that you can pull up after you?



lol... yes, as i mentioned before, i still need to add another table to make it fully normalized. It will be like this:

table1
t1_id,name, address, etc...

table2
t2_id, month, usage, t1_id(FK)
FTA, my 2D futuristic action MMORPG
Quote:Original post by Drew_Benton
Another thing to look into is something like this:

*** Source Snippet Removed ***

I'm still learning this standard C++ library stuff, so if I've mangled anything or there's a better way, anyone feel free to point it out, but that does compile and run. The test file used was:
first name,last name, address, city, state, month1, month2, month3first name,last name, address, city, state, month1, month2, month3


wow, thats a lot nicer then my code [grin]. one thing that i don't get, whats with the 2 lines where you trim the whitespace off the ends of the field? why did you do that?
FTA, my 2D futuristic action MMORPG
How does this database store information for more than 12 months? or is it even supposed to?


Quote:Original post by graveyard filla
Pragma, I'm trying to understand your example, but how does fscanf() know how many bytes / characters to read? And how does it know to ignore commas, etc.?


Y'know, I'm not really sure. The internals of fscanf are something of a mystery... All I really know is that the scanf functions work just like the printf functions, just in reverse. Here is the MSDN documentation on the method though, hope it helps.
It doesn't have to store data for more then 12 months
FTA, my 2D futuristic action MMORPG
Quote:Original post by graveyard filla
It doesn't have to store data for more then 12 months


Ah I see. Ok.

Here is another document titled "The Finer Points of scanf".
Quote:Original post by graveyard filla
Drew, from your example of getline(), it doesn't seem to fit with the reference i found for getline()... I'm guessing I should do something like fin.getline(&some_string,999999,"'") ? I put 999999 there because I want it to keep reading untill it finds the deliminator.


Unless you are using VS 6, the code I used for getline should work in dev and vs7. If you notice what I did, I am not using getline for the ifstream class, but rather the std::getline implementation, it is very different [wink]. In that example, what it does is first get an entire line, which is separted by a LF (character 10). Now after it get's a line, it sends it to a stringstream, which converts the entire line into a stream. I then pass that stream into the std::getline and it will get each element that is seperated by a ','. So if you have: 1,2, 3 ,4 5, 6 You will get:
12  3   4   5 6


Quote:
wow, thats a lot nicer then my code . one thing that i don't get, whats with the 2 lines where you trim the whitespace off the ends of the field? why did you do that?


I did that so if you have, like you said something messy, if you did have:
First Name ,Last Name , etc...

It would take care of that. Of course in my example it will only take away one of those spaces on each side, but you just use a while(string.begin()==' ') and that takes care of it. I was just 'trimming' the string [wink].

This topic is closed to new replies.

Advertisement