Parsing / splitting a HUGE file?

Started by
38 comments, last by pragma Fury 18 years, 10 months ago
Hi, I have a huge text file, its about 900 megs full of data thats comma delimated (.csv). It's basically just the dump of a huge table from a database. Anyway, the schema they were using was horrible, so I whipped up a little C++ program that parses the file and adds that data to a new (normalized) database. I left it running for a few hours and when I came back, my PC was bogged down and it said "The system is low on virtual memory... ". Anyway, has anyone ever parsed a huge file like this before? Any advise on a better way to do it? Should I do the whole thing at once? I was thinking of splitting the file into chunks... but how could I do such a thing? Is there any easy way to split this big bastard? I guess I could read say 50 lines into a buffer, spit it into a new file, clear the buffer, read 50 more lines, etc... then after doing this 10,000 times I start dumping to a new file. Thanks for any advise.
FTA, my 2D futuristic action MMORPG
Advertisement
When you were dumping the data into the new database, were you committing your transaction periodically? Some databases will keep a record of all changes you make during your transaction, so that should an error occur it can discard your changes and no harm is done to the database.

Unfortunately, the database has to store the transaction info somewhere until it's committed, and that's probably in system memory. Committing periodically will flush the changes out of the transaction cache.

The other issue may be that you're trying to load in the entire file. You should be able to use IO Streams to access data anywhere in the file without having to actually suck the whole thing into the heap.
Well your error suggests you are simply retaining too much in memory. What method are you using to read the file? You allude to saving the data to dumping to a new file, even though before you said it was a database. Perhaps you could better explain what output format you are using. What operations are you doing to normalize the information? Are you storing any kind of tables in memory for lookup? (for example if you are normalizing cities, you might have the cities list in memory as it is built for fast lookup)

As for the file itself you seem to be a bit lost in how to stream information (I might guess that your problem stems from you trying to mount the entire 1GB file at once). At the most basic level you should be familiar with how to use your compiler/file system library to properly open a file for reading. At a more advance, performance enhancing level you should be filling a fixed buffer, getting as many complete lines as possible, shifting the remaining data (partial line) to the start of the buffer and filling in the remainder from the file.

EDIT: pragma Fury raises a good point about transactions, even though I had assumed if you where having trouble with a basic upload you probably wouldn't be using them for batch inserts as you normally should (by default most databases when not told to specifically start a transaction will instead consider each command sent to be an individual transaction)
[edit]Whoa I'm a slow typer

Well first you will probabally need to allocate more VM [wink]. When I was messing around with writing an XML parser in Dec, I mad ludacris mutli GB files and worked with those to test it out. However, in your case, I don't see what the problem is with this.

If you have all your data comma delimated, then basically you should be able to read in each element and send that to the new data base, so it sounds like the problem is your new database and not the file itself you are parsing. I mean you could split it to smaller files, but that's not going to help you any, unless you are reading in the entire file to memory.

So what you should do is basically either read lines in and send them to the data base, or read into a buffer, process that, then continue on. File IO with C/C++ is very fast, so unless you are working on getting data character by character, then there shpuldn't be too much of a big problem [wink].

So for ideas, I'll give you one, but I'd need you to explain more on how you are getting data from the big file, then sending it to the new database. Can you show a quick excerpt from the file that shows how the data is organized, just like a few complete entries?

First idea that comes to mind from what you have said:

/* FormatName, Date, Size, Name ...*/std::ifstream IF("bighugefile.csv");std:;string buffer;while( std::getline(IF,buffer,",") ){   // buffer contains the first item, Name   std::string name = buffer;   std::getline(IF,buffer,",")   std::string date = buffer;   std::getline(IF,buffer,",")   std::string size = buffer;   SendToDataBase(name,date,size);}


That will be very slow though if you go though each element individually, so optimization might be needed.
Quote:Original post by pragma Fury
When you were dumping the data into the new database, were you committing your transaction periodically? Some databases will keep a record of all changes you make during your transaction, so that should an error occur it can discard your changes and no harm is done to the database.

Unfortunately, the database has to store the transaction info somewhere until it's committed, and that's probably in system memory. Committing periodically will flush the changes out of the transaction cache.


I actually never commit anything.. in fact, im confused about that.. in all my programing / SQL experience, i never evne typed the word "commit". Am I missing something here? I'm using SQL Server BTW.

Quote:
The other issue may be that you're trying to load in the entire file. You should be able to use IO Streams to access data anywhere in the file without having to actually suck the whole thing into the heap.


Well, I need the entire file to be parsed and added to a table(s)... Maybe I should split the file first, and then run the processing on each individual file?


Quote:
Well your error suggests you are simply retaining too much in memory. What method are you using to read the file? You allude to saving the data to dumping to a new file, even though before you said it was a database. Perhaps you could better explain what output format you are using. What operations are you doing to normalize the information? Are you storing any kind of tables in memory for lookup? (for example if you are normalizing cities, you might have the cities list in memory as it is built for fast lookup)

As for the file itself you seem to be a bit lost in how to stream information (I might guess that your problem stems from you trying to mount the entire 1GB file at once). At the most basic level you should be familiar with how to use your compiler/file system library to properly open a file for reading. At a more advance, performance enhancing level you should be filling a fixed buffer, getting as many complete lines as possible, shifting the remaining data (partial line) to the start of the buffer and filling in the remainder from the file.

EDIT: pragma Fury raises a good point about transactions, even though I had assumed if you where having trouble with a basic upload you probably wouldn't be using them for batch inserts as you normally should (by default most databases when not told to specifically start a transaction will instead consider each command sent to be an individual transaction)


I am taking the data from a .csv (text) file and dumping it into an SQL Server table... im using ADO / COM.

Let me better explain how I'm doing this. My program *should* be using very little memory... I only hold a single row of data in memory at one time. Basically it works like this:

-While fin.get(c)
-Parse a row of data into a std::vector<std::string>
-Do about 13 inserts with that new data

Thanks a lot for any more help.
FTA, my 2D futuristic action MMORPG
Quote:Original post by Drew_Benton
[edit]Whoa I'm a slow typer

Well first you will probabally need to allocate more VM [wink]. When I was messing around with writing an XML parser in Dec, I mad ludacris mutli GB files and worked with those to test it out. However, in your case, I don't see what the problem is with this.

If you have all your data comma delimated, then basically you should be able to read in each element and send that to the new data base, so it sounds like the problem is your new database and not the file itself you are parsing. I mean you could split it to smaller files, but that's not going to help you any, unless you are reading in the entire file to memory.

So what you should do is basically either read lines in and send them to the data base, or read into a buffer, process that, then continue on. File IO with C/C++ is very fast, so unless you are working on getting data character by character, then there shpuldn't be too much of a big problem [wink].

So for ideas, I'll give you one, but I'd need you to explain more on how you are getting data from the big file, then sending it to the new database. Can you show a quick excerpt from the file that shows how the data is organized, just like a few complete entries?

First idea that comes to mind from what you have said:

*** Source Snippet Removed ***

That will be very slow though if you go though each element individually, so optimization might be needed.


heh... actually, i AM reading character by character, using fin.get(c)... i didn't know about getline()... i'll have to look into it, but could getline() really be any faster then reading char by char? Surely it must be doing what im doing under the hood anyway?

FTA, my 2D futuristic action MMORPG
Quote:Original post by graveyard filla
-While fin.get(c)


Oh good lord no! [lol] You definitly do not want to use that. Sure your program uses little memory, but at the cost of time. I made that same mistake when I was workin with large XML files. When you go from get()->getline() you will see over a 50% speed increase. From getline->read you will see over a 200% increase (rough figures) but it's true. Just use power of two chunks optimized for compiler settings (32767 byte chunks) and you will blaze though that data.

So what you need to do is read in chunks and then process from that. Do you have a sample of data avaliable?
If your file doesn't require a manual join, you could probably do something like:

  file = openmyfile()  count = 0  while( got_data() ) {    line = next_line_from_file()    record = extract_record_from_line()    if( count == 0 ) execute_sql( "begin transaction" )    insert_record_into_database()    if( count++ == 50 ) {      execute_sql( "commit transaction" )      count = 0    }  }  if( count ) {    execute_sql( "commit transaction" )  }  closemyfile()


The theory is to only read a little bit at a time from the input, and process that, then re-use the buffers from that operation when you do the next part. You can do this in one linear program -- there's no need to actually split the file initially. The other idea is to batch your operations into transactions, because committing N operations within a single transaction is faster than committing N separate operations outside (which means they implicitly have their own transactions).

If you don't know what transactions do for you, though, you probably shouldn't be doing professional database programming... it's one of the core concepts of data storage, integrity, and multi-user operations (in that order).
enum Bool { True, False, FileNotFound };
Here is the function which takes the file and dumps it into the database. It worked fine with a 400 meg file (dumped 3.5 mill records). But I have another file that is 900 megs, and this is the one that gave me the "low on virtual memory" dealy. BTW, this code needs to be changed slightly to do an insert into 2 tables to be fully normalized.

This is for a power company.... this table has the power usage for like 8 million customers.. The problem is the schema they had was HORRIBLE, there table looked like this:

First Name, Last Name, Address, City, (etc), MonthUsage1, MonthUsage2, MonthUsage3, MonthUsage4, MonthUsage5.... etc. Not only did they cram 12 fields into the one table, but they made it RELATIVE, so MonthUsage1 isnt January, its actually August.

	//this function takes the .txt file containing the data 	//then turns it into the data in the table	if(!db.Connect("127.0.0.1","admin","***","webapp"))	{		cout << "database failed to connect"<<endl;		Bail();	}	std::ifstream fin("histY.csv");	char c;	std::vector<std::string> fields;	std::string buff;	while(fin.get(c))	{		if(c == ',')		{			cout << buff << endl;			fields.push_back(buff);			buff.clear();		}		else if (c == 10) //LF, new row		{			fields.push_back(buff);			buff.clear();			for(int i = 0; i < fields.size(); ++i)			{				cout << fields << endl;			}				//do insert here				//fields at 12 is month 1 or august, or 8				std::string query1 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",8," 							+ fields.at(12)+ ")";				cout << query1 << endl;				std::string query2 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",9," 							+ fields.at(13)+ ")";				cout << query2 << endl;				std::string query3 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",10," 							+ fields.at(14)+ ")";				std::string query4 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",11," 							+ fields.at(15)+ ")";				std::string query5 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",12," 							+ fields.at(16)+ ")";				std::string query6 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",1," 							+ fields.at(17)+ ")";				std::string query7 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",2," 							+ fields.at(18)+ ")";				std::string query8 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",3," 							+ fields.at(19)+ ")";				std::string query9 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",4," 							+ fields.at(20)+ ")";				std::string query10 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",5," 							+ fields.at(21)+ ")";				std::string query11 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",6," 							+ fields.at(22)+ ")";				std::string query12 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("						+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",7," 							+ fields.at(23)+ ")";				//cout << query << endl;				db.Run_Query(query1);				db.Run_Query(query2);				db.Run_Query(query3);				db.Run_Query(query4);				db.Run_Query(query5);				db.Run_Query(query6);				db.Run_Query(query7);				db.Run_Query(query8);				db.Run_Query(query9);				db.Run_Query(query10);				db.Run_Query(query11);				db.Run_Query(query12);				fields.clear();		}		else		{			if(c == '"')			{				if(fields.size() < 10)					buff.push_back(39);			}			else 				buff.push_back(c);		}	}	cout << " Finished Successfully! " <<endl;	system("PAUSE");
FTA, my 2D futuristic action MMORPG
Quote:Original post by graveyard filla
I actually never commit anything.. in fact, im confused about that.. in all my programing / SQL experience, i never evne typed the word "commit". Am I missing something here? I'm using SQL Server BTW.


As I suspected. At the moment you are actually doing a transaction for each and every command, which is very slow, but somewhat safe (so long as there are no intrigity issues, i.e. if this was done more then once you'd have a bad thing)

Quote:Original post by graveyard filla
Well, I need the entire file to be parsed and added to a table(s)... Maybe I should split the file first, and then run the processing on each individual file?


I think you *really* need to explain or show us your file code. It doesn't seem like you understand the difference between reading a file from disk as needed and reading the entire file into memory at once. You really need to show us what code you are trying to do this with.

Quote:Original post by graveyard filla
I am taking the data from a .csv (text) file and dumping it into an SQL Server table... im using ADO / COM.

Let me better explain how I'm doing this. My program *should* be using very little memory... I only hold a single row of data in memory at one time. Basically it works like this:

-While fin.get(c)
-Parse a row of data into a std::vector<std::string>
-Do about 13 inserts with that new data

Thanks a lot for any more help.


Do you probably clean up the vector object? This could be a memory leak issue.

This topic is closed to new replies.

Advertisement