Parsing / splitting a HUGE file?

Started by
38 comments, last by pragma Fury 18 years, 10 months ago
ah ok np

I typically use vector.resize(0) in VC7.1 to set the size to 0 but leave the capacity in-tact, so repeatedly filling the buffer like Toolmaker suggested doesnt cost destruction and re-creation of the buffers.

This was a recommendation from "Common C++ Performance Mistakes in Games" GDC paper, which can be seen http://www.gdconf.com/archives/2004/
Advertisement
Quote:Original post by DrEvil
I typically use vector.resize(0) in VC7.1 to set the size to 0 but leave the capacity in-tact, so repeatedly filling the buffer like Toolmaker suggested doesnt cost destruction and re-creation of the buffers.


Yeah if i remember correctly that should do the same with all imps. I'm actually surprised with the results of VC++ 7.1 and the difference in VC++ 8.0, considering it does it for basic_string and not vector.
Thanks for the replies everyone.

OK, im still working on this... One problem I came across is apostrophees... SQL will choke if i try to feed it a string like 'Bob's Corner Store' , because of the apostrophee... so I have to change it to 'Bob''s Corner Store'.

Anyway, im just trying to write a simple function which replaces all occurences of ' with '', but i can't get it to work. It either writes over the next character or gets caught in an infinite loop.

Here's the code (changes "Bob's Corner Store" to "Bob'' Corner Store". (<-- this should have 2 apostrophees but doesnt show up for some reason, probably some HTML issue)
std::string Change(std::string &str){	std::string::size_type pos = str.find("'");	while(pos != std::string::npos)	{		str.replace(pos,2,"''");		pos+=2;		pos = str.find("'",pos);	}	return str;}


Thanks for any help, and please excuse any stupid mistakes... my brain is mush, its almost 10:30 pm and i've been up and working since 8 AM... starting to get hard to type [grin].
FTA, my 2D futuristic action MMORPG
If you simply want to add in an ' to any preexisting ', then you can do something like this:

using std::cout;using std::endl;void Change(std::string &str){	std::string::size_type pos = str.find("'");	while(pos != std::string::npos)	{		str.insert( pos, "'" );		pos = str.find( "'", pos + 2 );	}}int main(int argc, char* argv[]){	std::string str = "Drew's Example's";	cout << str << endl;	Change(str);	cout << str << endl;	return 0;}


Which results in:
Drew‘s Example‘s
Drew‘‘s Example‘‘s (but the quote are the right way. The forum even eats the HTML entities as well [sad])

So rather then replace, simply add in!! [smile]
Back to the original question:

Why are you leaking memory? Do you know which process is leaking? Can you watch it in the process manager?

Maybe it's your database adapter layer that's leaking memory. If so, you have no choice but to split the original file and run it in batches.

However, I would still try inserting 50 records at a time inside a transaction -- even wrapping the inner part of your main loop in a transaction would be an improvement. I e, where you do:

				db.Run_Query(query1);				db.Run_Query(query2);				db.Run_Query(query3);				db.Run_Query(query4);				db.Run_Query(query5);				db.Run_Query(query6);				db.Run_Query(query7);				db.Run_Query(query8);				db.Run_Query(query9);				db.Run_Query(query10);				db.Run_Query(query11);				db.Run_Query(query12);


Just wrap this in a transaction ("begin transaction;" before, and "commit transaction;" after). It may improve both runtime, and memory usage.
enum Bool { True, False, FileNotFound };
OK.. for fun, I thought I'd run some tests. I created a 1GB csv file, which contained 5003205 lines of the form:
#,firstname_#,lastname_#,,,,,,,,,,,Aug_#,Sep_#,Oct_#,Dec_#,Jan_#,Feb_#,Mar_#,Apr_#,May_#,Jun_#,Jul_#

Where each "#" was replaced with the entry number.
I omitted the database calls and cout statements.
Oh.. I did have to replace clear() with erase(buff.begin(),buff.end()) and push_back() with += because those two methods are not available in VC6.

My dev/test computer is an Athlon 64 3500+ with 2GB PC4000 RAM and 10K rpm SATA drives in RAID-1.


Using graveyard filla's original code, the elapsed time was 457.1448 seconds and the application memory usage stayed at 692K of memory throughout the whole test.

Switching to my fopen/fscanf implementation did the same work in 210.5727 seconds and only used 580K. (I should note that I replaced all the query=blah+blah+blah+blah lines with snprintf statements.)

And then for the heck of it, I replaced fscanf with a call to fgets() and used strtok to parse out the string:
   char *szFormat = "insert into usage_history(esiid,first_name,last_name,month,usage) values(%s,%s,%s,%d,%s)";   FILE *pFile = fopen("F:\\Projects\\tempdata\\testdata.csv","r");      // array of query strings   char  aszQueries[12][256];   // array of field values   char  aszFields[24][64];   // temporary line storage   char szLineTemp[256];   // zero out our buffers   memset(aszFields,0,24*64);   memset(aszQueries,0,12*256);   memset(szLineTemp,0,256);      char *szToken;   char cTemp;   int nField, nTemp;   while(!feof(pFile))   {      // get the whole line      fgets(szLineTemp,255,pFile);            // trim whitespace from end of the line      for(int nIdx=strlen(szLineTemp); nIdx>0; nIdx--)      {         cTemp = szLineTemp[nIdx-1];         if(' '==cTemp || '\r' == cTemp || '\n' == cTemp)            szLineTemp[nIdx-1]='\0';         else            break;      }      // tokenize the string on commas.      nField=0;      szToken = strtok(szLineTemp,",");      while(szToken)      {         strncpy(aszFields[nField],szToken,63);                  szToken = strtok(NULL,",");         nField++;      }            // build our queries      for(int nQuery=0; nQuery<12; nQuery++)      {         // deal with the weird month offset.         nTemp = nQuery>4?nQuery-4:nQuery+8;         _snprintf(aszQueries[nQuery], 255, szFormat, aszFields[0], aszFields[1], aszFields[2], nTemp, aszFields[nQuery+12]);      }   }   fclose(pFile);


And that took 196.2625 seconds.

That was fun [smile]
Now, I'm not sure if I proved anything, other than that graveyard filla's file input code shouldn't leak... and whatever is using up his memory is elsewhere.

In any case, It was fun to do [smile]

EDIT: I know.. I know.. All the C++ purists are screaming in rage at me. "WHY ARE YOU USING SO MANY CHAR ARRAYS!!!!"
All I can say is: Because it's faster than working with string classes. Normally I wouldn't do this, but I was going for speed.

[Edited by - pragma Fury on June 3, 2005 1:55:51 AM]
Quote:Original post by pragma Fury
My dev/test computer is an Athlon 64 3500+ with 2GB PC4000 RAM and 10K rpm SATA drives in RAID-1.


Showoff [razz][grin] Anyways, great job! Nice to see those results. I might have to work on generating my own for the getline and read, but eh, I'll think about it [wink]
Quote:Original post by Drew_Benton
Quote:Original post by pragma Fury
My dev/test computer is an Athlon 64 3500+ with 2GB PC4000 RAM and 10K rpm SATA drives in RAID-1.


Showoff [razz][grin] Anyways, great job! Nice to see those results. I might have to work on generating my own for the getline and read, but eh, I'll think about it [wink]


Yeah.. I know.. I figure I paid enough for it, I might as well brag it up a bit. [embarrass]

Here's the code I used to create my test file, in case you wanted to do your own benchmarks... it takes a couple minutes.
Of course as we're running the tests on different computers, we won't be able to compare results.
   FILE* pFile = fopen("F:\\projects\\tempdata\\testdata.csv","w");   long MAX_FILE_SIZE = 1024000000; // 1 GB   int nRecord=1;   bool bDone = false;   while(!bDone)   {      fprintf(pFile, "%d,firstname_%d,lastname_%d,,,,,,,,,,Aug_%d,Sept_%d,Oct_%d,Nov_%d,Dec_%d,Jan_%d,Feb_%d,Mar_%d,Apr_%d,May_%d,Jun_%d,Jul_%d\r\n",         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord,         nRecord);      nRecord++;      long lPos = ftell(pFile);      if(lPos >= MAX_FILE_SIZE)         bDone = true;   }   fclose(pFile);
Thanks for the replies everyone. Pragma, that code is pretty slick...

OK, i got the parsing to work properly, but the memory problem is still here... after inserting about 500k rows, i get a pop up "windows is low on virtual memory".. so i hit ctrl-alt-delete, and see SQL Server is using 256 megs of ram!! my program was barely using any at all though... so its not a problem w/ my program, but an issue with SQL Server..

I AM doing a "begin txn", inserting 50 rows, and then doing a commit. also, im inserting each row one at a time, not throwing all 50 insert statements into one string and feeding it that string...

This is obviously some issue with SQL Server, something it is doing behidn the scenes that I don't know about... i believe its cashing memory or something, but i can't figure out how to turn it off so i can do these inserts...


What's REALLY weird is, last night i let the thing run for a while.. it inserted 19 million records... today i changed around the tables to be more normalized, and in fact, made it so that instead of inserting 13 records it only has to insert 1 (i crammed all 12 usage months into the one row, because theres only a 1:1 ratio of usage months to customers, so theres no point in having those extra 12*8 million records around). Anyway, after changing the tables around, my PC started choking at 500k records... so how did i make it to 19 mill with no problem before? so weird...

Thanks again for any help.
FTA, my 2D futuristic action MMORPG
Here is a Microsoft Info page about SQL Server Memory usage. Apparently there is an option where you can tell SQL Server to max it's memory usage at a certain point.

And here is a short article about SQL Server memory leaks. I guess SP1 has leaks when you do bulk inserts.

When all else fails, [google] is your friend. [smile]

This topic is closed to new replies.

Advertisement