Sign in to follow this  
graveyard filla

Parsing / splitting a HUGE file?

Recommended Posts

Hi, I have a huge text file, its about 900 megs full of data thats comma delimated (.csv). It's basically just the dump of a huge table from a database. Anyway, the schema they were using was horrible, so I whipped up a little C++ program that parses the file and adds that data to a new (normalized) database. I left it running for a few hours and when I came back, my PC was bogged down and it said "The system is low on virtual memory... ". Anyway, has anyone ever parsed a huge file like this before? Any advise on a better way to do it? Should I do the whole thing at once? I was thinking of splitting the file into chunks... but how could I do such a thing? Is there any easy way to split this big bastard? I guess I could read say 50 lines into a buffer, spit it into a new file, clear the buffer, read 50 more lines, etc... then after doing this 10,000 times I start dumping to a new file. Thanks for any advise.

Share this post


Link to post
Share on other sites
pragma Fury    343
When you were dumping the data into the new database, were you committing your transaction periodically? Some databases will keep a record of all changes you make during your transaction, so that should an error occur it can discard your changes and no harm is done to the database.

Unfortunately, the database has to store the transaction info somewhere until it's committed, and that's probably in system memory. Committing periodically will flush the changes out of the transaction cache.

The other issue may be that you're trying to load in the entire file. You should be able to use IO Streams to access data anywhere in the file without having to actually suck the whole thing into the heap.

Share this post


Link to post
Share on other sites
Michalson    1657
Well your error suggests you are simply retaining too much in memory. What method are you using to read the file? You allude to saving the data to dumping to a new file, even though before you said it was a database. Perhaps you could better explain what output format you are using. What operations are you doing to normalize the information? Are you storing any kind of tables in memory for lookup? (for example if you are normalizing cities, you might have the cities list in memory as it is built for fast lookup)

As for the file itself you seem to be a bit lost in how to stream information (I might guess that your problem stems from you trying to mount the entire 1GB file at once). At the most basic level you should be familiar with how to use your compiler/file system library to properly open a file for reading. At a more advance, performance enhancing level you should be filling a fixed buffer, getting as many complete lines as possible, shifting the remaining data (partial line) to the start of the buffer and filling in the remainder from the file.

EDIT: pragma Fury raises a good point about transactions, even though I had assumed if you where having trouble with a basic upload you probably wouldn't be using them for batch inserts as you normally should (by default most databases when not told to specifically start a transaction will instead consider each command sent to be an individual transaction)

Share this post


Link to post
Share on other sites
Drew_Benton    1861
[edit]Whoa I'm a slow typer

Well first you will probabally need to allocate more VM [wink]. When I was messing around with writing an XML parser in Dec, I mad ludacris mutli GB files and worked with those to test it out. However, in your case, I don't see what the problem is with this.

If you have all your data comma delimated, then basically you should be able to read in each element and send that to the new data base, so it sounds like the problem is your new database and not the file itself you are parsing. I mean you could split it to smaller files, but that's not going to help you any, unless you are reading in the entire file to memory.

So what you should do is basically either read lines in and send them to the data base, or read into a buffer, process that, then continue on. File IO with C/C++ is very fast, so unless you are working on getting data character by character, then there shpuldn't be too much of a big problem [wink].

So for ideas, I'll give you one, but I'd need you to explain more on how you are getting data from the big file, then sending it to the new database. Can you show a quick excerpt from the file that shows how the data is organized, just like a few complete entries?

First idea that comes to mind from what you have said:


/* Format
Name, Date, Size, Name ...
*/

std::ifstream IF("bighugefile.csv");
std:;string buffer;
while( std::getline(IF,buffer,",") )
{
// buffer contains the first item, Name
std::string name = buffer;

std::getline(IF,buffer,",")
std::string date = buffer;

std::getline(IF,buffer,",")
std::string size = buffer;

SendToDataBase(name,date,size);
}




That will be very slow though if you go though each element individually, so optimization might be needed.

Share this post


Link to post
Share on other sites
Quote:
Original post by pragma Fury
When you were dumping the data into the new database, were you committing your transaction periodically? Some databases will keep a record of all changes you make during your transaction, so that should an error occur it can discard your changes and no harm is done to the database.

Unfortunately, the database has to store the transaction info somewhere until it's committed, and that's probably in system memory. Committing periodically will flush the changes out of the transaction cache.


I actually never commit anything.. in fact, im confused about that.. in all my programing / SQL experience, i never evne typed the word "commit". Am I missing something here? I'm using SQL Server BTW.

Quote:

The other issue may be that you're trying to load in the entire file. You should be able to use IO Streams to access data anywhere in the file without having to actually suck the whole thing into the heap.


Well, I need the entire file to be parsed and added to a table(s)... Maybe I should split the file first, and then run the processing on each individual file?


Quote:

Well your error suggests you are simply retaining too much in memory. What method are you using to read the file? You allude to saving the data to dumping to a new file, even though before you said it was a database. Perhaps you could better explain what output format you are using. What operations are you doing to normalize the information? Are you storing any kind of tables in memory for lookup? (for example if you are normalizing cities, you might have the cities list in memory as it is built for fast lookup)

As for the file itself you seem to be a bit lost in how to stream information (I might guess that your problem stems from you trying to mount the entire 1GB file at once). At the most basic level you should be familiar with how to use your compiler/file system library to properly open a file for reading. At a more advance, performance enhancing level you should be filling a fixed buffer, getting as many complete lines as possible, shifting the remaining data (partial line) to the start of the buffer and filling in the remainder from the file.

EDIT: pragma Fury raises a good point about transactions, even though I had assumed if you where having trouble with a basic upload you probably wouldn't be using them for batch inserts as you normally should (by default most databases when not told to specifically start a transaction will instead consider each command sent to be an individual transaction)


I am taking the data from a .csv (text) file and dumping it into an SQL Server table... im using ADO / COM.

Let me better explain how I'm doing this. My program *should* be using very little memory... I only hold a single row of data in memory at one time. Basically it works like this:

-While fin.get(c)
-Parse a row of data into a std::vector<std::string>
-Do about 13 inserts with that new data

Thanks a lot for any more help.

Share this post


Link to post
Share on other sites
Quote:
Original post by Drew_Benton
[edit]Whoa I'm a slow typer

Well first you will probabally need to allocate more VM [wink]. When I was messing around with writing an XML parser in Dec, I mad ludacris mutli GB files and worked with those to test it out. However, in your case, I don't see what the problem is with this.

If you have all your data comma delimated, then basically you should be able to read in each element and send that to the new data base, so it sounds like the problem is your new database and not the file itself you are parsing. I mean you could split it to smaller files, but that's not going to help you any, unless you are reading in the entire file to memory.

So what you should do is basically either read lines in and send them to the data base, or read into a buffer, process that, then continue on. File IO with C/C++ is very fast, so unless you are working on getting data character by character, then there shpuldn't be too much of a big problem [wink].

So for ideas, I'll give you one, but I'd need you to explain more on how you are getting data from the big file, then sending it to the new database. Can you show a quick excerpt from the file that shows how the data is organized, just like a few complete entries?

First idea that comes to mind from what you have said:

*** Source Snippet Removed ***

That will be very slow though if you go though each element individually, so optimization might be needed.


heh... actually, i AM reading character by character, using fin.get(c)... i didn't know about getline()... i'll have to look into it, but could getline() really be any faster then reading char by char? Surely it must be doing what im doing under the hood anyway?

Share this post


Link to post
Share on other sites
Drew_Benton    1861
Quote:
Original post by graveyard filla
-While fin.get(c)


Oh good lord no! [lol] You definitly do not want to use that. Sure your program uses little memory, but at the cost of time. I made that same mistake when I was workin with large XML files. When you go from get()->getline() you will see over a 50% speed increase. From getline->read you will see over a 200% increase (rough figures) but it's true. Just use power of two chunks optimized for compiler settings (32767 byte chunks) and you will blaze though that data.

So what you need to do is read in chunks and then process from that. Do you have a sample of data avaliable?

Share this post


Link to post
Share on other sites
hplus0603    11356
If your file doesn't require a manual join, you could probably do something like:


file = openmyfile()
count = 0
while( got_data() ) {
line = next_line_from_file()
record = extract_record_from_line()
if( count == 0 ) execute_sql( "begin transaction" )
insert_record_into_database()
if( count++ == 50 ) {
execute_sql( "commit transaction" )
count = 0
}
}
if( count ) {
execute_sql( "commit transaction" )
}
closemyfile()


The theory is to only read a little bit at a time from the input, and process that, then re-use the buffers from that operation when you do the next part. You can do this in one linear program -- there's no need to actually split the file initially. The other idea is to batch your operations into transactions, because committing N operations within a single transaction is faster than committing N separate operations outside (which means they implicitly have their own transactions).

If you don't know what transactions do for you, though, you probably shouldn't be doing professional database programming... it's one of the core concepts of data storage, integrity, and multi-user operations (in that order).

Share this post


Link to post
Share on other sites
Here is the function which takes the file and dumps it into the database. It worked fine with a 400 meg file (dumped 3.5 mill records). But I have another file that is 900 megs, and this is the one that gave me the "low on virtual memory" dealy. BTW, this code needs to be changed slightly to do an insert into 2 tables to be fully normalized.

This is for a power company.... this table has the power usage for like 8 million customers.. The problem is the schema they had was HORRIBLE, there table looked like this:

First Name, Last Name, Address, City, (etc), MonthUsage1, MonthUsage2, MonthUsage3, MonthUsage4, MonthUsage5.... etc. Not only did they cram 12 fields into the one table, but they made it RELATIVE, so MonthUsage1 isnt January, its actually August.


//this function takes the .txt file containing the data
//then turns it into the data in the table

if(!db.Connect("127.0.0.1","admin","***","webapp"))
{
cout << "database failed to connect"<<endl;
Bail();
}

std::ifstream fin("histY.csv");

char c;
std::vector<std::string> fields;
std::string buff;

while(fin.get(c))
{
if(c == ',')
{
cout << buff << endl;
fields.push_back(buff);
buff.clear();
}
else if (c == 10) //LF, new row
{
fields.push_back(buff);
buff.clear();

for(int i = 0; i < fields.size(); ++i)
{
cout << fields[i] << endl;
}

//do insert here
//fields at 12 is month 1 or august, or 8
std::string query1 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",8,"
+ fields.at(12)+ ")";

cout << query1 << endl;

std::string query2 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",9,"
+ fields.at(13)+ ")";

cout << query2 << endl;

std::string query3 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",10,"
+ fields.at(14)+ ")";

std::string query4 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",11,"
+ fields.at(15)+ ")";

std::string query5 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",12,"
+ fields.at(16)+ ")";

std::string query6 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",1,"
+ fields.at(17)+ ")";

std::string query7 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",2,"
+ fields.at(18)+ ")";

std::string query8 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",3,"
+ fields.at(19)+ ")";

std::string query9 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",4,"
+ fields.at(20)+ ")";

std::string query10 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",5,"
+ fields.at(21)+ ")";

std::string query11 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",6,"
+ fields.at(22)+ ")";


std::string query12 = "insert into usage_history(esiid,first_name,last_name,month,usage) values("
+ fields.at(0) + "," + fields.at(1) + "," + fields.at(2) + ",7,"
+ fields.at(23)+ ")";


//cout << query << endl;
db.Run_Query(query1);
db.Run_Query(query2);
db.Run_Query(query3);
db.Run_Query(query4);
db.Run_Query(query5);
db.Run_Query(query6);
db.Run_Query(query7);
db.Run_Query(query8);
db.Run_Query(query9);
db.Run_Query(query10);
db.Run_Query(query11);
db.Run_Query(query12);

fields.clear();

}
else
{
if(c == '"')
{
if(fields.size() < 10)
buff.push_back(39);
}
else
buff.push_back(c);
}
}

cout << " Finished Successfully! " <<endl;
system("PAUSE");


Share this post


Link to post
Share on other sites
Michalson    1657
Quote:
Original post by graveyard filla
I actually never commit anything.. in fact, im confused about that.. in all my programing / SQL experience, i never evne typed the word "commit". Am I missing something here? I'm using SQL Server BTW.


As I suspected. At the moment you are actually doing a transaction for each and every command, which is very slow, but somewhat safe (so long as there are no intrigity issues, i.e. if this was done more then once you'd have a bad thing)

Quote:
Original post by graveyard filla
Well, I need the entire file to be parsed and added to a table(s)... Maybe I should split the file first, and then run the processing on each individual file?


I think you *really* need to explain or show us your file code. It doesn't seem like you understand the difference between reading a file from disk as needed and reading the entire file into memory at once. You really need to show us what code you are trying to do this with.

Quote:
Original post by graveyard filla
I am taking the data from a .csv (text) file and dumping it into an SQL Server table... im using ADO / COM.

Let me better explain how I'm doing this. My program *should* be using very little memory... I only hold a single row of data in memory at one time. Basically it works like this:

-While fin.get(c)
-Parse a row of data into a std::vector<std::string>
-Do about 13 inserts with that new data

Thanks a lot for any more help.


Do you probably clean up the vector object? This could be a memory leak issue.

Share this post


Link to post
Share on other sites
pragma Fury    343
May I suggest a fopen/fscanf approach, since you know the format of the csv file.

Here's a quick bit of code I whipped up to read in a text file containing 100 csv entries composed of 10 comma-separated integers.


FILE *pFile = fopen("test.txt","r");

// temporary storage for the values.
int n1,n2,n3,n4,n5,n6,n7,n8,n9,n10;
while(!feof(pFile))
{
fscanf(pFile,"%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\r\n",
&n1,&n2,&n3,&n4,&n5,&n6,&n7,&n8,&n9,&n10);

// do something with the data.
}

fclose(pFile);

/* Sample data:
24464,26962,29358,11478,15724,19169,26500,6334,18467,41
5436,4827,11942,2995,491,9961,16827,23281,28145,5705
19895,19718,18716,17421,12382,292,153,3902,14604,32391
9894,17035,26299,25667,19912,1869,11538,14771,21726,5447
6868,28253,7711,15141,4664,17673,30333,31322,23811,28703
778,27529,9741,8723,12859,20037,32757,32662,27644,25547
*/




I dunno how fscanf would compare to reading in a line and tokenizing it using something like strtok, but it's gotta be faster than going char by char.

Share this post


Link to post
Share on other sites
Drew_Benton    1861
Another thing to look into is something like this:


#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <sstream>

using std::cout;
using std::endl;

std::ifstream fin("histY.csv.txt");

std::string buff, elem;
std::vector<std::string> fields;

// Split by LF
while( std::getline( fin, buff, char(10) ) )
{
std::stringstream ss;
ss << buff;

while( std::getline( ss, elem, ',') )
{
// Erase the leading space if one is there
if( elem[0] == ' ' )
elem.erase( elem.begin() );

// Erase the trailing space if one is there
if( elem[elem.size() - 1] == ' ' )
elem.erase( --elem.end() );

// Show the element
cout << elem << endl;

// Now do whatever you want with each element here, probabally add it all the the vector
}
// When you get here, you have a vector of fileds, now you send it to the DB how you were doing it.
}
fin.close();




I'm still learning this standard C++ library stuff, so if I've mangled anything or there's a better way, anyone feel free to point it out, but that does compile and run. The test file used was:

first name,last name, address, city, state, month1, month2, month3
first name,last name, address, city, state, month1, month2, month3

Share this post


Link to post
Share on other sites
Michalson    1657
Quote:
Original post by graveyard filla
Here is the function which takes the file and dumps it into the database. It worked fine with a 400 meg file (dumped 3.5 mill records). But I have another file that is 900 megs, and this is the one that gave me the "low on virtual memory" dealy. BTW, this code needs to be changed slightly to do an insert into 2 tables to be fully normalized.

This is for a power company.... this table has the power usage for like 8 million customers.. The problem is the schema they had was HORRIBLE, there table looked like this:

First Name, Last Name, Address, City, (etc), MonthUsage1, MonthUsage2, MonthUsage3, MonthUsage4, MonthUsage5.... etc. Not only did they cram 12 fields into the one table, but they made it RELATIVE, so MonthUsage1 isnt January, its actually August.

*** Source Snippet Removed ***


!

Well, it seems you still might have some minor normalization issues to sort out with the database in regards to storing redundent customer information. That and you've committed an affront to God with that code. Any chance you could add a rope ladder that you can pull up after you?

Share this post


Link to post
Share on other sites
Thanks everyone for the replies... OK, I have some better ideas now, but I'm having problems understanding your guys suggestions..

Drew, from your example of getline(), it doesn't seem to fit with the reference i found for getline()... I'm guessing I should do something like fin.getline(&some_string,999999,"'") ? I put 999999 there because I want it to keep reading untill it finds the deliminator.

Michaelson, check the post right above yours, I posted the code.

Pragma, I'm trying to understand your example, but how does fscanf() know how many bytes / characters to read? And how does it know to ignore commas, etc.?

Thanks again.

Share this post


Link to post
Share on other sites
Quote:
Original post by Michalson
Quote:
Original post by graveyard filla
Here is the function which takes the file and dumps it into the database. It worked fine with a 400 meg file (dumped 3.5 mill records). But I have another file that is 900 megs, and this is the one that gave me the "low on virtual memory" dealy. BTW, this code needs to be changed slightly to do an insert into 2 tables to be fully normalized.

This is for a power company.... this table has the power usage for like 8 million customers.. The problem is the schema they had was HORRIBLE, there table looked like this:

First Name, Last Name, Address, City, (etc), MonthUsage1, MonthUsage2, MonthUsage3, MonthUsage4, MonthUsage5.... etc. Not only did they cram 12 fields into the one table, but they made it RELATIVE, so MonthUsage1 isnt January, its actually August.

*** Source Snippet Removed ***


!

Well, it seems you still might have some minor normalization issues to sort out with the database in regards to storing redundent customer information. That and you've committed an affront to God with that code. Any chance you could add a rope ladder that you can pull up after you?



lol... yes, as i mentioned before, i still need to add another table to make it fully normalized. It will be like this:

table1
t1_id,name, address, etc...

table2
t2_id, month, usage, t1_id(FK)

Share this post


Link to post
Share on other sites
Quote:
Original post by Drew_Benton
Another thing to look into is something like this:

*** Source Snippet Removed ***

I'm still learning this standard C++ library stuff, so if I've mangled anything or there's a better way, anyone feel free to point it out, but that does compile and run. The test file used was:

first name,last name, address, city, state, month1, month2, month3
first name,last name, address, city, state, month1, month2, month3


wow, thats a lot nicer then my code [grin]. one thing that i don't get, whats with the 2 lines where you trim the whitespace off the ends of the field? why did you do that?

Share this post


Link to post
Share on other sites
pragma Fury    343
How does this database store information for more than 12 months? or is it even supposed to?


Quote:
Original post by graveyard filla
Pragma, I'm trying to understand your example, but how does fscanf() know how many bytes / characters to read? And how does it know to ignore commas, etc.?


Y'know, I'm not really sure. The internals of fscanf are something of a mystery... All I really know is that the scanf functions work just like the printf functions, just in reverse. Here is the MSDN documentation on the method though, hope it helps.

Share this post


Link to post
Share on other sites
Drew_Benton    1861
Quote:
Original post by graveyard filla
Drew, from your example of getline(), it doesn't seem to fit with the reference i found for getline()... I'm guessing I should do something like fin.getline(&some_string,999999,"'") ? I put 999999 there because I want it to keep reading untill it finds the deliminator.


Unless you are using VS 6, the code I used for getline should work in dev and vs7. If you notice what I did, I am not using getline for the ifstream class, but rather the std::getline implementation, it is very different [wink]. In that example, what it does is first get an entire line, which is separted by a LF (character 10). Now after it get's a line, it sends it to a stringstream, which converts the entire line into a stream. I then pass that stream into the std::getline and it will get each element that is seperated by a ','. So if you have: 1,2, 3 ,4 5, 6 You will get:

1
2
3
4 5
6


Quote:

wow, thats a lot nicer then my code . one thing that i don't get, whats with the 2 lines where you trim the whitespace off the ends of the field? why did you do that?


I did that so if you have, like you said something messy, if you did have:
First Name ,Last Name , etc...

It would take care of that. Of course in my example it will only take away one of those spaces on each side, but you just use a while(string.begin()==' ') and that takes care of it. I was just 'trimming' the string [wink].

Share this post


Link to post
Share on other sites
pragma Fury    343
I modified my fscanf example to read in strings, ignoring any leading or trailing whitespace characters on any of the fields using scanfields (described here)

My test file contained the following entries:
firstname1,lastname1,month1, month2 , month3 ,  month4
firstname2,lastname2 , month1,month2 ,month3 ,month4
firstname3, lastname3 ,month1, month2 , month3 , month4


Output to the console was:
firstname1,lastname1,month1,month2,month3,month4
firstname2,lastname2,month1,month2,month3,month4
firstname3,lastname3,month1,month2,month3,month4



FILE *pFile = fopen("test.txt","r");

char szFirstName[16],
szLastName[16],
szMonth1[16],
szMonth2[16],
szMonth3[16],
szMonth4[16];


while(!feof(pFile))
{
// scan out the fields, ignoring any whitespace and ending on a newline.
fscanf(pFile," %[^, ] , %[^, ] , %[^, ] , %[^, ] , %[^, ] , %s \r\n",
&szFirstName,
&szLastName,
&szMonth1,
&szMonth2,
&szMonth3,
&szMonth4);


// do something with the data.
// I'll just dump it to the console for now.
printf("%s,%s,%s,%s,%s,%s\r\n",
szFirstName,
szLastName,
szMonth1,
szMonth2,
szMonth3,
szMonth4);
}

fclose(pFile);


Share this post


Link to post
Share on other sites
snk_kid    1312
Quote:
Original post by graveyard filla
I left it running for a few hours and when I came back, my PC was bogged down and it said "The system is low on virtual memory... ".


hmmm looking at the code:

Quote:
Original post by graveyard filla


while looping for hours & hours do
// ....
fields.push_back(buff);
// ....
buff.push_back(c);
// ....
buff.clear();
// ....
fields.clear();




[grin] typically clear for vector/basic_string does not de-allocate memory, it just destroys all elements this is due to efficiency reasons, also push_back for vector/basic_string is typically implementated with exponential-growth strategy [lol].

Don't believe me? do some push backs then clear then push back again and lastly check the results of std::vector/basic_string::size and std::vector/basic_string::capacity, capacity will be greater than size.

So what does it mean, your vector and/or string buffer keeps on growing growing exponentially lol. Your doing clear but its only destroying elements not deallocating memory.

Quote:
Original post by Drew_Benton
Quote:
Original post by graveyard filla
Drew, from your example of getline(), it doesn't seem to fit with the reference i found for getline()... I'm guessing I should do something like fin.getline(&some_string,999999,"'") ? I put 999999 there because I want it to keep reading untill it finds the deliminator.


Unless you are using VS 6, the code I used for getline should work in dev and vs7. If you notice what I did, I am not using getline for the ifstream class, but rather the std::getline implementation, it is very different [wink].


I'm pretty sure VC++ 6.0 has it, its standard anyways getline for std::basic_string is a free function declared in header string while the member function getline for basic_istream deals with C-style strings.

Share this post


Link to post
Share on other sites
Toolmaker    967
Grave, I think you could solve your first problem(Running out of memory) by calling clear() on the vector after each "round". Just as Michalson suggested.

The best approach is to read line by line(I assume each line is going to contain 1 customers data), shove to the database, clear the buffers(Or even better, make them local in a compound statement so the destructors handle it for you).

That's all there is, perhaps to speed up things you could read 50 lines at the same time, split them, start an SQL transaction, send 50 inserts, commit transaction and destroy all used buffers. This won't increase memory usage alot and it will definetly speed things up severly. Probably by a few hours, as pumping over 50 SQL inserts is going to be way faster than 1 insert at a time.

Toolmaker

Share this post


Link to post
Share on other sites
DrEvil    1148
Quote:
Original post by snk_kid
[grin] typically clear for vector/basic_string does not de-allocate memory, ...


Just FYI, In VC++ 7.1 vector.clear() frees the memory, that is, size() and capacity() end up being zero.

Share this post


Link to post
Share on other sites
pragma Fury    343
Quote:
Original post by snk_kid
[grin] typically clear for vector/basic_string does not de-allocate memory, it just destroys all elements this is due to efficiency reasons, also push_back for vector/basic_string is typically implementated with exponential-growth strategy [lol].

Don't believe me? do some push backs then clear then push back again and lastly check the results of std::vector/basic_string::size and std::vector/basic_string::capacity, capacity will be greater than size.


Except.. his char buffer will eventually reach the size that it's capacity won't need to be increased anymore. And his field buffer will always be 23 elements in size.
I don't see the leak here.

I ran some test code on VC7.1, VC6, and BCB5 and his buffers shouldn't grow past the maximum size needed.

[Edited by - pragma Fury on June 2, 2005 7:34:48 PM]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this