Parsing / splitting a HUGE file?

Started by
38 comments, last by pragma Fury 18 years, 10 months ago
I modified my fscanf example to read in strings, ignoring any leading or trailing whitespace characters on any of the fields using scanfields (described here)

My test file contained the following entries:
firstname1,lastname1,month1, month2 , month3 ,  month4firstname2,lastname2  ,  month1,month2     ,month3 ,month4firstname3,       lastname3       ,month1, month2 , month3 ,  month4


Output to the console was:
firstname1,lastname1,month1,month2,month3,month4firstname2,lastname2,month1,month2,month3,month4firstname3,lastname3,month1,month2,month3,month4


FILE *pFile = fopen("test.txt","r");char szFirstName[16],      szLastName[16],      szMonth1[16],      szMonth2[16],      szMonth3[16],      szMonth4[16];while(!feof(pFile)){   // scan out the fields, ignoring any whitespace and ending on a newline.   fscanf(pFile," %[^, ] , %[^, ] , %[^, ] , %[^, ] , %[^, ] , %s \r\n",      &szFirstName,      &szLastName,      &szMonth1,      &szMonth2,      &szMonth3,      &szMonth4);      // do something with the data.   // I'll just dump it to the console for now.   printf("%s,%s,%s,%s,%s,%s\r\n",      szFirstName,      szLastName,      szMonth1,      szMonth2,      szMonth3,      szMonth4);}fclose(pFile);
Advertisement
Quote:Original post by graveyard filla
I left it running for a few hours and when I came back, my PC was bogged down and it said "The system is low on virtual memory... ".


hmmm looking at the code:

Quote:Original post by graveyard filla
while looping for hours & hours do   // ....   fields.push_back(buff);   // ....   buff.push_back(c);   // ....   buff.clear();   // ....   fields.clear();



[grin] typically clear for vector/basic_string does not de-allocate memory, it just destroys all elements this is due to efficiency reasons, also push_back for vector/basic_string is typically implementated with exponential-growth strategy [lol].

Don't believe me? do some push backs then clear then push back again and lastly check the results of std::vector/basic_string::size and std::vector/basic_string::capacity, capacity will be greater than size.

So what does it mean, your vector and/or string buffer keeps on growing growing exponentially lol. Your doing clear but its only destroying elements not deallocating memory.

Quote:Original post by Drew_Benton
Quote:Original post by graveyard filla
Drew, from your example of getline(), it doesn't seem to fit with the reference i found for getline()... I'm guessing I should do something like fin.getline(&some_string,999999,"'") ? I put 999999 there because I want it to keep reading untill it finds the deliminator.


Unless you are using VS 6, the code I used for getline should work in dev and vs7. If you notice what I did, I am not using getline for the ifstream class, but rather the std::getline implementation, it is very different [wink].


I'm pretty sure VC++ 6.0 has it, its standard anyways getline for std::basic_string is a free function declared in header string while the member function getline for basic_istream deals with C-style strings.
Grave, I think you could solve your first problem(Running out of memory) by calling clear() on the vector after each "round". Just as Michalson suggested.

The best approach is to read line by line(I assume each line is going to contain 1 customers data), shove to the database, clear the buffers(Or even better, make them local in a compound statement so the destructors handle it for you).

That's all there is, perhaps to speed up things you could read 50 lines at the same time, split them, start an SQL transaction, send 50 inserts, commit transaction and destroy all used buffers. This won't increase memory usage alot and it will definetly speed things up severly. Probably by a few hours, as pumping over 50 SQL inserts is going to be way faster than 1 insert at a time.

Toolmaker

Quote:Original post by snk_kid
[grin] typically clear for vector/basic_string does not de-allocate memory, ...


Just FYI, In VC++ 7.1 vector.clear() frees the memory, that is, size() and capacity() end up being zero.
Quote:Original post by snk_kid
[grin] typically clear for vector/basic_string does not de-allocate memory, it just destroys all elements this is due to efficiency reasons, also push_back for vector/basic_string is typically implementated with exponential-growth strategy [lol].

Don't believe me? do some push backs then clear then push back again and lastly check the results of std::vector/basic_string::size and std::vector/basic_string::capacity, capacity will be greater than size.


Except.. his char buffer will eventually reach the size that it's capacity won't need to be increased anymore. And his field buffer will always be 23 elements in size.
I don't see the leak here.

I ran some test code on VC7.1, VC6, and BCB5 and his buffers shouldn't grow past the maximum size needed.

[Edited by - pragma Fury on June 2, 2005 7:34:48 PM]
Quote:Original post by DrEvil
Quote:Original post by snk_kid
[grin] typically clear for vector/basic_string does not de-allocate memory, ...


Just FYI, In VC++ 7.1 vector.clear() frees the memory, that is, size() and capacity() end up being zero.


example:

#include <cmath>#include <algorithm>#include <iterator>#include <vector>#include <string>#include <iostream>struct foo { int i; foo(int j = 0): i(j) {} };int main() {   std::vector<foo> f;	   std::generate_n(std::back_inserter(f), 15, std::rand);   f.clear();   std::cout << "before vector: capacity = " << f.capacity() << ", size = " << f.size();   std::generate_n(std::back_inserter(f), 5, std::rand);   std::cout << "\nafter vector: capacity = " << f.capacity() << ", size = " << f.size();   std::string s;   s.push_back('a');   s.push_back('b');   s.push_back('c');   s.push_back('d');   s.clear();   std::cout << "\n\nbefore string: capacity = " << s.capacity() << ", size = " << s.size() << std::endl;   s.push_back('a');   s.push_back('b');   std::cout << "after string: capacity = " << s.capacity() << ", size = " << s.size() << std::endl;}


results:

VC++ 7.1:

before vector: capacity = 0, size = 0
after vector: capacity = 6, size = 5

before string: capacity = 15, size = 0
after string: capacity = 15, size = 2

VC++ 8.0:

before vector: capacity = 19, size = 0
after vector: capacity = 19, size = 5

before string: capacity = 15, size = 0
after string: capacity = 15, size = 2

GCC 3.4.2:

before vector: capacity = 16, size = 0
after vector: capacity = 16, size = 5

before string: capacity = 4, size = 0
after string: capacity = 4, size = 2

Interesting results [grin].

GCC 3.4.2 > VC++ 8.0
VC++ 8.0 > VC++ 7.1

Quote:Original post by pragma Fury
Except.. his char buffer will eventually reach the size that it's capacity won't need to be increased anymore. And his field buffer will always be 15 elements in size.


You have a good point there

Quote:Original post by pragma Fury
I don't see the leak here. though I'm running VC7.1, so maybe I'm missing it.


I never said it was going leak memory.

[Edited by - snk_kid on June 2, 2005 8:02:56 PM]
Quote:Original post by snk_kid
Quote:Original post by pragma Fury
I don't see the leak here. though I'm running VC7.1, so maybe I'm missing it.


I never said it was going leak memory.


Sorry, I made an assumption about where we were going with this... I stand corrected.
Quote:Original post by ToolmakerThe best approach is to read line by line(I assume each line is going to contain 1 customers data), shove to the database, clear the buffers(Or even better, make them local in a compound statement so the destructors handle it for you).


Uh.. except that the overhead for reinitializing the buffer objects, reallocating the buffer space, and then deallocating it all for every line is going to impose more overhead than simply clear()-ing it.

More bulletproof, yes. Slower, yes.


I'm not sure if your example was an attempt to prove me wrong or not, but add a cout after the f.clear() and capacity and size will be 0 in VC7.1, as I originally posted.

vector.clear() free's the buffer(capacity and size go to 0)
string.clear() doesn't free the buffer

this is only in VC7.1 looks like the behavior has changed in 8.0, good to know thx
Quote:Original post by DrEvil
I'm not sure if your example was an attempt to prove me wrong or not, but add a cout after the f.clear() and capacity and size will be 0 in VC7.1, as I originally posted.

vector.clear() free's the buffer(capacity and size go to 0)
string.clear() doesn't free the buffer


Sorry plz have a look at my last post again as i was in the middle of updating and correcting some stuff.

Yeah that wasn't an attempt to prove you wright/wrong, as i was saying earlier its "typically" done but it is an implementation detail. as you can see from the results.

In anycase as pragma Fury pointed out i over a looked a minor issue so my post doesn't apply.

This topic is closed to new replies.

Advertisement