Large (i mean absolutely huge) File Parsing (Solved)

Started by
16 comments, last by doodah2001 19 years, 11 months ago
So I have this file, its actually a text file that is over 200 megs that has a bunch of information i need from it. Each line has information that I am gathering statistics on. I wrote a nice little parser in c/c++ however it would only parse approximately 25331 lines. Thats, oh, about 10 percent of the file line wise. I know the entire file exists, yes, I did actually open it and went to the end without any problems. I have tried using ifstream, FILE*, windows handles, etc, on both g++ and vc++ to open the file but i still get the same problem at stopping at 25331. I think the problem is they try to load the entire file at one time, except its just too big. I was wondering if anybody had any suggestions on being able to parse the entire file. I only need a line at a time. I'm open to using other languages, maybe something like perl would do the trick but i don't have much experience with that. Any ideas? Mat [edited by - doodah2001 on May 18, 2004 8:23:59 PM]
MatDoodah2001@hotmail.comLife is only as fun as you make it!!!
Advertisement
Well, you did not give much details, but splitting the file in smaller files would be the obvious solution...

Edit: spelling

[edited by - xMcBainx on May 18, 2004 6:23:35 PM]
I teleported home one night; With Ron and Sid and Meg; Ron stole Meggie's heart away; And I got Sydney's leg. <> I'm blogging, emo style
That sounds odd...

The first thing I thought of is a flag in Windows'' CreateFile() function. Don''t know if it''d help any, although I would hope that it could.
quote:From MSDN Docs for CreateFile()
FILE_FLAG_SEQUENTIAL_SCAN
Indicates that the file is to be accessed sequentially from beginning to end. The system can use this as a hint to optimize file caching. If an application moves the file pointer for random access, optimum caching may not occur; however, correct operation is still guaranteed.

Specifying this flag can increase performance for applications that read large files using sequential access. Performance gains can be even more noticeable for applications that read large files mostly sequentially, but occasionally skip over small ranges of bytes.
"We should have a great fewer disputes in the world if words were taken for what they are, the signs of our ideas only, and not for things themselves." - John Locke
Well its a standard Apache web log file, that contains the ip addresses, date, and pages viewed. Its in a standard text format. I need it to do some statistical stuff for some research I''m doing. I know the format of each line so parsing it is not a problem. What I do is I read in each line, parse the line, get the next line. The problem is that after so many lines, it reads in some weird information "\xb1\sb\xb1\sb\..." and then it reads in a bunch of empty lines "" for the rest of the file after I reach that point. That makes me think the file can''t be loaded entirely at once.
MatDoodah2001@hotmail.comLife is only as fun as you make it!!!
It sounds like you had one of the buffer overflow attacks hit your server and it''s now doing a number on your log analyzer program. I''d examine your analyzer program to make sure that it isn''t getting buffer overflowed.
This sounds like it''s been solved many times before, why don''t you just grab one of the many log analyzers from online?

I think SF alone should have many that''ll do what you want if not a whole lot more. Even if they don''t, they''ll at least have the parsing done and you can go straight to the statistics.
I''ve looked at several log analyzers before I started the project because I didn''t want to reinvent the wheel. The sad part is that none of them did what I needed to do, or they were expensive. Some of the analysis that I needed I''ve already done using those same analyzers. They had no problem reading through the same log file. One actually specifically mentioned it was written in c/c++ so I don''t understand why I''m having a problem. I''ve tried it on several machines to see if there is something machine specific, its not. Its the same count each time. I''ll try FILE_FLAG_SEQUENTIAL_SCAN and see if that works.

mat
MatDoodah2001@hotmail.comLife is only as fun as you make it!!!
How are you dealing with the file in code? I mean, I assume that you''re not trying to read the whole file in in one go? perhaps you''d care to post a snippet / overview of what you''re doing so that we can make suggestions (obviously not the analysis part ... just the parts where you''re dealing with the file).

AFAIK, the fread/f**** functions maintain a small memory cache close to where you''re working with the file (I think in some (older?) MSVC versions it was of the order of 4kb) & I wouldn''t be suprised if the same code applies to the ifstream stuff.

I suppose that you aren''t seeking the file every time that you need a new line ... it might be easy to inadvertently cause a bug doing something like that...
Are you sure that there are more lines than 25331? Make sure you don''t have any word wrapping enabled in the text viewer, when checking that.

Also how do you do the parsing? Maybe you have some kind of memory leak or allocate too much memory at once, in windows a program can''t allocate more than 2GB of memory at once. But I''m not sure without looking it up, if it''s 2GB total, or a single buffer bigger than 2GB.

And if there are some filesize limits then it''s at 2GB or bigger.
Ok, heres the basic outline of the code. This is the fstream version, however I have done bascially the same thing with FILE* and using windows specific with CreateFile().

int main()
{
ifstream in(filename);

if(!in.is_open())
return -1;

char line[1024];

while(!in.eof())
{
in.getline(line, 1024);

//call all the line parsing stuff here that just
//deals with the line, thats it
}
}

Its nothing complicated, it's very simple. I have run it without doing any of my analysis functions but still get caught up on the same line. I have tried increasing the line size but that didn't work because I still get caught up.

mat

edit: the way apache saves its log file is each entry (or hit) is on a new line. I know there are at least 2 million lines if not more. In fact, the log file is from late Feb until current. I just need the info from March and April but that line 25331 only gives me until some day in Feb.

[edited by - doodah2001 on May 18, 2004 8:01:08 PM]
MatDoodah2001@hotmail.comLife is only as fun as you make it!!!

This topic is closed to new replies.

Advertisement