Archived

This topic is now archived and is closed to further replies.

Large (i mean absolutely huge) File Parsing (Solved)

This topic is 4955 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

So I have this file, its actually a text file that is over 200 megs that has a bunch of information i need from it. Each line has information that I am gathering statistics on. I wrote a nice little parser in c/c++ however it would only parse approximately 25331 lines. Thats, oh, about 10 percent of the file line wise. I know the entire file exists, yes, I did actually open it and went to the end without any problems. I have tried using ifstream, FILE*, windows handles, etc, on both g++ and vc++ to open the file but i still get the same problem at stopping at 25331. I think the problem is they try to load the entire file at one time, except its just too big. I was wondering if anybody had any suggestions on being able to parse the entire file. I only need a line at a time. I'm open to using other languages, maybe something like perl would do the trick but i don't have much experience with that. Any ideas? Mat [edited by - doodah2001 on May 18, 2004 8:23:59 PM]

Share this post


Link to post
Share on other sites
That sounds odd...

The first thing I thought of is a flag in Windows'' CreateFile() function. Don''t know if it''d help any, although I would hope that it could.
quote:
From MSDN Docs for CreateFile()
FILE_FLAG_SEQUENTIAL_SCAN
Indicates that the file is to be accessed sequentially from beginning to end. The system can use this as a hint to optimize file caching. If an application moves the file pointer for random access, optimum caching may not occur; however, correct operation is still guaranteed.

Specifying this flag can increase performance for applications that read large files using sequential access. Performance gains can be even more noticeable for applications that read large files mostly sequentially, but occasionally skip over small ranges of bytes.

Share this post


Link to post
Share on other sites
Well its a standard Apache web log file, that contains the ip addresses, date, and pages viewed. Its in a standard text format. I need it to do some statistical stuff for some research I''m doing. I know the format of each line so parsing it is not a problem. What I do is I read in each line, parse the line, get the next line. The problem is that after so many lines, it reads in some weird information "\xb1\sb\xb1\sb\..." and then it reads in a bunch of empty lines "" for the rest of the file after I reach that point. That makes me think the file can''t be loaded entirely at once.

Share this post


Link to post
Share on other sites
It sounds like you had one of the buffer overflow attacks hit your server and it''s now doing a number on your log analyzer program. I''d examine your analyzer program to make sure that it isn''t getting buffer overflowed.

Share this post


Link to post
Share on other sites
This sounds like it''s been solved many times before, why don''t you just grab one of the many log analyzers from online?

I think SF alone should have many that''ll do what you want if not a whole lot more. Even if they don''t, they''ll at least have the parsing done and you can go straight to the statistics.

Share this post


Link to post
Share on other sites
I''ve looked at several log analyzers before I started the project because I didn''t want to reinvent the wheel. The sad part is that none of them did what I needed to do, or they were expensive. Some of the analysis that I needed I''ve already done using those same analyzers. They had no problem reading through the same log file. One actually specifically mentioned it was written in c/c++ so I don''t understand why I''m having a problem. I''ve tried it on several machines to see if there is something machine specific, its not. Its the same count each time. I''ll try FILE_FLAG_SEQUENTIAL_SCAN and see if that works.

mat

Share this post


Link to post
Share on other sites
How are you dealing with the file in code? I mean, I assume that you''re not trying to read the whole file in in one go? perhaps you''d care to post a snippet / overview of what you''re doing so that we can make suggestions (obviously not the analysis part ... just the parts where you''re dealing with the file).

AFAIK, the fread/f**** functions maintain a small memory cache close to where you''re working with the file (I think in some (older?) MSVC versions it was of the order of 4kb) & I wouldn''t be suprised if the same code applies to the ifstream stuff.

I suppose that you aren''t seeking the file every time that you need a new line ... it might be easy to inadvertently cause a bug doing something like that...

Share this post


Link to post
Share on other sites
Are you sure that there are more lines than 25331? Make sure you don''t have any word wrapping enabled in the text viewer, when checking that.

Also how do you do the parsing? Maybe you have some kind of memory leak or allocate too much memory at once, in windows a program can''t allocate more than 2GB of memory at once. But I''m not sure without looking it up, if it''s 2GB total, or a single buffer bigger than 2GB.

And if there are some filesize limits then it''s at 2GB or bigger.

Share this post


Link to post
Share on other sites
Ok, heres the basic outline of the code. This is the fstream version, however I have done bascially the same thing with FILE* and using windows specific with CreateFile().

int main()
{
ifstream in(filename);

if(!in.is_open())
return -1;

char line[1024];

while(!in.eof())
{
in.getline(line, 1024);

//call all the line parsing stuff here that just
//deals with the line, thats it
}
}

Its nothing complicated, it's very simple. I have run it without doing any of my analysis functions but still get caught up on the same line. I have tried increasing the line size but that didn't work because I still get caught up.

mat

edit: the way apache saves its log file is each entry (or hit) is on a new line. I know there are at least 2 million lines if not more. In fact, the log file is from late Feb until current. I just need the info from March and April but that line 25331 only gives me until some day in Feb.

[edited by - doodah2001 on May 18, 2004 8:01:08 PM]

Share this post


Link to post
Share on other sites
Don't solve the problem side step it. Use memory mapped files. Here is my code to open a file in memory mapped mode. Once open you adress it like it's a part of memory, ie no fseek or anything.


void CFile::OpenReadFile(void)
{
HANDLE FileHandle = (void*)CreateFile(m_Name.c_str(), GENERIC_READ, FILE_SHARE_READ,
NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (FileHandle == INVALID_HANDLE_VALUE)
throw FileUtilities::FileNotFound();

m_Size = (unsigned int)GetFileSize(FileHandle, NULL);
if (m_Size == 0)
return;

m_Handle = CreateFileMapping(FileHandle, NULL, PAGE_READONLY, 0, m_Size, NULL);
if (m_Handle == NULL)
{
std::runtime_error("Couldn't create file mapping");
return;
}

CloseHandle(FileHandle);
if (m_Handle == NULL)
{
std::runtime_error("Couldn't close file handle");
return;
}

m_Data = (char*)MapViewOfFile(m_Handle, FILE_MAP_READ, 0, 0, 0);
if (m_Data == NULL)
{
std::runtime_error("Couldn't map view of file");
return;
}
}

edit: breaking tables bad

[edited by - SiCrane on May 18, 2004 8:25:41 PM]

Share this post


Link to post
Share on other sites
also, though this doesn''t solve the problem at hand, there are several freely available programs that you can plug into apache to get it to spit out seperate log files for each day/month/year/whatever. having a seperate log file for each month is a good idea anyway b/c then you can delete old log files and not have to worry about the bloat. just google around and i''m sure you''ll find something.

-me

Share this post


Link to post
Share on other sites
Is line 25331 longer than 1022 characters? If you didn't want to worry about keeping the right buffer size you could read character by character into a string until you hit '\n'.

If you get the exact same problem with different C++ runtimes I doubt it has anything to do with being able to read the entire file. If you have grep or any other program that can display a specific line from a file, take a look at that line.

[edited by - igni ferroque on May 18, 2004 8:24:08 PM]

Share this post


Link to post
Share on other sites
Ok, for some reason I have it working. Apparently it was a "SEARCH" rather than a "GET" or "POST" post that had over 300000 characters in it. I increased the line size to a million, an extreme amount and it seemed to work. Thanks for all the help anyways everyone.

Share this post


Link to post
Share on other sites
quote:
Original post by doodah2001
Ok, heres the basic outline of the code. This is the fstream version, however I have done bascially the same thing with FILE* and using windows specific with CreateFile().

int main()
{
ifstream in(filename);

if(!in.is_open())
return -1;

char line[1024];

while(!in.eof())
{
in.getline(line, 1024);

//call all the line parsing stuff here that just
//deals with the line, thats it
}
}

Its nothing complicated, it''s very simple. I have run it without doing any of my analysis functions but still get caught up on the same line. I have tried increasing the line size but that didn''t work because I still get caught up.

mat


I''m fairly sure that your problem occures when the line is longer than 1023 characters

Note that if the line is longer than 1023 characters, it sets the failbit of the stream. So the stream isn''t valid after that. !in.eof() will not fail, as that just tests for eof

You could either use std::getline, which is much safer because it operates on strings. Or you could use some code like this

if ( cin.rdstate() & ios::failbit )
{
// the line was too long


// turn off the stream''s ios::failbit

cin.clear( cin.rdstate() & ~ios::failbit );

// read and discard any unread characters, up to

// and including the delimiter character

while ( cin.good() && (cin.get() != delim) );
}
else
{
//Process the line

}

[/souce]

Share this post


Link to post
Share on other sites
quote:
Original post by doodah2001
Ok, for some reason I have it working. Apparently it was a "SEARCH" rather than a "GET" or "POST" post that had over 300000 characters in it. I increased the line size to a million, an extreme amount and it seemed to work. Thanks for all the help anyways everyone.


Hehe. That's the problem with fixed-size buffers. They're never big enough =) Glad you fixed the problem!

[edited by - igni ferroque on May 18, 2004 8:26:54 PM]

Share this post


Link to post
Share on other sites
Fredizzimo,

Thats a good idea. I didn''t think of that. It seems as if it only happens on "SEARCH" posts so I can check for that, and just discard that line if I have to since I don''t care about "SEARCH" posts anyways.

mat

Share this post


Link to post
Share on other sites