Archived

This topic is now archived and is closed to further replies.

pacrugby

reading large files

Recommended Posts

pacrugby    122
Im just wondering about the best way to read large text files. Some of my files are 10 megs+. I use C++ 6.0 MFC on a brand new dell 1.8ghtz and Im expecting it to happen quite fast. However, the algo Im using now takes a few seconds. essentially the file is made up of 50,000 + lines with 13 or more numbers to be read in for each line. Thanks in advance.

Share this post


Link to post
Share on other sites
griffenjam    193
Well, I don''t know if this is the best way but you could do something like this. It''s general and assumes that you do something with the data (copy it to another buffer, transmit it over the internet, process it...) before the next read.

#define BLOCK_SIZE 2048
.
.
.
char sBuffer[BLOCK_SIZE];
int nRet;

FILE *fp = fopen("yourfile.txt", "rb");

while(!feof(fp))
{
nRet = fread(sBuffer, BLOCK_SIZE, sizeof(char), fp);
//Do something with the data in the buffer, you have nRet
//number of characters in there.
}
fclose(fp);
...

I don''t think this is anything new, but this is how I wrote a file transfer program. It was able to transfer a 10Meg file over a network in a couple of seconds(well, maybe a couple more than a couple ). You can play with the block size to see if you can get different results, Leave it a multiple of 2 though.

Also you can check to see what the file size is, allocate a buffer of that size and do one read operation, but something just tells me that I shouldn''t trust reading 10Megs in one call.
But that might just be me.



Jason Mickela
ICQ : 873518
E-Mail: jmickela@pacbell.net
------------------------------
"Evil attacks from all sides
but the greatest evil attacks
from within." Me
------------------------------

Share this post


Link to post
Share on other sites
pacrugby    122
the way I was doing it was by reading
each # recursively into my prog.
this was slow
then I started reading line by line
and using sscanf functions to get the #''s which
sped it up a lot.
I was just wondering if there was an even faster way like
reading the entire file into memory and the reading the #''s.
not to sure how reading is taken care of.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
This is not directly related but for a college project i have to write an algorithm that searches through large (200+ mb) dna text files, to be executed on a supercomputer. The problem is we do not know the exact size of each file. You mentioned something about checking the file size first, then allocating enough memory for the text. How exactly do i check the file size and how do you allocate a non-constant amount of energy.

Any help appreciated.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
you can read entire file into a buffer with fopen, fread(buffer, something or anther). look in the documentation for those commands.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
This is not directly related but for a college project i have to write an algorithm that searches through large (200+ mb) dna text files, to be executed on a supercomputer. The problem is we do not know the exact size of each file. You mentioned something about checking the file size first, then allocating enough memory for the text. How exactly do i check the file size and how do you allocate a non-constant amount of memory.

Any help appreciated.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
you can check file size by doing fseek(file, 0, SEEK_END), then position = ftell(file), position will hold the number of bytes in file.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
If you include you can use std::istream for input and std::ostream for output. Open it with ios_base::in for input or ios_base::out for output. Read n characters with infile->read(myBuffer, n);

For the DNA file, you can get file info by calling the function
stat(fname, &file_stat_struct).

HSZuyd

Share this post


Link to post
Share on other sites
no way    122
If you are willing to sacrifice 10megs of ( temporary ) memory, then yes you can read entire file into memory in one go.

Like people said before, first open the file, then fseek() to end and ftell() to get the size of the file.
Then allocate the buffer for reading, like char *hugebuf=malloc(filelen);
Now rewind the file ( fseek() to beginning) and use one fread() to read the entire contents.
Now whatever code you had for parsing the file, just replace fscanf(infile,... ) and fgets(infile,..) with sscanf(hugebuf,..) correspondingly and you should be good to go.

Dont forget to free(hugebuf) immediately afterwards !

Share this post


Link to post
Share on other sites
Mithrandir    607
Take advantage of the hardware features:

Every hard disk nowadays has at least a 256 kilobyte cache on it, so that when accessing data, it first fills the cache, and then dumps the data to memory, or vice versa (stores data in the cache, then writes it all to disk at once).

Knowing this, then, the most efficient way of accessing data is to do it in chunks.

Keep a 128-512K chunk of memory open, and load the file 128-512K at a time. Since this method uses the cache much more efficiently, you''ll see a significant speed enhancement.


My guess is that you are loading the file one byte at a time, and this significantly slows down performace by a great deal, because if you do byte-access, the computer goes to disk, requests a byte, and waits for that byte to be sent back, then requests another one. DO NOT LOAD A FILE THIS WAY.


Example: I created a class that generates two checksums (using 2 different algorithms) for a file. My first version read the file byte-for-byte. This was fine for small files... But anything larger than a meg choked the system (this was a few years ago). I then switched my algorithm so that I loaded 64K at a time, then generated the checksum from that, and it could do 1Meg files at least 20 times faster.

Share this post


Link to post
Share on other sites
Wain    122
Disk access is one of the slowest things your computer can possibly do, the object of making a fast loader is to minimize the amount of read/write commands you actually execute.

The first thing is make sure you're doing your own buffering (using fread), sure it adds a couple of lines of extra code for the maintenance, but anything that does the buffering for you will often waste more time(it depends on your compiler and exactly how you are reading the data).

The next step would be to do a bulk read in chunks as large as you can possibly handle into seperate buffers (or one large buffer if your machine has the available memory).

process from there. If there's still file left to be read, load the next chunk.

do NOT use the STDIN stream I/O routines(scanf/printf, getchar/putchar, cin/cout, etc...) these were not designed for large block reads and will slow down your program significantly.

Your best option speed wise will probably(depending slightly on your compiler) be to use fread to 'malloc'ed or 'new'ed memory and read in the largest chunks possible. fread is also usually faster than read because of the buffering issues.

anything where the compiler or BIOS, or OS does the buffering for you is a waste of time when you're reading files that large.


~~~~
I can put my foot in my mouth...WANNA SEE????


Edited by - wain on October 18, 2001 12:43:38 PM

Share this post


Link to post
Share on other sites
DerekSaw    243
If u r using the stdio.h library, u can use the setvbuf() function to attach an buffer to ur opened file. I experiences faster access, especially for lots of fscanf() and fgetc().


Try and see if it helps.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
Thanks for all the advice. Especially the fseek() and ftell() functions have provided useful. But now i have to know how to declare arrays to hold these text files, the size of which is known only at runtime. The supercomputer has 64gb of memory so that shouldn''t be a problem.

Share this post


Link to post
Share on other sites
Red Ant    471
If you want to dynamically allocate an array of say 1000 chars, do this.

  
char *text;

text = new char[1000];

if (text == NULL) // failed to allocate memory

return 0;

....
....
....

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
new defaults to throwing an exception. You have to specify ''nothrow'' when you use new.

// using the default new:
char *text1;
try
{
text1 = new char[1000];
}
catch( bad_alloc a )
{
cout << "Threw a bad_alloc exception" << endl;
}

// using the nothrow alternative:
char *text2;
text2 = new(nothrow) char[1000];
if (text2 == NULL) // now new returns NULL on failure
return 0;

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
(DNA guy again)

Yes that''s great but what if don''t know how big the text file is when i''m declaring the array. You can''t declare an array like this:

char text[n];

where n is a variable, right?

Share this post


Link to post
Share on other sites
Red Ant    471
Just do it like this.

            
#include <iostream.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>

int main()
{
FILE *stream;
char *buffer;
int size;
struct _stat Stats;
char filename[] = "some_file.bla";


// retrieve file info

if (_stat( filename, &Stats ) != 0)
return 0;

stream = fopen(filename);
if (stream == NULL)
return 0;


size = Stats.st_size;

// Next allocate enough memory so that you can put

// 'size' chars into the buffer array


buffer = new char[size];

// Now let's read the file into the buffer

fread(buffer, sizeof(char), size, stream);

fclose(stream);
// do stuff with your buffer array

...
...
// don't forget to free the menory for buffer when you're done

delete [] buffer;

return 1;
}


Hope I didn't forget anything.









Edited by - Red Ant on October 19, 2001 1:38:30 PM

Share this post


Link to post
Share on other sites
NuffSaid    122
quote:
Original post by Anonymous Poster
new defaults to throwing an exception. You have to specify ''nothrow'' when you use new.

// using the default new:
char *text1;
try
{
text1 = new char[1000];
}
catch( bad_alloc a )
{
cout << "Threw a bad_alloc exception" << endl;
}

// using the nothrow alternative:
char *text2;
text2 = new(nothrow) char[1000];
if (text2 == NULL) // now new returns NULL on failure
return 0;



I don;t think that applies on MSVC 6. AFAIK, unless you define your own new handler, new just returns NULL if it fails to allocate the required amount of memory.

Share this post


Link to post
Share on other sites