[.net] Reading text files faster?

Started by
6 comments, last by Nytegard 19 years, 6 months ago
Alright, I admit I'm a novice to C#. I have a 650 MB text file that I need parsed into a specific file structure in memory, but that's not important right now. Here's the code right now, minus all the error checking I have to do, to keep this short.

StreamReader objReader =  new StreamReader(strFileName);
string strLine = "";
string strHeader;
string strDefinition;
int nIndexofSplit;

while(null != (strLine = objReader.ReadLine() )
{
  //nIndexofSplit = strLine.IndexOf(".");
  //strHeader = strLine.Substring(0, nIndexofSplit);
  //strDefinition = strLine.Substring(nIndexofSplit + 1, strLine.Length - (nIndexofSplit + 1) );
}

This takes roughly 20 seconds. Now if I uncomment out the commented lines, it takes 1 minute to process. This isn't even including all the processing of the line that needs to be done later. As a simple test, I rewrote this in C (error checking included, and the lines uncommented), and it takes 13 seconds instead of 1 minute. What's a quicker way to read the file in and parse the string so it can average ~15 seconds, 20 seconds worst case scenario?
Advertisement
Well, first off, a bit problem you have is your use of strings. Each time you reassign a string, it has to create a whole new string object (strings are imutable). Instead you should be using a string builder.

Secondly, don't read in a line at a time. Read in a chunk of lines. Reads are more efficient with more data, upto a point of course.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

As Washu said, use StringBuilder instead of String. Also, read your file in chunks of several MBs.

If your storing the entire parsed version of the data in memory (as opposed to reading, parsing, using, discarding), you could write a class that reads the entire file into a single block of memory. Then, go through that block and store indicies to indicate where each strHeader and strDefinition begins and ends in that block.
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
The class library provides a BufferedStream wrapper that should improve performance by reading chuncks as joanusdmentia suggested. I would be wary about reading the entire thing as exceeding memory limits can cause a very severe slowdown -- much more severe than not buffering. Beyond a point the size of a buffer makes little difference, on my system reading 1024 byte chunks is much faster than reading 128 byte chunks but reading 4096 is not much faster than 1024.

If you are working with a whole lot of data consider working with the bytes directly instead of converting everyting to strings.

What can you tell more about the file? Will the data you're looking for appear randomly in the file? Are you reading the entire 650 MB into RAM ("...need parsed into a specific file structure in memory") or just a few bytes?

As other suggested, reading binary will be more efficient.

For such large files you could *consider* using memory-mapped files. That would likely give much better performance. However there are several issues involved. It's harder, and you need more Win32 knowledge to get it working and efficient. You need to go unsafe, which is a pain if you plan to deploy this widely (but not a problem if it's a tool only you use yourself). I would do in a managed c++ assembly if I were to do it.
Memory mapping is very seldom a big win, unless you want to do sparse I/O on the file. The reason is that you have to page fault to read in each little chunk of the file, which causes a disk seek per memory mapping page granule (which usually is bigger than a single physical page).

Some memory mapping implementations will optimize the case you're getting (scanning the file linearly) by using read-ahead, but you're likely to be able to do much better using asynchronous double-buffered I/O yourself.

I e, if you need to process N megabytes, do something like:

curRead = queue 2 MB read at offset 0;readOffset = 2 MB;processOffset = 0;while( processOffset < size ) {  prevRead = curRead;  if( readOffset < size ) {    curRead = queue 2 MB read at offset readOffset;    readOffset += 2 MB;  }  wait for completion of prevRead;  process at processOffset;  processOffset += 2 MB;}


(You need a little extra handling of the end if your file isn't neatly aligned to a multiple of 2 MB in size, but the extra work is simple)
enum Bool { True, False, FileNotFound };
Quote:Original post by Washu
Instead you should be using a string builder.


Actually look at the shared source it also does it. However the stringBuilder's buffer is large enough that it just copies memory over to the end. Each time you insert something it doubles its size. There was an article about how to do it correctly in Managed C++... (reassigning the pointer).
OK, the 650 MB file is used to store certain information I need for work, and the file format they decided is text. Anyways, I do need the entire file, but not stored in memory all at one time. Actually, currently, I cannot read the entire file into memory without serious page swapping, and slowing everything down. There are roughly 3.4 million lines in this file I do need stored in RAM at all times though. To cut down on size, 1.7 million of these lines need to be sorted, which results in roughly 46000 different objects (instead of 1.7 million). These are the keys used to access the other 1.7 million objects. The other 30 million lines of the text file, I'm parsing into my own format, and writing 2 files. One file is a temporary file of my own format which comrpesses the 650 MB file into 170 MB, and the other file is a list of offsets used to seek to the specific record in the file I need out of the 1.7 million compressed records.

Currently, on a 1.8 Ghz P4, I have a C version which will read, sort, and write the temporary files in roughly 35 seconds, and it maxes out at 68 MB of RAM. But the GUI, where the results of the processing need to be displayed, is in C#. And I'll admit right now, trying to look at the C code is doable, but extremely ugly. It's filled with tons of pointer arithmetic, mallocs, reallocs, frees, memcopys, memmoves, and memsets all over the place, and incase I ever discovered a bug or memory leak (and I'll be honest, I'm not a perfect programmer), trying to track it down wouldn't exactly be easy. The C# version on the other hand is very simple and very clean.

That's pretty much my goal. Read, sort, and write the temporary files in under 80 MB of RAM usage and in less than 45 seconds.

This topic is closed to new replies.

Advertisement