Working with huge text files...

Started by
10 comments, last by Viral_Fury 15 years, 4 months ago
I have a text file and I'm using it to make a word game. The thing is, though, the file is 5 mb and 179k lines. What would be the best way to go about working with this? Should I load it all into a vector/array? I figure looping through the whole file each time I'm trying to check for a word would be a lot slower than loading it all into memory first. Will looping through 179k lines go slow? I haven't gotten a chance to try it yet but am just trying to play this out before I tackle it. Here's two lines from the file for an example: PROGRAM to arrange in a plan of proceedings [v -GRAMED, -GRAMING, -GRAMS or -GRAMMED, -GRAMMING, -GRAMS] PROJECT to extend outward [v -ED, -ING, -S] Thanks for any advice. (source for the text file is zyzzyva.com if anyone's curious. scrabble dictionary or something like that)
Advertisement
Maybe load words into a seperate vector/map based on the starting letter? So all A's in one, all B's in another. Then when you need to search for a word you can check the starting letter and then search the correct map.
Quote:Original post by Jaqen
I have a text file and I'm using it to make a word game. The thing is, though, the file is 5 mb and 179k lines.


This is not "huge".

Quote:What would be the best way to go about working with this? Should I load it all into a vector/array? I figure looping through the whole file each time I'm trying to check for a word would be a lot slower than loading it all into memory first.


What does "check for a word" mean for you?

Quote:Will looping through 179k lines go slow?


It might go much slower than is called for, depending on what you are trying to do. What are you trying to do?
Oh, well I just finished writing up code to print out every line in the file and the program took 43.859 seconds. I mean, that seems too long to me so I'm betting someone knows a way that makes more sense, which is what I'm looking for.

The main part of it is you're going to be given an anagram, and then you type in words you can get out of it. It'll check the list to see if it exists.
The big question is what platform you are developing for. I'm going to assume that you are working with the PC as your platform, in which case loading everything into memory is not a big deal. You should not use a vector or array if you need to access this information frequently, you should use a data structure called a hash table as it is quite efficient at retrieving information for large data sets.
Yeah, I am using PC though it'll probably be cross platform. I know nothing about this area of programming so I'll go look up hash tables and see what I can come up with...
Quote:Original post by Jaqen
Oh, well I just finished writing up code to print out every line in the file and the program took 43.859 seconds. I mean, that seems too long to me so I'm betting someone knows a way that makes more sense, which is what I'm looking for.


The question is, was reading the file the slow part, or was printing out each line the slow part? If this was inside a Windows console (e.g. run from cmd.exe) then I would guess the printing was significantly slower than the reading part. Simply reading a file line by line should take a fraction of a second for a relatively small file like you've described (5MB is not big).

Try taking out the bit where you actually print each line and see what the difference is.
When you say it took nearly 50 secs to write the data to screen, to be honest this doesn't seem too outlandish at all if you're dumping it to standard out (console). Writing to console isn't exactly a performance-friendly operation. For example, iterating over and assigning to every value in a std::vector of 180000 integers takes a blink of an eye, whereas printing them to std::cout takes about 20 secs. I know you're not using ints, but the point is don't start optimising until you're sure you need to.

ninja'd, dammit!

Visit http://www.mugsgames.com

Stroids, a retro style mini-game for Windows PC. http://barryskellern.itch.io/stroids

Mugs Games on Twitter: [twitter]MugsGames[/twitter] and Facebook: www.facebook.com/mugsgames

Me on Twitter [twitter]BarrySkellern[/twitter]

You guys are right... I commented out the cout line and it took all of .406s to execute. Wondering how much longer it'll take to put it into a vector on top of that... according to you, not much time at all.

Sorry, I guess I got ahead of myself if I really don't need all this stuff.
put it in a vector and do a binary search on it. it will take < 19 checks to see if one of the 179K lines has the word you are looking for as long as they are all sorted in alphebetical order.

This topic is closed to new replies.

Advertisement