Sign in to follow this  

Working with huge text files...

This topic is 3298 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have a text file and I'm using it to make a word game. The thing is, though, the file is 5 mb and 179k lines. What would be the best way to go about working with this? Should I load it all into a vector/array? I figure looping through the whole file each time I'm trying to check for a word would be a lot slower than loading it all into memory first. Will looping through 179k lines go slow? I haven't gotten a chance to try it yet but am just trying to play this out before I tackle it. Here's two lines from the file for an example: PROGRAM to arrange in a plan of proceedings [v -GRAMED, -GRAMING, -GRAMS or -GRAMMED, -GRAMMING, -GRAMS] PROJECT to extend outward [v -ED, -ING, -S] Thanks for any advice. (source for the text file is zyzzyva.com if anyone's curious. scrabble dictionary or something like that)

Share this post


Link to post
Share on other sites
Maybe load words into a seperate vector/map based on the starting letter? So all A's in one, all B's in another. Then when you need to search for a word you can check the starting letter and then search the correct map.

Share this post


Link to post
Share on other sites
Quote:
Original post by Jaqen
I have a text file and I'm using it to make a word game. The thing is, though, the file is 5 mb and 179k lines.


This is not "huge".

Quote:
What would be the best way to go about working with this? Should I load it all into a vector/array? I figure looping through the whole file each time I'm trying to check for a word would be a lot slower than loading it all into memory first.


What does "check for a word" mean for you?

Quote:
Will looping through 179k lines go slow?


It might go much slower than is called for, depending on what you are trying to do. What are you trying to do?

Share this post


Link to post
Share on other sites
Oh, well I just finished writing up code to print out every line in the file and the program took 43.859 seconds. I mean, that seems too long to me so I'm betting someone knows a way that makes more sense, which is what I'm looking for.

The main part of it is you're going to be given an anagram, and then you type in words you can get out of it. It'll check the list to see if it exists.

Share this post


Link to post
Share on other sites
The big question is what platform you are developing for. I'm going to assume that you are working with the PC as your platform, in which case loading everything into memory is not a big deal. You should not use a vector or array if you need to access this information frequently, you should use a data structure called a hash table as it is quite efficient at retrieving information for large data sets.

Share this post


Link to post
Share on other sites
Yeah, I am using PC though it'll probably be cross platform. I know nothing about this area of programming so I'll go look up hash tables and see what I can come up with...

Share this post


Link to post
Share on other sites
Quote:
Original post by Jaqen
Oh, well I just finished writing up code to print out every line in the file and the program took 43.859 seconds. I mean, that seems too long to me so I'm betting someone knows a way that makes more sense, which is what I'm looking for.


The question is, was reading the file the slow part, or was printing out each line the slow part? If this was inside a Windows console (e.g. run from cmd.exe) then I would guess the printing was significantly slower than the reading part. Simply reading a file line by line should take a fraction of a second for a relatively small file like you've described (5MB is not big).

Try taking out the bit where you actually print each line and see what the difference is.

Share this post


Link to post
Share on other sites
When you say it took nearly 50 secs to write the data to screen, to be honest this doesn't seem too outlandish at all if you're dumping it to standard out (console). Writing to console isn't exactly a performance-friendly operation. For example, iterating over and assigning to every value in a std::vector of 180000 integers takes a blink of an eye, whereas printing them to std::cout takes about 20 secs. I know you're not using ints, but the point is don't start optimising until you're sure you need to.

ninja'd, dammit!

Share this post


Link to post
Share on other sites
You guys are right... I commented out the cout line and it took all of .406s to execute. Wondering how much longer it'll take to put it into a vector on top of that... according to you, not much time at all.

Sorry, I guess I got ahead of myself if I really don't need all this stuff.

Share this post


Link to post
Share on other sites
Just to make this a positive example, this is a perfect situation to point out the importance of studying data structures

think about implementing it as either a linked list, or maybe a buffer with a bunch of discriptors.

and loading it all into memory and then working on it is pretty much the only way to go, disk reads are too slow to do that frequently


and the key to searching a large list is sorting

Share this post


Link to post
Share on other sites
Quote:
Original post by godsenddeath
Just to make this a positive example, this is a perfect situation to point out the importance of studying data structures


This I agree with completely. Also take the time to look up algorithms, specifically searches and sorts.

Quote:
think about implementing it as either a linked list, or maybe a buffer with a bunch of discriptors.


Please... why would you use a linked list? for a linked list you would have to go through every single, EVERY single element leading up to the word you are looking for. An array or vector is a much better option, but a linear search will still take time. Take some time to look at binary search. It assumes that your list is in order, but it should be. If it's not, take the time to sort it. The O(nlogn-n^2) in preprocessing is well worth the O(logn) vs. O(n) search times.

As viperman1271 said, the hashtable is your BEST option. Instant time retrieval.

Share this post


Link to post
Share on other sites

This topic is 3298 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this