• Advertisement
Sign in to follow this  

Fastest way to replace words in a text file

Recommended Posts

Hi,

I have following problem: I need to inspect each line of a text file, and replace all words that are recognized, with alternative words. For that, I have a correspondency list like in the example below:

  • word1,alternativeWord1,
  • word2,alternativeWord2,
  • ...
  • wordN,alternativeWordN

Above list has above 1000 lines. The text file has more than 10'000 lines.

My current C++ algorithm is as follows:

for each line in text file:
    for each word in correspondency list:
        std::find(line,word). If found, replace.

The correspondency list can be preprocessed offline if needed, not the text file.

Above works fine but is too slow. Do you have a better suggestion?

Thanks

Edited by codingJoe

Share this post


Link to post
Share on other sites
Advertisement

You could try loading the entire file into a vector of strings, then using std::replace_if with an appropriate predicate to do the replacement for you. For your predicate, you could use a lambda that captures your alternative words list, and then checks the current string in the vector and replaces it if needed.

Share this post


Link to post
Share on other sites

Why not:

for each line in file:
    for each word in line:
        if word in correspondency list:
            write replacement word to output
        else
            write word to output

I'd also use something like unordered_map for the correspondency list rather than iterating through it.

Share this post


Link to post
Share on other sites
The fastest way is probably building a state machine from the list of words to replace, but it's not a trivial problem. I would stick with the simple algorithm you have or what Kylotan suggested, unless its performance is not satisfactory. Edited by �lvaro

Share this post


Link to post
Share on other sites

This REALLY looks like homework, but you've got other replies, so throwing mine into the ring.

Looks like you've got a fancy form of a regular expression, one data file with about 1000 lines of replacements, another file to be replaced.

 

My recommendation is loading the first file and turning the values into a large regular expression using a std::regex object. Sure the regex will be a few kilobytes long, but that's small potatoes to the engine. Then load the second file into a buffer and use regex_replace() [edit: regex_match()] on it, storing the result in your output stream. Looks like you're parsing about 500K of data, that's easily traversed.

Cppreference.com has an example doing it in 3 lines, but they've got a simple direct string instead of your more complicated data sources.  Yours would probably be closer to 10 lines since you've got the formatted source file.

If it is homework you'll need to explain a few things to your teacher, but for a real world environment that's going to likely be your fastest solution unless you want to hand craft something.

 

Barring that solution (maybe your peers tremble at the power of regular expressions), a hash table mapping the words together and a stream-based process of your second file is probably going to be your next fastest solution. You'll want to be certain you scan for words in place so you avoid unnecessary allocations, that becomes a single pass through the file with fast constant-time lookup of each of the 10K lines, so it should run in a few milliseconds. It will be slower to read and write the result.

Edited by frob
Tried it out for grins and giggles

Share this post


Link to post
Share on other sites

I had a few minutes to spare and figured I'd "eat my own dog food", as it were, since it has been a while since I used the regex engine.

You'll want to use regex_match instead of regex_replace as I incorrectly suggested, but with my sample data (a short file of 50 words and a 12,654 word Wikipedia article) it still runs in the blink of an eye using 9 lines of code.

Fun little project and a diversion from an enormous compile (lasting 40 minutes and counting, might end up being a full rebuild) so thanks for the diversion.

Share this post


Link to post
Share on other sites

I would use a trie built from an english dictionary. Then you should be able to loop over your input and grab words in linear time. From there a hash multimap can be used to key words to alternative words. The hard part sounds like manually constructing the hash multimap.

Share this post


Link to post
Share on other sites
If all of the 'words to find' have at least a certain length (maybe 4 characters or so), I would use a modified Boyer-Moore algorithm to do a fast scan for the initial N length where N is the minimum of all search string lengths, then use a trie for the rest. Edited by Nypyren

Share this post


Link to post
Share on other sites

I had a few minutes to spare and figured I'd "eat my own dog food", as it were, since it has been a while since I used the regex engine.
You'll want to use regex_match instead of regex_replace as I incorrectly suggested, but with my sample data (a short file of 50 words and a 12,654 word Wikipedia article) it still runs in the blink of an eye using 9 lines of code.
Fun little project and a diversion from an enormous compile (lasting 40 minutes and counting, might end up being a full rebuild) so thanks for the diversion.

I wouldn't go with RegEx for this.

A few points that may not even apply to this specific situation:
  • RegEx searches tend to be recursive and put you at risk of either a stack overflow or an exception inside the <regex> implementation if you do not set a macro sufficiently large enough.  May not be a problem for simple text matches.
  • Visual Studio provides a full implementation.  GCC and Clang do not.  You should expect wildly varying results between platforms, or valid expressions that are simply not accepted on some platforms.  May not be a problem for simple text matches and if only Windows is to be targeted.
These may not be a problem here but everyone should be aware of them.
Here, this biggest problem is performance.
I found it faster to do simple text searches by searching for the first letter of my word myself and then only triggering a RegEx starting from there once found.  If simple text searches were all I was doing I would not have used RegEx (which makes sense as to why I am recommending against it).
My datasets were around 0.5 megabytes, and my initial simple passes (just letting <regex> do its thing) were insanely slow.  I had to do the above hack and some others to shave off a significant amount of time, and I wouldn't expect it to be your fastest route even with the simpler matches you will be using.

I expect Kylotan's suggestion to be the fastest and easiest to implement.


L. Spiro

Share this post


Link to post
Share on other sites

Wait... is the computation actually the expensive part of this? I'd expect any reasonably optimal solution to be limited by the performance of reading/writing from disk.

There's no point waiting for the entire file to be read before you start computation - otherwise your CPU is just sitting idle while data moves all the way from disk to main memory. And there's no point waiting till the entire computation is complete to start writing it back out again, for the same reason.

If we are going for maximal performance, I'd advocate reading in blocks of 4kb-16kb at a time, performing the replacement on the block, and then writing it out immediately.

Share this post


Link to post
Share on other sites

a diversion from an enormous compile (lasting 40 minutes and counting, might end up being a full rebuild)

 

Jeebs. We've been complaining about a mere 20 minute full rebuild time and are getting ready to throw engineering resources at dropping it back closer to 5 minutes, and I'd still consider 5 minutes to be excessive. :)

Share this post


Link to post
Share on other sites

I wouldn't go with RegEx for this.
 

Sorry about that, I meant fastest as in "It is nine lines of code to write".  

If it were a regex with lookahead elements or many *'s and +'s or big classes of characters, then I fully it would be slow because those are a lot of work. As it about 1000 short literal values, those are fast to match and replace.

The other solution (which several of us mentioned, building a lookup table and handling the streams as they go) will be faster for the CPU.  It will be more than 9 lines of code, but yes it will be even easier on the CPU.

 

Wait... is the computation actually the expensive part of this? I'd expect any reasonably optimal solution to be limited by the performance of reading/writing from disk.

That is speculation based on the initial post.  He is likely doing tens of thousands of allocations from main memory, possibly more depending on how he is reading and writing data.

I'd still consider 5 minutes to be excessive.

I consider it excessive too, but it is the third generation of a major game, the full build goes through about 150GB of intermediate files. Distributing on Incredibuild helps, but sometimes the nasty evil interdependencies prevent that from happening. It gives me an opportunity to write comments on the board. 

Share this post


Link to post
Share on other sites

RegEx searches tend to be recursive and put you at risk of either a stack overflow or an exception inside the <regex> implementation if you do not set a macro sufficiently large enough.  May not be a problem for simple text matches.

There's only a problem with blowing the stack if you're using backreferences. Unlikely to be a problem with this application.

 

  • Visual Studio provides a full implementation.  GCC and Clang do not.  You should expect wildly varying results between platforms, or valid expressions that are simply not accepted on some platforms.  May not be a problem for simple text matches and if only Windows is to be targeted.

 

That hasn't been true for years. I mentored two summers of GSoC students to get a full and complete implementation in to GCC by 2013 and clang has had a full implementation since about the same time.

These may not be a problem here but everyone should be aware of them.
Here, this biggest problem is performance.
I found it faster to do simple text searches by searching for the first letter of my word myself and then only triggering a RegEx starting from there once found.  If simple text searches were all I was doing I would not have used RegEx (which makes sense as to why I am recommending against it).

Perhaps that's an issue with the Microsoft implementation works, but the GCC implementation already does exactly that: scan from the beginning of the haystack until the first expression matches, then start the NFA or DFA (depending on flags) to evaluate the regex. The boost implementation also works that way.

That said, simple full-string literal matches are not the best use of regexes and are likely to be outperformed by other approaches, but using std::sregex_token_iterator with std::transform means you can do the whole thing with 3 lines of code. You gotta love the terseness and elegance of that.

Share this post


Link to post
Share on other sites

Thanks for all the good and interesting replies!

I ended up by putting the replacement words into a map (with the original words as keys). And having a recursive function that splits the text into 3 parts that will be reassembled again after the middle part was replaced. Before reassembling the 3 parts, the last part will be used in a call to the same recursive function.

Above works really good if the searched words have a common prefix.

Thanks again!

Share this post


Link to post
Share on other sites

Thanks for all the good and interesting replies!

I ended up by putting the replacement words into a map (with the original words as keys). And having a recursive function that splits the text into 3 parts that will be reassembled again after the middle part was replaced. Before reassembling the 3 parts, the last part will be used in a call to the same recursive function.

Above works really good if the searched words have a common prefix.

Thanks again!

 

Is this a homework, by any chance?  If so, what course is it?

Just curious.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Advertisement