Fastest way to replace words in a text file

Started by
14 comments, last by alnite 6 years, 10 months ago

Hi,

I have following problem: I need to inspect each line of a text file, and replace all words that are recognized, with alternative words. For that, I have a correspondency list like in the example below:

  • word1,alternativeWord1,
  • word2,alternativeWord2,
  • ...
  • wordN,alternativeWordN

Above list has above 1000 lines. The text file has more than 10'000 lines.

My current C++ algorithm is as follows:


for each line in text file:
    for each word in correspondency list:
        std::find(line,word). If found, replace.

The correspondency list can be preprocessed offline if needed, not the text file.

Above works fine but is too slow. Do you have a better suggestion?

Thanks

Advertisement

You could try loading the entire file into a vector of strings, then using std::replace_if with an appropriate predicate to do the replacement for you. For your predicate, you could use a lambda that captures your alternative words list, and then checks the current string in the vector and replaces it if needed.

Why not:


for each line in file:
    for each word in line:
        if word in correspondency list:
            write replacement word to output
        else
            write word to output

I'd also use something like unordered_map for the correspondency list rather than iterating through it.

The fastest way is probably building a state machine from the list of words to replace, but it's not a trivial problem. I would stick with the simple algorithm you have or what Kylotan suggested, unless its performance is not satisfactory.

I need to inspect each line of a text file, and replace all words that are recognized, with alternative words.
Use 'sed', not sure what it does, but likely it is more optimized than you will ever make.

https://en.wikipedia.org/wiki/Sed

This REALLY looks like homework, but you've got other replies, so throwing mine into the ring.

Looks like you've got a fancy form of a regular expression, one data file with about 1000 lines of replacements, another file to be replaced.

My recommendation is loading the first file and turning the values into a large regular expression using a std::regex object. Sure the regex will be a few kilobytes long, but that's small potatoes to the engine. Then load the second file into a buffer and use regex_replace() [edit: regex_match()] on it, storing the result in your output stream. Looks like you're parsing about 500K of data, that's easily traversed.

Cppreference.com has an example doing it in 3 lines, but they've got a simple direct string instead of your more complicated data sources. Yours would probably be closer to 10 lines since you've got the formatted source file.

If it is homework you'll need to explain a few things to your teacher, but for a real world environment that's going to likely be your fastest solution unless you want to hand craft something.

Barring that solution (maybe your peers tremble at the power of regular expressions), a hash table mapping the words together and a stream-based process of your second file is probably going to be your next fastest solution. You'll want to be certain you scan for words in place so you avoid unnecessary allocations, that becomes a single pass through the file with fast constant-time lookup of each of the 10K lines, so it should run in a few milliseconds. It will be slower to read and write the result.

I had a few minutes to spare and figured I'd "eat my own dog food", as it were, since it has been a while since I used the regex engine.

You'll want to use regex_match instead of regex_replace as I incorrectly suggested, but with my sample data (a short file of 50 words and a 12,654 word Wikipedia article) it still runs in the blink of an eye using 9 lines of code.

Fun little project and a diversion from an enormous compile (lasting 40 minutes and counting, might end up being a full rebuild) so thanks for the diversion.

I would use a trie built from an english dictionary. Then you should be able to loop over your input and grab words in linear time. From there a hash multimap can be used to key words to alternative words. The hard part sounds like manually constructing the hash multimap.

If all of the 'words to find' have at least a certain length (maybe 4 characters or so), I would use a modified Boyer-Moore algorithm to do a fast scan for the initial N length where N is the minimum of all search string lengths, then use a trie for the rest.

I had a few minutes to spare and figured I'd "eat my own dog food", as it were, since it has been a while since I used the regex engine.
You'll want to use regex_match instead of regex_replace as I incorrectly suggested, but with my sample data (a short file of 50 words and a 12,654 word Wikipedia article) it still runs in the blink of an eye using 9 lines of code.
Fun little project and a diversion from an enormous compile (lasting 40 minutes and counting, might end up being a full rebuild) so thanks for the diversion.

I wouldn't go with RegEx for this.

A few points that may not even apply to this specific situation:
  • RegEx searches tend to be recursive and put you at risk of either a stack overflow or an exception inside the <regex> implementation if you do not set a macro sufficiently large enough. May not be a problem for simple text matches.
  • Visual Studio provides a full implementation. GCC and Clang do not. You should expect wildly varying results between platforms, or valid expressions that are simply not accepted on some platforms. May not be a problem for simple text matches and if only Windows is to be targeted.
These may not be a problem here but everyone should be aware of them.
Here, this biggest problem is performance.
I found it faster to do simple text searches by searching for the first letter of my word myself and then only triggering a RegEx starting from there once found. If simple text searches were all I was doing I would not have used RegEx (which makes sense as to why I am recommending against it).
My datasets were around 0.5 megabytes, and my initial simple passes (just letting <regex> do its thing) were insanely slow. I had to do the above hack and some others to shave off a significant amount of time, and I wouldn't expect it to be your fastest route even with the simpler matches you will be using.

I expect Kylotan's suggestion to be the fastest and easiest to implement.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

This topic is closed to new replies.

Advertisement