Back to General and Gameplay Programming

Fastest way to replace words in a text file

codingJoe · 2017-05-25T08:43:05

Hi, I have following problem: I need to inspect each line of a text file, and replace all words that are recognized, with alternative words. For that, I have a correspondency list like in the example below: word1,alternativeWord1, word2,alternativeWord2, ... wordN,alternativeWordN Above list has above 1000 lines. The text file has more than 10'000 lines. My current C++ algorithm is as follows: for each line in text file: for each word in correspondency list: std::find(line,word). If found, replace. The correspondency list can be preprocessed offline if needed, not the text file. Above works fine but is too slow. Do you have a better suggestion? Thanks

General and Gameplay Programming Programming

Started by codingJoe May 24, 2017 02:38 PM

14 comments, last by alnite 6 years, 10 months ago

swiftcoder

18,997

May 24, 2017 09:38 PM

Wait... is the computation actually the expensive part of this? I'd expect any reasonably optimal solution to be limited by the performance of reading/writing from disk.

There's no point waiting for the entire file to be read before you start computation - otherwise your CPU is just sitting idle while data moves all the way from disk to main memory. And there's no point waiting till the entire computation is complete to start writing it back out again, for the same reason.

If we are going for maximal performance, I'd advocate reading in blocks of 4kb-16kb at a time, performing the replacement on the block, and then writing it out immediately.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

SeanMiddleditch

17,596

May 24, 2017 09:49 PM

a diversion from an enormous compile (lasting 40 minutes and counting, might end up being a full rebuild)

Jeebs. We've been complaining about a mere 20 minute full rebuild time and are getting ready to throw engineering resources at dropping it back closer to 5 minutes, and I'd still consider 5 minutes to be excessive. :)

Sean Middleditch – Game Systems Engineer – Join my team!

frob

46,221

May 24, 2017 10:47 PM

I wouldn't go with RegEx for this.

Sorry about that, I meant fastest as in "It is nine lines of code to write".

If it were a regex with lookahead elements or many *'s and +'s or big classes of characters, then I fully it would be slow because those are a lot of work. As it about 1000 short literal values, those are fast to match and replace.

The other solution (which several of us mentioned, building a lookup table and handling the streams as they go) will be faster for the CPU. It will be more than 9 lines of code, but yes it will be even easier on the CPU.

Wait... is the computation actually the expensive part of this? I'd expect any reasonably optimal solution to be limited by the performance of reading/writing from disk.

That is speculation based on the initial post. He is likely doing tens of thousands of allocations from main memory, possibly more depending on how he is reading and writing data.

I'd still consider 5 minutes to be excessive.

I consider it excessive too, but it is the third generation of a major game, the full build goes through about 150GB of intermediate files. Distributing on Incredibuild helps, but sometimes the nasty evil interdependencies prevent that from happening. It gives me an opportunity to write comments on the board.

Bregma

9,461

May 25, 2017 02:20 AM

RegEx searches tend to be recursive and put you at risk of either a stack overflow or an exception inside the <regex> implementation if you do not set a macro sufficiently large enough. May not be a problem for simple text matches.

There's only a problem with blowing the stack if you're using backreferences. Unlikely to be a problem with this application.

Visual Studio provides a full implementation. GCC and Clang do not. You should expect wildly varying results between platforms, or valid expressions that are simply not accepted on some platforms. May not be a problem for simple text matches and if only Windows is to be targeted.

That hasn't been true for years. I mentored two summers of GSoC students to get a full and complete implementation in to GCC by 2013 and clang has had a full implementation since about the same time.

These may not be a problem here but everyone should be aware of them.
Here, this biggest problem is performance.
I found it faster to do simple text searches by searching for the first letter of my word myself and then only triggering a RegEx starting from there once found. If simple text searches were all I was doing I would not have used RegEx (which makes sense as to why I am recommending against it).

Perhaps that's an issue with the Microsoft implementation works, but the GCC implementation already does exactly that: scan from the beginning of the haystack until the first expression matches, then start the NFA or DFA (depending on flags) to evaluate the regex. The boost implementation also works that way.

That said, simple full-string literal matches are not the best use of regexes and are likely to be outperformed by other approaches, but using std::sregex_token_iterator with std::transform means you can do the whole thing with 3 lines of code. You gotta love the terseness and elegance of that.

Stephen M. Webb
Professional Free Software Developer

codingJoe

193

Author

May 25, 2017 08:33 AM

Thanks for all the good and interesting replies!

I ended up by putting the replacement words into a map (with the original words as keys). And having a recursive function that splits the text into 3 parts that will be reassembled again after the middle part was replaced. Before reassembling the 3 parts, the last part will be used in a call to the same recursive function.

Above works really good if the searched words have a common prefix.

Thanks again!

alnite

3,454

May 25, 2017 08:43 AM

Thanks for all the good and interesting replies!

I ended up by putting the replacement words into a map (with the original words as keys). And having a recursive function that splits the text into 3 parts that will be reassembled again after the middle part was replaced. Before reassembling the 3 parts, the last part will be used in a call to the same recursive function.

Above works really good if the searched words have a common prefix.

Thanks again!

Is this a homework, by any chance? If so, what course is it?

Just curious.

Fastest way to replace words in a text file

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Fastest way to replace words in a text file

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines