strtok() performance

Started by
5 comments, last by antareus 19 years, 6 months ago
I'm using strtok with a multithreaded runtime library (I have to) and it proves to be a major bottleneck in my code. I don't have exact numbers but the profiler shows that about 60% of the time is spent inside strtok function. I use strtok inside an inner loop (I tokenize 100+ MB log files at work) and I'd really like to improve performance as much as possible. Any ideas what I could do? Perhaps a better tokenizer implementation? AFAIK boost implementation is significantly slower than C runtime. Thanks.
Advertisement
you could test this out, just modify it to suit your needs:

#include <string>#include <deque>#include <iterator>#include <algorithm>#include <iostream>template< typename Container >void stringtok(Container& container, const std::string& in,           const char * const delimiters = " \t\n") {    const std::string::size_type len = in.length();          std::string::size_type i = 0;    while(i < len) {        // eat leading whitespace        i = in.find_first_not_of(delimiters, i);        if (i == std::string::npos)            return;   // nothing left but white space        // find the end of the token        std::string::size_type j = in.find_first_of (delimiters, i);        // push token        if(j == std::string::npos) {            container.push_back(in.substr(i));            return;        } else            container.push_back(in.substr(i, j-i));        // set up for next loop        i = j + 1;    }}int main() {   std::deque<std::string> tokens;   std::string sentance;   std::cout << "Enter a sentance:\n";   std::getline(std::cin, sentance);   stringtok(tokens, sentance);   std::copy(tokens.begin(), tokens.end(),             std::ostream_iterator<std::string>(std::cout, "\n"));   return 0;}
Well, what sort of data are you parsing? I realize you may not be able to be totally specific, but maybe there are some characteristics of the data that lend themselves to more optimal solutions.
--God has paid us the intolerable compliment of loving us, in the deepest, most tragic, most inexorable sense.- C.S. Lewis
Just arbitrary size strings (30 bytes - 1KB) separated by comas and containing nine tokens. I realize I could do a more efficient custom solution, I just find it surprising that strtok is performing so poorly.
Hmm, replaced strtok with my own implementation (a simple for loop, really) and the bottleneck shifted to a completely different function. I wonder why MSVC implementation is so slow...

On a different note, profilers are cool [smile]
StrTok uses some static storage. Hence there is an old rule: never, ever, ever, ever, ever use StrTok in a multithreaded environmetn. Use Stringstreams - they're inefficient, but clean.
-- Single player is masturbation.
Quote:Original post by Pxtl
StrTok uses some static storage. Hence there is an old rule: never, ever, ever, ever, ever use StrTok in a multithreaded environmetn. Use Stringstreams - they're inefficient, but clean.

Don't be so certain. I checked MSDN and it looks like they use thread-local storage for strtok:
Quote:Each function uses a static variable for parsing the string into tokens. If multiple or simultaneous calls are made to the same function, a high potential for data corruption and inaccurate results exists. Therefore, do not attempt to call the same function simultaneously for different strings and be aware of calling one of these functions from within a loop where another routine may be called that uses the same function. However, calling this function simultaneously from multiple threads does not have undesirable effects.
--God has paid us the intolerable compliment of loving us, in the deepest, most tragic, most inexorable sense.- C.S. Lewis

This topic is closed to new replies.

Advertisement