Write ToLower As Efficient, Portable As Possible

Started by
24 comments, last by SiCrane 8 years, 12 months ago

Hello

I am challenging myself to try to attempt to write a function that is as efficient, portable and failsafe as possible. The function is very simple and just converts a string (or wstring) to lower case. Are there any improvements you think I could make and any errors I have done?


#ifdef UNICODE
typedef std::wstring tstring;
typedef std::wstring::npos tnpos;
#else
typedef std::string tstring;
//typedef std::string::npos tnpos;  // throws compiler error?!
#endif


tstring toLower(const tstring& str)
{
    // Alternative 1: Returns a copy of a local string so not that efficient.
	
    unsigned int len = str.length(); 	// use unsigned int so I only have 4 bytes as opposed to 16 for int
    tstring lower(len, 'a'); 		// ensure string is correct size to avoid dynamic resizing. Reserve at construction; performing 2 steps at once defintion and reserving - is faster right?
    // or use lower.reserve(len);
	
    for (unsigned int i=0; i<len; i++)      // iterate using indexes. Maybe iterators could be more fail safe/efficient/etc.?
	lower[i] = tolower(str[i]);     
		
    return lower;
}

void toLower(const tstring& inStr, tstring& outStr)
{
    // Alternative 2: Have user define string and pass by reference (this would be faster?)
	
    unsigned int len = inStr.length();	// use unsigned int so I only have 4 bytes as opposed to 16 for int. Store length in local variable because we look it up again in the below for loop
    outStr.resize(len); 			// ensure string is correct size to avoid dynamic resizing.
	
    for (unsigned int i=0; i<len; i++)      // iterate using indexes. Maybe iterators could be more fail safe/efficient/etc.?
	outStr[i] = tolower(inStr[i]);
}
Advertisement
There are existing "tolower" functions that operate on chars, so I'd be careful on names.

For most portable and efficient, I'd go with this (untested, written in the forum editor so probably bad):


const char LowerCaseLookupTable[ ] = {0, 1, 2, 3 4, ... } ;

void ToLowerInPlace( char* str ) {
while (*str) {
*str = LowerCaseLookupTable[*str];
++str;
}
}

Since allocation needs to generally follow a source/sink model or a smart object model, I'd also consider this variant requiring the caller to manage memory lifetimes:

void ToLowerString( const char* source, char* dest) {
while(*source) {
*dest++ = LowerCaseLookupTable[*source++];
}

If you're planning on representing your strings as Unicode then your code is not correct. Not only can you not do a case conversion per code unit (which is what you're doing), but you can in general not even do it per code point. There are special cases where one or more characters can get converted to a single character and vice versa (ß <-> SS in German is the most famous example of this).

Hello

I am challenging myself to try to attempt to write a function that is as efficient, portable and failsafe as possible. The function is very simple and just converts a string (or wstring) to lower case. Are there any improvements you think I could make and any errors I have done?

It's a linear operation. There's nothing inheritly "fast" or "slow" about converting a string to lowercase. In fact, SPEED is not what I would focus on here, utility is, proper handling of unicode or similar characters is far more important than "speed" in this case. Iterators vs indices: There is literally no difference.... additionally:

#ifdef UNICODE
typedef std::wstring tstring;
typedef std::wstring::npos tnpos;
#else
typedef std::string tstring;
//typedef std::string::npos tnpos;  // throws compiler error?!
#endif

npos is not a type.

tstring toLower(const tstring& str)
{
    // Alternative 1: Returns a copy of a local string so not that efficient.

void toLower(const tstring& inStr, tstring& outStr)
{
    // Alternative 2: Have user define string and pass by reference (this would be faster?)

Move construction and such will eliminate the temporary. Returning vs passing in a reference really provides no difference in this case.

	
    unsigned int len = str.length(); 	// use unsigned int so I only have 4 bytes as opposed to 16 for int

On what platform are you where INT is 16 bytes, and UNSIGNED INT is 4 bytes.

tstring lower(len, 'a'); // ensure string is correct size to avoid dynamic resizing. Reserve at construction; performing 2 steps at once defintion and reserving - is faster right?
// or use lower.reserve(len);

Why are you creating a string of length "len" initialized to 'a'? You realize that this requires an iteration of the entire contents of the string to fill it with 'a', right?

    for (unsigned int i=0; i<len; i++)      // iterate using indexes. Maybe iterators could be more fail safe/efficient/etc.?
	lower[i] = tolower(str[i]);     

Iterators vs Indices: The great non-debate. Use iterators, the code will be far clearer.

Additionally, if you must use a copy:
std::string dest;
dest.reserve(data.length()); //if you MUST avoid the allocation, although, to be honest, for small strings this is unnecessary and for long strings the allocation overhead isn't something to worry about (since you should be doing in place modification instead of copying in those cases)
std::transform(data.begin(), data.end(), std::back_inserter(dest), ::tolower);
Do note how i'm not bothering to deal with unicode strings, as that's a complex bag of worms best dealt with by using a proper strings/language library.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

If you're planning on representing your strings as Unicode then your code is not correct. Not only can you not do a case conversion per code unit (which is what you're doing), but you can in general not even do it per code point. There are special cases where one or more characters can get converted to a single character and vice versa (ß <-> SS in German is the most famous example of this).

It gets worse than that. In some languages, lower casing a letter can be context sensitive. The example I usually see is that the Greek capital letter ? is lower cased as ? in the final letter of the word and ? in other locations.

Thanks for the advice. So regarding unicode strings what libraries do you guys use? Boost C++?

Generally, I use the ICU library's C bindings. I didn't like C++ bindings that ICU provides, and by the time the boost::locale wrapper for ICU got introduced, I already had my own C++ wrappers that do what I want.

Out of curiosity, where in your game did a profiler say tolower was a performance issue?

I agree with the above that it is best left to the standard library.

Another alternative is to lowercase your strings once and keep one always lower cased version for comparison and the original version for display.

It takes more ram but it is faster.

You could also.look at comparing hashes of lowercase strings instead (this can be even more efficient still).

Anyhow beware the premature optimisation monster!


Out of curiosity, where in your game did a profiler say tolower was a performance issue?

To be fair to the OP, he never said he was planning to use this in a game.


I am challenging myself to try to attempt to write a function that is as efficient, portable and failsafe as possible.

I think it's more along the lines of a programming exercise.

Kinda like writing your own linked list; everyone should write one once and then never ever use it. :)

if you think programming is like sex, you probably haven't done much of either.-------------- - capn_midnight

"Portable" and "Efficient" is almost a polar opposite of each other.

Efficiency makes certain assumptions and/or optimizations based on current architecture/environment/encoding (e.g. ASCII only).

Portable means that it must be usable in as many architectures and environments as possible.

This topic is closed to new replies.

Advertisement