Sign in to follow this  
HiMK

Write ToLower As Efficient, Portable As Possible

Recommended Posts

Hello

 

I am challenging myself to try to attempt to write a function that is as efficient, portable and failsafe as possible. The function is very simple and just converts a string (or wstring) to lower case. Are there any improvements you think I could make and any errors I have done?

#ifdef UNICODE
typedef std::wstring tstring;
typedef std::wstring::npos tnpos;
#else
typedef std::string tstring;
//typedef std::string::npos tnpos;  // throws compiler error?!
#endif


tstring toLower(const tstring& str)
{
    // Alternative 1: Returns a copy of a local string so not that efficient.
	
    unsigned int len = str.length(); 	// use unsigned int so I only have 4 bytes as opposed to 16 for int
    tstring lower(len, 'a'); 		// ensure string is correct size to avoid dynamic resizing. Reserve at construction; performing 2 steps at once defintion and reserving - is faster right?
    // or use lower.reserve(len);
	
    for (unsigned int i=0; i<len; i++)      // iterate using indexes. Maybe iterators could be more fail safe/efficient/etc.?
	lower[i] = tolower(str[i]);     
		
    return lower;
}

void toLower(const tstring& inStr, tstring& outStr)
{
    // Alternative 2: Have user define string and pass by reference (this would be faster?)
	
    unsigned int len = inStr.length();	// use unsigned int so I only have 4 bytes as opposed to 16 for int. Store length in local variable because we look it up again in the below for loop
    outStr.resize(len); 			// ensure string is correct size to avoid dynamic resizing.
	
    for (unsigned int i=0; i<len; i++)      // iterate using indexes. Maybe iterators could be more fail safe/efficient/etc.?
	outStr[i] = tolower(inStr[i]);
}
Edited by HiMK

Share this post


Link to post
Share on other sites

Generally, I use the ICU library's C bindings. I didn't like C++ bindings that ICU provides, and by the time the boost::locale wrapper for ICU got introduced, I already had my own C++ wrappers that do what I want.

Share this post


Link to post
Share on other sites
Out of curiosity, where in your game did a profiler say tolower was a performance issue?

I agree with the above that it is best left to the standard library.

Another alternative is to lowercase your strings once and keep one always lower cased version for comparison and the original version for display.

It takes more ram but it is faster.

You could also.look at comparing hashes of lowercase strings instead (this can be even more efficient still).

Anyhow beware the premature optimisation monster!

Share this post


Link to post
Share on other sites


Out of curiosity, where in your game did a profiler say tolower was a performance issue?

To be fair to the OP, he never said he was planning to use this in a game.

 


I am challenging myself to try to attempt to write a function that is as efficient, portable and failsafe as possible.

 

I think it's more along the lines of a programming exercise.

 

Kinda like writing your own linked list; everyone should write one once and then never ever use it. :)

Share this post


Link to post
Share on other sites

"Portable" and "Efficient" is almost a polar opposite of each other.

 

Efficiency makes certain assumptions and/or optimizations based on current architecture/environment/encoding (e.g. ASCII only).

Portable means that it must be usable in as many architectures and environments as possible.

Share this post


Link to post
Share on other sites

At this level it really becomes a matter writing your code as plainly as possible, so that the compiler can oprimize it.

 

What you want your compiler to do here is really to inline toLower and vectorize it, so it can apply it to up to 32 bytes at a time at the same time, for large strings. (This is like packing 4 characters in an int, and having a special tolower function for that). For short strings, you may not want to vectorize (pack into large chuncks), but they are a special case, because in that case the string is stored in the string object itself.

 

If you were a library writer with a particular string implementation and compiler locked in, you could do better. You could hand write the vectorized toLower, and take advantage of the fact that you can write past the end of string, up to it's capalicy which would always be a multiple of the vector width (vector width is the size interger you're packing your string into for your special tolower operation).

 

Move construction and such will eliminate the temporary. Returning vs passing in a reference really provides no difference in this case.

I think it could make a big difference because by passing in a reference the code introduces the posibility that the input and output aliases. So passing in a reference could actually be slower. Unless, that is, you explicitly tell the compiler that the feilds cannot alias, with the restrict keyword or other language extentions.

Share this post


Link to post
Share on other sites

At this level it really becomes a matter writing your code as plainly as possible, so that the compiler can oprimize it.
 
What you want your compiler to do here is really to inline toLower and vectorize it, so it can apply it to up to 32 bytes at a time at the same time, for large strings. (This is like packing 4 characters in an int, and having a special tolower function for that). For short strings, you may not want to vectorize (pack into large chuncks), but they are a special case, because in that case the string is stored in the string object itself.
 
If you were a library writer with a particular string implementation and compiler locked in, you could do better. You could hand write the vectorized toLower, and take advantage of the fact that you can write past the end of string, up to it's capalicy which would always be a multiple of the vector width (vector width is the size interger you're packing your string into for your special tolower operation).
 

Move construction and such will eliminate the temporary. Returning vs passing in a reference really provides no difference in this case.

I think it could make a big difference because by passing in a reference the code introduces the posibility that the input and output aliases. So passing in a reference could actually be slower. Unless, that is, you explicitly tell the compiler that the feilds cannot alias, with the restrict keyword or other language extentions.

This is a great idea if you want to throw all portability out of the window. I can imagine vectorizing it through a 64 Bit integer processing 8 bytes of a string per loop iteration. If you're processing huge strings ("ropes") this has massive benefit but for most uses will you notice the difference of nanoseconds when processing the average length of say 32 byte ascii string?

Share this post


Link to post
Share on other sites
Here is how I'd do it.

Precalculate the tolower "map":

/** Case insensitive map, ASCII rules.
 * That is;
 * A == a, B == b.
 */
unsigned const char ascii_case_insensitive_map[256] = {
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,                                   /* 0-19 */
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,                         /* 20-39 */
        40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,                         /* 40-59 */
        60, 61, 62, 63, 64, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,             /* 60-79 */
        112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 91, 92, 93, 94, 95, 96, 97, 98, 99,              /* 80-99 */
        100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,     /* 100-119 */
        120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,     /* 120-139 */
        140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,     /* 140-159 */
        160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,     /* 160-179 */
        180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199,     /* 180-199 */
        200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219,     /* 200-219 */
        220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,     /* 220-239 */
        240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255                          /* 240-255 */
};
Then just iterate the string in place until *c++==0, replacing *c with ascii_case_insensitive_map[*c].

This assumes ascii and nothing more complex like utf-8 or unicode...

Share this post


Link to post
Share on other sites

Here's another:
 

for (char* c = str; *c; ++c)
{
    if (*c >= 'A' && *c <= 'Z')
        *c = *c & 0xDF;
}

This was my old favourite on 8 bit assembler platforms.

You can make it toupper() by changing the & 0xDF to | 0x20 too... Neat eh? smile.png

Edited by braindigitalis

Share this post


Link to post
Share on other sites

 

At this level it really becomes a matter writing your code as plainly as possible, so that the compiler can oprimize it.
 
What you want your compiler to do here is really to inline toLower and vectorize it, so it can apply it to up to 32 bytes at a time at the same time, for large strings. (This is like packing 4 characters in an int, and having a special tolower function for that). For short strings, you may not want to vectorize (pack into large chuncks), but they are a special case, because in that case the string is stored in the string object itself.
 
If you were a library writer with a particular string implementation and compiler locked in, you could do better. You could hand write the vectorized toLower, and take advantage of the fact that you can write past the end of string, up to it's capalicy which would always be a multiple of the vector width (vector width is the size interger you're packing your string into for your special tolower operation).
 

Move construction and such will eliminate the temporary. Returning vs passing in a reference really provides no difference in this case.

I think it could make a big difference because by passing in a reference the code introduces the posibility that the input and output aliases. So passing in a reference could actually be slower. Unless, that is, you explicitly tell the compiler that the feilds cannot alias, with the restrict keyword or other language extentions.

This is a great idea if you want to throw all portability out of the window. I can imagine vectorizing it through a 64 Bit integer processing 8 bytes of a string per loop iteration. If you're processing huge strings ("ropes") this has massive benefit but for most uses will you notice the difference of nanoseconds when processing the average length of say 32 byte ascii string?

 

The idea is not to throw portability out the window, but to use an autovectorizing compiler. Clang, gcc, and msvc will all auto-vectorize, although I cannot say how well they will do in this case.

 

The bottom line is your compiler can optimize better than you can imagine, so focus on writing clear code, and let the compiler handle the micro-optimization.

 

 

Re:"but for most uses will you notice the difference of nanoseconds when processing the average length of say 32 byte ascii string?"

I imagine the most performance relevant use of toLower is a case insesitive compare over a large container or database of records. That could be noticable, even if each string is short. My guess is that even for short strings, vectorization would make a difference if you apply toLower to the capacity instead of up to the size of the string. (but admittedly, that wouldn't follow the above maxim)

Edited by King Mir

Share this post


Link to post
Share on other sites

I imagine the most performance relevant use of toLower is a case insesitive compare over a large container or database of records. That could be noticable, even if each string is short.

I doubt it would be noticeable in that case. Once you start talking about a large number of different strings, you get into the realm where memory dominates computation speeds. You might shave off a few clock cycles in your computation, but your overall processing time is still going to look pretty much identical to memory used divided by memory bandwidth.

Share this post


Link to post
Share on other sites

 

I imagine the most performance relevant use of toLower is a case insesitive compare over a large container or database of records. That could be noticable, even if each string is short.

I doubt it would be noticeable in that case. Once you start talking about a large number of different strings, you get into the realm where memory dominates computation speeds. You might shave off a few clock cycles in your computation, but your overall processing time is still going to look pretty much identical to memory used divided by memory bandwidth.

 

That's a good point, you're probably right.

Share this post


Link to post
Share on other sites

 

If you're planning on representing your strings as Unicode then your code is not correct. Not only can you not do a case conversion per code unit (which is what you're doing), but you can in general not even do it per code point. There are special cases where one or more characters can get converted to a single character and vice versa (ß <-> SS in German is the most famous example of this).

It gets worse than that. In some languages, lower casing a letter can be context sensitive. The example I usually see is that the Greek capital letter ? is lower cased as ? in the final letter of the word and ? in other locations.

That is a much better example for this case. The German ß is not such a great example for lowercasing since there is no capital ß, so there is no way of needing to lowercase it ever -- unless someone writes incorrect German and uses ß in a capitalized word -- but in that case it would be correct to just leave the ß as-is (which works fine with a 1:1 mapping).

 

On the other hand, uppercasing either ß or ss would give SS, so yes, there might be a problem with a simple 1:1 mapping. But I am not even sure if uppercasing is legal/possible in all languages, anyway.

 

Ligatures come to mind, too -- again, should not be an issue with lowercasing (since I'm not aware of any real, existing, upper-case ligatures). Uppercasing them might be a real challenge, especially in some middle-east languages where ligatures are non-optional (but also with the easy ff ffi ffl stuff it's hard enough).

Share this post


Link to post
Share on other sites


The German ß is not such a great example for lowercasing since there is no capital ß, so there is no way of needing to lowercase it ever -- unless someone writes incorrect German and uses ß in a capitalized word -- but in that case it would be correct to just leave the ß as-is (which works fine with a 1:1 mapping).

 

There actually is a capital ?; http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

 

The situation is a bit complicated though because there's a lot of legacy usage of SS out there (official street names in Germany use or used to to use SS for example) and the Unicode specification still has ß mapping to SS: http://unicode.org/faq/casemap_charprop.html#11

Share this post


Link to post
Share on other sites
As a native German speaker living in Germany, I'd like to say that I have never heard of that before. Note that the Wikipedia article calls it 'contestable' right in the first line and makes clear it's far from accepted in the rest of the article as well.

Share this post


Link to post
Share on other sites

Others have said it with more words, but I think it should be underlined a few times:

If you want to support more languages then simple 7bit ascii, you simply do not want to write this yourself, it is way to complicated for any single dev that has no intention of becoming a full time unicode AND language expert. Use a library.

 

Then the next question, why the lowercasing?

If it is to search for keys, and you want to support more then simple english 7bit ascii, just lowercasing is not enough.

You also need to handle language dependant rules of what characters match with others.

A unicode library will provide also this for you, by creating a "folded" version of the string.

Share this post


Link to post
Share on other sites

If you want to support more languages then simple 7bit ascii, you simply do not want to write this yourself, it is way to complicated for any single dev that has no intention of becoming a full time unicode AND language expert.

You don't really need to become a language expert. There are standards committees that put together case transformation rules. Admittedly, once you're able to understand and implement the casing standards, you'll know more about Unicode than 99.9% of all other software developers out there, which will make you a de facto Unicode expert.

You also need to handle language dependant rules of what characters match with others.

Actually, the Unicode rules for normalization and case folding are language independent (unlike the rules for case mapping, which are language dependent). For case folding, the same transformations get applied no matter what the source language is. However, the outputs of case folding are explicitly stated not to be valid for display purposes - only look ups and comparisons.

Share this post


Link to post
Share on other sites

As a native German speaker living in Germany, I'd like to say that I have never heard of that before. Note that the Wikipedia article calls it 'contestable' right in the first line and makes clear it's far from accepted in the rest of the article as well.

Same here, I've never seen this letter in my life, and I'm inclined to think whoever wrote that Wiki page made it up... biggrin.png

Share this post


Link to post
Share on other sites
Well, there is a German version of that same page which is more verbose. It also mentions that there was talk about that letter from the end of the 19th century on. Personally I wouldn't hold my breath that thing is ever actually used outside isolated incidents and with formal orthographic backing.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this