Write ToLower As Efficient, Portable As Possible

Started by
24 comments, last by SiCrane 8 years, 12 months ago

"Portable" and "Efficient" is almost a polar opposite of each other.


Nah, you can have both. Most of your lines-of-code become #ifdef directives, though. sad.png
Advertisement

At this level it really becomes a matter writing your code as plainly as possible, so that the compiler can oprimize it.

What you want your compiler to do here is really to inline toLower and vectorize it, so it can apply it to up to 32 bytes at a time at the same time, for large strings. (This is like packing 4 characters in an int, and having a special tolower function for that). For short strings, you may not want to vectorize (pack into large chuncks), but they are a special case, because in that case the string is stored in the string object itself.

If you were a library writer with a particular string implementation and compiler locked in, you could do better. You could hand write the vectorized toLower, and take advantage of the fact that you can write past the end of string, up to it's capalicy which would always be a multiple of the vector width (vector width is the size interger you're packing your string into for your special tolower operation).

Move construction and such will eliminate the temporary. Returning vs passing in a reference really provides no difference in this case.

I think it could make a big difference because by passing in a reference the code introduces the posibility that the input and output aliases. So passing in a reference could actually be slower. Unless, that is, you explicitly tell the compiler that the feilds cannot alias, with the restrict keyword or other language extentions.

At this level it really becomes a matter writing your code as plainly as possible, so that the compiler can oprimize it.

What you want your compiler to do here is really to inline toLower and vectorize it, so it can apply it to up to 32 bytes at a time at the same time, for large strings. (This is like packing 4 characters in an int, and having a special tolower function for that). For short strings, you may not want to vectorize (pack into large chuncks), but they are a special case, because in that case the string is stored in the string object itself.

If you were a library writer with a particular string implementation and compiler locked in, you could do better. You could hand write the vectorized toLower, and take advantage of the fact that you can write past the end of string, up to it's capalicy which would always be a multiple of the vector width (vector width is the size interger you're packing your string into for your special tolower operation).

Move construction and such will eliminate the temporary. Returning vs passing in a reference really provides no difference in this case.

I think it could make a big difference because by passing in a reference the code introduces the posibility that the input and output aliases. So passing in a reference could actually be slower. Unless, that is, you explicitly tell the compiler that the feilds cannot alias, with the restrict keyword or other language extentions.

This is a great idea if you want to throw all portability out of the window. I can imagine vectorizing it through a 64 Bit integer processing 8 bytes of a string per loop iteration. If you're processing huge strings ("ropes") this has massive benefit but for most uses will you notice the difference of nanoseconds when processing the average length of say 32 byte ascii string?
Here is how I'd do it.

Precalculate the tolower "map":

/** Case insensitive map, ASCII rules.
 * That is;
 * A == a, B == b.
 */
unsigned const char ascii_case_insensitive_map[256] = {
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,                                   /* 0-19 */
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,                         /* 20-39 */
        40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,                         /* 40-59 */
        60, 61, 62, 63, 64, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,             /* 60-79 */
        112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 91, 92, 93, 94, 95, 96, 97, 98, 99,              /* 80-99 */
        100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,     /* 100-119 */
        120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,     /* 120-139 */
        140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,     /* 140-159 */
        160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,     /* 160-179 */
        180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199,     /* 180-199 */
        200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219,     /* 200-219 */
        220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,     /* 220-239 */
        240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255                          /* 240-255 */
};
Then just iterate the string in place until *c++==0, replacing *c with ascii_case_insensitive_map[*c].

This assumes ascii and nothing more complex like utf-8 or unicode...

Here's another:


for (char* c = str; *c; ++c)
{
    if (*c >= 'A' && *c <= 'Z')
        *c = *c & 0xDF;
}

This was my old favourite on 8 bit assembler platforms.

You can make it toupper() by changing the & 0xDF to | 0x20 too... Neat eh? smile.png

At this level it really becomes a matter writing your code as plainly as possible, so that the compiler can oprimize it.

What you want your compiler to do here is really to inline toLower and vectorize it, so it can apply it to up to 32 bytes at a time at the same time, for large strings. (This is like packing 4 characters in an int, and having a special tolower function for that). For short strings, you may not want to vectorize (pack into large chuncks), but they are a special case, because in that case the string is stored in the string object itself.

If you were a library writer with a particular string implementation and compiler locked in, you could do better. You could hand write the vectorized toLower, and take advantage of the fact that you can write past the end of string, up to it's capalicy which would always be a multiple of the vector width (vector width is the size interger you're packing your string into for your special tolower operation).

Move construction and such will eliminate the temporary. Returning vs passing in a reference really provides no difference in this case.

I think it could make a big difference because by passing in a reference the code introduces the posibility that the input and output aliases. So passing in a reference could actually be slower. Unless, that is, you explicitly tell the compiler that the feilds cannot alias, with the restrict keyword or other language extentions.

This is a great idea if you want to throw all portability out of the window. I can imagine vectorizing it through a 64 Bit integer processing 8 bytes of a string per loop iteration. If you're processing huge strings ("ropes") this has massive benefit but for most uses will you notice the difference of nanoseconds when processing the average length of say 32 byte ascii string?

The idea is not to throw portability out the window, but to use an autovectorizing compiler. Clang, gcc, and msvc will all auto-vectorize, although I cannot say how well they will do in this case.

The bottom line is your compiler can optimize better than you can imagine, so focus on writing clear code, and let the compiler handle the micro-optimization.

Re:"but for most uses will you notice the difference of nanoseconds when processing the average length of say 32 byte ascii string?"

I imagine the most performance relevant use of toLower is a case insesitive compare over a large container or database of records. That could be noticable, even if each string is short. My guess is that even for short strings, vectorization would make a difference if you apply toLower to the capacity instead of up to the size of the string. (but admittedly, that wouldn't follow the above maxim)

I imagine the most performance relevant use of toLower is a case insesitive compare over a large container or database of records. That could be noticable, even if each string is short.

I doubt it would be noticeable in that case. Once you start talking about a large number of different strings, you get into the realm where memory dominates computation speeds. You might shave off a few clock cycles in your computation, but your overall processing time is still going to look pretty much identical to memory used divided by memory bandwidth.

I imagine the most performance relevant use of toLower is a case insesitive compare over a large container or database of records. That could be noticable, even if each string is short.

I doubt it would be noticeable in that case. Once you start talking about a large number of different strings, you get into the realm where memory dominates computation speeds. You might shave off a few clock cycles in your computation, but your overall processing time is still going to look pretty much identical to memory used divided by memory bandwidth.

That's a good point, you're probably right.

If you're planning on representing your strings as Unicode then your code is not correct. Not only can you not do a case conversion per code unit (which is what you're doing), but you can in general not even do it per code point. There are special cases where one or more characters can get converted to a single character and vice versa (ß <-> SS in German is the most famous example of this).

It gets worse than that. In some languages, lower casing a letter can be context sensitive. The example I usually see is that the Greek capital letter ? is lower cased as ? in the final letter of the word and ? in other locations.

That is a much better example for this case. The German ß is not such a great example for lowercasing since there is no capital ß, so there is no way of needing to lowercase it ever -- unless someone writes incorrect German and uses ß in a capitalized word -- but in that case it would be correct to just leave the ß as-is (which works fine with a 1:1 mapping).

On the other hand, uppercasing either ß or ss would give SS, so yes, there might be a problem with a simple 1:1 mapping. But I am not even sure if uppercasing is legal/possible in all languages, anyway.

Ligatures come to mind, too -- again, should not be an issue with lowercasing (since I'm not aware of any real, existing, upper-case ligatures). Uppercasing them might be a real challenge, especially in some middle-east languages where ligatures are non-optional (but also with the easy ff ffi ffl stuff it's hard enough).


The German ß is not such a great example for lowercasing since there is no capital ß, so there is no way of needing to lowercase it ever -- unless someone writes incorrect German and uses ß in a capitalized word -- but in that case it would be correct to just leave the ß as-is (which works fine with a 1:1 mapping).

There actually is a capital ?; http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

The situation is a bit complicated though because there's a lot of legacy usage of SS out there (official street names in Germany use or used to to use SS for example) and the Unicode specification still has ß mapping to SS: http://unicode.org/faq/casemap_charprop.html#11

This topic is closed to new replies.

Advertisement