Write ToLower As Efficient, Portable As Possible

Started by
24 comments, last by SiCrane 9 years ago
As a native German speaker living in Germany, I'd like to say that I have never heard of that before. Note that the Wikipedia article calls it 'contestable' right in the first line and makes clear it's far from accepted in the rest of the article as well.
Advertisement

Others have said it with more words, but I think it should be underlined a few times:

If you want to support more languages then simple 7bit ascii, you simply do not want to write this yourself, it is way to complicated for any single dev that has no intention of becoming a full time unicode AND language expert. Use a library.

Then the next question, why the lowercasing?

If it is to search for keys, and you want to support more then simple english 7bit ascii, just lowercasing is not enough.

You also need to handle language dependant rules of what characters match with others.

A unicode library will provide also this for you, by creating a "folded" version of the string.

If you want to support more languages then simple 7bit ascii, you simply do not want to write this yourself, it is way to complicated for any single dev that has no intention of becoming a full time unicode AND language expert.

You don't really need to become a language expert. There are standards committees that put together case transformation rules. Admittedly, once you're able to understand and implement the casing standards, you'll know more about Unicode than 99.9% of all other software developers out there, which will make you a de facto Unicode expert.

You also need to handle language dependant rules of what characters match with others.

Actually, the Unicode rules for normalization and case folding are language independent (unlike the rules for case mapping, which are language dependent). For case folding, the same transformations get applied no matter what the source language is. However, the outputs of case folding are explicitly stated not to be valid for display purposes - only look ups and comparisons.

As a native German speaker living in Germany, I'd like to say that I have never heard of that before. Note that the Wikipedia article calls it 'contestable' right in the first line and makes clear it's far from accepted in the rest of the article as well.

Same here, I've never seen this letter in my life, and I'm inclined to think whoever wrote that Wiki page made it up... biggrin.png

Well, there is a German version of that same page which is more verbose. It also mentions that there was talk about that letter from the end of the 19th century on. Personally I wouldn't hold my breath that thing is ever actually used outside isolated incidents and with formal orthographic backing.

The character is in the Unicode specification so you should be prepared to handle it if you're dealing with Unicode text. The point is that case mapping characters is non-trivial so the correct result depends a lot on what you want to achieve. The official Unicode consortium website has an informative FAQ on the subject: http://unicode.org/faq/casemap_charprop.html

If you want to implement the Unicode specification then use the mapping tables provided by the Unicode specification. If you want to normalize the text to do a case insensitive string comparison then you should be doing case folding instead. If you want a grammatically correct result then you first need to define the grammar rules. The character sharp S (ß) from the German alphabet is a good example of the complexities involved in this since it could map to either SS, SZ, S-S, ? or itself depending on the grammar rules used and the reverse operation is generally not possible to do. I can't think of a real world example of when you would ever have to do a case mapping to lower case like this (instead of case folding) so it's mostly a theoretical problem.

The character is in the Unicode specification so you should be prepared to handle it if you're dealing with Unicode text.
That only shows (once again) how useless Unicode is in practice. Look, it's not just that BitMaster and me happen not to be very familiar with that character, it simply doesn't exist in practice. I've shown this character to about a dozen people between 30 and 70 in my neighbourhood and not one of them took me for serious, they all thought I was joking. There's no way of typing it on a German keyboard either (well, short of using an Alt+numpad code).

This article from 2014 even argues that strictly according to the "new writing reform" (which isn't so new anyway), using ß in street names at all is entirely wrong anyway, since the reform mandates ss.

Maybe some nutjob made that capital-ß up for April's fool, then three or four other nutters tried to make it "fashionable" without success, and finally the Unicode consortium took it into their standard because someone had smoked weed and still deemed it useful. And since then it's "official". Or something.

There exists indeed at least one "real, no-joke" site (which is a Hessian community driven online newspaper) that I could find which uses that letter in their name (Gießener Zeitung) but they are being mocked (even on German typesetting sites) for being "proudly identifying themselves by being the only ones using ß-versal". Also, funnily, note how their URL is written (IDN anyone?). Imagine visitors having to type that letter biggrin.png

I could also find a Bavarian newspaper from 2008 which says "yeah, this is no joke, typographers are trying to push this" as well as "the DDR had that on the cover of their great lexicon in the '50s already, but it didn't get accepted" (though the photos that I could find seem to suggest that isn't true). They also mock the fact that "all keyboards will have to be replaced".

Anyway, being a valid code point in Unicode is of course an issue because someone might use it, but as often with Unicode, it's merely an annoyance and a mostly theoretical problem, not a really practical one.

A lot of things in Unicode do not make a lot of sense. Heck, there's languages in the BMP which aren't spoken by any living person since a thousand years or which only appear in ancient religious texts of some minority which most people never heard of, and those symbols have lower code points than one very major language which is spoken by two million people every day. Seeing how higher code points encode in more symbols both in UTF-8 and UTF-16, this is nonsensical. Don't even get me started on normalization and things like "numbers in circles" or "small letters" or the varieties of superscript (heck, that's markup, who needs extra symbols for that?) in fear of going totally OT rolleyes.gif

The point is that case mapping characters is non-trivial so the correct result depends a lot on what you want to achieve.
That, however, is very true. It's not just non-trivial, it's annoyingly complex (and getting worse with every year).

Well, Wikipedia was right in that I can type it in with shift+alt-gr+ß => ? on Windows, which I never knew of.

I guess, I saw a street-sign once having it, but really noone uses it or people don't know of it and therefore think its just the lowercase one inbetween uppercase letters, as they look very similar.

Though, the main point of Unicode is having ALL letters, even the most obscure, ancient or unused ones, for text storage without degradation. What you do with it, to render it onscreen or if you sometimes treat different codepoints the same, is a follow-up problem you probably should avoid dealing with (never advertise case-insensitivity and require the input to be in the correct form).

For case folding, the same transformations get applied no matter what the source language is.

I'm going to have to take this back. According to the Unicode website, and a number of other places, such as the ICU documentation, case folding is language and locale independent. However, there is a case folding "option" for treating dotted and dot-less I specially that exists just for the Turkish alphabet. So it's language independent iff you don't consider Turkish to be a language. Admittedly, if the Unicode consortium didn't consider Turkish to be a language it would explain a lot of things.

This topic is closed to new replies.

Advertisement