I see, true, if you are not careful.
Personally, I wouldn't do UTF8/16 stuff with bare C-strings, but instead use a class that has an operator[] of constant complexity and which always returns (and internally stores) 32bit values. Maybe that that was why I couldn't imagine an example :)
I just realise there is also an UTF-EBCDIC, yay.
Encoding is the devil.
C++ counter
Quote:Original post by phresnel
I see, true, if you are not careful.
Personally, I wouldn't do UTF8/16 stuff with bare C-strings, but instead use a class that has an operator[] of constant complexity and which always returns (and internally stores) 32bit values. Maybe that that was why I couldn't imagine an example :)
I just realise there is also an UTF-EBCDIC, yay.
Encoding is the devil.
I concur. Encoding is the devil, and it is often masked with simplicity which makes it doubly evil.
Part of the problem, even with your relatively safe approach (compared to other naive approaches) is that it doesn't take combining marks into account.
A somewhat separate but related issue is that there are multiple byte combinations for displaying certain accented characters (due to the combining characters) which means doing a simple "does this string contain this other string" comparison requires hairy string normalization to get characters represented in a cohesive manner. "How many characters are in my string" can also get difficult to make correct if you accidentally count combining characters as actual unique characters.
It's a hard problem all around.
If you're interested in why even UTF-32 isn't safe from variable byte characters:
read this and follow links
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement