C++ counter

Started by
20 comments, last by M2tM 13 years, 7 months ago
I see, true, if you are not careful.

Personally, I wouldn't do UTF8/16 stuff with bare C-strings, but instead use a class that has an operator[] of constant complexity and which always returns (and internally stores) 32bit values. Maybe that that was why I couldn't imagine an example :)

I just realise there is also an UTF-EBCDIC, yay.

Encoding is the devil.
Advertisement
Quote:Original post by phresnel
I see, true, if you are not careful.

Personally, I wouldn't do UTF8/16 stuff with bare C-strings, but instead use a class that has an operator[] of constant complexity and which always returns (and internally stores) 32bit values. Maybe that that was why I couldn't imagine an example :)

I just realise there is also an UTF-EBCDIC, yay.

Encoding is the devil.


I concur. Encoding is the devil, and it is often masked with simplicity which makes it doubly evil.

Part of the problem, even with your relatively safe approach (compared to other naive approaches) is that it doesn't take combining marks into account.

A somewhat separate but related issue is that there are multiple byte combinations for displaying certain accented characters (due to the combining characters) which means doing a simple "does this string contain this other string" comparison requires hairy string normalization to get characters represented in a cohesive manner. "How many characters are in my string" can also get difficult to make correct if you accidentally count combining characters as actual unique characters.

It's a hard problem all around.

If you're interested in why even UTF-32 isn't safe from variable byte characters:
read this and follow links
_______________________"You're using a screwdriver to nail some glue to a ming vase. " -ToohrVyk

This topic is closed to new replies.

Advertisement