Quote:Original post by SiCrane
Quote:Original post by agh-ha-ha
Wide stings, i beleve, contain 16 bit elements.
Wide strings in C++ can contain pretty much anything. They don't even have to be vaguely related to Unicode.
Quote:usually each element represents a "codepoint" however there is the ability to encode some values above 0xFFFF into 2 16 bit string elements, these are called surrogate pairs.
Actually, most C++ wide character implementations are restricted to the basic multilingual plane subset of the Unicode implementation. They don't use or understand surrogate pairs at all.
In practice, if you want to use Unicode in a C++ application, especially in anything resembling a portable manner, you should pretty much ignore the standard library facilities for localization and instead use a third party library like ICU.
When using std::wstring the element size is 16-bit (on all platforms I can think of, although you could define a wastefull string type that uses 32 bit elements using the std::basic_string template).
It is important to take surrogate pairs in to account when writing your string handling code, this is definatly a time when a little work now, will save alot of work at 1am on a day when the build was due 6 hours ago for code complete, and if you dont get it done, no one gets paid (this situation is probably not going to happen, but it could :-) ).
Windows will render codepoints encoded into surrogate pairs, and anyone creating their own render will need to combine them if they come across them in order to find the correct glyph(s) to render.
Also, rendering is about the only time you would need to worry about them, since manipulating an internationalised sting in any other way should be left to the translation team. This is because there are many things that can go wrong when messing with them.
Using converting case as an example:
1) Is there is such a concept of case in the language you are using?
2) Does the concept of case work the same way as it does in English?
3) Etc....
It's just not worth messing with a localised string; render it as you find it. To render it properly you will need to combine surrogate pairs into codepoints when you come accross them.
Another issue with localised strings is creating messages for the player that contain information generated at run time.
For example:
"Player 3 (Bob) wins the game"
could be constructed using something like
message << playerString << " " << playerNum << "(" + playerName << ")" << winMessage;
in another language, that word order could be wrong, this is because they might expect to read something more like
"The winer of the game is Player 3 (Bob)"
For this reason, you should use format strings (and the the localisation people should position your %s, %d, etc.. for you so that they are in the correct place). All of which work fine when using surrogate pairs (like everything else that you will need to do, because, as I have said before, just about the only time you need to worry about surrogate pairs is when you come to render a string. I have said that haven't I?).
The only other time i can think of (at the moment :-) ) that surrogate pairs would be important is when you are figuring out how far along a sting in memory the caret on screen is, this is because, for every surrogate pair you would need to bump the string iterator 2 places rather than 1 whilst only moving the caret on screen 1 place, this is really to do with rendering so should covered by my previous statement :-).