UTF-8 strings

Started by
14 comments, last by SiCrane 16 years, 6 months ago
Quote:Original post by Nitage
There's no type in the C++ standard library that can deal with variable width encodings.

Streams are built so that it is possible to use variable width encoding. C++ does not specify a Unicode encoding, but if you provide a stream with a proper char_traits object, the implementation of the stream is designed to handle variable width encodings.

This applies to file streams and string streams, and any other stream objects. It does not, however, apply to strings. The intention is that you only use variable width encoding to read and write from/to external sources.
Advertisement
Quote:Original post by SiCrane
Quote:Original post by agh-ha-ha
Wide stings, i beleve, contain 16 bit elements.

Wide strings in C++ can contain pretty much anything. They don't even have to be vaguely related to Unicode.

Quote:usually each element represents a "codepoint" however there is the ability to encode some values above 0xFFFF into 2 16 bit string elements, these are called surrogate pairs.

Actually, most C++ wide character implementations are restricted to the basic multilingual plane subset of the Unicode implementation. They don't use or understand surrogate pairs at all.

In practice, if you want to use Unicode in a C++ application, especially in anything resembling a portable manner, you should pretty much ignore the standard library facilities for localization and instead use a third party library like ICU.


When using std::wstring the element size is 16-bit (on all platforms I can think of, although you could define a wastefull string type that uses 32 bit elements using the std::basic_string template).
It is important to take surrogate pairs in to account when writing your string handling code, this is definatly a time when a little work now, will save alot of work at 1am on a day when the build was due 6 hours ago for code complete, and if you dont get it done, no one gets paid (this situation is probably not going to happen, but it could :-) ).
Windows will render codepoints encoded into surrogate pairs, and anyone creating their own render will need to combine them if they come across them in order to find the correct glyph(s) to render.

Also, rendering is about the only time you would need to worry about them, since manipulating an internationalised sting in any other way should be left to the translation team. This is because there are many things that can go wrong when messing with them.

Using converting case as an example:

1) Is there is such a concept of case in the language you are using?
2) Does the concept of case work the same way as it does in English?
3) Etc....

It's just not worth messing with a localised string; render it as you find it. To render it properly you will need to combine surrogate pairs into codepoints when you come accross them.

Another issue with localised strings is creating messages for the player that contain information generated at run time.

For example:

"Player 3 (Bob) wins the game"

could be constructed using something like

message << playerString << " " << playerNum << "(" + playerName << ")" << winMessage;


in another language, that word order could be wrong, this is because they might expect to read something more like

"The winer of the game is Player 3 (Bob)"

For this reason, you should use format strings (and the the localisation people should position your %s, %d, etc.. for you so that they are in the correct place). All of which work fine when using surrogate pairs (like everything else that you will need to do, because, as I have said before, just about the only time you need to worry about surrogate pairs is when you come to render a string. I have said that haven't I?).

The only other time i can think of (at the moment :-) ) that surrogate pairs would be important is when you are figuring out how far along a sting in memory the caret on screen is, this is because, for every surrogate pair you would need to bump the string iterator 2 places rather than 1 whilst only moving the caret on screen 1 place, this is really to do with rendering so should covered by my previous statement :-).
Quote:Original post by agh-ha-ha
When using std::wstring the element size is 16-bit (on all platforms I can think of, although you could define a wastefull string type that uses 32 bit elements using the std::basic_string template).

Repeat after me: wchar_t is not necessarily 16-bits. wchar_t is not necessarily 16-bits. wchar_t is not necessarily 16-bits. Can you think of Linux as a platform? Linux has 32-bit wide character versions. Unless you're using a time-tunneling modem and are actually communicating from back in 2000. Quality I18N-aware software libraries like the Xerces XML library are configurable for use with 16 bit or 32 bit wide character support. You cannot assume 16 bits for wchar_t if you want your software to be portable. Many embedded systems can't even address anything less than 32-bits at a time, meaning that every single data type is at least 32-bits in size.

Quote:
It is important to take surrogate pairs in to account when writing your string handling code, this is definatly a time when a little work now, will save alot of work at 1am on a day when the build was due 6 hours ago for code complete, and if you dont get it done, no one gets paid (this situation is probably not going to happen, but it could :-) ).

If you were paying attention to what I was saying, you'd find that I said that many C++ implementations don't use or understand surrogate pairs. This has nothing to do with the fact that programmers should handle surrogate pairs. And the easiest way to do that is to use a third party library like ICU. I may have mentioned that already.

Quote:
Windows will render codepoints encoded into surrogate pairs, and anyone creating their own render will need to combine them if they come across them in order to find the correct glyph(s) to render.

Recent Windows versions will render code points with surrogate pairs. Early Windows WCHAR support didn't grok surrogate pairs; which is understandable since the original versions of Unicode didn't use anything above the 16-bit mark. Surrogate pair support only emerged with Windows 2000 after China mandated that all computers sold in the country support GB18030. And even then, support required a special installation. It was only with XP when it became natively supported.

Quote:
Also, rendering is about the only time you would need to worry about them, since manipulating an internationalised sting in any other way should be left to the translation team. This is because there are many things that can go wrong when messing with them.

Maybe this is true for your applications. But other people have applications that deal with Unicode strings involving things like collation, which is a remarkably common operation. Not to mention normalization before collation takes place. Or you know, simple things like validating input strings.

Quote:Original post by SiCrane
Repeat after me: wchar_t is not necessarily 16-bits. wchar_t is not necessarily 16-bits. wchar_t is not necessarily 16-bits. Can you think of Linux as a platform? Linux has 32-bit wide character versions. Unless you're using a time-tunneling modem and are actually communicating from back in 2000. Quality I18N-aware software libraries like the Xerces XML library are configurable for use with 16 bit or 32 bit wide character support. You cannot assume 16 bits for wchar_t if you want your software to be portable. Many embedded systems can't even address anything less than 32-bits at a time, meaning that every single data type is at least 32-bits in size.


C++ doesn't specify the exact bitsize of most of their types, but how can you ever properly write a string that is exactly an UTF-8, an UTF-16 or UTF-32 string, if there is no single type in C++ that is sure to be for example 16-bit?

You just need a type that is at least N bits long. You can represent UTF-8 as a long list of 64bit longs if you like. You'll just have a lot of zeroed bits in there. UTF-8 doesn't really mean "you shall represent such a string with a series of 8-bit types'. It means "you only need an 8-bit type to represent such a string".

The same goes for any other encoding: 127-bit ASCII can be stored just fine in a series of 256byte lumps if you don't mind wasting just over 255 bytes on each character. The fact that you get type mismatch errors is just a language interface issue.

Remember, they're just encodings, not data types. A nice way to think about it is that Arabic numerals and Roman numerals are just 2 different encodings of integer numbers. You can index into an Arabic numeral 'string' to get any particular power of 10 that you want, but you can't do it with a Roman numeral 'string'.
Quote:Original post by Lode
C++ doesn't specify the exact bitsize of most of their types, but how can you ever properly write a string that is exactly an UTF-8, an UTF-16 or UTF-32 string, if there is no single type in C++ that is sure to be for example 16-bit?


Quote:Original post by SiCrane
In practice, if you want to use Unicode in a C++ application, especially in anything resembling a portable manner, you should pretty much ignore the standard library facilities for localization and instead use a third party library like ICU.


In particular, ICU defines the UChar type which in turn typedefs the proper data type for use with the UTF-16 strings that ICU uses. Also, if you decide to forgo a library like ICU, with boost/cstdint.hpp you can have access to typedefs like uint_least_16_t, which is an unsigned integral data type with at least 16 bits.

This topic is closed to new replies.

Advertisement