Sign in to follow this  
Lode

UTF-8 strings

Recommended Posts

In UTF-8 and UTF-16, the character size is variable. Do wstrings in C++ use such a system, or do all characters have the same size in C++? Do there exist string types in C++ that could use UTF-8, so that each character of the string could have a different size, but that using [x] on the string to get the x'th character would still return the x'th symbol in the string, not the x'th byte?

Share this post


Link to post
Share on other sites
The way I understand it, UTF8 and UTF16 are both variable width, always. In UTF8, a character may be represented by 1 to 4 bytes. In UTF16, a character may be represented by 1 or more 16-bit words. That means you can never index into a UTF8/16 encoded string and be entirely sure to get the character you want.
Correct me if I'm wrong.

[Edited by - Red Ant on October 11, 2007 6:33:06 AM]

Share this post


Link to post
Share on other sites
There's no type in the C++ standard library that can deal with variable width encodings.

Quote:

using [x] on the string to get the x'th character would still return the x'th symbol in the string, not the x'th byte?


In the real world, that isn't a useful thing to do. Finding the x'th code point will be a O(n) operation, making it pretty slow. Additionally, because a single code point doesn't always map to a single glyph (combining characters for example), finding the x'th glyph remains a O(n) operation even when using fixed width encodings (i.e. UTF32).

Share this post


Link to post
Share on other sites
Okay, I thought I had UNICODE figured out but after thinking a bit more about this it has become clear to me that I really don't. =( I've just spent a couple hours reading various docs at msdn and other sites on the net, and the more I read the more confused I get (I think this is in no small part due to many authors who otherwise seem very knowledgable on the subject using incorrect or misleading terminology).

Can someone with good knowledge of UNICODE please confirm or correct the following statement?

When UNICODE is defined and using std::basic_string< TCHAR >, the nth __character__ in any given string is always the same as the nth TCHAR, i.e. I can always index into the string and get exactly the character I want. Thus a TCHAR string is really a fixed-width representation.

Share this post


Link to post
Share on other sites
Quote:
Original post by Red Ant
When UNICODE is defined and using std::basic_string< TCHAR >, the nth __character__ in any given string is always the same as the nth TCHAR, i.e. I can always index into the string and get exactly the character I want. Thus a TCHAR string is really a fixed-width representation.


The term "character" is not well-defined when dealing with Unicode. If you index a basic_string<TCHAR> with n, you get the nth TCHAR. Since TCHAR are 16-bits when UNICODE is defined, assuming we're talking about a Windows application build here, then this can be one of a UTF-16 surrogate pair, and a UTF-16 surrogate pair can occur before the nth TCHAR, so the nth TCHAR may not refer to a valid code point, and even if it does refer to a valid code point, it may not be the nth code point.

Share this post


Link to post
Share on other sites
Thanks. What you said was pretty much my initial assumption, but I began to doubt it after having raid this article, particulary this excerpt:

Quote:

"Wide character" or "wide character string" refers to text where each character is the same size (usually a 32-bit integer) and simply represents a Unicode character value ("code point"). This format is a known common currency that allows you to get at character values if you want to. The wprintf() family is able to work with wide character format strings, and the "%ls" format specifier for normal printf() will print wide character strings (converting them to the correct locale-specific multibyte encoding on the way out).

Share this post


Link to post
Share on other sites
"where each character is the same size" is true for UTF-32. But Ive never come a cross a program that actually uses UTF-32.

I would recommend reading the Wikipedia entries for UTF-8, UTF-16 and Unicode. They explain very well the difference and pro/cons of each encoding.

As a side note to this topic does any one know of a unicode font that supports the whole range. Ive not been able to find one. Most only support UTF-16 without surrogates, ie only 2 bytes, not 4 byte characters.

Share this post


Link to post
Share on other sites
Wide stings, i beleve, contain 16 bit elements. usually each element represents a "codepoint" however there is the ability to encode some values above 0xFFFF into 2 16 bit string elements, these are called surrogate pairs. It is easy to figure out if a given string element is one of these surrogates because if falls in one of two ranges (0xD800 to 0xDBFF for the "high surrogate" range and 0xDC00 to 0xDFFF for the "low surrogate" range).
The low surrogate always comes before the high in a utf-16 sting (if this is not the case the sting is invalid).
Surrogate pairs can be split and combined using the following functions

Note: UnicodePoint is a 32-bit int and wchar_t is a 16-bit int.


wchar_t UTF16_GetLowSurrogate( const UnicodePoint pCodepoint )
{
assert( pCodepoint >= 0x10000 );

UnicodePoint temp = pCodepoint - 0x10000;

return (temp & 0x3FF)|0xDC00;
}
wchar_t UTF16_GetHighSurrogate( const UnicodePoint pCodepoint )
{
assert( pCodepoint >= 0x10000 );

UnicodePoint temp = pCodepoint - 0x10000;

return ((temp >> 10) & 0x3FF)|0xD800;
}
UnicodePoint UTF16_CombineSurrogates( const wchar_t pLow, const wchar_t pHigh )
{
return (pLow&0x3FF)|((pHigh&0x3FF)<<10) + 0x10000;
}





I wouldn't recomend using utf-8 at runtime, however most xml files are encoded using utf-8 so it would be a good idea to convert data from xml files into utf-16 as you load it.

all the utf-16 code is based on information from wikipedia's utf-16 page, if you want to understand whey my code works go there and trawl through it.

[Edited by - agh-ha-ha on October 11, 2007 10:42:10 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Red Ant
Thanks. What you said was pretty much my initial assumption, but I began to doubt it after having raid this article, particulary this excerpt:

Quote:

"Wide character" or "wide character string" refers to text where each character is the same size (usually a 32-bit integer) and simply represents a Unicode character value ("code point"). This format is a known common currency that allows you to get at character values if you want to. The wprintf() family is able to work with wide character format strings, and the "%ls" format specifier for normal printf() will print wide character strings (converting them to the correct locale-specific multibyte encoding on the way out).


Okay, let's clarify a bit. There are multibyte strings, in which each glyph (or code point) is represented by one or more bytes. This category includes ASCII, ISO Latin-N, JIS, and Unicode UTF-8. There are also wide-character strings, in which each code point is represented by a single fixed-width value. Examples of this category include GB-2312 and Unicode UCS-4.

Then there are bizarre hybrid approached in which you have a "multibyte" style wide-character strings in which each code point is represented by one or more fixed width values with a width greater than 8, for example Microsoft's UNICODE variant of the Unicode UTF-16 character set.

Usually, "wide character" refers to text where each character is the same size. This is not true of the Unicode UTF-16 encoding, which was adapted by Microsoft as their UNICODE encoding. The 32-bit integer "wide character" referred to in the quoted text would be the standard Unicode UCS-4 set, which is not supported by Microsoft.

Does that help any or have I just muddied the waters?

Share this post


Link to post
Share on other sites
Quote:
Original post by agh-ha-ha
Wide stings, i beleve, contain 16 bit elements.

Wide strings in C++ can contain pretty much anything. They don't even have to be vaguely related to Unicode.

Quote:
usually each element represents a "codepoint" however there is the ability to encode some values above 0xFFFF into 2 16 bit string elements, these are called surrogate pairs.

Actually, most C++ wide character implementations are restricted to the basic multilingual plane subset of the Unicode implementation. They don't use or understand surrogate pairs at all.

In practice, if you want to use Unicode in a C++ application, especially in anything resembling a portable manner, you should pretty much ignore the standard library facilities for localization and instead use a third party library like ICU.

Share this post


Link to post
Share on other sites
Quote:
Original post by Nitage
There's no type in the C++ standard library that can deal with variable width encodings.

Streams are built so that it is possible to use variable width encoding. C++ does not specify a Unicode encoding, but if you provide a stream with a proper char_traits object, the implementation of the stream is designed to handle variable width encodings.

This applies to file streams and string streams, and any other stream objects. It does not, however, apply to strings. The intention is that you only use variable width encoding to read and write from/to external sources.

Share this post


Link to post
Share on other sites
Quote:
Original post by SiCrane
Quote:
Original post by agh-ha-ha
Wide stings, i beleve, contain 16 bit elements.

Wide strings in C++ can contain pretty much anything. They don't even have to be vaguely related to Unicode.

Quote:
usually each element represents a "codepoint" however there is the ability to encode some values above 0xFFFF into 2 16 bit string elements, these are called surrogate pairs.

Actually, most C++ wide character implementations are restricted to the basic multilingual plane subset of the Unicode implementation. They don't use or understand surrogate pairs at all.

In practice, if you want to use Unicode in a C++ application, especially in anything resembling a portable manner, you should pretty much ignore the standard library facilities for localization and instead use a third party library like ICU.


When using std::wstring the element size is 16-bit (on all platforms I can think of, although you could define a wastefull string type that uses 32 bit elements using the std::basic_string template).
It is important to take surrogate pairs in to account when writing your string handling code, this is definatly a time when a little work now, will save alot of work at 1am on a day when the build was due 6 hours ago for code complete, and if you dont get it done, no one gets paid (this situation is probably not going to happen, but it could :-) ).
Windows will render codepoints encoded into surrogate pairs, and anyone creating their own render will need to combine them if they come across them in order to find the correct glyph(s) to render.

Also, rendering is about the only time you would need to worry about them, since manipulating an internationalised sting in any other way should be left to the translation team. This is because there are many things that can go wrong when messing with them.

Using converting case as an example:

1) Is there is such a concept of case in the language you are using?
2) Does the concept of case work the same way as it does in English?
3) Etc....

It's just not worth messing with a localised string; render it as you find it. To render it properly you will need to combine surrogate pairs into codepoints when you come accross them.

Another issue with localised strings is creating messages for the player that contain information generated at run time.

For example:

"Player 3 (Bob) wins the game"

could be constructed using something like


message << playerString << " " << playerNum << "(" + playerName << ")" << winMessage;



in another language, that word order could be wrong, this is because they might expect to read something more like

"The winer of the game is Player 3 (Bob)"

For this reason, you should use format strings (and the the localisation people should position your %s, %d, etc.. for you so that they are in the correct place). All of which work fine when using surrogate pairs (like everything else that you will need to do, because, as I have said before, just about the only time you need to worry about surrogate pairs is when you come to render a string. I have said that haven't I?).

The only other time i can think of (at the moment :-) ) that surrogate pairs would be important is when you are figuring out how far along a sting in memory the caret on screen is, this is because, for every surrogate pair you would need to bump the string iterator 2 places rather than 1 whilst only moving the caret on screen 1 place, this is really to do with rendering so should covered by my previous statement :-).

Share this post


Link to post
Share on other sites
Quote:
Original post by agh-ha-ha
When using std::wstring the element size is 16-bit (on all platforms I can think of, although you could define a wastefull string type that uses 32 bit elements using the std::basic_string template).

Repeat after me: wchar_t is not necessarily 16-bits. wchar_t is not necessarily 16-bits. wchar_t is not necessarily 16-bits. Can you think of Linux as a platform? Linux has 32-bit wide character versions. Unless you're using a time-tunneling modem and are actually communicating from back in 2000. Quality I18N-aware software libraries like the Xerces XML library are configurable for use with 16 bit or 32 bit wide character support. You cannot assume 16 bits for wchar_t if you want your software to be portable. Many embedded systems can't even address anything less than 32-bits at a time, meaning that every single data type is at least 32-bits in size.

Quote:

It is important to take surrogate pairs in to account when writing your string handling code, this is definatly a time when a little work now, will save alot of work at 1am on a day when the build was due 6 hours ago for code complete, and if you dont get it done, no one gets paid (this situation is probably not going to happen, but it could :-) ).

If you were paying attention to what I was saying, you'd find that I said that many C++ implementations don't use or understand surrogate pairs. This has nothing to do with the fact that programmers should handle surrogate pairs. And the easiest way to do that is to use a third party library like ICU. I may have mentioned that already.

Quote:

Windows will render codepoints encoded into surrogate pairs, and anyone creating their own render will need to combine them if they come across them in order to find the correct glyph(s) to render.

Recent Windows versions will render code points with surrogate pairs. Early Windows WCHAR support didn't grok surrogate pairs; which is understandable since the original versions of Unicode didn't use anything above the 16-bit mark. Surrogate pair support only emerged with Windows 2000 after China mandated that all computers sold in the country support GB18030. And even then, support required a special installation. It was only with XP when it became natively supported.

Quote:

Also, rendering is about the only time you would need to worry about them, since manipulating an internationalised sting in any other way should be left to the translation team. This is because there are many things that can go wrong when messing with them.

Maybe this is true for your applications. But other people have applications that deal with Unicode strings involving things like collation, which is a remarkably common operation. Not to mention normalization before collation takes place. Or you know, simple things like validating input strings.

Share this post


Link to post
Share on other sites
Quote:
Original post by SiCrane
Repeat after me: wchar_t is not necessarily 16-bits. wchar_t is not necessarily 16-bits. wchar_t is not necessarily 16-bits. Can you think of Linux as a platform? Linux has 32-bit wide character versions. Unless you're using a time-tunneling modem and are actually communicating from back in 2000. Quality I18N-aware software libraries like the Xerces XML library are configurable for use with 16 bit or 32 bit wide character support. You cannot assume 16 bits for wchar_t if you want your software to be portable. Many embedded systems can't even address anything less than 32-bits at a time, meaning that every single data type is at least 32-bits in size.


C++ doesn't specify the exact bitsize of most of their types, but how can you ever properly write a string that is exactly an UTF-8, an UTF-16 or UTF-32 string, if there is no single type in C++ that is sure to be for example 16-bit?

Share this post


Link to post
Share on other sites
You just need a type that is at least N bits long. You can represent UTF-8 as a long list of 64bit longs if you like. You'll just have a lot of zeroed bits in there. UTF-8 doesn't really mean "you shall represent such a string with a series of 8-bit types'. It means "you only need an 8-bit type to represent such a string".

The same goes for any other encoding: 127-bit ASCII can be stored just fine in a series of 256byte lumps if you don't mind wasting just over 255 bytes on each character. The fact that you get type mismatch errors is just a language interface issue.

Remember, they're just encodings, not data types. A nice way to think about it is that Arabic numerals and Roman numerals are just 2 different encodings of integer numbers. You can index into an Arabic numeral 'string' to get any particular power of 10 that you want, but you can't do it with a Roman numeral 'string'.

Share this post


Link to post
Share on other sites
Quote:
Original post by Lode
C++ doesn't specify the exact bitsize of most of their types, but how can you ever properly write a string that is exactly an UTF-8, an UTF-16 or UTF-32 string, if there is no single type in C++ that is sure to be for example 16-bit?


Quote:
Original post by SiCrane
In practice, if you want to use Unicode in a C++ application, especially in anything resembling a portable manner, you should pretty much ignore the standard library facilities for localization and instead use a third party library like ICU.


In particular, ICU defines the UChar type which in turn typedefs the proper data type for use with the UTF-16 strings that ICU uses. Also, if you decide to forgo a library like ICU, with boost/cstdint.hpp you can have access to typedefs like uint_least_16_t, which is an unsigned integral data type with at least 16 bits.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this