UTF-8 strings

Started by
14 comments, last by SiCrane 16 years, 6 months ago
In UTF-8 and UTF-16, the character size is variable. Do wstrings in C++ use such a system, or do all characters have the same size in C++? Do there exist string types in C++ that could use UTF-8, so that each character of the string could have a different size, but that using [x] on the string to get the x'th character would still return the x'th symbol in the string, not the x'th byte?
Advertisement
The way I understand it, UTF8 and UTF16 are both variable width, always. In UTF8, a character may be represented by 1 to 4 bytes. In UTF16, a character may be represented by 1 or more 16-bit words. That means you can never index into a UTF8/16 encoded string and be entirely sure to get the character you want.
Correct me if I'm wrong.

[Edited by - Red Ant on October 11, 2007 6:33:06 AM]
There's no type in the C++ standard library that can deal with variable width encodings.

Quote:
using [x] on the string to get the x'th character would still return the x'th symbol in the string, not the x'th byte?


In the real world, that isn't a useful thing to do. Finding the x'th code point will be a O(n) operation, making it pretty slow. Additionally, because a single code point doesn't always map to a single glyph (combining characters for example), finding the x'th glyph remains a O(n) operation even when using fixed width encodings (i.e. UTF32).
Okay, I thought I had UNICODE figured out but after thinking a bit more about this it has become clear to me that I really don't. =( I've just spent a couple hours reading various docs at msdn and other sites on the net, and the more I read the more confused I get (I think this is in no small part due to many authors who otherwise seem very knowledgable on the subject using incorrect or misleading terminology).

Can someone with good knowledge of UNICODE please confirm or correct the following statement?

When UNICODE is defined and using std::basic_string< TCHAR >, the nth __character__ in any given string is always the same as the nth TCHAR, i.e. I can always index into the string and get exactly the character I want. Thus a TCHAR string is really a fixed-width representation.
Quote:Original post by Red Ant
When UNICODE is defined and using std::basic_string< TCHAR >, the nth __character__ in any given string is always the same as the nth TCHAR, i.e. I can always index into the string and get exactly the character I want. Thus a TCHAR string is really a fixed-width representation.


The term "character" is not well-defined when dealing with Unicode. If you index a basic_string<TCHAR> with n, you get the nth TCHAR. Since TCHAR are 16-bits when UNICODE is defined, assuming we're talking about a Windows application build here, then this can be one of a UTF-16 surrogate pair, and a UTF-16 surrogate pair can occur before the nth TCHAR, so the nth TCHAR may not refer to a valid code point, and even if it does refer to a valid code point, it may not be the nth code point.
Thanks. What you said was pretty much my initial assumption, but I began to doubt it after having raid this article, particulary this excerpt:

Quote:
"Wide character" or "wide character string" refers to text where each character is the same size (usually a 32-bit integer) and simply represents a Unicode character value ("code point"). This format is a known common currency that allows you to get at character values if you want to. The wprintf() family is able to work with wide character format strings, and the "%ls" format specifier for normal printf() will print wide character strings (converting them to the correct locale-specific multibyte encoding on the way out).
"where each character is the same size" is true for UTF-32. But Ive never come a cross a program that actually uses UTF-32.

I would recommend reading the Wikipedia entries for UTF-8, UTF-16 and Unicode. They explain very well the difference and pro/cons of each encoding.

As a side note to this topic does any one know of a unicode font that supports the whole range. Ive not been able to find one. Most only support UTF-16 without surrogates, ie only 2 bytes, not 4 byte characters.
Wide stings, i beleve, contain 16 bit elements. usually each element represents a "codepoint" however there is the ability to encode some values above 0xFFFF into 2 16 bit string elements, these are called surrogate pairs. It is easy to figure out if a given string element is one of these surrogates because if falls in one of two ranges (0xD800 to 0xDBFF for the "high surrogate" range and 0xDC00 to 0xDFFF for the "low surrogate" range).
The low surrogate always comes before the high in a utf-16 sting (if this is not the case the sting is invalid).
Surrogate pairs can be split and combined using the following functions

Note: UnicodePoint is a 32-bit int and wchar_t is a 16-bit int.

wchar_t UTF16_GetLowSurrogate( const UnicodePoint pCodepoint ){	assert( pCodepoint >= 0x10000 );	UnicodePoint temp = pCodepoint - 0x10000;	return (temp & 0x3FF)|0xDC00;}wchar_t UTF16_GetHighSurrogate( const UnicodePoint pCodepoint ){	assert( pCodepoint >= 0x10000 );	UnicodePoint temp = pCodepoint - 0x10000;	return ((temp >> 10) & 0x3FF)|0xD800;}UnicodePoint UTF16_CombineSurrogates( const wchar_t pLow, const wchar_t pHigh ){	return (pLow&0x3FF)|((pHigh&0x3FF)<<10) + 0x10000;}


I wouldn't recomend using utf-8 at runtime, however most xml files are encoded using utf-8 so it would be a good idea to convert data from xml files into utf-16 as you load it.

all the utf-16 code is based on information from wikipedia's utf-16 page, if you want to understand whey my code works go there and trawl through it.

[Edited by - agh-ha-ha on October 11, 2007 10:42:10 AM]
Quote:Original post by Red Ant
Thanks. What you said was pretty much my initial assumption, but I began to doubt it after having raid this article, particulary this excerpt:

Quote:
"Wide character" or "wide character string" refers to text where each character is the same size (usually a 32-bit integer) and simply represents a Unicode character value ("code point"). This format is a known common currency that allows you to get at character values if you want to. The wprintf() family is able to work with wide character format strings, and the "%ls" format specifier for normal printf() will print wide character strings (converting them to the correct locale-specific multibyte encoding on the way out).


Okay, let's clarify a bit. There are multibyte strings, in which each glyph (or code point) is represented by one or more bytes. This category includes ASCII, ISO Latin-N, JIS, and Unicode UTF-8. There are also wide-character strings, in which each code point is represented by a single fixed-width value. Examples of this category include GB-2312 and Unicode UCS-4.

Then there are bizarre hybrid approached in which you have a "multibyte" style wide-character strings in which each code point is represented by one or more fixed width values with a width greater than 8, for example Microsoft's UNICODE variant of the Unicode UTF-16 character set.

Usually, "wide character" refers to text where each character is the same size. This is not true of the Unicode UTF-16 encoding, which was adapted by Microsoft as their UNICODE encoding. The 32-bit integer "wide character" referred to in the quoted text would be the standard Unicode UCS-4 set, which is not supported by Microsoft.

Does that help any or have I just muddied the waters?

Stephen M. Webb
Professional Free Software Developer

Quote:Original post by agh-ha-ha
Wide stings, i beleve, contain 16 bit elements.

Wide strings in C++ can contain pretty much anything. They don't even have to be vaguely related to Unicode.

Quote:usually each element represents a "codepoint" however there is the ability to encode some values above 0xFFFF into 2 16 bit string elements, these are called surrogate pairs.

Actually, most C++ wide character implementations are restricted to the basic multilingual plane subset of the Unicode implementation. They don't use or understand surrogate pairs at all.

In practice, if you want to use Unicode in a C++ application, especially in anything resembling a portable manner, you should pretty much ignore the standard library facilities for localization and instead use a third party library like ICU.

This topic is closed to new replies.

Advertisement