Unicode storage

Started by
2 comments, last by Ryan_001 11 years ago

Hi,

Unicode can be stored using UTF8, UTF16 or UTF32.

Unicode can also be stored using wchar_t who is UTF16 or UTF32 based on platform.

Advantage using UTF32 is one character is one code point, but need more memory.

Is UTF32 still a problem nowadays for cross platform ?

Thanks

Advertisement

A quick and dirty explanation:

UTF-8 is variable width, but is not endian-specific. Great for storage and transmission.

UTF-16 is variable width, and is endian-specific. It's limitations are also is the reason why the Unicode standard restricts to code points less than 0x10FFFF. Avoid like the plague.

UTF-32 is fixed width, and is endian-specific. It is faster to iterate through an array of code points in both directions, but requires more space.


If you need more in-depth information, a dedicated guide would be best.

Also, if you are doing your own text handling, avoid wchar_t unless you are dealing with something very close to the system API. Not only does one code point not necessarily correspond to one character, but one wchar_t need not correspond to a whole code point, such as with UTF-16. To make matters worse, wchar_t and wide-char strings aren't required to use UTF-16 or UTF-32. They must be a unit that can hold all characters used by the system. Symbian uses UCS-2 strings.

I use UTF-8 stored as a String<unsigned char> for all localizable strings and it works pretty well, given that I have a StringIterator class specialized for unsigned char which handles all of the special cases and outputs a series of full-width UTF-32 characters. Performance is never going to really be a problem unless you're doing tons of string manipulations or need random access for characters, in which case I'd suggest UTF-32.

And if you're using anything other than UTF-8, you'll have to handle endianness marker code points to read generic text.

I use UTF-8 stored as a String<unsigned char> for all localizable strings and it works pretty well, given that I have a StringIterator class specialized for unsigned char which handles all of the special cases and outputs a series of full-width UTF-32 characters. Performance is never going to really be a problem unless you're doing tons of string manipulations or need random access for characters, in which case I'd suggest UTF-32.

And if you're using anything other than UTF-8, you'll have to handle endianness marker code points to read generic text.

Good idea, I think that's similar to what they used in the D programming language. I wish boost or the standard library had UTF-8 and UTF-16 iterator adaptors. When I get some time perhaps I shall build me own.

This topic is closed to new replies.

Advertisement