Back to General and Gameplay Programming

Unicode storage

General and Gameplay Programming Programming

Started by Alundra April 06, 2013 02:01 AM

2 comments, last by Ryan_001 11 years ago

Alundra

2,325

Author

April 06, 2013 02:01 AM

Hi,

Unicode can be stored using UTF8, UTF16 or UTF32.

Unicode can also be stored using wchar_t who is UTF16 or UTF32 based on platform.

Advantage using UTF32 is one character is one code point, but need more memory.

Is UTF32 still a problem nowadays for cross platform ?

Thanks

Ectara

3,097

April 06, 2013 06:03 AM

A quick and dirty explanation:

UTF-8 is variable width, but is not endian-specific. Great for storage and transmission.

UTF-16 is variable width, and is endian-specific. It's limitations are also is the reason why the Unicode standard restricts to code points less than 0x10FFFF. Avoid like the plague.

UTF-32 is fixed width, and is endian-specific. It is faster to iterate through an array of code points in both directions, but requires more space.

If you need more in-depth information, a dedicated guide would be best.

Also, if you are doing your own text handling, avoid wchar_t unless you are dealing with something very close to the system API. Not only does one code point not necessarily correspond to one character, but one wchar_t need not correspond to a whole code point, such as with UTF-16. To make matters worse, wchar_t and wide-char strings aren't required to use UTF-16 or UTF-32. They must be a unit that can hold all characters used by the system. Symbian uses UCS-2 strings.

Aressera

3,141

April 06, 2013 06:15 AM

I use UTF-8 stored as a String<unsigned char> for all localizable strings and it works pretty well, given that I have a StringIterator class specialized for unsigned char which handles all of the special cases and outputs a series of full-width UTF-32 characters. Performance is never going to really be a problem unless you're doing tons of string manipulations or need random access for characters, in which case I'd suggest UTF-32.

And if you're using anything other than UTF-8, you'll have to handle endianness marker code points to read generic text.

Ryan_001

3,477

April 06, 2013 04:15 PM

I use UTF-8 stored as a String<unsigned char> for all localizable strings and it works pretty well, given that I have a StringIterator class specialized for unsigned char which handles all of the special cases and outputs a series of full-width UTF-32 characters. Performance is never going to really be a problem unless you're doing tons of string manipulations or need random access for characters, in which case I'd suggest UTF-32.

And if you're using anything other than UTF-8, you'll have to handle endianness marker code points to read generic text.

Good idea, I think that's similar to what they used in the D programming language. I wish boost or the standard library had UTF-8 and UTF-16 iterator adaptors. When I get some time perhaps I shall build me own.

Unicode storage

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Unicode storage

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines