Sign in to follow this  
Alundra

Unicode storage

Recommended Posts

Hi,

Unicode can be stored using UTF8, UTF16 or UTF32.

Unicode can also be stored using wchar_t who is UTF16 or UTF32 based on platform.

Advantage using UTF32 is one character is one code point, but need more memory.

Is UTF32 still a problem nowadays for cross platform ?

 

Thanks

Share this post


Link to post
Share on other sites

A quick and dirty explanation:

UTF-8 is variable width, but is not endian-specific. Great for storage and transmission.

UTF-16 is variable width, and is endian-specific. It's limitations are also is the reason why the Unicode standard restricts to code points less than 0x10FFFF. Avoid like the plague.

UTF-32 is fixed width, and is endian-specific. It is faster to iterate through an array of code points in both directions, but requires more space.


If you need more in-depth information, a dedicated guide would be best.

 

Also, if you are doing your own text handling, avoid wchar_t unless you are dealing with something very close to the system API. Not only does one code point not necessarily correspond to one character, but one wchar_t need not correspond to a whole code point, such as with UTF-16. To make matters worse, wchar_t and wide-char strings aren't required to use UTF-16 or UTF-32. They must be a unit that can hold all characters used by the system. Symbian uses UCS-2 strings.

Edited by Ectara

Share this post


Link to post
Share on other sites

I use UTF-8 stored as a String<unsigned char> for all localizable strings and it works pretty well, given that I have a StringIterator class specialized for unsigned char which handles all of the special cases and outputs a series of full-width UTF-32 characters. Performance is never going to really be a problem unless you're doing tons of string manipulations or need random access for characters, in which case I'd suggest UTF-32.

 

And if you're using anything other than UTF-8, you'll have to handle endianness marker code points to read generic text.

Share this post


Link to post
Share on other sites

I use UTF-8 stored as a String<unsigned char> for all localizable strings and it works pretty well, given that I have a StringIterator class specialized for unsigned char which handles all of the special cases and outputs a series of full-width UTF-32 characters. Performance is never going to really be a problem unless you're doing tons of string manipulations or need random access for characters, in which case I'd suggest UTF-32.

 

And if you're using anything other than UTF-8, you'll have to handle endianness marker code points to read generic text.

Good idea, I think that's similar to what they used in the D programming language.  I wish boost or the standard library had UTF-8 and UTF-16 iterator adaptors.  When I get some time perhaps I shall build me own.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this