Jump to content

  • Log In with Google      Sign In   
  • Create Account


Unicode storage


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
3 replies to this topic

#1 Alundra   Members   -  Reputation: 800

Like
0Likes
Like

Posted 05 April 2013 - 08:01 PM

Hi,

Unicode can be stored using UTF8, UTF16 or UTF32.

Unicode can also be stored using wchar_t who is UTF16 or UTF32 based on platform.

Advantage using UTF32 is one character is one code point, but need more memory.

Is UTF32 still a problem nowadays for cross platform ?

 

Thanks



Sponsor:

#2 Ectara   Crossbones+   -  Reputation: 2874

Like
3Likes
Like

Posted 06 April 2013 - 12:03 AM

A quick and dirty explanation:

UTF-8 is variable width, but is not endian-specific. Great for storage and transmission.

UTF-16 is variable width, and is endian-specific. It's limitations are also is the reason why the Unicode standard restricts to code points less than 0x10FFFF. Avoid like the plague.

UTF-32 is fixed width, and is endian-specific. It is faster to iterate through an array of code points in both directions, but requires more space.


If you need more in-depth information, a dedicated guide would be best.

 

Also, if you are doing your own text handling, avoid wchar_t unless you are dealing with something very close to the system API. Not only does one code point not necessarily correspond to one character, but one wchar_t need not correspond to a whole code point, such as with UTF-16. To make matters worse, wchar_t and wide-char strings aren't required to use UTF-16 or UTF-32. They must be a unit that can hold all characters used by the system. Symbian uses UCS-2 strings.


Edited by Ectara, 06 April 2013 - 12:23 AM.


#3 Aressera   Members   -  Reputation: 1348

Like
0Likes
Like

Posted 06 April 2013 - 12:15 AM

I use UTF-8 stored as a String<unsigned char> for all localizable strings and it works pretty well, given that I have a StringIterator class specialized for unsigned char which handles all of the special cases and outputs a series of full-width UTF-32 characters. Performance is never going to really be a problem unless you're doing tons of string manipulations or need random access for characters, in which case I'd suggest UTF-32.

 

And if you're using anything other than UTF-8, you'll have to handle endianness marker code points to read generic text.



#4 Ryan_001   Prime Members   -  Reputation: 1296

Like
0Likes
Like

Posted 06 April 2013 - 10:15 AM

I use UTF-8 stored as a String<unsigned char> for all localizable strings and it works pretty well, given that I have a StringIterator class specialized for unsigned char which handles all of the special cases and outputs a series of full-width UTF-32 characters. Performance is never going to really be a problem unless you're doing tons of string manipulations or need random access for characters, in which case I'd suggest UTF-32.

 

And if you're using anything other than UTF-8, you'll have to handle endianness marker code points to read generic text.

Good idea, I think that's similar to what they used in the D programming language.  I wish boost or the standard library had UTF-8 and UTF-16 iterator adaptors.  When I get some time perhaps I shall build me own.






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS