Unicode

Started by
6 comments, last by SiCrane 15 years, 10 months ago
Ok I'm really confused about the whole deal with unicode...I thought it was like Ascii except used 2 bytes instead of 1 and that the codes were like standard. But when I look it up there seems to be tons of chareter sets :( The main point being that it always used the same number of bytes rather than 1 byte for some charecters and 2 bytes for others like a bunch of extended sets...
Quote: UTF-8 (which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters).
So anything from 1 to 4 bytes? So I still need to parse the entire thing to get the size?
Quote: UCS-2 (which uses 2 bytes for all characters, but does not include every character in the Unicode standard)
Quote: UTF-16 (which extends UCS-2, using 4 bytes to encode characters missing from UCS-2).
So 2 to 4 bytes???
Advertisement
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Unicode defines a set of codes (usually noted as U+0065, for the character A, for instance). It does not define how to store those in memory.
That's where UTF, etc. comes in. There's UTF-8, UTF-16 and UTF-32. UTF-8 uses 8 bits as base unit. If a code fits in 8 bits (i.e., U+0000 - U+007F), it'll just store it in one byte. If a code does not fit in one byte, it'll use two, or three, or more bytes. The encoding scheme for this can found on the unicode website.
Similary, UTF-16 will store a code in a 16-bit word if it fits, otherwise it can use 2 words, or maybe even more. UTF-32 will always store a code in one 32-bit word, because there are less than 4 million codes so it will always fit.

As you noticed, this means that for UTF-8 and UTF-16, the string length in characters differs from the string length in bytes. This is very annoying, so they came up with UCS-2, which is UTF-16 without the multiple words. If the character code according to UTF-16 fits in 16 bits, then we store it, otherwise we don't support the character. This means you always get 2 bytes per character, but are unable to display several rare unicode characters.

Also note that UTF-8 is backwards compatible with ASCII, in that any string made up of U+0000 - U+007F stored in UTF-8 or ASCII will look the same.

Also, with UTF-16 and UTF-32, endianness plays a role, so strings or text files may be preceded with a Byte Order Mark (BOM), FEFF I think, that shows in what endianness the string was stored.


Edit: and fpsgamer's link is a very good one too :)
Million-to-one chances occur nine times out of ten!
With UTF-8, you need to parse the entire string to figure out the size. Or store the size explicitly, of course.
With UTF-16, the same is *technically* true, but it's able to represent the entire basic plane with 2 bytes per character. Anything beyond that is generally limited to historical (extinct) languages, musical or mathematical symbols and stuff like that. So as long as you're only dealing with plain text in *existing* languages, you might be able to assume that UTF-16 uses exactly 2 byte per character.

If you want to make *absolutely* sure, there's UTF-32, which uses a fixed 4 byte for everything.
Quote:Original post by Spoonbender
So as long as you're only dealing with plain text in *existing* languages, you might be able to assume that UTF-16 uses exactly 2 byte per character.


Last I checked, Chinese was an existing, non-extinct language. There are a little more than 42,000 ideographs in the current Unicode standard that exist outside the basic multi-lingual plane (code points 0x20000 - 0x2A6DF along with some compatibility code points in the range 0x2F800 - 0x2F8BF, which are largely duplicated code point values).
So XP uses UCS-2 so 2 bytes for every charecter...Well thats not to bad, I still wished they had just said that "unicode is 4 bytes" or something. Theres always compression stuff for actauly storeing the string if 4 bytes is too heavy for what your doing...
Quote:Original post by SiCrane
Quote:Original post by Spoonbender
So as long as you're only dealing with plain text in *existing* languages, you might be able to assume that UTF-16 uses exactly 2 byte per character.


Last I checked, Chinese was an existing, non-extinct language. There are a little more than 42,000 ideographs in the current Unicode standard that exist outside the basic multi-lingual plane (code points 0x20000 - 0x2A6DF along with some compatibility code points in the range 0x2F800 - 0x2F8BF, which are largely duplicated code point values).


Ah, I thought Chinese was in the basic multi-lingual plane too. Oops. :)
Some Chinese is, but only the most common characters. There are probably only about 2000 characters you need to know to get through day-to-day life. Those, combined with the basic Japanese and Korean characters, are all in the BMP under the heading of Unified CJK Ideographs.

This topic is closed to new replies.

Advertisement