I've been thinking this over for a while. Right now, my game engine and GUI do most things in USC-2. USC-2 is a 16 bit version of UNICODE and it can encode the values for all of the most common UNICODE characters. I think these days, Windows supports UTF-16 as well. I started to use it because I do most of my work on Windows, and this fits perfectly with the Windows 16 bit wchar_t type. I had avoided UTF-8 because it's a variably-sized encoding which means some character will use 8 bits while others will use 16, 24, and at most 32. Because of this, I would no longer be able to easily use string[index] to get a specific character.
But lately, I've been wondering if I made a good decision. Post like this (link to comment) got me thinking. I've been so single platform focused that I haven't really considered some real problems. For one thing, wchar_t is 16 bits on Windows, but 32 on some other platforms. Also when using wchar_t and even with UTF-16 and UTF-32, you have to worry about endianess. So I've been considering using UTF-8 for all of my strings. This is possible because I use my own GUI anyway and will only have to convert with using API specific functions. If I can limit this, I shouldn't have any problems.
What is UTF8 and how is it encoded?
Now a lot of the information in this section comes from http://www.fileformat.info/info/unicode/utf8.htm so you should look at it for more detailed information, but I'll try to explain it the best I can. Just one thing, when I say character, I'm referring to the value of the character in unicode which goes from 1 until 1,112,064 (http://en.m.wikipedia.org/wiki/UTF-8) with zero which can be used as a string terminator.
Like I said before, UTF8 can encode characters into 1, 2, 3, or 4 bytes. 1 byte encodings are only for characters from 0 to 127 meaning if it's a 1 byte encoding it'll be equivilent to ASCII. 2 byte encodings are for characters from 128 to 2047. 3 byte encodings are for characters from 2048 to 65535 and 4 byte encodings are for characters from 65536 to 1,112,064. To understand how the encoding works, we'll need to examine the binary representation of each character's numeric value. To do this easily, I'll also use hexadecimal notation as 1 hexadecimal digit always corresponds to a 4 bit nibble. Here's a quick table.
So 2C (hexadecimal) = (2 X 16) + (12 X 1) = 48(decimal) and 0010 1100(binary)
I understand that many understand this already, but I want to make sure new programmers will be able to understand.
The UTF-8 Format
In UTF-8, the high-order bits in binary are important. UTF-8 works by using the leading high-order bits of the first byte to tell how many bytes were used to encode the value. For 8 bit encoding from 0 to 127, the high-order bit will always be zero. Because of this, if the high-order bit is zero, the byte will always be treated as a single byte encoding. Therefore, all single byte encodings have the following form: 0XXX XXXX
7 bits are available to code the number. Here is the format for all of the encodings:
Once you know the format of UTF-8, converting back and forth between it is fairly simple. To convert to UTF-8, you can easily see if it will encode to 1, 2, 3, or 4 bytes by checking the numerical range. Then copy the bits to the correct location. After a few weeks, I'll post a class that puts all of this togther and one that can convert between the different UNICODE formats.
Let's try an example. I'm going to use hexadecimal value 1FACBD for this example. Now, I don't believe this is a real UNICODE character, but it'll help us see how to encode values. The number is greater than FFFF so it will require a 4-byte encoding. Let's see how it'll work. First, here's the value in binary.
This will be a 4-byte encoding so we'll need to use the following format.
Now converting to UTF-8 is as simple as copying the bits from right to left into the correct positions.
Like I said, UTF-8 is a fairly straight-forward format.
Questions and Final Thoughts
Now it's not difficult converting back and forth between UTF8, but is this really the best solution? Would it be better to continue to store things in a know string type and just be able to convert back and forth, using UTF8 only to store things in files? Maybe I could continue to use wide character strings in the native OS format and just and write some functions to do the conversions. If you have any preferred strategies, I'd like to hear them.