Using UTF-8 in Games
[/font][font=arial]
After working on my GUI system, I really started to explore using UTF-8 more in my game. I'm currently living in Korea and that has made me see how important it is to give people games in their language. The kids I teach are just crazy about video games. The game Starcraft, for example, was at one time so popular in Korea, it spawned sponsored professional game players. Matches can be even seen on game-specific cable television networks. There is a large international desire for games and developers should build their games with this in mind.
[/font][font=arial]
I've started looking at UTF-8 these days and trying to explore it. I've written an article on it and I've even made a video presentation based on that article. And if you like want source code, on my other blog, I'm currently writing a series Writing a STL-Style UTF-8 String Class. It will be a five part series that I should finish by Friday. If it gets a nice reception, I'll post a summary article here on Gamedev. Even with all of this information, plus the countless articles on the Internet, how can someone actually use UTF-8 in a game?
[/font][font=arial]
UTF-8 is a Unicode encoding. One good thing about UTF-8 is it's a variable-sized 8 bit encoding. This means that games developed using 7 bit ASCII(the 8 bit "char" data type in C and C++) will not need to alter their file formats that store text. This is because 7-bit ASCII is a subset of UTF-8 and UTF-8 can work with null-terminated c-style strings. All of the standard "char"-based string functions will work including search, concatenation, and comparison.
[/font][font=arial]
To use UTF-8 with a GUI, the text can be stored in UTF-8 and then converted during rendering. This can be done by iterating through the string to get the correct code point. For GUIs that render text that doesn't need further processing, this will work well. Edit boxes will be a little trickier since they will require random insertions of characters anywhere in the text field. This can be done using a good UTF-8 string class, but it would be better to not use UTF-8 in this case and convert later. This makes sense as an edit control will probably have it's own buffer for text anyway that will later be synchronized with the application.
[/font][font=arial]
Another thing UTF-8 is good for is sending data. Even when a game already uses wchar_t(16 bits on Windows), UTF-8 may still be used when communicating with other programs over a network. Many network libraries and many servers require request to be sent in JSON or XML encoded in UTF-8 so even if the program stores it's text internally as wchar_t, conversion to UTF-8 may still be needed. UTF-8 is good for sending data because it is endian-independent. One example of this is the RakNet master server implementation by Jenkins Software. To use it, applications should send data in JSON over HTTP. User names, chat messages, and room details can all be encoded in UTF-8.
[/font][font=arial]
Very interesting. I'm personally surprised you're going to work on edit support; finding out the insertion point (potentially) requires consistent support for Unicode layout. Just thinking at it makes my head hurt.
Personally I use UTF8 for serialization purposes only.