Code Localization (UTF-8 vs UTF-16)

Started by
9 comments, last by Alberth 5 months, 1 week ago

Hi guys,

I'm using C++, and started to work on the text system of my game, my idea is to support several languages other than english “ru, kr, ch, etc”.

Now I came with the issue that Windows has the macro “UNICODE”, now all my FileSystem (and most of the stuff) use everything in a UTF-8 way, and I start to wonder, what should I do? I could, enable “UNICODE” and adapt the code. Or there is “utf8cpp” which allows to convert UTF-16 to UTF-8, but that make's me wonder, if it has issues, like folder names that are in chinese (example player has the game inside it's own user folder, with a name in chinese).

Anyone has any thoughts on this?

Advertisement

In my codebase I use UTF8 for all strings, except when interfacing with windows functions. In that case, I convert to and from UTF16. There is no information loss during conversion. UTF8 has the advantage of smaller size, compatibility with ASCII and various APIs that use char*, such as those on linux/macos. It also allows you to use regular std::string instead of wstring or whatever, though I use a custom string class with template specializations for each encoding (including ASCII and UTF32).

Aressera said:

In my codebase I use UTF8 for all strings, except when interfacing with windows functions. In that case, I convert to and from UTF16. There is no information loss during conversion. UTF8 has the advantage of smaller size, compatibility with ASCII and various APIs that use char*, such as those on linux/macos. It also allows you to use regular std::string instead of wstring or whatever, though I use a custom string class with template specializations for each encoding (including ASCII and UTF32).

Thanks for the response.

But how do u handle chinese names, for example? Or the text localization in the files.

One idea that I had, is that all files should be UTF-16, then I load the file and convert to UTF-8.

You seem to be confusing the Unicode concept with encoding concepts.

The explanation that worked for me was

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

It's ancient, but I think it will make things more clear.

In short, Unicode is just a mapping between very many characters ("codepoints" in Unicode speak) to unique integers, nothing more. 65 ←> ‘A’ is one of them, you can find all at unicode.org. So a Unicode text is equivalent to a sequences of large numbers,

Now to store such numbers in a computer or at a disk, or to send them to someone else at the Internet can be a complicated problem, in particular if you expect the other end to be able to read the text as well. You need to make an agreement how you encode the numbers for storage or transmission. This is what UTF-8, UTF-16, UTF-32, (and others) are about. They convert a sequence of codepoint numbers to a form that you can save or transmit, or use as a file name, or whatever you want to do with it. Conveniently it is also possible to reverse the process, and get the same sequence of code points back from the encoded data. This solves the storage problem, basically.

Note that all encodings basically do the same thing, except they make different choices in how to store the numbers. As a result, some encoidings are better (more compact) for Western languages, and others are better for Asian languages, for example. However in all cases you must know how storage is encoded in order to retrieve the original content.

I hope that you now start to realize that “Chinese names” are no different from “English names” or names from Spain. It's all Unicode code points. Just the codepoint numbers are different since different languages tend to use different characters (glyphs, the graphical representation of a character) to express something.

The problem however comes when you try to show that Chineese name at the screen. Then you need the right glyphs and paint them at the screen. This is where the fonts come in. There are also large and complicated libraries for painting text. I have no experience with them, so I cannot tell you much about it.

Hopefully this clarifies things a bit.

I recommend avoiding UTF-16 text files, because they are unusual, inconvenient and typically almost twice as large as the same text encoded as UTF-8.

Text editors should support all encodings that include the characters you want to use equally well; as Alberth explains, the challenging part of showing Chinese text in your game is providing fonts (and, possibly, performing top to bottom rather than left to right layout).

Omae Wa Mou Shindeiru

Rewaz said:
But how do u handle chinese names, for example? Or the text localization in the files.

Adding a tiny bit to Alberth's excellent answer: be careful that you're not conflating different things, and do your best to build systems in a way that makes it meaningless.

When you're programming stop thinking of things like filenames or UI text as words and symbols. It is far better to think of them as generic blobs of data which happen to correlate into human-meaningful blobs.

If you're in generic C++, there are now system-agnostic formats like std::filesystem::path data type that works automatically with 8-bit, 16-bit, and 32-bit string formats. You don't need to be bothered with the underlying structure, nor about if they use a forward slash or backslash or other characters, nor about any other system-specific elements.

In windows, that can also mean a generic 8-bit string for the “A” functions like CreateFileA(), FindFileA(), CreateDirectoryA(), and related functions, or the 16-bit wchar_t string and the “W” functions like CreateFileW(), FindFileW(), CreateDirectoryW(), and related functions. Windows has a bunch of functionality that automatically handle them, with Axxx and Wxxx variants, plus Txxx variants with intelligent rules to help move between them.

In an Apple MacOS environment, or Linux environment, the wchar_t is 32 bits and the blob of text for the file name is adjusted accordingly. Linux has had UTF8 for filesystems almost from the very beginning, which is perfectly capable of encoding unicode file names.

But ultimately, if you've done it right you don't care. Files are located “somewhere”. You can throw up a dialog box that lets the user point to “somewhere else”. Done well you shouldn't care if they're located on a hard drive, a USB drive through a chain of 17 usb hubs, or a network location on a different continent. You don't care if the file name is in English, Spanish, Korean, Kanji, or Klingon, all of them are a blob representing “somewhere”. You pass that argument that says “somewhere”, and the functions know how to interpret “somewhere” and create a file pointer out of it.

When you are displaying blobs of text on the screen allow for translators to do their thing, and for artists to do their things around fonts, and for unicode rendering to do it's thing. You do that by not manipulating strings for UI. Let them be driven by string tables, and you pass along whatever blob of text is in the string table. This way you don't know or care what's actually displayed, all you care about is the tag for the data.

In Unreal, they're all put into FText instances that can all be customized by translation. Functions like FText::Format() allow translators to reposition strings, change tense based on the plurality or genders used, and more, but the programmer doesn't need to know or care about it, it's just an FText object that represents a blob of text meaningful to the user. As one of many examples in Unreal, following the simple rules lets translators do things like "You came {Place}{Place}|ordinal(one=st,two=nd,few=rd,other=th)!" to make "You came 1st!", or “You came 3rd!" or for genedered languages, like a male warrior guerrero or female warrior guerrera based on tags.

The point of all this is: Build your systems in a way that it doesn't matter.

LorenzoGatti said:
I recommend avoiding UTF-16 text files, because they are unusual, inconvenient and typically almost twice as large as the same text encoded as UTF-8.

UTF-8 is very much optimized for pure ASCII with a few non-ASCII characters in it. Such languages are used in the Western world. Now if you instead happen to live in Asia that has languages using mainly characters with much larger code-point values, and pretty much never ASCII characters (I am guessing), the story is quite different, and UTF-16 wins then as far as I know. Both UTF-8 and UTF-16 encodings have the problem that an encoded characters have varying lengths, which can be troublesome for some applications. UTF-32 solves at the cost of some more memory.

Each encoding has some advantages and some disadvantages. People weigh those for their situation and pick the solution that works best for them. UTF-8 is not the clear winner in all cases, or all other encodings wouldn't be used at all in the world.

It seem I have a wrong concept of it, I was thinking of 1 byte vs 2 or more bytes.

That's where it comes my issues on how it works, since in my head is, okay one japanese letter is len 1 but can be 3 bytes, and how to convert byte wise and get the length of the word.

Rewaz said:
how to convert byte wise and get the length of the word.

You can't quite do that. One byte doesn't equal one letter. UTF8, UTF16, UTF32, doesn't matter, you can always find individual letters that span multiple in-memory bytes. It might be one, or two, or even six.

You can however test to see if it's one of the recognized white space characters. There are currently 17 of them in the big list, someone has shortened to just those values here. If you're doing it for word wrapping, be careful that you don't break on any of the no-break spaces. I understand there are languages where there must be a whitespace gap but the word changes when the gap is broken up or expanded.

Edit to add: When comparing them, you also can't go byte-by-byte for the same reasons as above. All UTF strings need to be processed as potentially multibyte, so you process them with string functions that know how to accommodate the encoding, advancing one symbol at a time rather than one byte at a time, that has brains to know if a character is one byte, two, three, or even six bytes.

There are descriptions of how a codepoint gets encoded in bytes. You can simply find then at Wikipedia. That defines where the bits of the codepoint go, and also how to recognize which bytes belong together.

For UTF-8 at https://en.wikipedia.org/wiki/UTF-8​ you can see in the encoding section that the first byte has a different fixed sequence of high bits depening on how many bytes follow.

This topic is closed to new replies.

Advertisement