Multi-lingual text?

Started by
8 comments, last by ChaosEngine 10 years, 11 months ago

Lately, I've been attempting to add support for different languages other than English in my game (i.e. Spanish, Danish, Hungarian, Chinese, Japanese, Korean, etc.). I'm more familiar and experienced with the former 3 languages, I never finished my Japanese course in Uni, and never touched Chinese or Korean.

My assumption was that I could just use a .ttf font for East Asian languages and the standard Segoe_UI.ttf for European languages. An example of the problem I have for European languages would be non-English standard characters. Examples: teljes képerny? (Hungarian), fuld skærm (Danish); the é, ? and æ letters do not show up and breaks the translation. With Japanese, I use a Japanese .ttf font, and what happens is that majority of the characters do not show up, and the ones that do show up aren't the ones I need.

My question is how would I implement these languages using .ttf fonts? For Asian languages, should I discontinue subtracting the value of each char by 32? Any help is greatly appreciated!

Shogun.

Advertisement

What programming language are you using? Do you have unicode support enabled?

What are you using to render the text? platform, api, etc?

if you think programming is like sex, you probably haven't done much of either.-------------- - capn_midnight

Language: C++

Unicode support: I don't know.

Text rendering method: I'm using OpenGL and stb_truetype.h

Platform: Mac OSX (the code is highly portable).

IDE: XCode

Anything else I should add?

Shogun.

Unicode support: I don't know.

This is the first thing to figure out then.

Where does your input text come from? You need to know which encoding it is stored in, such as UTF8, UTF16, etc...

stb_truetype seems to operate on 32-bit unicode code-points, so I assume that if your text is UTF8, you need to convert each UTF8 multi-byte code-point into a single UTF32 code-point before passing it to std_truetype.

I'm in your same boat: trying to provide my own localizable string interface that's platform-independent.

I believe UTF-8 encoding is an encoding format for storing text that would multiple bytes to identify, yet it's meant to be backwards-compatible with ASCII. With that said, that front bit in each byte isn't used in calculating it's UNICODE value. Instead, it's used as a flag to determine if the next byte of data's back 7 bits are used to describe the localized character's unique value. These values, once calculated into normalized 32-bit (unsigned?) UNICODE encoding. Then, you'd have a bitmapped font that would represent each glyph in the font on the image by a 32-bit (again, unsigned?) value equal to the what would match up to your localized text's normalized UNICODE values.

I think that's in line with what Hodgman was saying above. With that being said, you'd want to store your text that you'd display onscreen in UTF-8 encoding, and use a library to read those UTF-8 strings, such as utf8proc, to convert it into a string of normalized UNICODE characters that'd you'd use as look-up values in your localized bitmapped font when rendering text to the screen.

Since we're storing text in UTF-8 and XML parsers typically expect its text to be stored as UTF-8 text, I store my strings in an XML schema like so:


 Hello! Hola!  New Game Nuevo Juego

I'd like to point out that my XML code doesn't appear here^ Looks like it was edited out :/



Then, in my code, I'd have a LocalizedPackage that'd load up an XML file of localized text, typically for an entire menu or for cut-scene dialog that would contain a collection of localized strings described by my LocalizedSting class. LocalizedPackage reads the XML file, and for each element in the XML, it create a LocalizedString instance. LocalizedString would then create an instance of LocalizedText that'd it hold for each element found. It'd read each element, use the 2-character code to determine which language it falls under, and label that LocalizedText with that language. the XML parser would read in the text, and tell it as UTF-8 string and convert it to normalized, an array/vector/list of unsigned long's.


Then, you'd do something like this in your code:


fontString->SetText(localizedPackage->GetString("greeting_string"));


LocalizedPackage contains would keep track of the game's current language with this variable:


// in .h
static int currentLanguage;

// in .cpp
int LocalizedPackage::currentLanguage = LANGUAGE_ENGLISH; // set default language to English


LocalizedPackage would return the correct language's normalized UNICODE string that my FontString class would know how to interpret. Of course, if you wanted to provide localized text in a specific language regardless of the engine's current language, you could always do this:


fontString->SetText(localizedPackage->GetString("greeting_string", LANGUAGE_SPANISH));


You would want to provide lots of error-checking so that GetString() returns NULL if something's invalid, and have SetText() check if it's receiving non-NULL data...


I don't have this completely implemented yet, but I hope this gives you ideas!


EDIT: LocalizedPackage could be expanded to also load more than just localized text --images, sounds, music, etc.

Subtracting 32 from character codes is not something you would normally do, not even in western scripts (not for TTF, anyway -- you might do that if you use your own bitmap font where glyphs start at zero and the space character is your first defined glyph).

For Asian fonts, you obviously must support Unicode in some way, since you'll be using considerably more than 255 different characters. Whatever you use is your decision, as long as you convert them to UTF-32 at the end (before passing it to stb_truetype). I'd go with UTF-8 for storage because it's straightforward and more efficient than UTF-32. Conversion routines are freely available too, so there's not much you need to think about.

UTF-16 is larger for most languages and has no advantages over UTF-8, but it has all of its disadvantages (e.g. non-obvious length to character count mapping), plus it does not work with legacy string routines and isn't as "intuitive".

Note that using the "standard" Segoe_UI.ttf font will require you to buy a license from Monotype, unless you rely on it being installed with the operating system (not the case on MacOS X, that'd be Lucida instead).

Also, you may need to redesign the UI, since different languages can have grossly different text lengths (up to 30-40% difference) and directions -- though I think you can legitimately write Japanese and Chinese left-to-right instead of top-down-right-to-left, as an alternative style.

You would want to write a localisation system that works off of keystrings that look up the localised string to display, which means that your input text are coming from a text file or xls file or whatever you decide to store this in. After that it all boils down to your text renderer and the bitmap fonts it loads and which characters are present in that map.

You shouldn't redesign the UI too much for different langauges you should design it where you are running with a longest string setting from your localisation file and if that fits okish you are good. Framing is still important as well.

Unicode works with code points to find out what letter is which, a bitmapped font will just link these characters to the correct glyph and glyph information for you. You will have to compile the ttf font with the right settings though.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

Note that you will also need to do some amount of (potentially non-trivial) unicode normalisation in order to correctly handle ligatures and diacritical marks.

For example, the character 'e' combined with a ` character is not necessarily the same glyph as è.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Okay, thanks for all the replies. This whole UTF-8/16/32 thing is kinda new to me and I never knew what made it useful. Do you mean use something like wchar_t instead of char? I'm still in the process of finding out how to do this on XCode. I'm at work right now, so I'll have to wait until I get back today to further read everything that was said in detail.

Shogun.

wchar_t isn't portably UTF-16 or UTF-32, it's an implementation dependent type that is typically 16-bits on Windows and 32-bits on other platforms. My recommendation is to find a Unicode library and use it's typedefs for Unicode character sizes. I like ICU, but there are other ones available.

Also, it's not going to be as simple as finding a single font for all East Asian languages. Due to CJK unification certain characters that are represented by the same code point are rendered differently in different languages (or differently between traditional Chinese and simplified Chinese). So you'll end up wanting a different font for each East Asian language you want to support.

This topic is closed to new replies.

Advertisement