Cross-platform UTF-8 in C++

For Beginners

Started by iaminternets March 14, 2009 01:30 PM

21 comments, last by Codeka 15 years, 1 month ago

11,840

March 15, 2009 09:20 AM

Quote:Original post by Yann L
Quote:There are about 41,000 characters in CJK extension B that fall outside the BMP.

AFAIK, you can perfectly well write CJK using only the BMP. But I may be wrong.

Why do you think that the Unicode standard includes CJK extension B? To take up space for no reason? There's a reason why I included this sentence in my post:

Quote:In particular, the Chinese government requires that computer systems properly implement many characters outside the BMP and has since at least around 2000.

Quote:If you specifically target the Chinese market, then the SIP might be required.

SIP?

Quote:Can't disagree here. However there are many cases, especially for hobby programmers, where an 'intermediate' Unicode support level is entirely sufficient. Including a library such as ICU can be a daunting task for a beginner, and is not always justified.

I know where you're coming from, but just adding ICU to your application isn't much harder than adding a library like SDL or SFML, and these are tasks that we, in this forum, expect beginners and hobby programmers to be able to do.

In any case, this "intermediate" Unicode support level that you're advocating has the important disadvantages of only really existing on Windows machines (non-Windows compilers tend to use 32-bit wchar_t types) and relies on some of the most poorly documented features in the standard library. Fun challenge: try to find a list of locale names usable with MSVC.

Bregma

9,461

March 15, 2009 10:52 AM

Quote:Original post by iaminternets
I'm trying to ensure that the chat feature of my game will work with any language. Should I not use C++ to deal with text?

I would think that a chat program isn't going to do much parsing of content, so C++ and std::string is ideal.

Your problem will be rendering and input methods. How are you going to handle right-to-left rendering with liasons and positional significance (eg. Arabic) or compound rendering (eg. Hangul)? How about mixed language?

Your choice of text rendering library and input method library are far more important that what language you're going to use to send and receive null-terminated sequences of bytes.

Stephen M. Webb
Professional Free Software Developer

Codeka

1,239

March 15, 2009 06:22 PM

Quote:Original post by Bregma
compound rendering (eg. Hangul)

Just FYI, the vast majority (all?) of Korean IMEs output pre-composed Hangul. You only need to use the decomposed form for Old-Hangul, which is basically an historic script.

That being said, you'll still need to be able to handle decomposed base + combining character combinations if you want to support, say, Vietnamese. The hardest part of this is not so much the rendering (which is pretty much automatic if you're using Win32) but supporting proper cursor movement and selection (i.e. press the right key just once jumps over multiple codepoints). Again, if you're using regular Win32 (i.e. an INPUT control) that stuff is handled automatically. If you're writing your own GUI, it's going to be a lot of work (interfacing with IMEs is no simple task, either).

War Worlds • Journal

Cross-platform UTF-8 in C++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Cross-platform UTF-8 in C++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines