Sign in to follow this  
iaminternets

Cross-platform UTF-8 in C++

Recommended Posts

From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner. It doesn't appear boost has anything (sadly), and I'm not particularly in the mood for handling the logic myself! Have I been searching for the wrong terms, or is there no well-tested library for handling Unicode in C++?

Share this post


Link to post
Share on other sites
Quote:
Original post by iaminternets
From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner.


It depends.

Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

If you want to translate UTF-8 into glyphs for your display in a cross-platform manner, you're out of luck because your display is platform-specific. C++ has no knowledge of fonts, rendering, input methods, glyphs, or video technology.

If you want to parse a specific native-language script stored in Unicode, it's true that there is no portable way to do that. Parsing Arabic is certainly different from parsing French.

What are you trying to do, maybe we can make suggestions on how to make it easier?

Share this post


Link to post
Share on other sites
Quote:
Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

~Jonathan

Share this post


Link to post
Share on other sites
Quote:
Original post by Twisol
Quote:
Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

~Jonathan


Under Win32, wchar_t is 16 bits wide and represents a UTF-16 code unit. On Unix-like systems wchar_t is commonly 32 bits wide and represents a UTF-32 code unit.

So, even std::wstring isn't portable, you have to use different codecs to store encoded characters read from a text file.

Share this post


Link to post
Share on other sites
Quote:
Original post by Twisol
Quote:
Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?
Right, but UTF-8 is an 8-bit encoding of Unicode and so fits perfectly into std::string. As others have mentioned, std::wstring isn't really portable.

Also, just being able to "support" UTF-8 via std::string doesn't give you much. For example, you'll need to use a library to be able to properly handle things such as cursor movement in edit boxes, rendering of the glyphs, and so on.

Share this post


Link to post
Share on other sites
Quote:
Original post by Bregma
It depends.

Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

No it doesn't work. In fact, doing this makes all hell break lose. std::string is meant to store strings where every character is encoded using the same amount of space in the byte stream. UTF-8 being a variable length encoding, where individual characters may use different amount of space in the byte stream, will absolutely not work as expected with std:string. All string manipulation methods provided by the std::string class will fail as soon as your string contains a character that is not part of the ANSI subset: length or size will return garbage, erase, insert, etc, methods will fail (they'll operate with wrong character offsets), operator[] will fail.

The worst in all that: if you prototype your software in english, you might not even notice that your code contains tons of bugs that will only manifest themselves as soon as you start i18n'ing your app.

You will have to parse the entire content of the std::string char by char, decoding the UTF-8 as appropriate. And this for all types of string manipulation you have to do, including simply querying the length. Basically, std::string degrades back to a simple C char array.

UTF-16 with wstrings is a much better choice.

Share this post


Link to post
Share on other sites
Quote:
Original post by Yann L
UTF-16 with wstrings is a much better choice.


I would say a marginally better choice. With wchar_t you've got platform dependent data size, and people will think that somehow that the wchar_t functions in the C and C++ standard library will actually work in a sane manner in a global environment, and many of your objections about UTF-8 with std::string apply equally to UTF-16 with std::wstring when wchar_t is 16 bits. For a much better choice, go with a third party i18n library like ICU.

Share this post


Link to post
Share on other sites
Quote:

No it doesn't work. In fact, doing this makes all hell break lose.


Not true, most common operations actually works. For example splitting a string at "," or any other character in the 0-127 range is exactly the same. Splitting text using as string is exactly the same (even single character strings). String find and replace works as expected. Sorting works as expected.

However, utf-8 needs to be converted during rendering (and while doing text layout).

Share this post


Link to post
Share on other sites
Quote:
Original post by SiCrane
I would say a marginally better choice. With wchar_t you've got platform dependent data size, and people will think that somehow that the wchar_t functions in the C and C++ standard library will actually work in a sane manner in a global environment, and many of your objections about UTF-8 with std::string apply equally to UTF-16 with std::wstring when wchar_t is 16 bits. For a much better choice, go with a third party i18n library like ICU.

A valid point. However, one should also balance out of the pro and cons of adding a monster library like ICU, which often is complete overkill for a smaller project. 16bit wchar_t based UTF-16 can reliably represent all codepoints in the entire basic multilingual plane without variable length encodings. This covers 99.9% of all languages currently spoken on the planet. So for most cases, even a 16bit encoding is perfectly sufficient. Now, if you absolutely need full support for all planes, you can still use UTF-32.

Share this post


Link to post
Share on other sites
Quote:
Original post by mzeo77
Not true, most common operations actually works. For example splitting a string at "," or any other character in the 0-127 range is exactly the same.

Try erasing the first 5 characters without parsing the UTF-8.

Quote:
Original post by mzeo77
Splitting text using as string is exactly the same (even single character strings).

No, since character offsets do not equal character position in an UTF-8 string. It only works if you parse the string beforehand, so to find the correct offsets and lengths.

Quote:
Original post by mzeo77
String find and replace works as expected.

Yes.

Quote:
Original post by mzeo77
Sorting works as expected.

Depends on how you want your strings sorted. Lexigraphical sorting is much more complex on unicode, and needs full decoding of UTF-8.

The simple fact that you cannot reliably index the string anymore makes it entirely useless. std::string, by definition, is designed to contain strings. Letting it degenerate into a container for generic byte data, while breaking most of its functionality, is nonsense. There are other, more appropriate containers for that. std::string is clearly the wrong tool for the job, since it was absolutely not designed to represent UTF-8 streams. The same applies to wstring for UTF-16, as SiCrane mentioned, but in a much less pronounced way.

Share this post


Link to post
Share on other sites
Quote:
Original post by Yann L
A valid point. However, one should also balance out of the pro and cons of adding a monster library like ICU, which often is complete overkill for a smaller project. 16bit wchar_t based UTF-16 can reliably represent all codepoints in the entire basic multilingual plane without variable length encodings. This covers 99.9% of all languages currently spoken on the planet. So for most cases, even a 16bit encoding is perfectly sufficient.

I have 885,000,000 people speaking Mandarin Chinese who might disagree with you, not to mention people using other Chinese dialects. There are about 41,000 characters in CJK extension B that fall outside the BMP. In particular, the Chinese government requires that computer systems properly implement many characters outside the BMP and has since at least around 2000. Maybe the people of China only speak 0.1% of the languages of the world, (but given the number of dialects of Chinese, I doubt it) but they still represent 20% of the world's population.

Even sticking with European languages, using wchar_t and the standard C and C++ facilities for i18n means that your code will be non-portable. wchar_t is commonly defined to be both 16-bit and 32-bit on different platforms. You can't even portably even use the same locale identifiers from compiler to compiler. Hell, just trying to even find a listing of locales for different compilers is a non-trivial exercise.

Quote:
Now, if you absolutely need full support for all planes, you can still use UTF-32.
If you need full support for all planes, you're going to need to use a full feature language library anyways, so you might as well use a variable encoding and save memory.

Share this post


Link to post
Share on other sites
Quote:
Original post by SiCrane
Quote:
Original post by Yann L
A valid point. However, one should also balance out of the pro and cons of adding a monster library like ICU, which often is complete overkill for a smaller project. 16bit wchar_t based UTF-16 can reliably represent all codepoints in the entire basic multilingual plane without variable length encodings. This covers 99.9% of all languages currently spoken on the planet. So for most cases, even a 16bit encoding is perfectly sufficient.

I have 885,000,000 people speaking Mandarin Chinese who might disagree with you, not to mention people using other Chinese dialects. There are about 41,000 characters in CJK extension B that fall outside the BMP. In particular, the Chinese government requires that computer systems properly implement many characters outside the BMP and has since at least around 2000. Maybe the people of China only speak 0.1% of the languages of the world, (but given the number of dialects of Chinese, I doubt it) but they still represent 20% of the world's population.

Even sticking with European languages, using wchar_t and the standard C and C++ facilities for i18n means that your code will be non-portable. wchar_t is commonly defined to be both 16-bit and 32-bit on different platforms. You can't even portably even use the same locale identifiers from compiler to compiler. Hell, just trying to even find a listing of locales for different compilers is a non-trivial exercise.

Quote:
Now, if you absolutely need full support for all planes, you can still use UTF-32.
If you need full support for all planes, you're going to need to use a full feature language library anyways, so you might as well use a variable encoding and save memory.



Everyone, thank you for taking the time to write in depth replies! I appreciate it.

I'm trying to ensure that the chat feature of my game will work with any language. Should I not use C++ to deal with text?

A language like Ruby seems like a bit of a time-saver for something like this.

Share this post


Link to post
Share on other sites
Quote:
Original post by SiCrane
I have 885,000,000 people speaking Mandarin Chinese who might disagree with you, not to mention people using other Chinese dialects. There are about 41,000 characters in CJK extension B that fall outside the BMP.

AFAIK, you can perfectly well write CJK using only the BMP. But I may be wrong. If you specifically target the Chinese market, then the SIP might be required.

Quote:
Original post by SiCrane
Quote:
Now, if you absolutely need full support for all planes, you can still use UTF-32.
If you need full support for all planes, you're going to need to use a full feature language library anyways, so you might as well use a variable encoding and save memory.

Can't disagree here. However there are many cases, especially for hobby programmers, where an 'intermediate' Unicode support level is entirely sufficient. Including a library such as ICU can be a daunting task for a beginner, and is not always justified.

Share this post


Link to post
Share on other sites
Quote:
Original post by Yann L
AFAIK, you can perfectly well write CJK using only the BMP. But I may be wrong. If you specifically target the Chinese market, then the SIP might be required.
It's mostly proper names and the names of streets/locations that you can't write with just the BMP.

On Ruby, as far as I'm aware, Ruby is still rather lacking in terms of proper Unicode support. It's getting closer than it was a couple of years ago, but nowhere near the level of something like Java or C#/.NET.

Share this post


Link to post
Share on other sites
Quote:

Quote:

Original post by mzeo77
Sorting works as expected.


Depends on how you want your strings sorted. Lexigraphical sorting is much more complex on unicode, and needs full decoding of UTF-8.


No, UTF-8 is constructed in such way that lexigraphical generates the same list comparing byte by byte or unicode by unicode. And if you want case independent sort the complexity lies not in decoding unicode, but in the different ways different languages handle case.

UTF-16 is does also contain some multicharacter chars, and this encoding is more complicated than for utf-8. The ucs-16 encoding could be used instead (same as utf-16, but without multicharacter chars). Sorting a UTF-16 string does generaly not generate a lexigraphical sorting (due to the multicharacter chars).

Quote:

Try erasing the first 5 characters without parsing the UTF-8.


Ok, this is not doable without knowledge about utf-8, but can be done without full decoding. But I claim that this is not what something that would to often.

Quote:

The simple fact that you cannot reliably index the string anymore makes it entirely useless. std::string, by definition, is designed to contain strings. Letting it degenerate into a container for generic byte data, while breaking most of its functionality, is nonsense. There are other, more appropriate containers for that. std::string is clearly the wrong tool for the job, since it was absolutely not designed to represent UTF-8 streams. The same applies to wstring for UTF-16, as SiCrane mentioned, but in a much less pronounced way.


Yes, direct indexing is a problem, but there is more to string handling than that like a text search and replace and similar stuff, and there are some functions that are dangerous to use. std::string could be subclassed to tighten some dangerous. std::string does also not have any knowledge about if its iso-8859-1 or iso-8859-15 or whatever code page it is encoded in, which can also be dangerous. There are many applications that can handle utf-8 strings witout knowing about utf-8 for example mscv and gcc can parse strings without knowing about them being utf-8.

Quote:

From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner.


Its not as hellish as you think, decoding and encoding utf-8 is quite simple.

Share this post


Link to post
Share on other sites
Quote:
Original post by mzeo77
UTF-16 is does also contain some multicharacter chars, and this encoding is more complicated than for utf-8. The ucs-16 encoding could be used instead (same as utf-16, but without multicharacter chars). Sorting a UTF-16 string does generaly not generate a lexigraphical sorting (due to the multicharacter chars).
Note that lexigraphical sorting in UTF-16 with surrogate pairs is no more difficult than lexigraphical sorting of base + combining characters (that is, you handle the weights the same way). And if your plan is "full" Unicode support, you have to support base + combining chars anyway. If you can handle base + combining chars, you can handle surrogate pairs as well.

The same is true of things like character movement, search, etc.

Share this post


Link to post
Share on other sites
Quote:
Original post by Yann L
Quote:
There are about 41,000 characters in CJK extension B that fall outside the BMP.

AFAIK, you can perfectly well write CJK using only the BMP. But I may be wrong.

Why do you think that the Unicode standard includes CJK extension B? To take up space for no reason? There's a reason why I included this sentence in my post:
Quote:
In particular, the Chinese government requires that computer systems properly implement many characters outside the BMP and has since at least around 2000.

Quote:
If you specifically target the Chinese market, then the SIP might be required.
SIP?

Quote:
Can't disagree here. However there are many cases, especially for hobby programmers, where an 'intermediate' Unicode support level is entirely sufficient. Including a library such as ICU can be a daunting task for a beginner, and is not always justified.

I know where you're coming from, but just adding ICU to your application isn't much harder than adding a library like SDL or SFML, and these are tasks that we, in this forum, expect beginners and hobby programmers to be able to do.

In any case, this "intermediate" Unicode support level that you're advocating has the important disadvantages of only really existing on Windows machines (non-Windows compilers tend to use 32-bit wchar_t types) and relies on some of the most poorly documented features in the standard library. Fun challenge: try to find a list of locale names usable with MSVC.

Share this post


Link to post
Share on other sites
Quote:
Original post by iaminternets
I'm trying to ensure that the chat feature of my game will work with any language. Should I not use C++ to deal with text?

I would think that a chat program isn't going to do much parsing of content, so C++ and std::string is ideal.

Your problem will be rendering and input methods. How are you going to handle right-to-left rendering with liasons and positional significance (eg. Arabic) or compound rendering (eg. Hangul)? How about mixed language?

Your choice of text rendering library and input method library are far more important that what language you're going to use to send and receive null-terminated sequences of bytes.

Share this post


Link to post
Share on other sites
Quote:
Original post by Bregma
compound rendering (eg. Hangul)


Just FYI, the vast majority (all?) of Korean IMEs output pre-composed Hangul. You only need to use the decomposed form for Old-Hangul, which is basically an historic script.

That being said, you'll still need to be able to handle decomposed base + combining character combinations if you want to support, say, Vietnamese. The hardest part of this is not so much the rendering (which is pretty much automatic if you're using Win32) but supporting proper cursor movement and selection (i.e. press the right key just once jumps over multiple codepoints). Again, if you're using regular Win32 (i.e. an INPUT control) that stuff is handled automatically. If you're writing your own GUI, it's going to be a lot of work (interfacing with IMEs is no simple task, either).

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this