Cross-platform UTF-8 in C++

Started by
21 comments, last by Codeka 15 years, 1 month ago
From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner. It doesn't appear boost has anything (sadly), and I'm not particularly in the mood for handling the logic myself! Have I been searching for the wrong terms, or is there no well-tested library for handling Unicode in C++?
I love the 'nets.
Advertisement
I haven't found one that I've liked, but you could use Unicode's standard C functions.
If you don't mind a very large library, then ICU is an option, though it does much more than just provide support for UTF-8.
I find this library very nice to use: http://utfcpp.sourceforge.net/.
Quote:Original post by iaminternets
From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner.


It depends.

Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

If you want to translate UTF-8 into glyphs for your display in a cross-platform manner, you're out of luck because your display is platform-specific. C++ has no knowledge of fonts, rendering, input methods, glyphs, or video technology.

If you want to parse a specific native-language script stored in Unicode, it's true that there is no portable way to do that. Parsing Arabic is certainly different from parsing French.

What are you trying to do, maybe we can make suggestions on how to make it easier?

Stephen M. Webb
Professional Free Software Developer

Quote:Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

~Jonathan
Quote:Original post by Twisol
Quote:Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

~Jonathan


Under Win32, wchar_t is 16 bits wide and represents a UTF-16 code unit. On Unix-like systems wchar_t is commonly 32 bits wide and represents a UTF-32 code unit.

So, even std::wstring isn't portable, you have to use different codecs to store encoded characters read from a text file.
I've been building a little UTF string library. Though, I only started it fairly recently, so it's not well tested. There are probably better alternatives. http://code.google.com/p/easl/
Quote:Original post by Twisol
Quote:Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?
Right, but UTF-8 is an 8-bit encoding of Unicode and so fits perfectly into std::string. As others have mentioned, std::wstring isn't really portable.

Also, just being able to "support" UTF-8 via std::string doesn't give you much. For example, you'll need to use a library to be able to properly handle things such as cursor movement in edit boxes, rendering of the glyphs, and so on.
Quote:Original post by Bregma
It depends.

Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

No it doesn't work. In fact, doing this makes all hell break lose. std::string is meant to store strings where every character is encoded using the same amount of space in the byte stream. UTF-8 being a variable length encoding, where individual characters may use different amount of space in the byte stream, will absolutely not work as expected with std:string. All string manipulation methods provided by the std::string class will fail as soon as your string contains a character that is not part of the ANSI subset: length or size will return garbage, erase, insert, etc, methods will fail (they'll operate with wrong character offsets), operator[] will fail.

The worst in all that: if you prototype your software in english, you might not even notice that your code contains tons of bugs that will only manifest themselves as soon as you start i18n'ing your app.

You will have to parse the entire content of the std::string char by char, decoding the UTF-8 as appropriate. And this for all types of string manipulation you have to do, including simply querying the length. Basically, std::string degrades back to a simple C char array.

UTF-16 with wstrings is a much better choice.

This topic is closed to new replies.

Advertisement