Back to For Beginners

Cross-platform UTF-8 in C++

For Beginners

Started by iaminternets March 14, 2009 01:30 PM

21 comments, last by Codeka 15 years, 1 month ago

iaminternets

187

Author

March 14, 2009 01:30 PM

From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner. It doesn't appear boost has anything (sadly), and I'm not particularly in the mood for handling the logic myself! Have I been searching for the wrong terms, or is there no well-tested library for handling Unicode in C++?

I love the 'nets.

fastcall22

10,918

March 14, 2009 01:41 PM

I haven't found one that I've liked, but you could use Unicode's standard C functions.

let_bound

488

March 14, 2009 02:00 PM

If you don't mind a very large library, then ICU is an option, though it does much more than just provide support for UTF-8.

bubu LV

1,436

March 14, 2009 02:17 PM

I find this library very nice to use: http://utfcpp.sourceforge.net/.

Bregma

9,461

March 14, 2009 02:59 PM

Quote:Original post by iaminternets
From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner.

It depends.

Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

If you want to translate UTF-8 into glyphs for your display in a cross-platform manner, you're out of luck because your display is platform-specific. C++ has no knowledge of fonts, rendering, input methods, glyphs, or video technology.

If you want to parse a specific native-language script stored in Unicode, it's true that there is no portable way to do that. Parsing Arabic is certainly different from parsing French.

What are you trying to do, maybe we can make suggestions on how to make it easier?

Stephen M. Webb
Professional Free Software Developer

Twisol

468

March 14, 2009 03:29 PM

Quote:Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

~Jonathan

asdfg__12

362

March 14, 2009 04:25 PM

Quote:Original post by Twisol
Quote:Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

~Jonathan

Under Win32, wchar_t is 16 bits wide and represents a UTF-16 code unit. On Unix-like systems wchar_t is commonly 32 bits wide and represents a UTF-32 code unit.
So, even std::wstring isn't portable, you have to use different codecs to store encoded characters read from a text file.

mackron

122

March 14, 2009 04:55 PM

I've been building a little UTF string library. Though, I only started it fairly recently, so it's not well tested. There are probably better alternatives. http://code.google.com/p/easl/

Codeka

1,239

March 14, 2009 05:02 PM

Quote:Original post by Twisol
Quote:Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

Right, but UTF-8 is an 8-bit encoding of Unicode and so fits perfectly into std::string. As others have mentioned, std::wstring isn't really portable.

Also, just being able to "support" UTF-8 via std::string doesn't give you much. For example, you'll need to use a library to be able to properly handle things such as cursor movement in edit boxes, rendering of the glyphs, and so on.

War Worlds • Journal

Yann L

1,806

March 14, 2009 05:11 PM

Quote:Original post by Bregma
It depends.

Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

No it doesn't work. In fact, doing this makes all hell break lose. std::string is meant to store strings where every character is encoded using the same amount of space in the byte stream. UTF-8 being a variable length encoding, where individual characters may use different amount of space in the byte stream, will absolutely not work as expected with std:string. All string manipulation methods provided by the std::string class will fail as soon as your string contains a character that is not part of the ANSI subset: length or size will return garbage, erase, insert, etc, methods will fail (they'll operate with wrong character offsets), operator[] will fail.

The worst in all that: if you prototype your software in english, you might not even notice that your code contains tons of bugs that will only manifest themselves as soon as you start i18n'ing your app.

You will have to parse the entire content of the std::string char by char, decoding the UTF-8 as appropriate. And this for all types of string manipulation you have to do, including simply querying the length. Basically, std::string degrades back to a simple C char array.

UTF-16 with wstrings is a much better choice.

Cross-platform UTF-8 in C++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Cross-platform UTF-8 in C++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines