Jump to content
  • Advertisement
Sign in to follow this  
iaminternets

Cross-platform UTF-8 in C++

This topic is 3416 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner. It doesn't appear boost has anything (sadly), and I'm not particularly in the mood for handling the logic myself! Have I been searching for the wrong terms, or is there no well-tested library for handling Unicode in C++?

Share this post


Link to post
Share on other sites
Advertisement
Quote:
Original post by iaminternets
From what I've been reading, it's a hellish task to support UTF-8 in C++, especially in a cross-platform manner.


It depends.

Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

If you want to translate UTF-8 into glyphs for your display in a cross-platform manner, you're out of luck because your display is platform-specific. C++ has no knowledge of fonts, rendering, input methods, glyphs, or video technology.

If you want to parse a specific native-language script stored in Unicode, it's true that there is no portable way to do that. Parsing Arabic is certainly different from parsing French.

What are you trying to do, maybe we can make suggestions on how to make it easier?

Share this post


Link to post
Share on other sites
Quote:
Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

~Jonathan

Share this post


Link to post
Share on other sites
Quote:
Original post by Twisol
Quote:
Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?

~Jonathan


Under Win32, wchar_t is 16 bits wide and represents a UTF-16 code unit. On Unix-like systems wchar_t is commonly 32 bits wide and represents a UTF-32 code unit.

So, even std::wstring isn't portable, you have to use different codecs to store encoded characters read from a text file.

Share this post


Link to post
Share on other sites
Quote:
Original post by Twisol
Quote:
Original post by Bregma
Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.


I could've sworn that std::string was based on char, and you need to use std::wstring to support unicode characters?
Right, but UTF-8 is an 8-bit encoding of Unicode and so fits perfectly into std::string. As others have mentioned, std::wstring isn't really portable.

Also, just being able to "support" UTF-8 via std::string doesn't give you much. For example, you'll need to use a library to be able to properly handle things such as cursor movement in edit boxes, rendering of the glyphs, and so on.

Share this post


Link to post
Share on other sites
Quote:
Original post by Bregma
It depends.

Supporting Unicode in C++ in a cross-platform manner is as easy as using std::string. It just works, you don't have to worry about it.

No it doesn't work. In fact, doing this makes all hell break lose. std::string is meant to store strings where every character is encoded using the same amount of space in the byte stream. UTF-8 being a variable length encoding, where individual characters may use different amount of space in the byte stream, will absolutely not work as expected with std:string. All string manipulation methods provided by the std::string class will fail as soon as your string contains a character that is not part of the ANSI subset: length or size will return garbage, erase, insert, etc, methods will fail (they'll operate with wrong character offsets), operator[] will fail.

The worst in all that: if you prototype your software in english, you might not even notice that your code contains tons of bugs that will only manifest themselves as soon as you start i18n'ing your app.

You will have to parse the entire content of the std::string char by char, decoding the UTF-8 as appropriate. And this for all types of string manipulation you have to do, including simply querying the length. Basically, std::string degrades back to a simple C char array.

UTF-16 with wstrings is a much better choice.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!