C++ Handling Unicode

Started by
5 comments, last by Elarys 11 years, 9 months ago
I'm slowly learning C++ (so apologies for the newbie question), but having come from VB.net and C# I'm very used to having access to Unicode as standard. I haven't really got that far into C++ yet and I might be jumping ahead too fast. (Please tell me if I am)

In C# I just use a string datatype, it's supported, no problems encountered. But I would like to move away from C# and to C++.

So how should I best handle Unicode in C++? Ironically enough, looking for an answer online has confused me even more. I've read people recommend the Boost library, or ICU if you want unicode Support. I've also read that C11 supports unicode as standard, if so, does that mean once I upgrade to VS2012 in future I won't need to worry about the problem?

The current applications I develop read data from a Unicode MS SQL Server 2008 and the data contains multiple languages, sometimes even within the same string. English/Japanese/Korean is not uncommon in my data. If I can get over this one hurdle, that would be a big leap forward for me in being able to move to C++ and away from .net.

TL;DR >

With C11 coming along, do I still need to learn Boost or ICU to support Unicode data in my applications?
Advertisement
Assuming you're using Visual Studio, all c++ projects are Unicode by default.

Most Win32 functions have two versions, which accept either single byte character sets, or Unicode UTF16; which functions are used depends on compiler settings.

Go to Project->Properties (alt-F7)...Configuration Properties->General. Under Project defaults in the right hand pane you'll see "character set", which will by default be set to "Use Unicode Character Set".

Also note that there are numerous Unicode functions built into Windows, many of which are extremely useful for things like Unicode normalisation etc.

Be aware though, that .net languages are quite a lot easier to work with Unicode. Although, using Win32 functions isn't actually that hard - it just takes a little bit to get used to.

EDIT - also search for things like "surrogate pairs" and "combining characters". These are very relevant when dealing with Unicode.
C++ has std::string which supports UTF8 just fine out of the box, and std::wstring which (depending on your compiler) will give you either UTF16 or UTF32. For the most part, sticking with std::string and UTF8 is the way to go.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Thanks for the answers. At the moment I am sticking with Visual Studio for now as I like the IDE, however I would like to look into cross-platform in the future. But that's not a priority for learning.


@ApochPiQ: If std::string supports UTF-8 out of the box. What is the purpose then of the Unicode support provided by Boost/ICU? Maybe I'm just misunderstanding the 'support' they provide.

Would you consider the amount of memory used to be irrelevant and a non-issue on modern computers?

From wikipedia on UTF-8
Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters[/quote]

Judging from that, and there is a chance that, 90% of the data is going to be Japanese in the near future, should I seriously consider UTF-16 over UTF-8? And hence std::wstring might be my better choice in my situation?
If all you want to do is move UTF8 strings around, std::string is fine. It's when you get into things like counting the number of characters in a string (not the same as code points or bytes!), or converting between encodings, and so on, that external libraries become important.


As for the "extra space" needed by UTF8: unless and until you are handling millions of pages of text, this will make zero practical difference.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Here's what you can currently do:
http://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals
http://members.shaw.ca/akochoi/articles/unicode-processing-c++0x/
Excellent, thanks for the straight up, easy and realistic advice guys. :-)

Matt-D, that'd second article linked is fantastic, seems he found the same issues I had.
A Web search of C++ and Unicode produces the standard recommendation to use ICU, Qt, or Boost.[/quote]

This topic is closed to new replies.

Advertisement