Sign in to follow this  
Elarys

C++ Handling Unicode

Recommended Posts

Elarys    108
I'm slowly learning C++ (so apologies for the newbie question), but having come from VB.net and C# I'm very used to having access to Unicode as standard. I haven't really got that far into C++ yet and I might be jumping ahead too fast. (Please tell me if I am)

In C# I just use a string datatype, it's supported, no problems encountered. But I would like to move away from C# and to C++.

So how should I best handle Unicode in C++? Ironically enough, looking for an answer online has confused me even more. I've read people recommend the Boost library, or ICU if you want unicode Support. I've also read that C11 supports unicode as standard, if so, does that mean once I upgrade to VS2012 in future I won't need to worry about the problem?

The current applications I develop read data from a Unicode MS SQL Server 2008 and the data contains multiple languages, sometimes even within the same string. English/Japanese/Korean is not uncommon in my data. If I can get over this one hurdle, that would be a big leap forward for me in being able to move to C++ and away from .net.

TL;DR >

With C11 coming along, do I still need to learn Boost or ICU to support Unicode data in my applications?

Share this post


Link to post
Share on other sites
mark ds    1786
Assuming you're using Visual Studio, all c++ projects are Unicode by default.

Most Win32 functions have two versions, which accept either single byte character sets, or Unicode UTF16; which functions are used depends on compiler settings.

Go to Project->Properties (alt-F7)...Configuration Properties->General. Under Project defaults in the right hand pane you'll see "character set", which will by default be set to "Use Unicode Character Set".

Also note that there are numerous Unicode functions built into Windows, many of which are extremely useful for things like Unicode normalisation etc.

Be aware though, that .net languages are quite a lot easier to work with Unicode. Although, using Win32 functions isn't actually that hard - it just takes a little bit to get used to.

EDIT - also search for things like "surrogate pairs" and "combining characters". These are very relevant when dealing with Unicode. Edited by mark ds

Share this post


Link to post
Share on other sites
ApochPiQ    23004
C++ has std::string which supports UTF8 just fine out of the box, and std::wstring which (depending on your compiler) will give you either UTF16 or UTF32. For the most part, sticking with std::string and UTF8 is the way to go.

Share this post


Link to post
Share on other sites
Elarys    108
Thanks for the answers. At the moment I am sticking with Visual Studio for now as I like the IDE, however I would like to look into cross-platform in the future. But that's not a priority for learning.


@ApochPiQ: If std::string supports UTF-8 out of the box. What is the purpose then of the Unicode support provided by Boost/ICU? Maybe I'm just misunderstanding the 'support' they provide.

Would you consider the amount of memory used to be irrelevant and a non-issue on modern computers?

From wikipedia on UTF-8
[quote]Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters[/quote]

Judging from that, and there is a chance that, 90% of the data is going to be Japanese in the near future, should I seriously consider UTF-16 over UTF-8? And hence std::wstring might be my better choice in my situation?

Share this post


Link to post
Share on other sites
ApochPiQ    23004
If all you want to do is move UTF8 strings around, std::string is fine. It's when you get into things like counting the number of characters in a string (not the same as code points or bytes!), or converting between encodings, and so on, that external libraries become important.


As for the "extra space" needed by UTF8: unless and until you are handling millions of pages of text, this will make zero practical difference.

Share this post


Link to post
Share on other sites
Matt-D    1574
Here's what you can currently do:
http://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals
http://members.shaw.ca/akochoi/articles/unicode-processing-c++0x/

Share this post


Link to post
Share on other sites
Elarys    108
Excellent, thanks for the straight up, easy and realistic advice guys. :-)

Matt-D, that'd second article linked is fantastic, seems he found the same issues I had.
[quote]A Web search of C++ and Unicode produces the standard recommendation to use ICU, Qt, or Boost.[/quote]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this