Jump to content

  • Log In with Google      Sign In   
  • Create Account


C++ Handling Unicode


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
6 replies to this topic

#1 Elarys   Members   -  Reputation: 108

Like
0Likes
Like

Posted 16 July 2012 - 02:55 PM

I'm slowly learning C++ (so apologies for the newbie question), but having come from VB.net and C# I'm very used to having access to Unicode as standard. I haven't really got that far into C++ yet and I might be jumping ahead too fast. (Please tell me if I am)

In C# I just use a string datatype, it's supported, no problems encountered. But I would like to move away from C# and to C++.

So how should I best handle Unicode in C++? Ironically enough, looking for an answer online has confused me even more. I've read people recommend the Boost library, or ICU if you want unicode Support. I've also read that C11 supports unicode as standard, if so, does that mean once I upgrade to VS2012 in future I won't need to worry about the problem?

The current applications I develop read data from a Unicode MS SQL Server 2008 and the data contains multiple languages, sometimes even within the same string. English/Japanese/Korean is not uncommon in my data. If I can get over this one hurdle, that would be a big leap forward for me in being able to move to C++ and away from .net.

TL;DR >

With C11 coming along, do I still need to learn Boost or ICU to support Unicode data in my applications?

Sponsor:

#2 mark ds   Members   -  Reputation: 1069

Like
0Likes
Like

Posted 16 July 2012 - 03:48 PM

Assuming you're using Visual Studio, all c++ projects are Unicode by default.

Most Win32 functions have two versions, which accept either single byte character sets, or Unicode UTF16; which functions are used depends on compiler settings.

Go to Project->Properties (alt-F7)...Configuration Properties->General. Under Project defaults in the right hand pane you'll see "character set", which will by default be set to "Use Unicode Character Set".

Also note that there are numerous Unicode functions built into Windows, many of which are extremely useful for things like Unicode normalisation etc.

Be aware though, that .net languages are quite a lot easier to work with Unicode. Although, using Win32 functions isn't actually that hard - it just takes a little bit to get used to.

EDIT - also search for things like "surrogate pairs" and "combining characters". These are very relevant when dealing with Unicode.

Edited by mark ds, 16 July 2012 - 03:53 PM.


#3 ApochPiQ   Moderators   -  Reputation: 14247

Like
1Likes
Like

Posted 16 July 2012 - 04:21 PM

C++ has std::string which supports UTF8 just fine out of the box, and std::wstring which (depending on your compiler) will give you either UTF16 or UTF32. For the most part, sticking with std::string and UTF8 is the way to go.

#4 Elarys   Members   -  Reputation: 108

Like
0Likes
Like

Posted 16 July 2012 - 04:57 PM

Thanks for the answers. At the moment I am sticking with Visual Studio for now as I like the IDE, however I would like to look into cross-platform in the future. But that's not a priority for learning.


@ApochPiQ: If std::string supports UTF-8 out of the box. What is the purpose then of the Unicode support provided by Boost/ICU? Maybe I'm just misunderstanding the 'support' they provide.

Would you consider the amount of memory used to be irrelevant and a non-issue on modern computers?

From wikipedia on UTF-8

Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters


Judging from that, and there is a chance that, 90% of the data is going to be Japanese in the near future, should I seriously consider UTF-16 over UTF-8? And hence std::wstring might be my better choice in my situation?

#5 ApochPiQ   Moderators   -  Reputation: 14247

Like
0Likes
Like

Posted 16 July 2012 - 05:18 PM

If all you want to do is move UTF8 strings around, std::string is fine. It's when you get into things like counting the number of characters in a string (not the same as code points or bytes!), or converting between encodings, and so on, that external libraries become important.


As for the "extra space" needed by UTF8: unless and until you are handling millions of pages of text, this will make zero practical difference.

#6 Matt-D   Crossbones+   -  Reputation: 1410

Like
0Likes
Like

Posted 16 July 2012 - 06:08 PM

Here's what you can currently do:
http://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals
http://members.shaw.ca/akochoi/articles/unicode-processing-c++0x/

#7 Elarys   Members   -  Reputation: 108

Like
0Likes
Like

Posted 17 July 2012 - 08:31 AM

Excellent, thanks for the straight up, easy and realistic advice guys. :-)

Matt-D, that'd second article linked is fantastic, seems he found the same issues I had.

A Web search of C++ and Unicode produces the standard recommendation to use ICU, Qt, or Boost.






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS