• Advertisement
Sign in to follow this  

a few questions on unicode, utf-8 and portability...

This topic is 3890 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

My aim is to write an app that can support unicode character sets (including Chinese, Japanese etc.). I understand that UTF-8 is one of the most popular ways of encoding/storing unicode data these days, and I've done some research but I'm unclear on a few points. If someone could clarify/help with the following I'd be grateful... 1) My app needs to run on Windows, Linux and OSX (I'm using C++ with SDL and OpenGL). I understand windows uses 2 bytes to store unicode characters, but unicode chars can potentially take up to 6 bytes. If I use UTF-8 as an external encoding (e.g. my config files) but use Windows' UTF-16 internally, won't I only be able to support a smaller subset of unicode characters in my app? 2) If (1) is correct, can I still utilise most of the major languages/charsets in use today using just 16-bits to store each character (e.g. chinese/japanese/korean etc.), or do they have many characters out of the 16-bit range? 3) As I'm developing for Windows, Linux and OSX, if someone has set a filename of a program to run (or config file to read in) containing unicode characters, is there a standard way (in c++) to open/run these files in Win32/Linux/OSX, or will I have to write platform specific code for each OS? Thanks for your help, jimbogd

Share this post


Link to post
Share on other sites
Advertisement
Quote:
1) My app needs to run on Windows, Linux and OSX (I'm using C++ with SDL and OpenGL). I understand windows uses 2 bytes to store unicode characters, but unicode chars can potentially take up to 6 bytes. If I use UTF-8 as an external encoding (e.g. my config files) but use Windows' UTF-16 internally, won't I only be able to support a smaller subset of unicode characters in my app?


No. UTF-8 and UTF-16 can represent the same set of characters.

They do this by having multi-byte character sequences.

See:
http://unicode.org/faq/utf_bom.html#14

Quote:
3) As I'm developing for Windows, Linux and OSX, if someone has set a filename of a program to run (or config file to read in) containing unicode characters, is there a standard way (in c++) to open/run these files in Win32/Linux/OSX, or will I have to write platform specific code for each OS?


Not all characters are legal in all file systems.

I haven't used it, but I'm guessing:
http://www.boost.org/libs/filesystem/doc/index.htm

It isn't perfect: reading the path grammer, they note that illegal characters under windows are left as-is and simply fail.

Portability:
http://www.boost.org/libs/filesystem/doc/portability_guide.htm

HTH.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement