Sign in to follow this  
jimbogd

a few questions on unicode, utf-8 and portability...

Recommended Posts

My aim is to write an app that can support unicode character sets (including Chinese, Japanese etc.). I understand that UTF-8 is one of the most popular ways of encoding/storing unicode data these days, and I've done some research but I'm unclear on a few points. If someone could clarify/help with the following I'd be grateful... 1) My app needs to run on Windows, Linux and OSX (I'm using C++ with SDL and OpenGL). I understand windows uses 2 bytes to store unicode characters, but unicode chars can potentially take up to 6 bytes. If I use UTF-8 as an external encoding (e.g. my config files) but use Windows' UTF-16 internally, won't I only be able to support a smaller subset of unicode characters in my app? 2) If (1) is correct, can I still utilise most of the major languages/charsets in use today using just 16-bits to store each character (e.g. chinese/japanese/korean etc.), or do they have many characters out of the 16-bit range? 3) As I'm developing for Windows, Linux and OSX, if someone has set a filename of a program to run (or config file to read in) containing unicode characters, is there a standard way (in c++) to open/run these files in Win32/Linux/OSX, or will I have to write platform specific code for each OS? Thanks for your help, jimbogd

Share this post


Link to post
Share on other sites
Quote:
1) My app needs to run on Windows, Linux and OSX (I'm using C++ with SDL and OpenGL). I understand windows uses 2 bytes to store unicode characters, but unicode chars can potentially take up to 6 bytes. If I use UTF-8 as an external encoding (e.g. my config files) but use Windows' UTF-16 internally, won't I only be able to support a smaller subset of unicode characters in my app?


No. UTF-8 and UTF-16 can represent the same set of characters.

They do this by having multi-byte character sequences.

See:
http://unicode.org/faq/utf_bom.html#14

Quote:
3) As I'm developing for Windows, Linux and OSX, if someone has set a filename of a program to run (or config file to read in) containing unicode characters, is there a standard way (in c++) to open/run these files in Win32/Linux/OSX, or will I have to write platform specific code for each OS?


Not all characters are legal in all file systems.

I haven't used it, but I'm guessing:
http://www.boost.org/libs/filesystem/doc/index.htm

It isn't perfect: reading the path grammer, they note that illegal characters under windows are left as-is and simply fail.

Portability:
http://www.boost.org/libs/filesystem/doc/portability_guide.htm

HTH.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this