Sign in to follow this  
Nairou

STL and Unicode in a cross-platform world

Recommended Posts

Nairou    430
Unicode is something I've always avoided because of complications, but I'm at that point where I really need to figure it out, especially with the proliferation of UTF-8 files. Up until now, I've been using std::string, trying to avoid char* as much as possible, and breaking down to char* when dealing with OS things like filenames. Obviously a char* doesn't work for any Unicode variant, and I seem to remember std::string not supporting anything either (can I get this confirmed? What about UTF-8?). What I really want is UTF-8 support. I like keeping my native English strings as small as possible, not having *every* character be two or more bytes each. Also, wide characters means dealing with endian issues and OS Unicode variants. I'm guessing that OS support would be a lot simpler if left to UTF-8? However, I still don't know how to handle Unicode or UTF-8 in my code. Do I assume all strings are UTF-8 and just look for extended-character bytes? Do I have to convert everything to Unicode wide characters internally, and convert back and forth to UTF-8 when I have to input/output text or files? Is there a STL method to using UTF-8 strings, or do I need to create my own string class for this? I've heard of std:wstring, but that appears to be wide characters only. I'm totally lost here, any input or direction would be greatly appreciated.

Share this post


Link to post
Share on other sites
blaze02    100
For all my character conversions, I use MultiByteToWideChar or WideCharToMultiByte. The first param is a code page that informs the function how to translate the characters (I've only used CP_ACP). This works great for converting between char and WCHAR.

There may be more options for UTF-8 or anything else you need. Hope that helps.

Edit: Another code page is CP_UTF8. Seems like that will do any conversions you need.

Share this post


Link to post
Share on other sites
Nairou    430
Yes, but those are Windows functions aren't they? I need my method to be cross-platform, as I'll be doing a lot of testing on Linux.

Also, using those functions means you are internally using wide characters throughout your program, and just converting to another character set for I/O, right? I'd really like to find some pros and cons about using wide character strings versus UTF-8 (multibyte) strings throughout the program. Other than the encoding/decoding of UTF-8, it *seems* like it would be less hassle and more efficient than using wide characters everywhere. But I could be wrong.

Share this post


Link to post
Share on other sites
Nitage    1107
std::string can store any utf-8 on a platform which uses 8 bit bytes, and so cna null terminatred char*s, as the null ascii character only appears in utf8 as the null character.

The thing you don't seem top be getting is that you don't really have to do anything special with utf8 strings until you come to display them graphically.

When you write a utf8 string to a file, you don't have to worry about how many glyphs you're writing - just how many bytes. Similarly, you don't have to worry when you read a file - you just read it into a string.

The difficult stuff begins when you want to write a utf8 string to std::wcout. The C++ standard doesn't specify what kind of encoding std::wstring uses - so you'll need to convert the utf8 string to utf16BE,utf16LE,utf32BE,utf32LE or whatever other encoding is required - so this bit has to be platform specific (stupid C++ standard...)

Likewise, if you're using OpenGL or DirectX to render characters, you'll need utf8 aware functions to calculate which charcters to display and how wide a string is.

Share this post


Link to post
Share on other sites
Nairou    430
Interesting... So then, internally, I can use regular std::string's, and leave all of my string constants unchanged, any only deal with UTF-8 when interpreting or generating strings for input or output? Huh, that almost seems too easy, I thought there was more management involved than that. It makes me wonder, then, why people use all of the wide character functions and wide character string types rather than doing this? If I stick with UTF-8 then I don't have to deal with any of that do I? This is sounding exciting..

Share this post


Link to post
Share on other sites
rollo    366
There are some more things to think about when you use UTF-8 in something designed for one byte = one character. A lot of methods on std::string wont return correct results since some characters are encoded as multiple bytes. stuff like indexing into the string, getting the total length, extracting substrings etc.

Share this post


Link to post
Share on other sites
Antheus    2409
Quote:
Original post by Nairou
Interesting... So then, internally, I can use regular std::string's, and leave all of my string constants unchanged, any only deal with UTF-8 when interpreting or generating strings for input or output? Huh, that almost seems too easy, I thought there was more management involved than that. It makes me wonder, then, why people use all of the wide character functions and wide character string types rather than doing this? If I stick with UTF-8 then I don't have to deal with any of that do I? This is sounding exciting..


Variable length characters are costlier for some manipulations.

Assuming you want to replace a character, or find a character. For fixed width characters, it's a matter of comparing native machine types (8, 16, 32 bit ints), for variable length characters, you need to sequentially parse the string, and perform checks to see which type to use.

Replacing a character may require re-encoding (or at least re-allocating) entire string.

Indexing into UTF8 string also requires searching the string from the beginning.

Comparisons of UTF-8 strings can be also more costly.

UTF-8 is meant for storage, not manipulation. And even there, the cost of storage is insignifficant in almost all cases, so the extra hurdle of using variable length types doesn't offset the run-time costs incurred.

And multi-platform software that does rely on unicode, will simply use the method best for their problem, the coding time isn't an issue here - define your own conversion methods, or your own unicode storage class.

Share this post


Link to post
Share on other sites
Nitage    1107
[quote]Original post by rollo
There are some more things to think about when you use UTF-8 in something designed for one byte = one character. A lot of methods on std::string won't return correct results since some characters are encoded as multiple bytes. stuff like indexing into the string, getting the total length, extracting substrings etc.

std::string::size() will return the number of bytes in a utf8 string, not the number of characters or glyphs, and operator[] will index by byte, not character, but you'd be suprised how rarely indexing by character is required. Substring replacement will still work.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this