Jump to content
  • Advertisement
Sign in to follow this  
jonbell

UTF8 Problems

This topic is 5402 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

MultiByteToWideChar() is giving me strange results. When I pass a string with the pound character in £, the pound symbol is removed : strcpy(szStr, "a£a"); MultiByteToWideChar(CP_UTF8, 0, szStr, (int) (strlen(szStr)+1), wsz, sizeof(wsz)/sizeof(wsz[0]) ); After this wsz contains "aa". Why?

Share this post


Link to post
Share on other sites
Advertisement
When you write "a£a" in your source code, that pound symbol is not a Unicode (multi-byte) encoding for the pound symbol, but just some extended ASCII byte (single byte in range 128-255), which is rendered as a pound symbol in whatever ASCII encoding you're using (i.e. doing things the old-fashioned way with "code pages").

So when MultiByteToWideChar() tries to read that as a UTF-8 sequence, it sees some value > 128 which it probably figures is the first byte of a 2-byte sequence, but the next 'a' does not have high bits set in the way that indicates 'second byte of a 2-byte sequence'. So the pound character is concluded to be invalid.

Or so I'm guessing. I have no experience with the method in question (if that's what's happening, it *should* report an error instead, really); I just happen to know a thing or two about UTF-8 encoding. :)

Share this post


Link to post
Share on other sites
Alternative: If your IDE supports it, use Unicode source files for such things. Literals that are not containing ASCII base characters (ie. code >127) are not portable between different code-pages and should be avoided. It's best not to hardcode string literals at all to avoid these issues. You can instead save these in a seperate resource file and format that as UTF8.

This is propably the only portable and safe way that correctly addresses the problem.

Share this post


Link to post
Share on other sites
The text is coming from a generated XML file which I have no control over. It was my understainding that the pound symbol was within the single byte ascii range.

I find it extremely bizare that UTF8 cannot handle this symbol, I am in the UK although this should surely not matter because UTF8 should contain symbols for every nation except some Chinese symbols right?

The function calls definitly do not fail. Any ideas how to solve this?

Share this post


Link to post
Share on other sites
UTF-8 can handle the pound just fine. UTF-8 handles all unicode characters just fine. It simply represents all characters over 127 as multiple bytes (the pound symbol is character 163). Your string literal is encoded as ANSI, not UTF-8, so you're going to have problems if you try to treat it as UTF-8 and decode it. Passing CP_ACP instead of CP_UTF8 might give you better results.

If this is coming from an XML file, I'd check that the XML file is actually encoded as UTF-8. If it is, and if it really contains invalid character data, you'll have little choice left but to slap its author with a wet fish.

Share this post


Link to post
Share on other sites
I am probably in danger of sounding stupid here but why on earth does it treat >127 as 2 bytes when it only take one byte to represent it?

Heres my current problem as it stands. I have UTF8 encoded XML which I read back using TinyXML (which does support UTF8). However TinyXML returns C style strings which I then convert to WCHAR's using MultiByteToWideChar(CP_UTF8, ....

This seems to work fine for genuine multibyte char's but I am losing the £ signs. Ideas?

Share this post


Link to post
Share on other sites
Quote:
Original post by jonbell
I am probably in danger of sounding stupid here but why on earth does it treat >127 as 2 bytes when it only take one byte to represent it?
Because there are more than 256 characters that need representation? [smile]
There actually is a compression scheme for Unicode that results in around one byte per character, but UTF-8 isn't it.
Quote:
Original post by jonbell
I have UTF8 encoded XML which I read back using TinyXML (which does support UTF8). However TinyXML returns C style strings...
I'm not familiar with TinyXML, but I'd expect any character arrays to be in ANSI encoding unless returned by a function whose name suggests otherwise (like GetUTF8String() or such). Try calling MultiByteToWideChar with CP_ACP instead of CP_UTF8.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by jonbell
I am probably in danger of sounding stupid here but why on earth does it treat >127 as 2 bytes when it only take one byte to represent it?

It doesn't. There are 1,114,111 possible code points in the Unicode system, where "code point" is roughly equivalent to the "character" in C terminology.

In UTF-8, a character may be between one and four bytes long.
Quote:

This seems to work fine for genuine multibyte chars but I am losing the £ signs. Ideas?

Because £ isn't a genuine multibyte character.

Share this post


Link to post
Share on other sites
In other words, whatever you're reading isn't UTF8. you need to change CP_UTF8 to whatever codepage it really is.

Just in case that wasn't clear...

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!