Sign in to follow this  
WarAmp

Converting UTF-8 to UTF-16

Recommended Posts

Does anyone here have any experience in converting between UTF-8 and UTF-16 strings? I am getting data in UTF-8 format but my font rendering library requires a UTF-16 formatted string. I'm not entirely sure what to google for, as the majority of my results are other forum posts around the web asking the same question with no good results. Are there any windows libraries or c/c++ library functions to do this? I have found MultiByteToWideChar but passing a parameter of CP_UTF8 either isn't working the way i expect (translating a UTF-8 to UTF-16) or my incoming data is bad (I am assured that it is just fine). Anyone?

Share this post


Link to post
Share on other sites
MultiByteToWideChar *should* (is meant to) do it, if you specify 65001 (or whatever it is) as the input codepage.

Or you could convert it by hand, with something along the lines of (not tested):

#include <stdexcept>
#include <vector>
typedef implementation_defined_type uint8; // e.g. unsigned __int8 on VC++
typedef implementation_defined_type uint16; // e.g. unsigned __int16 on VC++
typedef implementation_defined_type uint32; // e.g. unsigned __int32 on VC++

std::vector<uint16> utf8_to_utf16(const std::vector<uint8>& utf8)
{
std::vector<uint16> utf16;
utf16.reserve(utf8.size()); // worst-case is each utf8 byte requiring one utf16 character
for(size_t i(0), end(utf8.size()); i < end;)
{
uint8 byte(utf8[i]);
uint32 code_point(0);
if(byte & 0x80 == 0x00) // 0xxxxxxx
{
code_point = (utf8[i++] & 0x7f) << 0;
}
else if(byte & 0xe0 == 0xc0) // 110xxxxx 10xxxxxx
{
if(end - i < 2) { throw std::exception("truncated string"); }
code_point = (utf8[i++] & 0xe0) << 6;
code_point |= (utf8[i++] & 0x3f) << 0;
if(code_point & 0x7f == code_point) { throw std::exception("illegal encoding"); }
}
else if(byte & 0xf0 == 0xe0) // 1110xxxx 10xxxxxx 10xxxxxx
{
if(end - i < 3) { throw std::exception("truncated string"); }
code_point = (utf8[i++] & 0xf0) << 12;
code_point |= (utf8[i++] & 0x3f) << 6;
code_point |= (utf8[i++] & 0x3f) << 0;
if(code_point & 0x7ff == code_point) { throw std::exception("illegal encoding"); }
}
else if(byte & 0xf8 == 0xf0) // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
{
if(end - i < 4) { throw std::exception("truncated string"); }
code_point = (utf8[i++] & 0xf8) << 18;
code_point |= (utf8[i++] & 0x3f) << 12;
code_point |= (utf8[i++] & 0x3f) << 6;
code_point |= (utf8[i++] & 0x3f) << 0;
if(code_point & 0xffff == code_point) { throw std::exception("illegal encoding"); }
}
else if(byte & 0xfc == 0xf8) // 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
{
if(end - i < 5) { throw std::exception("truncated string"); }
code_point = (utf8[i++] & 0xfc) << 24;
code_point |= (utf8[i++] & 0x3f) << 18;
code_point |= (utf8[i++] & 0x3f) << 12;
code_point |= (utf8[i++] & 0x3f) << 6;
code_point |= (utf8[i++] & 0x3f) << 0;
if(code_point & 0x1fffff == code_point) { throw std::exception("illegal encoding"); }
}
else if(byte & 0xfe == 0xfc) // 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
{
if(end - i < 6) { throw std::exception("truncated string"); }
code_point = (utf8[i++] & 0xfe) << 30;
code_point |= (utf8[i++] & 0x3f) << 24;
code_point |= (utf8[i++] & 0x3f) << 18;
code_point |= (utf8[i++] & 0x3f) << 12;
code_point |= (utf8[i++] & 0x3f) << 6;
code_point |= (utf8[i++] & 0x3f) << 0;
if(code_point & 0x3ffffff == code_point) { throw std::exception("illegal encoding"); }
}
else
{
throw std::exception("illegal lead byte");
}
if(code_point > 0x10ffff)
{
throw std::exception("non-ISO character");
}
else if(code_point >= 0x10000)
{
utf16.push_back(0xd7c0 + (code_point >> 10));
utf16.push_back(0xdc00 | (code_point & 0x3ff));
}
else if(0xe000 > code_point && code_point >= 0xd800)
{
throw std::exception("unpaired surrogate");
}
else
{
utf16.push_back(code_point);
}
}
return utf16;
}

Share this post


Link to post
Share on other sites
I use the ICU library's u_strFromUTF8() function to convert from UTF-8 to UTF-16. ICU is available here. The project is open source with a very liberal license, so it should be possible to take the source out of the library and use it directly if you desire.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
MultiByteToWideChar(CP_UTF8, ...) should work fine in your case. You will get UCS2 out, which is a subset of UTF-16.

Usage of MultiByteToWideChar is well documented: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp

If you have problems, post a short sample that anyone can copy/paste and try.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this