• Create Account

We're offering banner ads on our site from just \$5!

### #ActualDemos Sema

Posted 02 August 2013 - 12:07 PM

Oh windows. wchar_t, wide strings, and so on are NOT really the only way or even the best way to "do Unicode". In general you should expect a Unicode-aware strlen type function to step through every character of the string and compute length as it goes, even with wide characters there are some code points that will take multiple characters. The obvious and least painful solution is to just avoid taking string lengths, this is particularly good because there are some characters that may have length one or two depending on how you look at them. Another option would be to convert stuff to some massive 32bit length strings to take the length and then convert back when you are done.

In general I use Unicode mode in windows just for more static checking but IMMEDIATELY convert stuff from wchars to UTF-8 narrow strings for use inside my code. On recent versions of visual studio you can use std::wstring_convert to do the narrowing/widening. If you need to use GCC you are going to have to use the C facets library. What follows is my implementation of widen and narrow, the commented code is the same function but with std::wstring_convert and likely less bugs. The C version very likely has overrun bugs and leaks and so on (but at least is ACTUALLY narrows/widens unlike most examples on the net)

	inline std::string narrow(const std::wstring& wstr) {
std::mbstate_t state = std::mbstate_t();
auto buffer = wstr.c_str();
size_t len = 1 + std::wcsrtombs(nullptr, &buffer, 0, &state);
std::vector<char> nstrbuf(len);
std::wcsrtombs(&nstrbuf[0], &buffer, nstrbuf.size(), &state);
return std::string(nstrbuf.data());
//this stuff does not work in GCC, FML
//std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
//return converter.to_bytes(wstr);
}
inline std::wstring widen(const std::string& nstr) {
std::mbstate_t state = std::mbstate_t();
auto buffer = nstr.c_str();
size_t len = 1 + std::mbsrtowcs(nullptr, &buffer, 0, &state);
std::vector<wchar_t> wstrbuf(len);
std::mbsrtowcs(&wstrbuf[0], &buffer, wstrbuf.size(), &state);
return std::wstring(wstrbuf.data());
//again does not work on GCC because reasons
//std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
//return converter.from_bytes(nstr);
}


see http://utf8everywhere.org/ for how to handle text and an explanation of how weird windows is

Edit: I think the best bet for actually getting character lengths is to go ahead and widen your strings and then take the length of the wide string. Yes this is a lot of slow conversions but strings are slow /anyways/

### #1Demos Sema

Posted 02 August 2013 - 12:03 PM

Oh windows. wchar_t, wide strings, and so on are NOT really the only way or even the best way to "do Unicode". In general you should expect a Unicode-aware strlen type function to step through every character of the string and compute length as it goes, even with wide characters there are some code points that will take multiple characters. The obvious and least painful solution is to just avoid taking string lengths, this is particularly good because there are some characters that may have length one or two depending on how you look at them. Another option would be to convert stuff to some massive 32bit length strings to take the length and then convert back when you are done.

In general I use Unicode mode in windows just for more static checking but IMMEDIATELY convert stuff from wchars to UTF-8 narrow strings for use inside my code. On recent versions of visual studio you can use std::wstring_convert to do the narrowing/widening. If you need to use GCC you are going to have to use the C facets library. What follows is my implementation of widen and narrow, the commented code is the same function but with std::wstring_convert and likely less bugs. The C version very likely has overrun bugs and leaks and so on (but at least is ACTUALLY narrows/widens unlike most examples on the net)

	inline std::string narrow(const std::wstring& wstr) {
std::mbstate_t state = std::mbstate_t();
auto buffer = wstr.c_str();
size_t len = 1 + std::wcsrtombs(nullptr, &buffer, 0, &state);
std::vector<char> nstrbuf(len);
std::wcsrtombs(&nstrbuf[0], &buffer, nstrbuf.size(), &state);
return std::string(nstrbuf.data());
//this stuff does not work in GCC, FML
//std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
//return converter.to_bytes(wstr);
}
inline std::wstring widen(const std::string& nstr) {
std::mbstate_t state = std::mbstate_t();
auto buffer = nstr.c_str();
size_t len = 1 + std::mbsrtowcs(nullptr, &buffer, 0, &state);
std::vector<wchar_t> wstrbuf(len);
std::mbsrtowcs(&wstrbuf[0], &buffer, wstrbuf.size(), &state);
return std::wstring(wstrbuf.data());
//again does not work on GCC because reasons
//std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
//return converter.from_bytes(nstr);
}


see http://utf8everywhere.org/ for how to handle text and an explanation of how weird windows is

PARTNERS