Jump to content

  • Log In with Google      Sign In   
  • Create Account


#Actualdmatter

Posted 02 August 2013 - 12:45 PM

wchar is just a wide (generally 2byte) character. It hasn't really got a lot to do with Unicode and was probably added to C++ before Unicode existed.

 

Unicode is just an encoding (or a family or encodings: utf-8, utf-16 and utf-32), so it's an algorithm that requires parsing and interpreting bytes.

UTF-8, for example, can be done with regular chars. Some unicode characters require up to 4 bytes, most are 1 or 2.

 

Obviously ASCII allowed us to use a fixed-width system, 1 character == 1 char == 1 byte. Not so with Unicode although that is pretty much what you get with UTF-32. Unicode is a variable-width encoding, different characters require different numbers of bytes.

 

char is fine for UTF-8. 

But of course, with Unicode, 1 char (or wchar) does not equal 1 character (code point) necessarily.

 

As for string lengths, as Demos Sema said, you have to run the decoding algorithm which means parsing the bytes, so really a sequential scan through the string. Similarly, you cannot just random-access a character at a specific index. Of course, if you can make some assumptions about the text (e.g. I'm using UTF-8 but I know there won't be any multi-byte code points - Which is practically equivalent to saying you know your string is ASCII-only) then you can just count number of char elements and you can randomly index into the string.

 

There are no functions in the C++ Standard Library that are Unicode aware. Any strlen type functions, for example, will just be counting the number of chars in the array, which is going to yield too large a number if that string is a Unicode string.

 

The width of wchar_t is implementation defined, so it isn't really useful for unicode unless you know something about the implementation. Generally it'll be 2 bytes, which is probably good-enough and so can be used for UTF-16.

 

C++11 introduced char16_t and char32_t. These clearly have a well-defined fixed-size. So if you're using UTF-16 and C++11 you would be better off using char16_t instead of wchar_t.

 

Overall I would leave all of that alone; if you're serious about using Unicode in C++ then instead take a look at the defacto ICU library. Then there's also boost::locale.


#2dmatter

Posted 02 August 2013 - 12:31 PM

wchar is just a wide (2byte) character. It hasn't really got a lot to do with Unicode and was probably added to C++ before Unicode existed.

 

Unicode is just an encoding (or a family or encodings: utf-8, utf-16 and utf-32), so it's an algorithm that requires parsing and interpreting bytes.

UTF-8, for example, can be done with regular chars. Some unicode characters require up to 4 bytes, most are 1 or 2.

 

Obviously ASCII allowed us to use a fixed-width system, 1 character == 1 char == 1 byte. Not so with Unicode although that is pretty much what you get with UTF-32. Unicode is a variable-width encoding, different characters require different numbers of bytes.

 

char is fine for UTF-8. 

wchar is fine for UTF-16.

But of course, with Unicode, 1 char (or wchar) does not equal 1 character (code point) necessarily.

 

As for string lengths, as Demos Sema said, you have to run the decoding algorithm which means parsing the bytes, so really a sequential scan through the string. Similarly, you cannot just random-access a character at a specific index. Of course, if you can make some assumptions about the text (e.g. I'm using UTF-8 but I know there won't be any multi-byte code points - Which is practically equivalent to saying you know your string is ASCII-only) then you can just count number of char elements and you can randomly index into the string.

 

There are no functions in the C++ Standard Library that are Unicode aware. Any strlen type functions, for example, will just be counting the number of chars in the array, which is going to yield too large a number if that string is a Unicode string.

 

If you're serious about using Unicode in C++ then take a look at the defacto ICU library. Then there's also boost::locale.


#1dmatter

Posted 02 August 2013 - 12:28 PM

wchar is just a wide (2byte) character. It hasn't really got a lot to do with Unicode and was probably added to C++ before Unicode existed.

 

Unicode is just an encoding (or a family or encodings: utf-8, utf-16 and utf-32), so it's an algorithm that requires parsing and interpreting bytes.

UTF-8, for example, can be done with regular chars. Some unicode characters require up to 4 bytes, most are 1 or 2.

 

Obviously ASCII allowed us to use a fixed-width system, 1 character == 1 char == 1 byte. Not so with Unicode although that is pretty much what you get with UTF-32. Unicode is a variable-width encoding, different characters require different numbers of bytes.

 

char is fine for UTF-8. 

wchar is fine for UTF-16.

But of course, with Unicode, 1 char (or wchar) does not equal 1 character (code point) necessarily.

 

As for string lengths, as Demos Sema said, you have to run the decoding algorithm which means parsing the bytes, so really a sequential scan through the string. Similarly, you cannot just random-access a character at a specific index. Of course, if you can make some assumptions about the text (e.g. I'm using UTF-8 but I know there won't be any multi-byte code points - Which is practically equivalent to saying you know your string is ASCII-only) then you can just count number of char elements and you can randomly index into the string.

 

If you're serious about using Unicode in C++ then take a look at the defacto ICU library. Then there's also boost::locale.

 

 


PARTNERS