Sign in to follow this  

Multibytes char or a regular char?

This topic is 3859 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi guys, I want to ask about multibytes format.

char a[] = "?????"; //some Japaneses/Arabic/Chinese character
char b[] = "abcd";



how a program know that "a" should be treated as a multibytes and "b" is not? Thanks before.

Share this post


Link to post
Share on other sites
Quote:
Original post by VitaminCpp
how a program know that "a" should be treated as a multibytes and "b" is not?


Because you tell it to treat "a" that way...

Seriously though, C or C++ (or most languages) won't automagically know how to deal with multi-byte character strings. You (the programmer) have to know in advance which strings will be ASCII and which ones will be Unicode (or whatever), and then call the correct functions on them that are designed to deal with that encoding type.

Share this post


Link to post
Share on other sites
Here is my case:
so I have array of string with Japaneses characters. alot of it.
I need to make sure that the strings is 8 character, and all the strings that less than 8 character, I need to add spaces(' ') in front of it.
I need to know the length of every strings so I know how much spaces that I need to add.
I haven't try strlen yet, but I don't think it's gonna work.
as far as I know strlen only iterate the array of char until '\0' is found.

Basicly what I need is count the characters on the string.
Is there specific format on multibyte encoding?

Share this post


Link to post
Share on other sites
Ok, if you're storing an array of Japanese (or other multi-byte) characters, you shouldn't be using the "char" datatype. Chars can only hold single-byte characters.

I think you need to use a "wchar_t" insead of a "char", and need to put an L in front of your strings. wchar_t's can hold 2-byte characters instead of single-byte characters.

E.g.
instead of this:
char* str1 = "Blah blah.";
length = strlen(str1);

you would write this:
wchar_t* wstr1 = L"Blah blah.";
length = wcslen(wstr1);

Share this post


Link to post
Share on other sites
I found function _mbstrlen on msdn. I think this function will do.
I will try this first, and if it's not working, I will use wchar_t.
Thanks for your help.

EDIT:
btw, i'm just curious. I'm currently using mysql, and the library not using wchar_t, only char. but somehow, when I pass a Japaneses string, mysql store the string perfectly.

Share this post


Link to post
Share on other sites
Quote:
Original post by VitaminCpp
I found function _mbstrlen on msdn. I think this function will do.
I will try this first, and if it's not working, I will use wchar_t.
Thanks for your help.

EDIT:
btw, i'm just curious. I'm currently using mysql, and the library not using wchar_t, only char. but somehow, when I pass a Japaneses string, mysql store the string perfectly.


In C++, 'char' doesn't actually represent a character of text; it represents a byte of data. (Although English programmers will often pretend they represent characters of text, using ASCII, because it makes their lives easier than working with Unicode.) And *anything* on a computer can be looked at as a sequence of bytes - just that there might be more than one byte involved per characters.

In fact, Unicode text is commonly represented as a sequence of bytes in several different ways, called "encodings". The way Unicode works is, every character is assigned a unique number, called a "code point". These are unique for everything, regardless of context - so you can mix text from different languages, and there will never be any ambiguity.*

However, there are many ways that the number could be represented in bytes. The obvious way is to just use an integer type of the appropriate size. 'wchar_t' is, basically, a 2-byte type, so it represents code points 0 up to 65,535 by just storing the appropriate value in those two bytes of memory.** Another way is to use a variable-length encoding called UTF-8. With this scheme, different code points use different numbers of bytes. Code points in the "basic multilingual plane" - the ones from 0 up to 65,535 - use anywhere from 1 up to 3 bytes in UTF-8. This saves space for text written in Western languages (and also allows English ASCII text to pretend to be Unicode text with a minimum of effort), but means that you can no longer easily find the nth character of a string (because you don't know how many bytes precede it, so you have to search through the string, interpreting the encoded values as you go, until you've seen enough characters).

For these reasons, it is usual to use the variable-length encoding when storing text on disk, and convert it to a fixed-sized encoding when it is read in to memory.

* Except for ambiguity resulting from the fact that characters don't actually correspond exactly to *letters* anyway, nor to *glyphs* (letter-shapes). For example, when you write French text, you could represent the letter é either as a single "e-accent-aigu" character, or as a combination of a plain e and a "combining aigu accent" character.

** Actually, it is a little more complicated than that; [google] "unicode surrogate pairs".

Share this post


Link to post
Share on other sites
Thanks for the explanation.
So if I'm writing "abcdef", the string still using 1 byte per character, right? Because its still using English ASCII.
but at the same time, the first byte can be used for the encoding information, for others language.
If the first byte can be use for the encoding information, that mean not all ASCII character code can be use?
Thanks again.


EDIT:
I have read http://en.wikipedia.org/wiki/UTF-8
I think I understand now.
The valid ASCII codes only 128 codes.


[Edited by - VitaminCpp on May 23, 2007 1:26:50 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Hodgman
Ok, if you're storing an array of Japanese (or other multi-byte) characters, you shouldn't be using the "char" datatype. Chars can only hold single-byte characters.

I did a whole software that used japanese strings using char* buffers. The reason behind this is that, contrary to russian or some other character encoding, it is perfectly possible to use a specific encoding for japanese characters - which is called MBCS - Multi-Byte Character Set. This is different from unicode, as a character can be represented as either one byte of two bytes, depending on the value of the first byte. On Windows platform, to find the number of characters in a MBCS string, you should use _mbslen(); to find the number of bytes in a MBCS string, you shall use strlen(). If you #include <tchar.h>, you can use _tcslen(), which will map to either strlen, _mbslen or wcslen depending on the preprocessor definition (_MBCS, _UNICODE or none).

Share this post


Link to post
Share on other sites

This topic is 3859 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this