• Advertisement
Sign in to follow this  

UNICODE and string confusion...

This topic is 4728 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I want to move away from ansi, ascii char stuff 1: I want to use unicode so I can start getting used to a more international concept. but I am already confused with UTF-8, 16, 32, big endian etc etc chaos already... 2: should I use Win32 or c++ functions? is <tchar> functional enough for unicode tasks? or do they only move 8 or 16 bits of data and couldn't care less of content? 3: std::string or my own object? so little examples out there using tchar and std::string ... so little examples of anything... 4: As I compile using latest MS Platform SDK, I get error when I use sprintf to format my code ... So safety is nice ... but should I use the MS strsafe library? Maybe I should just make my own string object, but what should I include in it? I should be able to feed it with Ascii or Unicode chars ... I should be able to return ascii or unicode chars... The internals working could wery well be just unicode But should I use UTF-16? or 32? UTF-32 seems nice to use as internal format, but it will demand alot of conversion overhead all the time. I guess utf-16 is the native for windows ... big or little endian? How is linux dealing with unicode? How to deal with buffer owerflow and such stuff?

Share this post


Link to post
Share on other sites
Advertisement
It's really simple.

Use Unicode only if you intend to make your program multi-langual or allow support for multiple
languages in the future. If your program will only use English, now and forever, then don't use Unicode.

Share this post


Link to post
Share on other sites
maybe my fault...

I WILL USE UNICODE - period.

should I base it on <tchar> ?? or windows TCHAR?

I noticed that you had to define
#define _UNICODE // for windows
#define UNICODE // for <tchar>

any comment regarding using those two different libraries?
any comment on using little or big endian?

Share this post


Link to post
Share on other sites
Both are windows concepts. Neither are portable to other platforms. The TChar stuff is windows specific. THe theorey is portable but not the types.

wchar_t is 16 bits on windows and 32 on OS X for instance, allthough gcc will let you compile with the whcar-short flag, but then you can't use the C++stdlib on OS X as tis compiled to use wchar_t as 4 bytes.


If all you need is windows support then use the TCHAR defs. 10s of thousands of programs have been written usign them to successfully support unicode on the PC.

Cheers
Chris

Share this post


Link to post
Share on other sites
i would use the Win32 functions. they support UTF-16. so use unsigned short instead of char. you could easily write your own string class if you use the Win32 functions. just remember to use reference counting so that you can return a string from a function.

ex:
string Function0()
{
string ret = L"pizza";
return ret;
}

Share this post


Link to post
Share on other sites
TCHAR is for apps that want to do both Unicode and non-Unicode depending on how they're compiled. TCHAR is really just an alias for char or WCHAR depending on your settings.

If you want to be Unicode on Win32 then use WCHAR. WCHAR is a 16-bit little-endian type. You still need to define UNICODE and _UNICODE so that you end up calling Unicode versions of the API. In case you don't know, for any Win32 API that takes character data, say TextOut, there are really two versions - TextOutA and TextOutW. TextOut itself is just a #define that selects which of the 'A' and 'W' versions to use.

When using Unicode all of the standard C/C++ runtime functions like sprintf have Unicode equivalents. swprintf in this case. MSDN will tell you what the unicode version is and it will usually be basically the same name except with a 'w' in it someplace.

You can't really manipulate UTF-8 directly in Win32, pretty much none of the API's will take it as it's not a real code page in the Windows view of things. You can convert UTF8 data using MultiByteToWideString().

Share this post


Link to post
Share on other sites
Quote:
Original post by raydog
It's really simple.

If your program will only use English, now and forever, then don't use Unicode.


How do you know someone who doesn't have an english version of windows won't use your program?

Make your programs as simple as possible and no simpler.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Share this post


Link to post
Share on other sites
Its important to note that windows uses UTF16 and this won't handle all unicode characters. At one time it was thought it would:). This really isn't microsofts fault as they did their unicode work back in 91 way before Mac and linux were thinking of this stuff.

Mac and linux use ucs4, (UTF32), This encoding does handle all possible unicode characters.

With UTF16 you'll need to handle surggates if you want to handle all characters.

CHeers
Chris

Share this post


Link to post
Share on other sites
You don't need to define a custom string type, the standard library string is a class template, std::string is just a type alias for the the real type std::basic_string, 2 of it's type parameters are the character type and character traits, there is also a type alias of basic_string for the wide character type wchar_t.

You can also do localization/internationalization in the standard IOstreams library, look up locales, facets, character traits in particular the facet std::std::codecvt/codecvt_byname preforms code conversation to and from the internal and external respenstation specified by the character encoding schemes.

You need to decide what your going to use for your internal representation, character type and character encoding scheme, you need to know what character type and character encoding scheme is used for the external repesentation.

Share this post


Link to post
Share on other sites
Quote:
Original post by petewood
How do you know someone who doesn't have an english version of windows won't use your program?


I'm not saying someone from Japan, for example, can't use my English-only program on their version of Windows.
They can. The text strings just won't be in Japanese.

What I'm saying is that I'm not going to waste any time and translate every English string into
10 different languages. If I was selling my program internationally, then perhaps I would, but
I'm not, so it really doesn't matter.

Share this post


Link to post
Share on other sites
Quote:
Original post by raydog
The text strings just won't be in Japanese.


What will they be?

Are you sure?

Share this post


Link to post
Share on other sites
Quote:
Original post by petewood
Quote:
Original post by raydog
The text strings just won't be in Japanese.


What will they be?

Are you sure?


English, of course. The whole point of Unicode is to support multiple languages.

Really, if you're not targeting an international audience, then why bother with Unicode? Save yourself some memory.

Share this post


Link to post
Share on other sites
Quote:
Original post by raydog
Quote:
Original post by petewood
Quote:
Original post by raydog
The text strings just won't be in Japanese.


What will they be?

Are you sure?


English, of course. The whole point of Unicode is to support multiple languages.

Really, if you're not targeting an international audience, then why bother with Unicode? Save yourself some memory.


Because some of us do like to use different languages than English, wether you like it or not? I don't like having to use Romaji when I could have used Hirigana or Kanji.

And wether your program supports it or not, I can type Japanese into any windows control you use. It will drop to question marks when I'm done typing. Do you have any idea how annoying that is, and how much I want to severely maul the programmer who wrote the program? What is particularly annoying is when the program asks for a file and I want to use one that has a Kanji filename.

Share this post


Link to post
Share on other sites
I recently changed a large project to Unicode, and here is what i used to figure things out

Read this first, its a very good primer that answers most upfront questions.
http://www.flipcode.com/articles/article_advstrings01.shtml


Then read this, actually, refer to it when you need info on a specific string functions.
http://www.metagraphics.com/index.htm?page=pubs/mgct_language-portable-code.htm


Hope that helps.

Share this post


Link to post
Share on other sites
Quote:
Original post by Erzengeldeslichtes
And wether your program supports it or not, I can type Japanese into any windows control you use. It will drop to question marks when I'm done typing. Do you have any idea how annoying that is, and how much I want to severely maul the programmer who wrote the program? What is particularly annoying is when the program asks for a file and I want to use one that has a Kanji filename.


Yes, that's a very good point. Typing in Unicode characters in a common dialog box that opens
and saves files probably won't work as expected in a program that doesn't support Unicode.

Hey, if you're writing a program like Explorer that needs to display a Unicode system file directory,
then yeah, it might important, but I'm not. :)

Share this post


Link to post
Share on other sites
Quote:
Original post by raydog
Quote:
Original post by Erzengeldeslichtes
And wether your program supports it or not, I can type Japanese into any windows control you use. It will drop to question marks when I'm done typing. Do you have any idea how annoying that is, and how much I want to severely maul the programmer who wrote the program? What is particularly annoying is when the program asks for a file and I want to use one that has a Kanji filename.


Yes, that's a very good point. Typing in Unicode characters in a common dialog box that opens
and saves files probably won't work as expected in a program that doesn't support Unicode.

Hey, if you're writing a program like Explorer that needs to display a Unicode system file directory,
then yeah, it might important, but I'm not. :)
I'd rather not try to predict where the user will want to type Unicode text. Better to use it from the beginning than to hear complaints later on.

Share this post


Link to post
Share on other sites
Quote:
Original post by raydog
Quote:
Original post by Erzengeldeslichtes
And wether your program supports it or not, I can type Japanese into any windows control you use. It will drop to question marks when I'm done typing. Do you have any idea how annoying that is, and how much I want to severely maul the programmer who wrote the program? What is particularly annoying is when the program asks for a file and I want to use one that has a Kanji filename.


Yes, that's a very good point. Typing in Unicode characters in a common dialog box that opens
and saves files probably won't work as expected in a program that doesn't support Unicode.

Hey, if you're writing a program like Explorer that needs to display a Unicode system file directory,
then yeah, it might important, but I'm not. :)

Doesn't need to be like explorer.
For example: Winamp uses Windows' Open Dialog Box to get the MP3s. From here, I can get to any of my songs with title in kanji. This works perfectly fine and winamp can open it fine because windows is doing all the work, and windows supports unicode. However, once they're in the play list (since Winamp doesn't support unicode) I can't tell which song is which, they're all gobly gook. I wish to severely maul the programmer who wrote the text code they're using (according to the winamp guys, they can only display characters in the current code page because the text rendering engine they're liscensing only renders textures in the system code page). I've been forced to use iTunes as my primary player.
Another Example: AIM can't use unicode, and so I can only send/recieve messages in ascii. This is annoying. I do my best to get people to use MSN Messanger instead, but still have to use AIM for those who don't. (AOL: "It's America! Just because it's a melting pot of the world and has people speaking and writing every concievable language doesn't mean they want to communicate in anything but English!")

I'm in the United States. I doubt you'd consider me an "International" user. If you did, you'd be using unicode anyway. And as a "non-international" user, I'm asking you to use unicode. I'm not asking you to translate your English strings into 10 languages, I'm asking you to let me name things in 10 languages and have your program not care what language it's in. It's not that hard to do, and it doesn't take up that much memory. (Each string takes up twice as much space. If this increases your program size significantly, you have WAAAAAAYYYY too many strings. Compared to code, meshes, textures, sounds, movies, the doubling or even quadrupling of string size should be insignificant.)

[Edited by - Erzengeldeslichtes on March 15, 2005 2:21:49 AM]

Share this post


Link to post
Share on other sites
It is really annoying that 128bit ascii stuff is still around, with unsafe function to manipulate things.
But unicode have made a mess out of something that should have been a nice standard I feel, because you'll have to read a ton of docs, and take care of alot of formats...

It is not about translating your program to different languages, but about having ppl with different language be able to type in their language.
...just think of notepad...

The downfall of c++, open source, linux and stuff could be as simple as lack of support, use and understanding of strings... 1byte ascii encoding must die. (even in unicode, utf-32 the values from 00 00 00 00 to 00 00 00 FF is ascii I belive, so it should be simple).

If I want to make my own string object, then the internal workings should be UTF-16 big endian, since this would have to deal with less conversion towards windows.

Todays question:
The use of L"...", TEXT"...", _T"...", what are they really doing? just pad the characters with 00 first?
So if a greek person would write

tchar foo_str[] = TEXT"Some greek letters I dont know of";

Would it come out right? I guess not since .h and .cpp files is just ascii .txt files really... So L, TEXT and so on is just for those who want to convert ascii to unicode? by padding?

Share this post


Link to post
Share on other sites
I think your making your self think that its harder than it is.

To use Unicode on windows is very easy. Just define UNICODE and _UNICODE, there are the two definitions as one is Micorsoft's for the Platform SDK and the other is for the C++stdlib.

Then Instead of using char everywhere just use TCHAR( or wchar if you know you'll be strictly UNICODE ) for all strings that should be affected by teh encoding, almost all will, occasionally you'll have strings that should be 8 bit ascii.

Change all C++stdlib methods to use tchar methods, ie strcpy to _tcscpy.

Finally change all compile time string constants to use the _T() macro's. You don't need to worry about encoding if you do this. You'll use UTF 16 internally.

I'm afraid it was me who confused you by mentioning that if you want to port your program this approach will bring you a world of hurt. If its just a windows program this will do.

Cheers
Chris

Share this post


Link to post
Share on other sites
in ascii you will terminate strings with \0

what is the termination code in unicode? or when using w_char?

another one...
int foo_function( w_char *str, size_t len )

if I use the function like this

foo_function( TEXT("Some text"), sizeof( ??? ));

how do I find the number of bytes sent?

Share this post


Link to post
Share on other sites
Quote:
Original post by DarkSlayer
in ascii you will terminate strings with \0

what is the termination code in unicode? or when using w_char?


It's not an ASCII issue but a C string library issue. Use a 16-bit character constant (I'll let you look that up [grin]), or just plain int 0.

Quote:
another one...
int foo_function( w_char *str, size_t len )

if I use the function like this

foo_function( TEXT("Some text"), sizeof( ??? ));

how do I find the number of bytes sent?


Ignoring the fact that, if it's a string, you should use strlen (or wstrlen) and not sizeof, what's wrong with sizeof(TEXT("Some text")) ?

Quote:
English, of course. The whole point of Unicode is to support multiple languages.


How do you know that the characters you are using in your program aren't going to be interpreted/displayed as different characters on your client's machine? That is the whole point of Unicode - ensuring that the same (sequences of) bytes refer to the same glyphs. Even I remember the trouble code pages were (and still are, to some extent).

Share this post


Link to post
Share on other sites
Hi,

I thought you may like to hear some general points that I have found whilst going through this process myself that may or may not prove helpful to you. To provide a frame-retardant disclaimer to this, clearly the value of any of the below advice rather depends on the problem you¡¯re solving, but who knows, it might be handy :-)

1. wchar_t and std::wstring

Use "wchar_t" as the primary character type where Unicode characters are required. This is, as others have pointed out, unsigned short under Win32. If you're worried about wchar_t suddenly becoming a 32 bit value on some machines, do a compilation guard that checks that sizeof(wchar_t) == sizeof(unsigned short).

There is no need to roll your own Unicode string object. Use std::wstring. This is a wchar_t version of basic_string; so its ".c_str()" will return a const wchar_t *, and .length() will return the length of the string in wchar_ts.

I do not recommend using Microsoft's WCHAR, it's just typedef'd to a wchar_t anyway. If you're writing code that may one day run on other platforms, there is a lot of sense to using the standard library stuff -- given you've for a wstring, a wchar_t and copies of all strcpy() style functions in wide versions (in the case of strcpy, wcscpy) then you're well equipped to make your engine Unicode and still reasonably portable at its core.

I also do not recommend that anything TCHAR related is used. This will cause you headaches (see 6, below); and the same applies to the TEXT macro. Obviously this is a slight generalisation and there are cases where they would be useful, but I have found that *UNLESS YOU NEED YOUR PROGRAM TO COMPILE IN 8-BIT CHAR *AND* UNICODE* you really don't need these macros (and macros generally cause more problems than they solve anyway in C++ because, amongst other things, they're not type safe). You then get the option of converting what you want to Unicode without being forced to do 100% of the application. Given that supporting Unicode supports extended ASCII (0x00 -> 0xff), you shouldn't need it. If you're using standard library stuff, TCHAR and its associated _T/TEXT macro will drive you to tears in a surprisingly short period of time.

2. Watch the UNICODE definition; it *may* not be relevant to your problem

I elected not to define _UNICODE and manage the process myself. This was because I was converting an existing application that was sensitive to data types (MMO engine) and I wished to convert a part at a time. Likewise, not all the application was going to be converted to Unicode. Under these circumstances, conversion is easy: Use the W() forms of all calls where required. E.g.;


MessageBoxW(hWnd, L"Hello", L"Hello", MB_OK);




... is a Unicode version. If you have not already figured it out, the L in front of the string means a wide string (and yes, the character 'A' would be 0x41 in ASCII and is 0x0041 or L'A' in Unicode). The same applies to a single char:


wchar_t wideCharacter = L'A';




Incidentally, L'\0' is the zero terminator for Unicode; 0x0000 -- so yes, you can slap an unsigned char with a value of zero to the end of a string to terminate it correctly.

Back to UNICODE -- personally, and I guess this is personal preference, I wanted to see which was which and control which parts were Unicode and which were not because it's a mixed application (there is little point in doubling the size of configuration files that do not and cannot contain Unicode characters by making it wchar_t for the sake of it). By using the actual function names (MessageBoxA() and MessageBoxW()) rather than the "auto" one (MessageBox -- which maps to either A() or W() depending on whether UNICODE is defined or not using a macro), I had control over what was and what was not Unicode and when. Which was nice.

3. Window procedures and controls

Some controls, such as list views, have W versions of their critical messages: there is a LVM_INSERTSTRINGW and the appropriate LVITEMW. This makes conversion of such controls easy as you can insert Unicode strings onto an otherwise ANSI control. Others, though, will NOT work in Unicode unless you create the dialog in Unicode:


CreateDialogW()




A combo box in a dialog created with CreateDialogW() will be Unicode; so its CB_INSERTSTRING call will expect a wchar_t.

Bear in mind that a window procedure created using CreateWindowW() will need to return DefWindowProcW() not DefWindowProcA(). If you don't get this bit right, your WM_CHAR messages will come through in MBCS rather than Unicode.

This, incidentally, is the stuff that _UNICODE "fixes" for you; have a look at any of the windows header files to see what they do with this definition.

4. Win32 isn't all Unicode and there are bits that cause "problems" in the most unexpected places.

InternetReadFileW() is a fine example. It has been in the MSDN documentation for years, and yet it is actually just a stubbed non-functioning call. This kind of thing can really burn your soul when doing the Unicode work because the documentation is poor - you have to get the NOT_IMPLEMENTED return result back before you realise why your code isn't working. In general, you¡¯ll only bump into stuff like this if you're playing with SDK stuff that isn't part of the core Win32 base such as the Shell, Internet stuff, etc.

MSDN does contain some Unicode gems, but don't expect to find too much in the way of non-TCHAR examples.

5. Watch the BOM

To read and write Unicode text files, you need to open files in BINARY MODE using the wide versions of the file open calls. The BOM is the biggest trap. BOM is a byte ordering marker -- a single Unicode character (FFFE) that indicates whether the file is little or big endian. It's a breeze to work with; just fputc() it out and fgetc() it in. If you get the BOM right (the first character in the file), your Unicode text files can be loaded and edited in WordPad/NotePad, etc. without any problems. Incidentally, if you're skipping the BOM your file will be UTF-16BE or UTF-16LE rather than UTF-16.

6. Watch the Locale and Unicode to ANSI->ANSI to Unicode

DBCS and MBCS are NOT your friends when working with Unicode. Make sure your locales (both the C run-time if you use it and the system) are set correctly or your mbcstow()/MultiByteToWideChar() type calls will not function as you expect them to. Look all this up in MSDN and you'll see what I mean -- there is a lot of scope for things not working as you expect them to, especially if you're just starting.

Personally, I wrote my own Unicode to ANSI and visa versa calls. They are not rocket science, you'll be delighted to hear. To convert ANSI to Unicode just go through the chars and convert them to wchar_ts and job done. The only thing you need to watch out for is that in Visual Studio BY DEFAULT, CHAR IS A SIGNED VALUE. This can cause ALL SORTS of problems. For European stuff and to ease your pain during Unicode conversions, ensure that the compiler /J option is set (look it up and it'll explain all). This ensures that char is treated as "unsigned char" which removes lots of "issues" (most problems people have with this is because -54 is not a valid index into an array...) To convert Unicode to ANSI just take each wchar_t. If it is > 255, replace it with a '?' otherwise take the least significant eight bits and slap that into the char. There are very few special cases and I've not run into any problems with this method. See the Unicode 3.0 specification for information on all this.

7. Kiss goodbye to Windows 98 and Windows ME compatibility

The Unicode Layer for Windows 98 is not a complete solution; it does not do everything and generates extra shipping complications. If you don't need to worry about compatibility with 4.x windows (98/ME) then this is great -- require 5.0 (Windows 2000) or higher and compile against the WINVER5 defined headers. All will be nice under those circumstances.

8. You will get a performance increase but a memory hit

wchar_ts are faster under Windows XP and 2000 because the OS is internally Unicode (except for some of the surrounding gubbins such as InternetReadFileW, see above). But be warned that those 1024 character strings you were declaring on the stack left, right and centre are now double the number of bytes you think they are -- keep an eye on your stack size and usage; you may need to consider bumping it up if you declare a lot of character buffers (wchar_t errorMessage[256], resultString[512], tempString[512]; is 2.5 kilobytes of the stack sent to the moon).

9. Check all character size calculations

The buffer size for a string is (sizeof(char) * length) + 1 (for a zero terminated string). The sizeof(char) means when you change it to wchar_t, your calculations will be correct. Code that assumes that a char is a byte will be problematic.

That's about the size of it. It is enormously satisfying when you get all this working; as I discovered when my Japanese, Korean and Chinese test string went straight through the chat system, got logged, displayed, checked for content and displayed correctly at the other end. The greatest thing about Unicode is that it "just works". And whilst there are indeed many more Unicode characters than the 65530 or so definable under Win32's UTF-16 Unicode, frankly, you won't need the others and there are special sequences that deal with UTF-32 encoding in UTF-16 streams.

Oh, and by the way; although Visual Studio .NET 2003 claims it does not support Unicode literals, it does ¨C so long as the file (.h or .cpp) is actually loaded into a edit window rather than just in the project; so you can do this:


const wchar_t *unicodetest__MysteryString1 =
L"&#12354;&#12356;&#12358;&#12360;&#12362;&#12363;&#12365;&#12367;&#12369;&#12371;&#12373;&#12375;&#12377;&#12379;&#12381;";



I'm not sure if that'll come out in the forums, I guess that depends on whether they support Unicode ;-) but if it doesn't; just imagine there were some Japanese characters on the line above.

Anyway, sorry for waffling, hope that was of some assistance :)

CuttleFish.
(sorry for the edits; the first version of this was poorly formatted - it's been a while since I posted and I messed the tags up :-))

Share this post


Link to post
Share on other sites
Quote:
Original post by Fruny
How do you know that the characters you are using in your program aren't going to be interpreted/displayed as different characters on your client's machine? That is the whole point of Unicode - ensuring that the same (sequences of) bytes refer to the same glyphs. Even I remember the trouble code pages were (and still are, to some extent).


Because the English alphabet is part of the ASCII character set, which is very standarized,
every computer out there has understand it. It's a given, Unicode or not.

Share this post


Link to post
Share on other sites
Quote:
Original post by raydog
Quote:
Original post by petewood
What will they be?

Are you sure?


English, of course.

!

Quote:
Because the English alphabet is part of the ASCII character set, which is very standarized,

!
Quote:
every computer out there has understand it.

!
Quote:
It's a given

!

Share this post


Link to post
Share on other sites
On another note, you'd be surprised at how many non-speakers of English use English programs. When I visited my cousins in Vietnam one of them had a computer and we played Road Rash on it. Could he read the text? No. Could he read the Windows UI? No. Years later we chatted with my Vietnamese uncle through Yahoo messenger. As far as I know a Vietnamese version of Yahoo messenger does not exist, but luckily Vietnamese mostly uses the Latin alphabet. But what if we were Chinese? It would be annoying then to use the Latin alphabet.

In another example, I needed a simple program to connect two MPEG files. The best one I found was in German. What if the program was in Chinese and allowed only Chinese characters for input? It's like when you search for solutions to an obscure hardware problem and only find French pages. Your program might be the only one of its kind, and maybe there's a Thai user who desperately needs to use Thai in it.

If you want, you can continue to assume that everyone who uses your program knows English, but you're only restricting yourself.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement