unicode

Started by
6 comments, last by Quat 16 years, 9 months ago
I'm reading the Unicode Chapter of Petzold's Programming Windows 5th Ed. He seems to say that Windows uses the Unicode character set for wchar_t's. However, wikipedia (http://en.wikipedia.org/wiki/Unicode) seems to suggest that there are different unicode character sets, and that 16-bits isn't enough. So what does Windows actually do? Where can I get a table of the number-->character mapping?
-----Quat
Advertisement
Quote:Original post by Quat
I'm reading the Unicode Chapter of Petzold's Programming Windows 5th Ed.

He seems to say that Windows uses the Unicode character set for wchar_t's. However, wikipedia (http://en.wikipedia.org/wiki/Unicode) seems to suggest that there are different unicode character sets, and that 16-bits isn't enough.


Yes, it is true that certain characters require more than 16 bit.

Quote:
So what does Windows actually do? Where can I get a table of the number-->character mapping?


Windows 2000 and newer versions are supposed to have full support for UTF-16, NT4 only had support for UCS-2 (unicode characters up to 16 bit and no more). Here's a table, but if you are going to perform checks on valid characters or something, there are better (computational) ways. See the docs at unicode.org.
I recently did a deep investigation into the subject of Unicode and found out that - from a programming perspective at least - there is no real "unicode" equivalent to the good'ol ANSI character set.

The only thing "universal" about unicode is that every existing non-unicode character set can be translated into it and (in almost every case) vice versa.

If you're programming in C or C++, everything depends on your compiler and platform(s). For example:

char strings on Windows under Visual C++ 2005 are interpreted as the Windows-1252 character set when sent to the GUI and such (at least... on Western Language based installations).

GCC on the other hand uses UTF-8 encoding for single-byte char strings.

wchar_t is UTF-16 BE on Win/VC, but UTF-32 on GCC...

On Mac OS X systems, the "normal" single-byte character set is MacRoman, most API functions take UTF-16.


If you want to write code that is portable across platforms and compilers, your best solution will be to hand-pick a character set that you will stick to internally in your program (perhaps only vary it per compiler so it matches one of the compiler's "native" character sets).

Any incoming or outgoing character data are then transcoded into the appropriate character set, based on the platform it is running on. On most GCC-based systems you can use the iconv library for that.

On Win32 systems I only found IBM's ICU library to be feature-rich enough. If you only need to take Windows into account, you might get away with using the "normal" wcsXXXX and mbsXXXX functions available in the standard library.

The best pointer I can give you is to NOT do anything by hand in Unicode-land but use a library instead.
Simulation & Visualization Software
GPU Computing
www.organicvectory.com
Thanks for the info. One more question, I made a console application and a wchar_t array with some hex codes of unicode characters I looked up. When I called wcout, I didn't get any output. Does anyone know how I can output unicode characters to the console?
-----Quat
Quote:Original post by Quat
Thanks for the info. One more question, I made a console application and a wchar_t array with some hex codes of unicode characters I looked up. When I called wcout, I didn't get any output. Does anyone know how I can output unicode characters to the console?


The normal way of defining unicode characters in source code is to use \u unstead of writing hex codes directly. If you look at the table i posted, there is a number U+XXXX in the table. So if you would like the greek character theta (U+0398), you would do:

wcout << L"Theta character: \u0398" << endl;


The advantage of that is that it works with whatever wchar_t the compiler uses.

Edit: Whether it will display in the console or not is another matter altogether. I doubt the console has support for all unicode glyphs.
Quote:
Edit: Whether it will display in the console or not is another matter altogether. I doubt the console has support for all unicode glyphs.


That is the issue I am having. I tried to save the text to a textfile using wofstream. But I got nothing in notepad, even though I changed the font to a Unicode one. Does Windows XP or Vista have a text editor that supports unicode?
-----Quat
Quote:Original post by Quat
Quote:
Edit: Whether it will display in the console or not is another matter altogether. I doubt the console has support for all unicode glyphs.


That is the issue I am having. I tried to save the text to a textfile using wofstream. But I got nothing in notepad, even though I changed the font to a Unicode one. Does Windows XP or Vista have a text editor that supports unicode?


Do you have a hex editor of some sort? (If not, get one now and save face! [smile]) Verify the contents of the file - check that the data you expect is there, in some sort of valid Unicode encoding. If it still doesn't work, try adding a byte-order mark to the front (if there isn't already one there).

That is to say, notepad *is* a Unicode-supporting text editor, but it does have some quirks (without a BOM, it has to guess the encoding, and sometimes gets it wrong because it has a few rather strange rules). But it seems to me more likely that you somehow ended up with a blank file :)
Okay the file was blank, but I do not know why. The code is simply this:

#include <iostream>#include <fstream>int main(){	std::wofstream fout("data.txt");	if( fout )	{		fout << L"\u2620\u262D\u262F" << std::endl;		fout.close();	}}


If I use MessageBox in a Windows app to output L"\u2620\u262D\u262F", it works...

-----Quat

This topic is closed to new replies.

Advertisement