Sign in to follow this  
  • entries
    9
  • comments
    12
  • views
    24169

Working With Unicode in the Windows API

Sign in to follow this  

805 views

So the issue of Unicode and character sets is one that seems to come up quite a bit in the For Beginners forum (and elsewhere). Usually someone who is new to Windows programming will make a thread saying that the compiler barfs when it gets to their "MessageBox" call, and has no idea how to deal with it. Therefore I spent a lot of time explaining what their problem is and how to fix it, and usually this involves me explaining how the Windows API deals with strings. This happened again today and it resulted in me writing up an explanation that I rather like, so I've decided to post it here so that I can simply link people to it when necessary. I hope that those who read it find it useful, and get on their way with Windows programming. If there's anything hideously wrong or that you think could be added, please let me know.

Also before or after you read this, you may want to consult the official documentation at MSDN regarding Unicode and character sets. All the information I've explained is in there, you may just have to go through more reading to get to it. You can also find some more good information on the topic in this blog entry by Joel Spolsky.

------------------------------------------------------------------------------

The Windows API supports two kinds of strings, each using two types of characters. The first type is multi-byte strings, which are arrays of char's. With these strings each glyph can either be a single byte (char) or multiple bytes, and how the data is interpreted into glyphs depends on the ANSI code page being used. The "standard" code page for Windows in the US is windows-1252, known as "ANSI Latin 1; Western European". These strings are generally referred to as "ANSI" strings throughout the Windows documentation. The Windows headers typedef the type "char" to "CHAR", and also typedef pointers to strings as "LPSTR" and "LPCSTR" (the second being a constant pointer to a string). String literals for this type simply use quotations, like in this example:


const char* ansiString = "This is an ANSI string!";


The second type of string is what is referred to as Unicode strings. There are several types of Unicode, but in the Windows API "Unicode" generally refers to UTF-16 encoding. UTF-16 uses (at least) two bytes per glyph, and therefore in C and C++ the strings are represented as arrays of the type wchar_t (which is two bytes in size, and therefore referred to as a "wide" character). Unicode is a worldwide standard, and supports glyphs from many languages with one standard code page (with multi-byte strings you'd have to use a different code page if you wanted something like kanji). This is obviously a big improvement, which is why Microsoft encourages that all newly-written apps use Unicode exclusively (this is also why a new Visual C++ project defaults to Unicode). The Windows headers typedef the type "wchar_t" to "WCHAR", and also typedef pointers to Unicode strings as "LPWSTR" and "LPCWSTR". String literals for this type use quotations prefixed with an "L", like in this example:


const wchar_t* unicodeString = L"This is a Unicode string!";


Okay, so I said that the Windows API supports both the old ANSI strings as well as Unicode strings. It does this through polymorphic types and by using macros for functions that take strings as parameters. Allow me to elaborate on the first part...

The Windows API defines a third character type, and consequently a third string type. This type is "TCHAR", and its definition looks something like this:


#ifdef UNICODE
typedef WCHAR TCHAR;
#else
typedef CHAR TCHAR;
#endif

typedef TCHAR* LPTSTR;
typedef const TCHAR* LPCTSTR;


So as you can see here, how the TCHAR type is defined depends on whether the "UNICODE" macro is defined. In this way, the "UNICODE" macro becomes a sort of switch that lets you say "I'm going to be using Unicode strings, so make my TCHAR a wide character." And this is exactly what Visual C++ does when you set the project's "character set" to Unicode: it defines UNICODE for you. So what you get out of this is the ability to write code that can compile to use either ANSI strings or Unicode strings depending on a macro definition or a compiler setting. This ability is further aided by the TEXT() macro, which will produce either an ANSI or Unicode string literal:


LPCTSTR tString = TEXT("This could be either an ANSI or Unicode string!");


Now that you know about TCHAR's, things might make a bit more sense if you look at the documentation for any Windows API function that accepts a string. For example, let's look at the documentation for MessageBox. The prototype shown on MSDN looks like this:


int MessageBox( HWND hWnd,
LPCTSTR lpText,
LPCTSTR lpCaption,
UINT uType
);


As you can see, it asks for a string of TCHAR's. This makes sense, since your app could be using either character type and the API doesn't want to force either type on you. However there's a big problem with this: the functions that make up the Windows API are implemented as precompiled DLL's. Since TCHAR is resolved at compile-time, the function had to be compiled as either ANSI or Unicode. So how did MS get around this? They compiled both!

See, the function prototype you see in the documentation isn't actually a prototype of any existing function. It's just a bunch of sugar to make things look nice for you when you're learning how a function works, and tells you how you should be using it. In actuality, every function that accepts strings has two versions: one with an "A" suffix that takes ANSI strings, and one with a "W" suffix that takes Unicode strings. When you call a function like MessageBox, you're actually calling a macro that's defined to one of its two versions depending on whether the UNICODE macro is defined. This means that the Windows headers has something that looks like this:


WINUSERAPI
int
WINAPI
MessageBoxA(
__in_opt HWND hWnd,
__in_opt LPCSTR lpText,
__in_opt LPCSTR lpCaption,
__in UINT uType);
WINUSERAPI
int
WINAPI
MessageBoxW(
__in_opt HWND hWnd,
__in_opt LPCWSTR lpText,
__in_opt LPCWSTR lpCaption,
__in UINT uType);
#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif


Pretty tricky, eh? With these macros, the ugliness of having two functions is kept reasonably transparent for the programmer (with the disadvantage of causing some confusion among Windows newbies). Of course these macros can be bypassed completely if you want, by simply calling one of the typed versions directly. This is important for programs that dynamically load functions from Windows DLL's at runtime, using LoadLibrary and GetProcAddress. Since macros like "MessageBox" don't actually exist in the DLL, you have to specify the name of one of the "real" functions.

Anyway, that's basically a summarized guide of how the Windows API handles Unicode. With this, you should be able to get started with using Windows API functions, or at least know what kinds of questions to ask when you need something cleared up on the issue.

ADDITIONAL INFO:

The above refers specifically to how the Windows API handles strings. The Visual C++ C Run-Time library also supports its own _TCHAR type which is defined in a manner similar to TCHAR, except that it uses the _UNICODE macro. It also defines a _T() macro for string literals that functions in the same manner as TEXT(). String functions in the CRT also use the _UNICODE macro, so if you're using these you must remember to define _UNICODE in addition to UNICODE (Visual C++ will define both if you set the character set as Unicode).

If you use Standard C++ Library classes that work with strings such as std::string and std::ifstream and you want to use Unicode, you can use the wide-char versions. These classes have a w prefix, such as std::wstring and std::wifstream. There are no classes that use the TCHAR type, however if you'd like you can simply define a tstring or tifstream class yourself using the _UNICODE macro.
Sign in to follow this  


4 Comments


Recommended Comments

According to Wikipedia, UTF-16 is also variable length and thus not necessarily two bytes per glyph: some glyphs are more bytes than two in UTF-16.

In the article here is mentioned that the unicode of Windows uses two bytes per glyph. Does it also support the glyphs of UTF-16 with more than two bytes, or does Windows only support the two byte glyphs?

There are more than 65536 glyphs in Unicode.

Share this comment


Link to comment
Originally Windows NT supported UCS-2. Support for surrogates (and therefore UTF-16) is a bit spotty, but increasing. XP required US users to explicitly enable Uniscribe which GDI needs to be able to output surrogates. With Vista it's always on plus some fonts were added for non-BMP characters.

Joel Spolsky's post on Unicode is pretty good.

Share this comment


Link to comment
You're right Lode, thank you for pointing that out so I can reword this for better accuracy. Also thank you for linking to that blog Anon Mike, I think I will link to that in my entry as it's good further reading.

Share this comment


Link to comment
I wish I had read this article when I was new to windows programming, it would have saved me a lot of hair pulling. Thanks MJP!

Share this comment


Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now