Using unicode character set, adding 'L' before strings, some stuff I don't understand

Started by
6 comments, last by moeron 16 years, 11 months ago
The Directx examples that come with the SDK are set to use the unicode character set and trying to compile them in any other situation results in errors about conversions (such as 'LPSTR' to 'LPCWSTR' or 'const char [29]' to 'LPCWSTR') and to solve it I have to add an L before strings, so "string" becomes L"string". Why is this so? My second question is how can I convert an LPSTR * to a LPCWSTR? Also LPSTR and LPCWSTR are typedefs so what do they stand for exactly?
Advertisement
The 'L' tells the compiler that you want a string literal of type wchar_t*, as opposed to a string literal of type char*. Why an 'L'? I've got no idea, perhaps it's meant to signify long characters. Whatever the reason, getting it wrong means a world of hurt since there's no automatic conversion between strings of type wchar_t and char. IMHO string literals should default to either wchar_t or char according to a compiler option, but alas it's not so.

As for what the typedefs and some other related macros stand for:

LPSTR = Long Pointer to a STRing (char*)
LPCSTR = Long Pointer to a Constant STRing (const char*)
LPWSTR = Long Pointer to a Wide STRing (wchar_t*)
LPCWSTR = Long Pointer to a Constant Wide STRing (const wchar_t*)
LPTSTR = Long Pointer to a TCHAR STRing (TCHAR*)
LPCTSTR = Long Pointer to a Constant TCHAR STRing (const TCHAR*)
TCHAR = either char or wchar_t, depending on whether UNICODE is defined

TEXT("My String") = either "My String" or L"My String", depending on whether UNICODE is defined
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
There are two types of character sets "Multi Byte" and "Unicode". The difference between the two is that "Multi-Byte" is your standard day-to-day ASCII 255 bit patterned format, and "Unicode" which is an industry standard that allows text/symbols from all world language systems to be displayed. Before "Unicode", this was impossible due to the 255 ASCII bit pattern limit and operating system.

LPSTR and LPCWSTR and so on are Hungarian Notation typedefs. LPSTR stands for (pointer to a multi-byte string) and LPWSTR stands for a (pointer to a unicode string).

There are some macros that may be of use to you located in the header file
<tchar.h> such as the _T macro which converts a string to proper format based
on the compilers character set, and so on.

You can read up and learn more about these topics here:
Unicode
ASCII
Hungarian Notation
MSDN Unicode Character Sets
Quote:Original post by Shadow Wolf
There are two types of character sets "Multi Byte" and "Unicode". The difference between the two is that "Multi-Byte" is your standard day-to-day ASCII 255 bit patterned format, and "Unicode" which is an industry standard that allows text/symbols from all world language systems to be displayed. Before "Unicode", this was impossible due to the 255 ASCII bit pattern limit and operating system.

Multibyte (or UTF-8) allows that too. It just achieves it a bit more awkwardly.
Multibyte, as the name implies, uses multiple bytes to represent special characters. All the 127 ASCII characters are represented by a single byte (so your ascii strings will look *exactly* the same in UTF8-encoding), and everything else will use two or more bytes, starting with a value above 127, to keep it separate from the ASCII chars.
The ASCII compatibility is nice in some situations, but this scheme also makes it very hard to figure out the number of characters in a string and other operations. (You need to iterate through the string to tell where each character begins and ends)
Whilst you might have come across this via the DXSDK it's not really a DirectX-specific issue; I'm going to move this to 'General Programming' where it's more suited [smile]

Quote:how can I convert an LPSTR * to a LPCWSTR?
Have a look into MultiByteToWideChar() (it's opposite function is linked at the bottom of that page).

Cheers,
Jack

<hr align="left" width="25%" />
Jack Hoxley <small>[</small><small> Forum FAQ | Revised FAQ | MVP Profile | Developer Journal ]</small>

Thanks for explaining everyone, I was pretty confused as to why they have seperate character sets.
Quote:Original post by Shadow Wolf
The difference between the two is that "Multi-Byte" is your standard day-to-day ASCII 255 bit patterned format
Not at all.
Multi-Byte normally referes to the extensions to ASCII designed to cover huge character sets by using variable length encoding.
On Windows platform you may meet Big5 for Chinese and Shift JIS for Japanese.
If you do not care about Win98 — you don't have to use Multi-Byte strings in your application.

Quote:Original post by Shadow Wolf
Before "Unicode", this was impossible due to the 255 ASCII bit pattern limit and operating system.
Well, those multi-byte encodings were developed exactly to do that before Unicode was even born.

Quote:Original post by Spoonbender
Multibyte (or UTF-8) allows that too.
Even if some functions, that process multi-byte strings may deal with UTF-8, it's not fully supported — UTF-8 cannot be set as system locale.

Quote:Original post by Spoonbender
The ASCII compatibility is nice in some situations, but this scheme also makes it very hard to figure out the number of characters in a string and other operations. (You need to iterate through the string to tell where each character begins and ends)
…or you may store with the string its length in characters.
Quote:Original post by Serge K
Quote:Original post by Shadow Wolf
The difference between the two is that "Multi-Byte" is your standard day-to-day ASCII 255 bit patterned format
Not at all.
Multi-Byte normally referes to the extensions to ASCII designed to cover huge character sets by using variable length encoding.
On Windows platform you may meet Big5 for Chinese and Shift JIS for Japanese.
If you do not care about Win98 — you don't have to use Multi-Byte strings in your application.

I think he may have just been referring to what its called in Visual Studio. The only two choices are "Multi-byte character set" and "Unicode" - and in this case the MBCS does just yield you regular ASCII chars ( I've never done anything with any extended characters so I'll defer to your knowledge on that ). But thanks for clearing that up - I'd always wondered why it was called a multi-byte character set when a char was one byte.
moe.ron

This topic is closed to new replies.

Advertisement