Archived

This topic is now archived and is closed to further replies.

ibolcina

getting confused about strings, char *, tchar,wchar?, unicode

Recommended Posts

Hi. I am confused about string types in c++. In old days, there was char* and strcpy,strcat and so on, but now there are so many,TCHAR,ATL macros, UNICODE functions, T() macros and _T() and millions others I am currently using std::string, but I am often forced to cast it with _T macro or T macro or LPSTR or whatever and I am not so sure, what am I doing. Is there a tutorial with complete overview about this. Which is "write once,run everytime" version for strings. I really dont like much #ifdef in my code. Which one do you use? bye,ivan

Share this post


Link to post
Share on other sites
AFAIK, TCHAR, _T(), T() and all similar macros are for neutralizing the characters. This makes the code compatable across multiple platforms that might use different character sets or use two bytes instead of one for a character. To tell you the truth, I hardly every use those macros or types, because I don''t expect my code to be cross platform compatable, lol! You can use them if you want, but it can begin to become a burden after a while.

Share this post


Link to post
Share on other sites
I''d suggest sticking with std::string and using the member function string::c_str() which will return a char*. Although to be fair, I rarely use strings, so that may not be what you are looking for.

Share this post


Link to post
Share on other sites
char is 1 byte.
wchar_t is 2 bytes (unsigned short)
tchar is either char or wchar_t, depends on whether _UNICODE is defined. err... oh... you mean TCHAR right?
string is a 'sequence' of char
wstring is a 'sequence' of wchar_t
and there's no tstring.

char* isn't a string... it is a pointer. And is 4 bytes in size... so is wchar_t* .

"ABC" means it's in ASCII string representation
L"ABC" means it's in UNICODE string representation
so, _T("ABC") means either "ABC" or L"ABC", again depends on whether _UNICODE is defined.

Hope that make things clearer.

[edited by - DerekSaw on December 26, 2002 1:36:26 AM]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
hi ..

16 bit unicode characters/strings are only needed if you plan to release a version of your programm for some far east or asian, japanese markets .. or doing some kind of international web-based application .. e.g one server for 8 bit/16 Bit clients (but thats simply ansi->unicode conversion).

microsoft exaggerates once again

porting a programm from 8 bit to 16 bit without TCHAR is done in a few minutes.

TCHAR is only useful when developers from asia working together with developers from usa (e.g.) using the same sourcecode ..

you really dont need to care about all this stuff if you dont plan one of the things above.

Share this post


Link to post
Share on other sites
This article might interest you: Strings the OLE Way.

From the article:

char An 8-bit signed character (an ANSI character).
wchar_t A typedef to a 16-bit unsigned short (a Unicode character).
CHAR The Win32 version of char.
WCHAR The Win32 version of wchar_t.
OLECHAR The OLE version of wchar_t.
_TCHAR A generic character that maps to char or wchar_t.
LPSTR, LPCSTR A Win32 character pointer. The version with C is const.
LPWSTR, LPCWSTR A Win32 wide character pointer.
LPOLESTR, LPCOLESTR An OLE wide character pointer.
LPTSTR, LPCTSTR A Win32 generic character pointer.
_T(str), _TEXT(str) Identical macros to create generic constant strings.
OLESTR(str) OLE macro to create generic constant strings.


"Oh no, not again" - Agrajag

Share this post


Link to post
Share on other sites
quote:
Original post by Anonymous Poster
16 bit unicode characters/strings are only needed if you plan to release a version of your programm for some far east or asian, japanese markets .. or doing some kind of international web-based application .. e.g one server for 8 bit/16 Bit clients (but thats simply ansi->unicode conversion).



Most east asian character sets are multi-byte character sets.

Unicode is used mainly to simplify supporting multiple character sets in the same application. They don''t need to be multi byte character sets either. For example, you generally can''t support Russian + Latin character sets at the same time with a single-byte character set and you need a lot of locale-switching to handle collation and so forth (and even then, you wouldn''t be able to collate a Russian string and a Latin string together, because they''re both encoded differently). With Unicode you can mix and match locales without worrying about character encoding because they all get encoded the same way.

The TCHAR and _T macros are good because they allow you to release a single-byte version of your application and a double-byte version with just a change of compiler settings. This is good because on Windows 9X all the system libraries are single-byte while on Windows NT/2000 they''re all double-byte. So you run a single-byte app on Windows NT/2000, all system calls which expect strings need to have them all converted to double-byte, whereas if you release a double-byte app on Windows 9X, all system calls that expect strings would need to have them converted to a multi-byte character set. This all results in a bit of overhead.

Still, if you''re only developing for Windows NT/2000/XP, then I wouldn''t bother with the TCHAR and _T and just go for everything unicode, because it''ll be faster anyway (ignoring the extra memory needed by unicode strings...)

quote:

you really dont need to care about all this stuff if you dont plan one of the things above.


You don''t need to care if you''re only ever going to support one character set. If you''re supporting multiple character sets (even multiple single-byte character sets) on both Windows NT/2000 and Windows 9X, then you might care.



If I had my way, I''d have all of you shot!


codeka.com - Just click it.

Share this post


Link to post
Share on other sites
quote:
I am confused about string types in c++.


You needn''t be. There are two basic types: char which holds an ASCII character, and wchar which holds a wide character. Char is guaranteed to be eight bits so that it can hold any character in the ASCII character set. wchar is provided for languages that need more than 128 characters, like IIRC traditional Chinese. std::basic_string is not a basic type but actually a class template that can work with any of the basic types. std::string is just a typedef for basic_string <char>.

quote:
In old days, there was char* and strcpy,strcat and so on, but now there are so many,TCHAR,ATL macros, UNICODE functions, T() macros and _T() and millions others I am currently using std::string, but I am often forced to cast it with _T macro or T macro or LPSTR or whatever and I am not so sure, what am I doing.


The T stuff is non-standard. Basically, a tchar is supposed to be large enough to hold one character from any modern character set. Others have explained this in more detail, so I''ll skip it...

quote:
I''d suggest sticking with std::string and using the member function string::c_str() which will return a char*.


I''ll second that.

Share this post


Link to post
Share on other sites
quote:
std::basic_string is not a basic type but actually a class template that can work with any of the basic types.


It also works with user-defined types, provided of course that they meet std::basic_string''s requirements.

Share this post


Link to post
Share on other sites
Hi.
Thanks for clearing up. Its lot clearer now.

So I stick with std::string. But a lot of libraries need TCHAR which is, as far as I understand, macro over either char or wchar. So I should convert to required "type" just before calling library function, otherwise std::string will be fine?


thanx,ivan

Share this post


Link to post
Share on other sites