I was browsing through help in VC++ .net, and I kept coming across this...

Started by
8 comments, last by BattleGuard 20 years, 4 months ago
I was browsing through the "isalpha" function in the help for VC++ .net, and for every keyword, functions etc there was in C++, the was one with "w" added to it.. such as: w int isw alpha etc... What does the "w" mean, and how is it different?? BattleGuard Only questions raise questions. Questions are raised by people, by curiousity, the gift of nature to all human beings. And curiosity is satisfied by answers, which in turn raise questions, which lead to answers. And this curiosity is what keeps SCIENCE alive...
Advertisement
w stands for "wide" in this case.
i don''t really know anything about the "wide" character set, but they have a slightly different interpretation than ascii, apparently.

check out:

http://www.opengroup.org/onlinepubs/007908799/xsh/iswalpha.html
Thanks AP, anyone know what the integration is with wide variables?? What''s teh difference??

Battleguard


Only questions raise questions. Questions are raised by people, by curiousity, the gift of nature to all human beings. And curiosity is satisfied by answers, which in turn raise questions, which lead to answers. And this curiosity is what keeps SCIENCE alive...
wide is for character sets that use 2 bytes, such as UTF-16.
--AnkhSVN - A Visual Studio .NET Addin for the Subversion version control system.[Project site] [IRC channel] [Blog]
I''m using wide characters in my project right now, if not for any reason but to learn how to use them. It surprised me that support for them is built in throughout the C/C++ libraries: std::wstring, wcslen(), etc. I believe someone mentioned that wchar_t is actually a standard C++ type, even though MSVC++ makes it a typedef. The characters take up twice as much memory, but I''m not exactly concerned about that at this point in my project.
Wide characters are not necessarily 2 bytes. A wchar_t in C++ is a type large enough to store all the distinct codes in the extended character set supported by the all the locales supported by the runtime library. Some compilers implement wchar_t as the same as a char, because they support no real locales. Some compilers implement wchar_t as 32 bit values because their locales include support for some of the whacking huge locales such as most of the far east languages. Technically, it''s a violation of the standard to use UTF-16 with wchar_t because UTF-16 is a variable length encoding. In general, a compiler that uses a 16 bit wchar_t has locales that use a UCS-2 encoding.

Basically, you would use wchar_t values the same as char values except that when declaring literals you use the L prefix. Such as:
wchar_t ch = L''a'';
wchar_t str[] = L"a string";
Umm... thanx, but what is UTF-16... Sry, I''m new to this..

Battleguard
UTF-16 and UCS-2 would be different standards to store characters in a computer.

Everyone knows about ASCII: It''s the standard used to store English characters and symbols within 8 bits of data (technically 7... but let''s not quibble over the details any further than that).

UTF-16 and UCS-2 are just different standards to store different sets of characters. 256 slots are certainly not enough for, say, Japanese, where one finds over 50,000 unique characters in the common newspapers alone.

Hope that clarifies. :-D


Remember: Silly is a state of Mind, Stupid is a way of Life.
-- Dave Butler

Globals are not evil. Singletons are evil.
google on "unicode" "UTF-8" "UTF-16" and "wide characters"

to tell the truth, I think I''m confused, cause I always thought that "wide" we''re the characters that could be variable length, and unicode-16 we''re ALWAYS 16 bits per char (and hence simply couldn''t store certain of the unicode glyphs) ... but I really just don''t know enough about it ...

The minimal thing you need to know is this: ASCII (or any 8 bit character format) is no longer sufficient to deal with strings ... because every major operating system now supports sets of langauges which use more than 256 total unique characters) ...

so once upon a time, they invented variable length encodings, where most characters stayed 1 byte, but certain bytes we''re used to say "also look at the next byte" ... hence VARAIBLE LENGTH encodings ...

then people said, no, random access is too important, as is determining the number off bytes needed to store it easily ... so they decided on a pure 16 bit encoding .. 16 bits for every character ... but in fact they cannot store EVERY langauge in 1 set of 16 bit characters (64k characters), so they then decided to create a pure 32 bit mapping of characters ... an official map (unicode) standardized by ISO ... and to define standard subsets and methods to map this down into 16 bit or 8 bit or variable length versions for different uses ...

THAT IS UNICODE ...
First a nitpick: the Unicode organization is separate from ISO/IEC JTC1/SC2/WG2, the workgroup responsible for producing ISO 10646, which is the ISO document for a defining a universal character set. However, the latest Unicode standard and ISO 10646:2000 are largely compatible; the Unicode group and ISO/IEC JTC1/SC2/WG2 work closely together, much in the same way the ANSI C++ committee and the ISO C++ committee work closely together.

Anyways, here''s a brief (and somewhat oversimplified) version of what UTF-16, UCS-2, etc. are all about. (It''s also directly from memory so some parts might be not as correct as you might find from actually going to the Unicode website.)

The first version of the Unicode standard was based on the concept that all characters in all languages could be squeezed into 16 bits. In hindsight this seems laughable, but at the time it seemed to make sense. Afterall, when the Unicode standard was first proposed, the most common forms of representing far east characters, such as MBCS, seemed to obey the 16 bit limit.

So in the first versions of the standard, recording a Unicode character was a simple as storing a 16 bit value. However, given the fact that most existing processing works on 8 bit values for working with character strings, a different kind of encoding that would remain somewhat backwards compatible was proposed. This was done by proposing a method where all characters from 0-127 would map to the same characters as defined by ASCII. Then two or more characters with the high bit set could be used to assemble the 16 bit values. This encoding is called UTF-8 (Unicode transformation format, 8 bits). Storing the values in 16 bit chunks was called UTF-16. IIRC, The ISO committee then took UTF-16 and labled it as UCS-2 (Universal Character Set - 2 octets).

So a few revisions later, when it was figured out that 16 bits just wasn''t going to cut it, the Unicode standard arrived at the following system: Unicode would encode "code points" which include what we would normally call characters, as well as other special formatting values, such as combining code points and augmenting code points. And instead of extending from 0 to 0xFFFF, the code points would go from 0 to 0x10FFFF (which can be fit into 21 bits). (And no, I don''t know why it doesn''t fill all 21 bits.)

UTF-8 was modified so that it could encode 21 bits instead of just 16. Then UTF-16 was rewritten so that it would enocde 21 bits in a similar way. However, UCS-2 was not extended the same way as UTF-16. Instead UCS-2 only refers to code points that occupy the lower 65535 code point positions.

Incidently, random access wasn''t the biggest reason for desiring a shift to fixed length character encoding. The largest problems with working with, for example, SJIS, consisted of dealing with overlap issues (which is related to the random access problem, but far more pernicious). In any case UTF-16 deals with the problem by making high, low and single values form disjoint sets.

So back to the original point. The C++ standard requires that a wchar_t, when pressed for duty as a wide character (as opposed to being used a funny typedef for an integer type), must hold a single atomic character unit. Therefore, C++ locales cannot use a UTF-16 (or for that matter a UTF-8) encoding for their wide characters. Because then passing a wchar_t to iswalpha() wouldn''t make any sense, as the wchar_t might be holding only half the bits that make up an actual Unicode code point. So usually, if the size of wchar_t is 16 bits, then the locales use UCS-2, the lower 16 bits worth of the Unicode code points.

This topic is closed to new replies.

Advertisement