• Advertisement
Sign in to follow this  

unsigned int == 4 chars?

This topic is 3873 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I just came across this code and don't fully understand why it 'works' or behave in this way.
unsigned int blah = 'blah';
Looking at the variable in memory shows:
halb
Which is 'blah' backwards. I see how 4 chars fit in the unsigned int physically but it doesn't seem to make any sense especially when I expect to see a single char between '' and is copied backwards. Cheers. Edit: Forgot to mention, this is in C++.

Share this post


Link to post
Share on other sites
Advertisement
The backwards part follows from the endianness of your operating system, in particular little endian.

That letters can be used as numbers follows from the ascii code for those letters.

blah

b - 98
l - 108
a - 97
h - 104

b - 0x62
l - 0x6C
a - 0x61
h - 0x68

0x626C6168

Share this post


Link to post
Share on other sites
Quote:
Original post by yaustar
Edit: Forgot to mention, this is in C++.


That's not legal C++. Let's take a look at some actual legal C++:

unsigned int blah = 0x12345678;

In memory (as viewed in hex): 78 56 34 12

The problem you're encountering is the fact that the x86 architecture is "little endian" -- the littlest "digits" (bytes) are placed first in memory. Other architectures, such as PPC, often use "big endian" instead, which results in what you were probably expecting: 12 34 56 78

More information: http://en.wikipedia.org/wiki/Endianness

Share this post


Link to post
Share on other sites
* Does some silent ranting *
Taking the Endianness into account explains the order of the letters perfectly, cheers. How/Why does the compiler convert 'blah' to their 'combined' ascii values (0x626C6168) without warning? Or is it just another quirk that I have to just accept in C++?

Edit: Cancel the warning part, under g++ it does give a warning.

Share this post


Link to post
Share on other sites
Relevant portions of the standard:

Quote:
2.13.2 - Character literals [lex.ccon]
-1- A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by the letter L, as in L'x'. A character literal that does not begin with L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value.

Quote:
2.13.2 - Character literals [lex.ccon]
-2- A character literal that begins with the letter L, such as L'x', is a wide-character literal. A wide-character literal has type wchar_t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set. The value of a wide-character literal containing multiple c-chars is implementation-defined.

The "c-char" term referred to is any member of the source character set -- barring some exceptions (such as the single quote itself) -- an escape sequence, or a universal character name. The bolded sections are the bits to be aware of.

Share this post


Link to post
Share on other sites
Quote:
Original post by MaulingMonkey
That's not legal C++.
No actually it is; it's a multicharacter literal! I came across them a fair bit in 680x0 Mac programming many years ago (big endian so it comes out in the right byte order).
They are cerainly a lesss-used feature of C/C++ though.

Share this post


Link to post
Share on other sites
Quote:
Original post by iMalc
They are cerainly a lesss-used feature of C/C++ though.


Many people use them, especially in data files. They are human readable, they can be used in switch statements, they can be easily loaded then used in hash tables as lookup keys, and so on.

For example, IFF files -- used by many libraries out there -- use these simple codes.

You'll find them in the 4C codes video streaming.
You'll find them in the PNG graphics format.
You'll find them in the JPEG-2000 format.
You'll find them in MIDI audio definitions.
You'll find them in PDF tagged documents.
You'll find them in headers for many audio file formats.
You'll even find them in the object files your compiler creates.

Frequently when people try to reverse engineer complex file formats they'll end up with a few magic numbers. Many times if you convert the four bytes into a multi-char literal, they turn out to be useful 4c codes.

Oh, and don't use unsigned int.

Sure it will work fine if you are only working on 32-bit PCs, but don't expect it in the distant future.

An unsigned int is at least large enough to hold a 16-bit number but it can be whatever size the compiler writers want to make it. An unsigned long is at least large enough to hold a 32-bit number. Not all machines have 32-bit (or bigger) int types, but an unsigned long is going to be at least long enough to hold it.

Share this post


Link to post
Share on other sites
Quote:
Original post by iMalc
Quote:
Original post by MaulingMonkey
That's not legal C++.
No actually it is; it's a multicharacter literal! I came across them a fair bit in 680x0 Mac programming many years ago (big endian so it comes out in the right byte order).
They are cerainly a lesss-used feature of C/C++ though.

Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.

Share this post


Link to post
Share on other sites
Quote:
Original post by frob
Many people use them, especially in data files.


Data files don't contain multicharacter literals because they (generally speaking) don't contain C++ source code. They may contain sequences of bytes that are intended to be interpreted in this way, however.

Quote:
Frequently when people try to reverse engineer complex file formats they'll end up with a few magic numbers. Many times if you convert the four bytes into a multi-char literal, they turn out to be useful 4c codes.


Sure, IF your platform's endianness matches the one for which the format was designed, AND IF sizeof(int) == 4 and CHAR_BITS == 8 on your platform (see below), AND if your compiler defines the value of multicharacter literals in a sane way (note the phrase "implementation defined" in the quoted section of the Standard).

Quote:
Oh, and don't use unsigned int.

Sure it will work fine if you are only working on 32-bit PCs, but don't expect it in the distant future.

An unsigned int is at least large enough to hold a 16-bit number but it can be whatever size the compiler writers want to make it. An unsigned long is at least large enough to hold a 32-bit number. Not all machines have 32-bit (or bigger) int types, but an unsigned long is going to be at least long enough to hold it.


That's not even wrong.

First of all, the multicharacter literal has type int, as dictated by the standard, and the type of the variable to which you assign the result is totally irrelevant. C++ doesn't provide for overloading functions solely by return type or deducing a return type from the calling context; the rules are the same for built-ins.

Second, an unsigned int is only at least large enough to hold an 8-bit number. Sorry. In fact, an unsigned long is only at least large enough to hold an 8-bit number. Ints are only guaranteed to be *at least as* large as chars (and consist of an integral number of them), and likewise longs are *at least as* large as ints (and consist of an integral number of chars). (And of course, chars are guaranteed to provide *at least* 8 bits, but they can legally provide 9 or 16 or 64, as long as every bit of memory is addressable via a char* and bit arithmetic.)

Third, you seem to be under the assumption that 16-bit platforms will at some point in the distant future become popular again?

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
Quote:
Original post by iMalc
Quote:
Original post by MaulingMonkey
That's not legal C++.
No actually it is; it's a multicharacter literal! I came across them a fair bit in 680x0 Mac programming many years ago (big endian so it comes out in the right byte order).
They are cerainly a lesss-used feature of C/C++ though.

Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.

Just when I'd started to think nobody could screw up C++ even more...

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.


Sure they are; jpetrie already quoted the relevant chapter and verse.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sharlin
Quote:
Original post by Washu
Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.


Sure they are; jpetrie already quoted the relevant chapter and verse.


"Implementation defined" is not much of a standard. Might as well be undefined behaviour altogether.

Share this post


Link to post
Share on other sites
Yes, but the syntax is well-formed, so any C++ compiler should accept it. The semantics aren't very useful, though.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
Quote:
Original post by iMalc
Quote:
Original post by MaulingMonkey
That's not legal C++.
No actually it is; it's a multicharacter literal! I came across them a fair bit in 680x0 Mac programming many years ago (big endian so it comes out in the right byte order).
They are cerainly a lesss-used feature of C/C++ though.

Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.


They're in my copy of the standard...

Share this post


Link to post
Share on other sites
Quote:
Original post by rip-off
Quote:
Original post by Sharlin
Quote:
Original post by Washu
Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.


Sure they are; jpetrie already quoted the relevant chapter and verse.


"Implementation defined" is not much of a standard. Might as well be undefined behaviour altogether.
Well, if you ask me I'd say they were just being lazy in not specifying the exact behaviour, perhaps due to the endianness issue.

Quote:
Frequently when people try to reverse engineer complex file formats they'll end up with a few magic numbers. Many times if you convert the four bytes into a multi-char literal, they turn out to be useful 4c codes.
Heck even the main product I work on at work uses 4-char-code magic numbers, though they've been defined by their hexadecimal value at present instead of 4cc's.

Share this post


Link to post
Share on other sites
Quote:
Original post by rip-off
Quote:
Original post by Sharlin
Quote:
Original post by Washu
Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.


Sure they are; jpetrie already quoted the relevant chapter and verse.


"Implementation defined" is not much of a standard. Might as well be undefined behaviour altogether.


Not at all.

Undefined behavior is completely unreliable. If your program has UB then it is a complete and utter bug. However, implementation defined behavior is always reliable on a given system.

Most classes in the STL have implementation defined types for their iterator members and related type definitions and differences.

The size of nearly all data types is implementation defined with a specific minimum. That was already briefly mentioned because an unsigned int isn't guaranteed to hold at least 32 bits, but an unsigned long will hold at least that much.

Many conversions, including mappings performed by reinterpret_cast<>, are implementation defined.

We can rely on these implementation defined behaviors, and people who develop cross-platform games (like me) rely heavily on knowing what is IB and making sure it is handled consistently between platforms.

Other things are a little more tricky, such as alignment, byte ordering, and structure padding. They are all implementation defined, but a little harder to take advantage of.

Yes you are right that the use of multibyte characters is implementation defined. They are treated as int, and when converted to an unsigned long, are properly handled on all systems (although re-ordered for proper byte ordering). But that is not the same as undefined.


Relying on implementation-defined behavior is a good thing. It lets us know that we can reliably run on any processor, from a Z80 to an Arm9 to a dual OpteronX64. It means our game core will work fine if we port it from the X360 to the Wii to a PSP or DS, or even jump over to Windows Mobile.

Relying on undefined behavior is guaranteed catastrophic behavior for the 'real' world. Please don't confuse the two.

Share this post


Link to post
Share on other sites
Quote:
Original post by frob
Quote:
Original post by rip-off
Quote:
Original post by Sharlin
Quote:
Original post by Washu
Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.


Sure they are; jpetrie already quoted the relevant chapter and verse.


"Implementation defined" is not much of a standard. Might as well be undefined behaviour altogether.


Not at all.

Undefined behavior is completely unreliable. If your program has UB then it is a complete and utter bug. However, implementation defined behavior is always reliable on a given system.

Most classes in the STL have implementation defined types for their iterator members and related type definitions and differences.

The size of nearly all data types is implementation defined with a specific minimum. That was already briefly mentioned because an unsigned int isn't guaranteed to hold at least 32 bits, but an unsigned long will hold at least that much.

Many conversions, including mappings performed by reinterpret_cast<>, are implementation defined.

We can rely on these implementation defined behaviors, and people who develop cross-platform games (like me) rely heavily on knowing what is IB and making sure it is handled consistently between platforms.

Other things are a little more tricky, such as alignment, byte ordering, and structure padding. They are all implementation defined, but a little harder to take advantage of.

Yes you are right that the use of multibyte characters is implementation defined. They are treated as int, and when converted to an unsigned long, are properly handled on all systems (although re-ordered for proper byte ordering). But that is not the same as undefined.


Relying on implementation-defined behavior is a good thing. It lets us know that we can reliably run on any processor, from a Z80 to an Arm9 to a dual OpteronX64. It means our game core will work fine if we port it from the X360 to the Wii to a PSP or DS, or even jump over to Windows Mobile.

Relying on undefined behavior is guaranteed catastrophic behavior for the 'real' world. Please don't confuse the two.


I'm not confusing the two, I do understand the difference. Believe me the term undefined behaviour has been burned into my skull, like most people who use c++. [grin]

It is just *in this case* I don't see much benefit to this. std::iterator types are useful despite their "implementation defined" members. We don't need to know the details of their members to use them, which is why these members can be implementation defined in the first place.

With multi-byte character literals not so much (IMO). They only have one use, they are a literal with integer type of "implementation defined" value. I just happen not to see much use in that.

I probably should have ended my original post with "IMO" [smile].

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement