unsigned int == 4 chars?

Started by
16 comments, last by rip-off 16 years, 10 months ago
I just came across this code and don't fully understand why it 'works' or behave in this way.
unsigned int blah = 'blah';
Looking at the variable in memory shows:
halb
Which is 'blah' backwards. I see how 4 chars fit in the unsigned int physically but it doesn't seem to make any sense especially when I expect to see a single char between '' and is copied backwards. Cheers. Edit: Forgot to mention, this is in C++.

Steven Yau
[Blog] [Portfolio]

Advertisement
The backwards part follows from the endianness of your operating system, in particular little endian.

That letters can be used as numbers follows from the ascii code for those letters.

blah

b - 98
l - 108
a - 97
h - 104

b - 0x62
l - 0x6C
a - 0x61
h - 0x68

0x626C6168
"I thought what I'd do was, I'd pretend I was one of those deaf-mutes." - the Laughing Man
Quote:Original post by yaustar
Edit: Forgot to mention, this is in C++.


That's not legal C++. Let's take a look at some actual legal C++:

unsigned int blah = 0x12345678;

In memory (as viewed in hex): 78 56 34 12

The problem you're encountering is the fact that the x86 architecture is "little endian" -- the littlest "digits" (bytes) are placed first in memory. Other architectures, such as PPC, often use "big endian" instead, which results in what you were probably expecting: 12 34 56 78

More information: http://en.wikipedia.org/wiki/Endianness
* Does some silent ranting *
Taking the Endianness into account explains the order of the letters perfectly, cheers. How/Why does the compiler convert 'blah' to their 'combined' ascii values (0x626C6168) without warning? Or is it just another quirk that I have to just accept in C++?

Edit: Cancel the warning part, under g++ it does give a warning.

Steven Yau
[Blog] [Portfolio]

Relevant portions of the standard:

Quote:2.13.2 - Character literals [lex.ccon]
-1- A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by the letter L, as in L'x'. A character literal that does not begin with L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value.

Quote:2.13.2 - Character literals [lex.ccon]
-2- A character literal that begins with the letter L, such as L'x', is a wide-character literal. A wide-character literal has type wchar_t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set. The value of a wide-character literal containing multiple c-chars is implementation-defined.

The "c-char" term referred to is any member of the source character set -- barring some exceptions (such as the single quote itself) -- an escape sequence, or a universal character name. The bolded sections are the bits to be aware of.
Thanks very much for that.

Steven Yau
[Blog] [Portfolio]

Quote:Original post by MaulingMonkey
That's not legal C++.
No actually it is; it's a multicharacter literal! I came across them a fair bit in 680x0 Mac programming many years ago (big endian so it comes out in the right byte order).
They are cerainly a lesss-used feature of C/C++ though.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Quote:Original post by iMalc
They are cerainly a lesss-used feature of C/C++ though.


Many people use them, especially in data files. They are human readable, they can be used in switch statements, they can be easily loaded then used in hash tables as lookup keys, and so on.

For example, IFF files -- used by many libraries out there -- use these simple codes.

You'll find them in the 4C codes video streaming.
You'll find them in the PNG graphics format.
You'll find them in the JPEG-2000 format.
You'll find them in MIDI audio definitions.
You'll find them in PDF tagged documents.
You'll find them in headers for many audio file formats.
You'll even find them in the object files your compiler creates.

Frequently when people try to reverse engineer complex file formats they'll end up with a few magic numbers. Many times if you convert the four bytes into a multi-char literal, they turn out to be useful 4c codes.

Oh, and don't use unsigned int.

Sure it will work fine if you are only working on 32-bit PCs, but don't expect it in the distant future.

An unsigned int is at least large enough to hold a 16-bit number but it can be whatever size the compiler writers want to make it. An unsigned long is at least large enough to hold a 32-bit number. Not all machines have 32-bit (or bigger) int types, but an unsigned long is going to be at least long enough to hold it.
Quote:Original post by iMalc
Quote:Original post by MaulingMonkey
That's not legal C++.
No actually it is; it's a multicharacter literal! I came across them a fair bit in 680x0 Mac programming many years ago (big endian so it comes out in the right byte order).
They are cerainly a lesss-used feature of C/C++ though.

Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Quote:Original post by frob
Many people use them, especially in data files.


Data files don't contain multicharacter literals because they (generally speaking) don't contain C++ source code. They may contain sequences of bytes that are intended to be interpreted in this way, however.

Quote:Frequently when people try to reverse engineer complex file formats they'll end up with a few magic numbers. Many times if you convert the four bytes into a multi-char literal, they turn out to be useful 4c codes.


Sure, IF your platform's endianness matches the one for which the format was designed, AND IF sizeof(int) == 4 and CHAR_BITS == 8 on your platform (see below), AND if your compiler defines the value of multicharacter literals in a sane way (note the phrase "implementation defined" in the quoted section of the Standard).

Quote:Oh, and don't use unsigned int.

Sure it will work fine if you are only working on 32-bit PCs, but don't expect it in the distant future.

An unsigned int is at least large enough to hold a 16-bit number but it can be whatever size the compiler writers want to make it. An unsigned long is at least large enough to hold a 32-bit number. Not all machines have 32-bit (or bigger) int types, but an unsigned long is going to be at least long enough to hold it.


That's not even wrong.

First of all, the multicharacter literal has type int, as dictated by the standard, and the type of the variable to which you assign the result is totally irrelevant. C++ doesn't provide for overloading functions solely by return type or deducing a return type from the calling context; the rules are the same for built-ins.

Second, an unsigned int is only at least large enough to hold an 8-bit number. Sorry. In fact, an unsigned long is only at least large enough to hold an 8-bit number. Ints are only guaranteed to be *at least as* large as chars (and consist of an integral number of them), and likewise longs are *at least as* large as ints (and consist of an integral number of chars). (And of course, chars are guaranteed to provide *at least* 8 bits, but they can legally provide 9 or 16 or 64, as long as every bit of memory is addressable via a char* and bit arithmetic.)

Third, you seem to be under the assumption that 16-bit platforms will at some point in the distant future become popular again?

This topic is closed to new replies.

Advertisement