Back to General and Gameplay Programming

unsigned int == 4 chars?

General and Gameplay Programming Programming

Started by yaustar June 15, 2007 01:45 PM

16 comments, last by rip-off 16 years, 10 months ago

yaustar

1,022

Author

June 15, 2007 01:45 PM

I just came across this code and don't fully understand why it 'works' or behave in this way.

unsigned int blah = 'blah';

Looking at the variable in memory shows:

halb

Which is 'blah' backwards. I see how 4 chars fit in the unsigned int physically but it doesn't seem to make any sense especially when I expect to see a single char between '' and is copied backwards. Cheers. Edit: Forgot to mention, this is in C++.

Steven Yau
[Blog] [Portfolio]

LessBread

1,415

June 15, 2007 02:01 PM

The backwards part follows from the endianness of your operating system, in particular little endian.

That letters can be used as numbers follows from the ascii code for those letters.

blah

b - 98
l - 108
a - 97
h - 104

b - 0x62
l - 0x6C
a - 0x61
h - 0x68

0x626C6168

"I thought what I'd do was, I'd pretend I was one of those deaf-mutes." - the Laughing Man

MaulingMonkey

1,729

June 15, 2007 02:02 PM

Quote:Original post by yaustar
Edit: Forgot to mention, this is in C++.

That's not legal C++. Let's take a look at some actual legal C++:

unsigned int blah = 0x12345678;

In memory (as viewed in hex): 78 56 34 12

The problem you're encountering is the fact that the x86 architecture is "little endian" -- the littlest "digits" (bytes) are placed first in memory. Other architectures, such as PPC, often use "big endian" instead, which results in what you were probably expecting: 12 34 56 78

More information: http://en.wikipedia.org/wiki/Endianness

yaustar

1,022

Author

June 15, 2007 02:11 PM

* Does some silent ranting *
Taking the Endianness into account explains the order of the letters perfectly, cheers. How/Why does the compiler convert 'blah' to their 'combined' ascii values (0x626C6168) without warning? Or is it just another quirk that I have to just accept in C++?

Edit: Cancel the warning part, under g++ it does give a warning.

Steven Yau
[Blog] [Portfolio]

jpetrie

13,220

June 15, 2007 02:16 PM

Relevant portions of the standard:

Quote:2.13.2 - Character literals [lex.ccon]
-1- A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by the letter L, as in L'x'. A character literal that does not begin with L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value.

Quote:2.13.2 - Character literals [lex.ccon]
-2- A character literal that begins with the letter L, such as L'x', is a wide-character literal. A wide-character literal has type wchar_t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set. The value of a wide-character literal containing multiple c-chars is implementation-defined.

The "c-char" term referred to is any member of the source character set -- barring some exceptions (such as the single quote itself) -- an escape sequence, or a universal character name. The bolded sections are the bits to be aware of.

yaustar

1,022

Author

June 15, 2007 02:30 PM

Thanks very much for that.

Steven Yau
[Blog] [Portfolio]

iMalc

2,466

June 15, 2007 10:29 PM

Quote:Original post by MaulingMonkey
That's not legal C++.

No actually it is; it's a multicharacter literal! I came across them a fair bit in 680x0 Mac programming many years ago (big endian so it comes out in the right byte order).
They are cerainly a lesss-used feature of C/C++ though.

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

frob

46,223

June 15, 2007 11:45 PM

Quote:Original post by iMalc
They are cerainly a lesss-used feature of C/C++ though.

Many people use them, especially in data files. They are human readable, they can be used in switch statements, they can be easily loaded then used in hash tables as lookup keys, and so on.

For example, IFF files -- used by many libraries out there -- use these simple codes.

You'll find them in the 4C codes video streaming.
You'll find them in the PNG graphics format.
You'll find them in the JPEG-2000 format.
You'll find them in MIDI audio definitions.
You'll find them in PDF tagged documents.
You'll find them in headers for many audio file formats.
You'll even find them in the object files your compiler creates.

Frequently when people try to reverse engineer complex file formats they'll end up with a few magic numbers. Many times if you convert the four bytes into a multi-char literal, they turn out to be useful 4c codes.

Oh, and don't use unsigned int.

Sure it will work fine if you are only working on 32-bit PCs, but don't expect it in the distant future.

An unsigned int is at least large enough to hold a 16-bit number but it can be whatever size the compiler writers want to make it. An unsigned long is at least large enough to hold a 32-bit number. Not all machines have 32-bit (or bigger) int types, but an unsigned long is going to be at least long enough to hold it.

Washu

7,836

June 16, 2007 12:08 AM

Quote:Original post by iMalc
Quote:Original post by MaulingMonkey
That's not legal C++.
No actually it is; it's a multicharacter literal! I came across them a fair bit in 680x0 Mac programming many years ago (big endian so it comes out in the right byte order).
They are cerainly a lesss-used feature of C/C++ though.

Still not standard. They might be adopted into 0x, but I haven't seen much forward movement on that front.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Zahlman

1,682

June 16, 2007 02:15 AM

Quote:Original post by frob
Many people use them, especially in data files.

Data files don't contain multicharacter literals because they (generally speaking) don't contain C++ source code. They may contain sequences of bytes that are intended to be interpreted in this way, however.

Quote:Frequently when people try to reverse engineer complex file formats they'll end up with a few magic numbers. Many times if you convert the four bytes into a multi-char literal, they turn out to be useful 4c codes.

Sure, IF your platform's endianness matches the one for which the format was designed, AND IF sizeof(int) == 4 and CHAR_BITS == 8 on your platform (see below), AND if your compiler defines the value of multicharacter literals in a sane way (note the phrase "implementation defined" in the quoted section of the Standard).

Quote:Oh, and don't use unsigned int.

Sure it will work fine if you are only working on 32-bit PCs, but don't expect it in the distant future.

An unsigned int is at least large enough to hold a 16-bit number but it can be whatever size the compiler writers want to make it. An unsigned long is at least large enough to hold a 32-bit number. Not all machines have 32-bit (or bigger) int types, but an unsigned long is going to be at least long enough to hold it.

That's not even wrong.

First of all, the multicharacter literal has type int, as dictated by the standard, and the type of the variable to which you assign the result is totally irrelevant. C++ doesn't provide for overloading functions solely by return type or deducing a return type from the calling context; the rules are the same for built-ins.

Second, an unsigned int is only at least large enough to hold an 8-bit number. Sorry. In fact, an unsigned long is only at least large enough to hold an 8-bit number. Ints are only guaranteed to be *at least as* large as chars (and consist of an integral number of them), and likewise longs are *at least as* large as ints (and consist of an integral number of chars). (And of course, chars are guaranteed to provide *at least* 8 bits, but they can legally provide 9 or 16 or 64, as long as every bit of memory is addressable via a char* and bit arithmetic.)

Third, you seem to be under the assumption that 16-bit platforms will at some point in the distant future become popular again?

unsigned int == 4 chars?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

unsigned int == 4 chars?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines