Sign in to follow this  
okonomiyaki

memory alignment in c - atrocious?

Recommended Posts

okonomiyaki    548
I was curious about C structures and how they were laid out in memory. I wrote the following programming to give show me a few things about structures.

#include <stdlib.h>
#include <stdio.h>

int type_banana = 1;
int type_orange = 2;
int type_apple = 3;

struct object {
	int type;
	char* size;
};
typedef struct object object;

struct banana {
	int type;
	char* size;
	float softness;
	int length;
};
typedef struct banana banana;

struct orange {
	int type;
	char* size;
	unsigned char peeled;
};
typedef struct orange orange;

struct apple {
	int type;
	char* size;
	short weight;
	char* color;
};
typedef struct apple apple;

banana* make_banana(char* size, float softness, int length) {
	banana* b = (banana*)malloc(sizeof(banana));
	b->type = type_banana;
	b->size = size;
	b->softness = softness;
	b->length = length;
	return b;
}

orange* make_orange(char* size, unsigned char peeled) {
	orange* o = (orange*)malloc(sizeof(orange));
	o->type = type_orange;
	o->size = size;
	o->peeled = peeled;
	return o;
}

apple* make_apple(char* size, float weight, char* color) {
	apple* a = (apple*)malloc(sizeof(apple));
	a->type = type_apple;
	a->size = size;
	a->weight = weight;
	a->color = color;
	return a;
}

int main(int argc, char** argv) {
	apple* a = make_apple("small", 2.5, "red");
	banana* b = make_banana("big", 1.5, 5);

	object* generic1 = (object*)a;
	object* generic2 = (object*)b;

	printf("generic1 size: %s\n", a->size);
	printf("generic2 size: %s\n", b->size);

	printf("\nLocations:\n");
	printf("apple:         0x%x\n", a);
	printf("apple->type:   0x%x\n", &a->type);
	printf("apple->size:   0x%x\n", &a->size);
	printf("apple->weight: 0x%x\n", &a->weight);
	printf("apple->color:  0x%x\n", &a->color);

	printf("\nSizes:\n");
	printf("apple:         %d\n", sizeof(apple));
	printf("apple->type:   %d\n", (long)&a->size - (long)&a->type);
	printf("apple->size:   %d\n", (long)&a->weight - (long)&a->size);
	printf("apple->weight: %d\n", (long)&a->color - (long)&a->weight);
	printf("apple->color:  %d\n", ((long)a + sizeof(apple)) - (long)&a->color);

}

The output is:
generic1 size: small
generic2 size: big

Locations:
apple:         0x100120
apple->type:   0x100120
apple->size:   0x100124
apple->weight: 0x100128
apple->color:  0x10012c

Sizes:
apple:         16
apple->type:   4
apple->size:   4
apple->weight: 4
apple->color:  4

I compiled this Apple's gcc 4.0.1 without any options. Obviously the memory is 4-byte aligned. Is memory alignment something that can be depended on, or is it something so horrific between vendors/compilers/operating systems that you should never make any assumptions on it? If I made assumptions on memory alignment, I would have to ensure that all binaries (such as loadable libraries) were compiled right. Seems pretty scary to me - but there are some definite optimizations that could be made. I'm just curious to hear from anyone who have already felt this out.

Share this post


Link to post
Share on other sites
asp_    172
You can't make any assumptions about padding afaik. It's a pure performance optimization. All compilers provide means of controlling the padding though. MSVC uses pragma pack, GCC uses pragma pack or __attribute(pack) if I remember correctly.

Share this post


Link to post
Share on other sites
Evil Steve    2017
As far as I'm aware, there's no guarantee about alignment, but compilers will usually align members on 4-byte boundaries in a x86 build (I haven't checked a x64 build, but ISTR that the default alignment is 8-byes).
Generally, you'll have all of your modules compiled with the same compiler, so this isn't much of a concern. If you want proper safety, you can turn on packing for the structs you need (In MSVC that's #pragma pack(1), I belive GCC uses some __attribute).

Share this post


Link to post
Share on other sites
okonomiyaki    548
Quote:
Original post by Evil Steve
Generally, you'll have all of your modules compiled with the same compiler, so this isn't much of a concern. If you want proper safety, you can turn on packing for the structs you need (In MSVC that's #pragma pack(1), I belive GCC uses some __attribute).


I'm glad to hear that it's not completely unreliable. Thanks.

I did some research on the reason for memory alignment. The wikipedia article seems like a good summarization:

http://en.wikipedia.org/wiki/Memory_alignment

Essentially, a CPU which requires a 4-byte boundary is assuming that 4 bytes will hold the largest primitive datum, which would be your standard integer? Then all pages/caches/etc. deal with memory in 4 byte chunks, and makes sure no datum is split across pages or memory boundaries.

Does that mean using `long long' variables (meaning 8 byte integers) is dramatically slower than 4 byte integers (on 32-bit architectures)?

I'm also curious about one of the statements in the above article:

"SSE2 instructions on x86 and x64 CPUs do require the data to be 128-bit (16-byte) aligned and there can be substantial performance advantages from using aligned data on these architectures."

16 bytes? Really?

(edit: clarifications)

Share this post


Link to post
Share on other sites
swiftcoder    18432
Quote:
Original post by okonomiyaki
"SSE2 instructions on x86 and x64 CPUs do require the data to be 128-bit (16-byte) aligned and there can be substantial performance advantages from using aligned data on these architectures."

16 bytes? Really?
Ja. SSE2 and Altivec both use 128-bit registers internally, meaning that you can operate on 4 floats or 2 doubles simultaneously .

Share this post


Link to post
Share on other sites
OrangyTang    1298
Adding more addressing modes and formats means that the individual instructions get more complicated (and therefore slower). Since SSE is focused primarily on performance it makes sense to only support the fastest alignment/addressing method possible.

Share this post


Link to post
Share on other sites
SnotBob    202
Quote:
Original post by okonomiyaki
Essentially, a CPU which requires a 4-byte boundary is assuming that 4 bytes will hold the largest primitive datum, which would be your standard integer? Then all pages/caches/etc. deal with memory in 4 byte chunks, and makes sure no datum is split across pages or memory boundaries.

Pages and caches don't really have anything to do with this. They have different alignment requirements. X86 page size is 4k and they are always aligned with 4k boundaries in the physical memory. Caches tend to be aligned to boundaries equal to the size of cache lines.

Some CPUs, such as ARM, do require that 32-bit entities are aligned to 4-byte boundaries. Unlike x86, where unaligned dword accesses result in multiple memory accesses, ARM just cannot access unaligned 32-bit data directly, there are no such instructions. Of course you can always read 8-bits at a time and shift and or to make a 32-bit value. BTW, on ARM it's faster to copy memory between buffers that are aligned the same way for this reason. E.g. it takes more time to copy between buffers that start at addresses 0xBEEF0001 and 0xDEAD0002 than 0xBEEF0001 and 0xDEAD0001 (probably the same is true of x86 though).

Quote:

Does that mean using `long long' variables (meaning 8 byte integers) is dramatically slower than 4 byte integers (on 32-bit architectures)?

Yes, but not because of alignment issues.
There are no instructions for handling 64-bit integers atomically on 32-bit architectures (or at least on x86 and ARM). For the CPU core, it wouldn't make any difference if the two 32-bit words weren't even adjacent in memory. (Caching and paging is another thing.)

Quote:

16 bytes? Really?

I can't remember for sure and didn't bother checking Intel's docs, but I believe so.

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Quote:
Since SSE is focused primarily on performance it makes sense to only support the fastest alignment/addressing method possible.

Note: there are two SSE instructions that exist solely for loading unaligned data. Depending on microarchitecture there may or may not be a penalty relative to the normal move instruction if the operand turns out to be aligned, and definitely so if unaligned.

Quote:
Pages and caches don't really have anything to do with this. They have different alignment requirements. X86 page size is 4k and they are always aligned with 4k boundaries in the physical memory. Caches tend to be aligned to boundaries equal to the size of cache lines.

Pages and caches are relevant here because it is much more expensive to additionally cross a page border rather than just a cache line (which 'only' seems to cause L1d miss). Cache lines are indeed by definition (rather than "tend to be") aligned to their size due to the way they are addressed.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this