Alignment questions

Started by
4 comments, last by ajas95 19 years, 3 months ago
I posted about alignment a while ago but more about why its needed and its relevance with SSE instructions etc. I have a few more questions about alignment that I thought would be better in a new post. 1) Assuming that it is best to try and align all your data, what alignment do you shoot for? If you are aligning a variable that will be used in SSE code for example, you would want 16 byte alignment. But what do you go for the rest of the time? I.e. if you have a structure you wanted to pad so it was aligned...what size padding? Is there an easy way (or reference) to work it out for your specific hardware? Do you target cache line size? Which cache would you aim for etc? 2) I presume that you only shoot for alignment in release builds as in debug, the debug CRT will add debug headers and before and after blocks for buffer overrun detection. Therefore, this will upset alignment but is it ok not to worry about it in this situation? If you have any code that requires alignment, would you just guard that code with pre-processor #ifdefs and check for _DEBUG? 3) How much should you use the compiler to force alignment? For example VC++ provides __declspec(align(#)) where you can specify a power of two value for alignment at variable declaration.
Advertisement
1) Well, alignment is a double-edged sword. If you have 2 or 4 data members that are often used in tandem, you want 16 (or 32) byte alignment so that when one is loaded into L1, the other is also. However, if you start aligning everything on 16 bytes that increases the amount of padding bytes that get filled into the cache, so your cache hits inevitably decrease... Personally, I align my SSE structures on 16, and use compiler default for everything else.

2) I would never intentionally use different memory layouts for debug and release builds. That means that if you're writing a debug physics library and you have a solid renderer, you can't build debug physics with release renderer if they depend on the same structures. I would strive at every turn to maintain compatibility between release and debug builds of my libraries. Of course, it's not that simple and there are big caveats... but as far as general principles go, debug/release compatibility is very useful.

3) Good luck with that. Simply declspec(align())'ing or pragma(pack())'ing your data will leave you very disappointed. It might work for objects on the stack, but if you try to 'new' one of these structures, you're rolling the dice. I use my own allocator library and new/delete overloads along with custom STL structures for anything that needs to be heapily allocated and guaranteed align(16)'ed.
Quote:Original post by ajas95
1) Well, alignment is a double-edged sword. If you have 2 or 4 data members that are often used in tandem, you want 16 (or 32) byte alignment so that when one is loaded into L1, the other is also. However, if you start aligning everything on 16 bytes that increases the amount of padding bytes that get filled into the cache, so your cache hits inevitably decrease... Personally, I align my SSE structures on 16, and use compiler default for everything else.

2) I would never intentionally use different memory layouts for debug and release builds. That means that if you're writing a debug physics library and you have a solid renderer, you can't build debug physics with release renderer if they depend on the same structures. I would strive at every turn to maintain compatibility between release and debug builds of my libraries. Of course, it's not that simple and there are big caveats... but as far as general principles go, debug/release compatibility is very useful.

3) Good luck with that. Simply declspec(align())'ing or pragma(pack())'ing your data will leave you very disappointed. It might work for objects on the stack, but if you try to 'new' one of these structures, you're rolling the dice. I use my own allocator library and new/delete overloads along with custom STL structures for anything that needs to be heapily allocated and guaranteed align(16)'ed.
hi, thanks very much for the post, just have a few questions about some of the points...

1) You are opposed to using align() because of its limits in that it doesn't work for heap allocated objects. But, you do align your data (for SSE structures for example) to say 16 byte alignment, so how do you go about doing this? Is it just a case of making objects a multiple of 16 bytes in size? I have yet to find anything really good on this on the net.

2) My question about release/debug etc was because in debug the runtime system automatically adds more memory onto objects you allocate. But I gather from 3 you dont have to deal with this because you dont leave your memory allocation/deallocation up to the CRT, rather you do all your calls to malloc etc as and when you need and maybe use memory pools to accurately provide memory for your classes. I take it you use STL Allocator objects to assist you in doing this then?

Thanks very much for help.
Quote:Original post by BigBadBob
1) You are opposed to using align() because of its limits in that it doesn't work for heap allocated objects. But, you do align your data (for SSE structures for example) to say 16 byte alignment, so how do you go about doing this? Is it just a case of making objects a multiple of 16 bytes in size? I have yet to find anything really good on this on the net.

The thing you want to align is the address of the memory:
const int ALIGN_SIZE = 16;unsigned char* mem = malloc(size + ALIGN_SIZE-1);unsigned char* alignedMem = (mem + ALIGN_SIZE-1)&~(ALIGN_SIZE-1);

(untested.. from the top of my head ;], might need lots of casting depending on compiler..)
Don't loose the original pointer, or you can't deallocate the memory..

A custum memory-library/handler can be nice to have.. ;]

Good luck!
Here is another thread where I talk about what I do. If you want to write for a specific platform, you can use _aligned_alloc for windows and _memalign for gcc, but to my knowledge there is no portable solution. I just decided to use Doug Lea's malloc library because I've heard lots of good things about it, and I only needed to change a #define to set memory alignment to 16. AND it let me replace the system memcpy with the specialized memcpy_amd.

The most important thing to me was that my solution be compatible with STL, and Doug Lea's malloc lib allowed me to just write the one custom allocator and that was it. I'd seen other people talk about solutions like gulgi's, but it wasn't immediately clear to me if that would work trivially with STL allocators (since you have to keep the original pointer around). I'm sure you could get it to work, but at that point I was frustated that it took so much work and research just to get stupid 16-byte alignment, that dlmalloc was the only guaranteed thing I could think of.

I'd also be willing to bet that if you use Intel's compiler, setting alignment takes about 2.8 seconds :)
Here is the thread. Damn gd forums logging me out and then not even asking if I really want to post AP. Grrrrr..

This topic is closed to new replies.

Advertisement