Forcing Alignment of SSE Intrinsics Types onto the Heap with Class Hierachies

Started by
5 comments, last by l0k0 12 years, 3 months ago
I'm currently developing a Component-Entity Model, and am getting rather annoyed by my intrinsic __m128 based matrices and vectors breaking when they are aligned onto the heap. I know they must be aligned on 16 byte bounds and everything works fine when things are done statically. Here is the basic hierachy currently in place:


class Object {
string name;
uint32 instanceId;
};

class Component : public Object {
GameObject *gameObject;
uint32 typeId;
}; // 44 bytes total in size

class Transform : public Component {
Vector2 localPosition; // 52 bytes
Vector2 localScale; // 60 bytes
Transform *parent; // 64 bytes
// Are we not aligned on a 16 bytes boundary here?
Matrix4 worldMatrix; // holds 4 vector 4s (which wrap __m128)
std::unordered_map<tbdString, tbdTransform*> children;
tbdVector2 worldPosition;
tbdVector2 worldScale;
float32 localRotation;
float32 worldRotation;
bool isDirty;
uint8 __pad[15] // -- total of 256 bytes
};


Unfortunately, despite trying to ensure the worldMatrix was aligned on a proper boundary, it still breaks. I tried overriding new/delete and using an Aligned version of malloc/free, but that would break in Object's constructor. Do I need to just implement object and component as an interface or an abstract class so I can reorder the structure of the concrete class the way I want/need to? Also, why is this failing in its current form? Thanks in advance for the help. Intrinsics are still new to me.
<shameless blog plug>
A Floating Point
</shameless blog plug>
Advertisement
classes and structs inherit the largest alignment of their members. However, the language is pretty weak in regards to alignment safety, as the alignment of the allocations are not passed into the allocators.

This forces you into several options:

  • provide a user defined global new/delete/new[]/delete[] that hardcodes allocating the worst case allocation (16 bytes in all versions of SSE, 32 in AVX). This typically coincides with replacing the heap entirely (with DLMalloc or Hoard and the others)
  • provide per-class new/delete/new[]/delete[] overloads that do the same thing but in a more limited scope (only where you need them)
  • use placement new on the classes and allocate the memory yourself


The first choice is the most preferable, but code executing and allocating memory before main can be a real pain, especially if you replace the heap. Really annoying problems like the global variables that manage the heap getting constructed back to their default states after some memory has been allocated and whatnot.
http://www.gearboxsoftware.com/

classes and structs inherit the largest alignment of their members. However, the language is pretty weak in regards to alignment safety, as the alignment of the allocations are not passed into the allocators.

This forces you into several options:

  • provide a user defined global new/delete/new[]/delete[] that hardcodes allocating the worst case allocation (16 bytes in all versions of SSE, 32 in AVX). This typically coincides with replacing the heap entirely (with DLMalloc or Hoard and the others)
  • provide per-class new/delete/new[]/delete[] overloads that do the same thing but in a more limited scope (only where you need them)
  • use placement new on the classes and allocate the memory yourself


The first choice is the most preferable, but code executing and allocating memory before main can be a real pain, especially if you replace the heap. Really annoying problems like the global variables that manage the heap getting constructed back to their default states after some memory has been allocated and whatnot.

I tried using a global "must align on 16 bytes" allocator but that isn't working :/. Overriding new/delete just inside the structures that really need it isn't working either. It's getting to the point where I'm thinking of dropping intrinsics altogether, unless I can somehow get to the bottom of the problem. I'm using std::string, so do I need to make and set a custom allocator for it that uses my AllocAligned and FreeAligned functions? Also, could the vtable be the culprit of my alignment issues?
<shameless blog plug>
A Floating Point
</shameless blog plug>
A combination of an aligned allocator and compiler-specific alignment attributes should suffice. For Visual C++, look at __declspec(align). For GCC, look at __attribute__((aligned)).

My intrinsic wrappers assert the alignment in the constructors and copy constructors. It can be useful to leave these asserts enabled in release builds for a while, as on the Windows it seems allocations from the debug heap are aligned sufficiently for SSE2.

A combination of an aligned allocator and compiler-specific alignment attributes should suffice. For Visual C++, look at __declspec(align). For GCC, look at __attribute__((aligned)).

My intrinsic wrappers assert the alignment in the constructors and copy constructors. It can be useful to leave these asserts enabled in release builds for a while, as on the Windows it seems allocations from the debug heap are aligned sufficiently for SSE2.


The SSE types __m128 and friends already have declspec align 16 applied to them for you. Placing them as a member in a struct will promote the structs alignment.

Looking at the original data structures from the first post, the compiler should be generating 12 bytes of padding before the worldMatrix member in struct Transform, and also padding 4 bytes between the struct and the base class (as their alignments are different)





struct TestA
{
zBYTE bytey;
};
struct TestB : public TestA
{
vfloat vecy;
};

zSIZE sizeA = sizeof(TestA);
zSIZE alignA = alignof(TestA);
zSIZE sizeB = sizeof(TestB);
zSIZE alignB = alignof(TestB);
zSIZE offset_bytey = offsetof(TestB, bytey);
zSIZE offset_vecy = offsetof(TestB, vecy);




watch window:

[font=courier new,courier,monospace]sizeA=1[/font]
[font=courier new,courier,monospace]alignA=1[/font]
[font=courier new,courier,monospace]sizeB=32[/font]
[font=courier new,courier,monospace]alignB=16[/font]
[font=courier new,courier,monospace]offset_bytey=0[/font]
[font=courier new,courier,monospace]offset_vecy=16[/font]
http://www.gearboxsoftware.com/
Also I haven't yet replaced the heap in my homebrew project yet, but I have written a fairly large SIMD library for it (FPU, multiple versions of SSE and AVX supported), the global heap is all i've had to hook:

This is ultimately windows code, as they have an aligned heap available out of the box (_aligned_malloc)

mDEFAULT_ALIGNMENT is 16 in my codebase, ideally you would pass in the alignof the type here but the C++ ABI only passes in size to new (and you don't have the ability to get the type information either).


void* mAlloc(zSIZE size, zSIZE alignment)
{
void* pointer = _aligned_malloc(size, alignment);
if (pointer == null)
{
throw std::bad_alloc();
}
return pointer;
}


void mFree(void* pointer)
{
_aligned_free(pointer);
}


void* operator new(zSIZE allocationSize)
{
return mAlloc(allocationSize, mDEFAULT_ALIGNMENT);
}


void* operator new[](zSIZE allocationSize)
{
return mAlloc(allocationSize, mDEFAULT_ALIGNMENT);
}


void operator delete(void* pointer)
{
mFree(pointer);
}


void operator delete[](void* pointer)
{
mFree(pointer);
}
http://www.gearboxsoftware.com/
Thank you very much Zoner! I went from using my own aligned allocator to _aligned_malloc/_aligned_free and it seems to have resolved the issue. Intrinsics are still a huge pain in the ass but it looks like they're gonna stay in the project.
<shameless blog plug>
A Floating Point
</shameless blog plug>

This topic is closed to new replies.

Advertisement