Matrix 16 byte alignment

Started by
12 comments, last by Ravyne 7 years, 5 months ago

Hey

Im having some trouble with the DirectX 11 matrix functions.

Not 100%, but maybe 80% of the time im getting the following assertion:

((uintptr_t)pSource & 0xF) == 0

It happens when i call the function:

XMLoadFloat4x4A

Relevant code


class Matrix4v4 {
public:
    __declspec(align(16)) float m[4][4];
};
 

void matrixMultiply( Matrix4v4 * p_Result, const Matrix4v4 * p_A, const Matrix4v4 * p_B ) {
    using namespace DirectX;
    XMMATRIX a = XMLoadFloat4x4A( (const XMFLOAT4X4A *)p_A );
    XMMATRIX b = XMLoadFloat4x4A( (const XMFLOAT4X4A *)p_B );
    XMMATRIX r = XMMatrixMultiply( a, b );
    XMStoreFloat4x4A( (XMFLOAT4X4A *)p_Result, r );
}

This must have something to do with the alignment of my data i take it.

But the above code is just like they do it in UE4.

I dont know what to do really..
Is it a compiler setting? (Im using vs2015)

Advertisement

Ah...

So i can just use the function XMLoadFloat4x4 instead of XMLoadFloat4x4A.

Note the missing "A" at the end.

My problem have been resolved.

Your problem was resolved by making your code run slower (potentially a _lot_ slower, depending on the particular CPU and how badly it deals with unaligned SSE loads/stores).

The real problem is that you weren't allocating your Matrix4v4 objects with proper alignemtn. The code snippet you provided didn't show how those objects are being created, so it's impossible for me to say why specifically they were off.

Note that if you're using malloc or new to allocate data that the alignas specification will be utterly ignored - those functions don't know anything about a type's alignment requirements so you must explicitly ask for allocations with the appropriate alignment using something like malloc_aligned or an operator new overload.

Sean Middleditch – Game Systems Engineer – Join my team!

Your problem was resolved by making your code run slower (potentially a _lot_ slower, depending on the particular CPU and how badly it deals with unaligned SSE loads/stores).

Not in my experience, _mm_loadu_ps() was only a few % slower (maybe 1 cycle at most) than _mm_load_ps() when I did the benchmarks on Intel i7, and that extra cost is not even measurable when the address is aligned. Use aligned loads whenever you can ensure alignment, but it seems like more of a microptimization. You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

Not in my experience, _mm_loadu_ps() was only a few % slower (maybe 1 cycle at most) than _mm_load_ps() when I did the benchmarks on Intel i7, and that extra cost is not even measurable when the address is aligned. Use aligned loads whenever you can ensure alignment, but it seems like more of a microptimization. You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

YMMV (your mileage may vary). Expensive, power hungry CPUs like the Intel i7 have the lowest penalty. But on certain architectures the performance hit is big (Atom, AMD CPUs). Also this problem comes back to bite you if you later port to other platforms (i.e. ARM)
Furthermore how much slower depends on how good the CPU was masking the penalty of unaligned access. If you're hitting certain bottlenecks (such as bandwidth limits) the CPU won't be able to mask it well, and thus that 1% grows.

You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

Ensuring alignment is correct is part of carefully thinking how to lay out the data. Furthermore, ensuring correct alignment takes literally seconds of programming work, if not less, and it doesn't make things unreadable or harder to maintain either.

Your problem was resolved by making your code run slower (potentially a _lot_ slower, depending on the particular CPU and how badly it deals with unaligned SSE loads/stores).

The real problem is that you weren't allocating your Matrix4v4 objects with proper alignemtn. The code snippet you provided didn't show how those objects are being created, so it's impossible for me to say why specifically they were off.

Note that if you're using malloc or new to allocate data that the alignas specification will be utterly ignored - those functions don't know anything about a type's alignment requirements so you must explicitly ask for allocations with the appropriate alignment using something like malloc_aligned or an operator new overload.

Thanks for pointing this out.
Alignment is something new for me, so im still trying to figure it out to the best of my ability.

So, i have tried allocating using new and also by just putting the Matrix4v4 on the stack, both result in the assertion going of.
This is the object im using thats causing the problem.

It in turn is allocated using "new Camera()";

class Camera {
 
    Matrix4v4 m_View;
    Matrix4v4 m_Projection;
    Matrix4v4 m_ViewProjection;
    Matrix4v4 m_InversedViewProjection;
 
public:
};
Thanks for pointing this out. Alignment is something new for me, so im still trying to figure it out to the best of my ability.

Dropping support of x86 in favor to x64 might solve a lot of headaches. :wink:

Thanks for pointing this out. Alignment is something new for me, so im still trying to figure it out to the best of my ability.

Dropping support of x86 in favor to x64 might solve a lot of headaches. :wink:

Thank you, i was indeed compiling for x86.
I recompiled my source for x64 and with the align directives i now seem to be able to use the "A" notation functions :)

Can you provide an explination to why this change is so significant?

Can you provide an explination to why this change is so significant?

x64 aligns data by 16 bytes, x86 - by 8.

XMMATRIX and XMVECTOR should be 16-byte aligned, in order to use them directly.

So, on x64 they are implicitly aligned, and you can use them on stack/new.

Downside of x64, that you should keep in mind - you can consume more memory, if alignment of your data structures is not efficient like:


struct Foo
{
    XMMATRIX m1; //starts from 16 bytes implicitly
    bool     b1; //Increases sizeof structure by 16 bytes, because next member is 16-byte aligned
    XMMATRIX m2;
    bool     b2; //Extra 16 bytes
};

So this is better:


struct BetterFoo
{
    XMMATRIX m1; //starts from 16 bytes implicitly
    XMMATRIX m2;
    bool     b1; //b1+b2 add only 16 bytes, not 2*16
    bool     b2;  
};

As for me, I decided not to support x86 at all =)

From here:

Properly Align Allocations

The aligned versions of the SSE intrinsics underlying the DirectXMath Library are faster than the unaligned.

For this reason, DirectXMath operations using XMVECTOR and XMMATRIX objects assume those objects are 16-byte aligned.

This is automatic for stack based allocations, if code is compiled against the DirectXMath Library using the recommended Windows (see Use Correct Compilation Settings) compiler settings.

However, it is important to ensure that heap-allocation containing XMVECTOR and XMMATRIX objects, or casts to these types, meet these alignment requirements.

While 64-bit Windows memory allocations are 16-byte aligned, by default on 32 bit versions of Windows memory allocated is only 8-byte aligned.

Thank you so much for taking the time to explain.

Learning stuff everyday it seems :)

This topic is closed to new replies.

Advertisement