Sign in to follow this  

Matrix 16 byte alignment

This topic is 413 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey

 

Im having some trouble with the DirectX 11 matrix functions.

Not 100%, but maybe 80% of the time im getting the following assertion:

 

((uintptr_t)pSource & 0xF) == 0

 

It happens when i call the function:

 

XMLoadFloat4x4A

 

Relevant code

class Matrix4v4 {
public:
    __declspec(align(16)) float m[4][4];
};
 

void matrixMultiply( Matrix4v4 * p_Result, const Matrix4v4 * p_A, const Matrix4v4 * p_B ) {
    using namespace DirectX;
    XMMATRIX a = XMLoadFloat4x4A( (const XMFLOAT4X4A *)p_A );
    XMMATRIX b = XMLoadFloat4x4A( (const XMFLOAT4X4A *)p_B );
    XMMATRIX r = XMMatrixMultiply( a, b );
    XMStoreFloat4x4A( (XMFLOAT4X4A *)p_Result, r );
}

This must have something to do with the alignment of my data i take it.

But the above code is just like they do it in UE4.

 

I dont know what to do really..
Is it a compiler setting? (Im using vs2015)

Share this post


Link to post
Share on other sites
Your problem was resolved by making your code run slower (potentially a _lot_ slower, depending on the particular CPU and how badly it deals with unaligned SSE loads/stores).

The real problem is that you weren't allocating your Matrix4v4 objects with proper alignemtn. The code snippet you provided didn't show how those objects are being created, so it's impossible for me to say why specifically they were off.

Note that if you're using malloc or new to allocate data that the alignas specification will be utterly ignored - those functions don't know anything about a type's alignment requirements so you must explicitly ask for allocations with the appropriate alignment using something like malloc_aligned or an operator new overload.

Share this post


Link to post
Share on other sites

Your problem was resolved by making your code run slower (potentially a _lot_ slower, depending on the particular CPU and how badly it deals with unaligned SSE loads/stores).

 

Not in my experience, _mm_loadu_ps() was only a few % slower (maybe 1 cycle at most) than _mm_load_ps() when I did the benchmarks on Intel i7, and that extra cost is not even measurable when the address is aligned. Use aligned loads whenever you can ensure alignment, but it seems like more of a microptimization. You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

Share this post


Link to post
Share on other sites

Not in my experience, _mm_loadu_ps() was only a few % slower (maybe 1 cycle at most) than _mm_load_ps() when I did the benchmarks on Intel i7, and that extra cost is not even measurable when the address is aligned. Use aligned loads whenever you can ensure alignment, but it seems like more of a microptimization. You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

YMMV (your mileage may vary). Expensive, power hungry CPUs like the Intel i7 have the lowest penalty. But on certain architectures the performance hit is big (Atom, AMD CPUs). Also this problem comes back to bite you if you later port to other platforms (i.e. ARM)
Furthermore how much slower depends on how good the CPU was masking the penalty of unaligned access. If you're hitting certain bottlenecks (such as bandwidth limits) the CPU won't be able to mask it well, and thus that 1% grows.
 

You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

Ensuring alignment is correct is part of carefully thinking how to lay out the data. Furthermore, ensuring correct alignment takes literally seconds of programming work, if not less, and it doesn't make things unreadable or harder to maintain either. Edited by Matias Goldberg

Share this post


Link to post
Share on other sites

Your problem was resolved by making your code run slower (potentially a _lot_ slower, depending on the particular CPU and how badly it deals with unaligned SSE loads/stores).

The real problem is that you weren't allocating your Matrix4v4 objects with proper alignemtn. The code snippet you provided didn't show how those objects are being created, so it's impossible for me to say why specifically they were off.

Note that if you're using malloc or new to allocate data that the alignas specification will be utterly ignored - those functions don't know anything about a type's alignment requirements so you must explicitly ask for allocations with the appropriate alignment using something like malloc_aligned or an operator new overload.

 

Thanks for pointing this out.
Alignment is something new for me, so im still trying to figure it out to the best of my ability.

 

So, i have tried allocating using new and also by just putting the Matrix4v4 on the stack, both result in the assertion going of.
This is the object im using thats causing the problem.

It in turn is allocated using "new Camera()";

class Camera {
 
    Matrix4v4 m_View;
    Matrix4v4 m_Projection;
    Matrix4v4 m_ViewProjection;
    Matrix4v4 m_InversedViewProjection;
 
public:
};

Share this post


Link to post
Share on other sites
Thanks for pointing this out. Alignment is something new for me, so im still trying to figure it out to the best of my ability.

Dropping support of x86 in favor to x64 might solve a lot of headaches. :wink:

Edited by Happy SDE

Share this post


Link to post
Share on other sites

 

Thanks for pointing this out. Alignment is something new for me, so im still trying to figure it out to the best of my ability.

Dropping support of x86 in favor to x64 might solve a lot of headaches. :wink:

 

 

Thank you, i was indeed compiling for x86.
I recompiled my source for x64 and with the align directives i now seem to be able to use the "A" notation functions :)

 

Can you provide an explination to why this change is so significant?

Share this post


Link to post
Share on other sites
Can you provide an explination to why this change is so significant?

 

x64 aligns data by 16 bytes, x86 - by 8.

XMMATRIX and XMVECTOR should be 16-byte aligned, in order to use them directly.

So, on x64 they are implicitly aligned, and you can use them on stack/new.

 

Downside of x64, that you should keep in mind - you can consume more memory, if alignment of your data structures is not efficient like:

struct Foo
{
    XMMATRIX m1; //starts from 16 bytes implicitly
    bool     b1; //Increases sizeof structure by 16 bytes, because next member is 16-byte aligned
    XMMATRIX m2;
    bool     b2; //Extra 16 bytes
};

So this is better:

struct BetterFoo
{
    XMMATRIX m1; //starts from 16 bytes implicitly
    XMMATRIX m2;
    bool     b1; //b1+b2 add only 16 bytes, not 2*16
    bool     b2;  
};

As for me, I decided not to support x86 at all =)

 

From here:

Properly Align Allocations

The aligned versions of the SSE intrinsics underlying the DirectXMath Library are faster than the unaligned.

For this reason, DirectXMath operations using XMVECTOR and XMMATRIX objects assume those objects are 16-byte aligned.

 

This is automatic for stack based allocations, if code is compiled against the DirectXMath Library using the recommended Windows (see Use Correct Compilation Settings) compiler settings.

However, it is important to ensure that heap-allocation containing XMVECTOR and XMMATRIX objects, or casts to these types, meet these alignment requirements.

While 64-bit Windows memory allocations are 16-byte aligned, by default on 32 bit versions of Windows memory allocated is only 8-byte aligned.

Edited by Happy SDE

Share this post


Link to post
Share on other sites
These claims are both misleading.

Dropping support of x86 in favor to x64 might solve a lot of headaches

x64 aligns data by 16 bytes, x86 - by 8.


The default alignment is entirely implementation-defined. There is no guarantee that allocations are 16-byte aligned just because you're on the x64 architecture. There are absolutely platforms that use 8-byte new/malloc alignment even when compiled to x86_64.

OSX/iOS - guaranteed 16-byte alignment no matter which architecture is targeted.
Microsoft - guaranteed 16-byte alignment on x64, 8-byte alignment on x32.
Linux/Android - guaranteed 8-byte alignment only.
Other platforms - not sure off the top of my head.

Share this post


Link to post
Share on other sites

Linux/Android - guaranteed 8-byte alignment only.

 
The GNU libc guarantees 16-byte alignment on x64 (http://www.gnu.org/software/libc/manual/html_node/Aligned-Memory-Blocks.html).


I stand corrected. That hadn't been the case in the past.

I'd still be wary of relying on the behavior, though.

It's probably worth noting that just installing a custom allocator also solves the problem. Most serious games engines I've used will drop in a custom allocator that makes alignment guarantees, either with a flat 16-byte alignment guarantee or a 16-byte alignment for blocks at lesst 16 bytes in size guarantee. Edited by SeanMiddleditch

Share this post


Link to post
Share on other sites

Another potential performance impact of unaligned vectors and matrices is that your data can cross a cache-line boundary, increasing cache spills and potentially wasting precious memory bandwidth. A 4x4 single-precision matrix fills a cacheline exactly on most current architectures, so you might consider aligning static/long-lived matrices on 64-byte boundaries even. For 4-wide single-precision vectors, aligning on 16-byte addresses relieves the potential to cross cache-line boundaries which, in the worst-case scenario, can cause your program to read 128 bytes of data to use only 16 bytes of it (though, you probably shouldn't be operating on single small vectors anyways); it could also cause other useful data already in the cache to spill, potentially. I imagine, also, that small arrays of small vectors could benefit by 64-byte alignment (the array, not the individual vectors) but I'm not sure how quickly the prefetcher picks up on the array and kicks in -- this potential optimization would only help quite small arrays of vectors (I'd guess < 8 vectors for certain, < 16 probably) -- though it'll never hurt, AFAICT.

Share this post


Link to post
Share on other sites

This topic is 413 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this