Alignment requirements

Started by
29 comments, last by Starfox 10 years, 3 months ago
  • The application needs to inform a new flag when registering types that require 16byte alignment, e.g. asOBJ_APP_ALIGN16 (asCScriptEngine::RegisterObjectType)

I didn't add a flag for 16-byte alignment (since we'll need more than just 16 byte aligned variables soon), I added a new optional parameter to RegisterObjectType:

https://bitbucket.org/sherief/angelscript/commits/a6dd89ef276d09825c15a929dac6c6d2055d0cd6

Holy crap I started a blog - http://unobvious.typepad.com/
Advertisement

  • I didn't add a flag for 16-byte alignment (since we'll need more than just 16 byte aligned variables soon), I added a new optional parameter to RegisterObjectType:

What other alignment requirements are you in need of?

I'll review all above patches as soon as I can and provide my feedback.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

What other alignment requirements are you in need of?

32, for AVX. For future compatibility I want the changes to be able to support any alignment.

Holy crap I started a blog - http://unobvious.typepad.com/
  • The engine must make sure to always allocate the memory for the script objects on 16byte aligned boundaries (asCScriptEngine::CallAlloc)

CallAlloc() relies on useralloc - should the signature be changed to add an alignment parameter? I can do this, and I'd also add a simple wrapper for the default malloc() / free() use case.

Holy crap I started a blog - http://unobvious.typepad.com/

I don't think AVX will require 32 byte alignment. Do you have any reference that says it will? In fact the wikipedia entry on avx says the memory alignment requirement of SIMD instructions may be relaxed, so it is quite possible that not even 16byte alignment will be needed. (unfortunately no reference for this was given).

But, feel free to make the code generic to support any alignment requirement.

As for CallAlloc(). Not all memory allocations needs to be aligned and you definitely do not want to force alignment of small allocations as it may waste a lot of memory. I think you need to add a secondary method, e.g. CallAllocAligned(). This new method can optionally take an argument with the desired alignment (though, for performance it may be better just to hardcode it to 16). The code that allocates memory must know if the memory needs to be aligned or not.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

vmovaps requires 32-byte alignment. We also see a performance benefit from aligning matrices on 32 or 64 byte boundaries due to cache behavior.

Holy crap I started a blog - http://unobvious.typepad.com/

For a reference see Intel® Advanced Vector Extensions Programming Reference:

[table]
[th='2']Table 2-4. Instructions Requiring Explicitly Aligned Memory[/th]
[tr][td]Require 16-byte alignment[/td][td]Require 32-byte alignment[/td][/tr]
[tr][td](V)MOVDQA xmm, m128 [/td][td]VMOVDQA ymm, m256[/td][/tr]
[tr][td](V)MOVDQA m128, xmm [/td][td]VMOVDQA m256, ymm[/td][/tr]
[tr][td](V)MOVAPS xmm, m128 [/td][td]VMOVAPS ymm, m256[/td][/tr]
[tr][td](V)MOVAPS m128, xmm [/td][td]VMOVAPS m256, ymm[/td][/tr]
[tr][td](V)MOVAPD xmm, m128 [/td][td]VMOVAPD ymm, m256[/td][/tr]
[tr][td](V)MOVAPD m128, xmm [/td][td]VMOVAPD m256, ymm[/td][/tr]
[tr][td](V)MOVNTPS m128, xmm [/td][td]VMOVNTPS m256, ymm[/td][/tr]
[tr][td](V)MOVNTPD m128, xmm [/td][td]VMOVNTPD m256, ymm[/td][/tr]
[tr][td](V)MOVNTDQ m128, xmm [/td][td]VMOVNTDQ m256, ymm[/td][/tr]
[tr][td](V)MOVNTDQA xmm, m128 [/td][td]VMOVNTDQA ymm, m256[/td][/tr]
[/table]
[table]
[th]Table 2-5. Instructions Not Requiring Explicit Memory Alignment[/th]
[tr][td](V)MOVDQU xmm, m128[/td][/tr]
[tr][td](V)MOVDQU m128, m128[/td][/tr]
[tr][td](V)MOVUPS xmm, m128[/td][/tr]
[tr][td](V)MOVUPS m128, xmm[/td][/tr]
[tr][td](V)MOVUPD xmm, m128[/td][/tr]
[tr][td](V)MOVUPD m128, xmm[/td][/tr]
[tr][td]VMOVDQU ymm, m256[/td][/tr]
[tr][td]VMOVDQU m256, ymm[/td][/tr]
[tr][td]VMOVUPS ymm, m256[/td][/tr]
[tr][td]VMOVUPS m256, ymm[/td][/tr]
[tr][td]VMOVUPD ymm, m256[/td][/tr]
[tr][td]VMOVUPD m256, ymm[/td][/tr]
[/table]

In http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf, we can read in section 3.6.4 that:


Misaligned data access can incur significant performance penalties. This is particularly true for cache line 
splits. The size of a cache line is 64 bytes in the Pentium 4 and other recent Intel processors, including 
processors based on Intel Core microarchitecture. 
 
An access to data unaligned on 64-byte boundary leads to two memory accesses and requires several 
µops to be executed (instead of one). Accesses that span 64-byte boundaries are likely to incur a large 
performance penalty, the cost of each stall generally are greater on machines with longer pipelines.

^^ what he said.

Holy crap I started a blog - http://unobvious.typepad.com/

:)

The changes you've done so far seems to be fine.

Are you testing against the test_feature app in the svn? It's the best way to make sure you don't accidentally break anything.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

As for CallAlloc(). Not all memory allocations needs to be aligned and you definitely do not want to force alignment of small allocations as it may waste a lot of memory. I think you need to add a secondary method, e.g. CallAllocAligned(). This new method can optionally take an argument with the desired alignment (though, for performance it may be better just to hardcode it to 16). The code that allocates memory must know if the memory needs to be aligned or not.

Some types do need natural alignment though, and the default allocator (that wraps malloc()) is already aligning to the largest supported type alignment on the platform, since malloc() is guaranteed to do that. On OS X, my primary platform, malloc() is guaranteed to return 16 byte aligned data. On almost every platform with doubles aligned is required by the spec to return a pointer that is at least as aligned as a double (8 bytes).

Also, the patches add an alignment type to the type id info that defaults to 4, and that's what'll get passed to allocators when allocating memory for a type - so in existing code the behavior of the changes should be exactly identical.

Holy crap I started a blog - http://unobvious.typepad.com/

This topic is closed to new replies.

Advertisement