Followers 0

# Alignment requirements

## 30 posts in this topic

I'm having trouble registering a float3 class with a float3(float x, float y, float z) constructor as a scoped ref type. Calls specifying asBEHAVE_CONSTRUCT fail when the flags are asOBJ_REF | asOBJ_SCOPED and adding asOBJ_APP_CLASS fails when registering the object type. How would I do that?

If possible, I'd like to cast a vote for the alignment feature request described at http://www.gamedev.net/topic/606270-memory-aligned-objects/?gopid=4835684 - aligned vector types exist in every single game engine that I know of, and that option would significantly improve the ease of binding.

1

##### Share on other sites

You should use asBEHAVE_FACTORY for asOBJ_SCOPED objects. Consider coupling this with a memory pool to avoid doing dynamic allocations for each instance.

You'll also need to register the asBEHAVE_RELEASE behaviour to free the memory allocated with the factory, but not the asBEHAVE_ADDREF behaviour as there is no reference counting with asOBJ_SCOPED objects..

Manual: Registering scoped reference types

I agree that the alignment feature might make things a lot easier, but it is a feature that will require quite a bit of work to get right, not the least to update the assembly code for the native calling conventions to support this type too. Needless to say, it will take a while before I'll get to start implementing this feature, though it is always in the back of my mind.

You may however want to consider if it is really wise to expose the aligned vector class to the script. They are great for math heavy calculations, but when mixed with other computation they have quite a bit of overhead as the loading and unloading of the SIMD registers will be performed even for trivial calculations. It will also use up 33% more memory than you usually need for a vector3 structure.

This is why for example DirectXMath has two separate vector classes. One for heavy duty math work (XMVECTOR), and one for normal work and storage (XMFLOAT3).

0

##### Share on other sites

Math-heavy scripts are of course a bad idea, but for simple back and forth passing of parameters implementing a whole new class just for unaligned storage is overkill. If you accept contributions we could have someone do the work necessary if you can provide some outline assistance.

0

##### Share on other sites

I haven't made a complete mapping of all the changes that would be needed, but these are the high level changes:

• The script context must make sure to always allocate the local stack memory buffer on 16byte aligned boundaries (asCContext::ReserveStackSpace)
• The engine must make sure to always allocate the memory for the script objects on 16byte aligned boundaries (asCScriptEngine::CallAlloc)
• The application needs to inform a new flag when registering types that require 16byte alignment, e.g. asOBJ_APP_ALIGN16 (asCScriptEngine::RegisterObjectType)
• The script object type must make sure to align member properties of these types correctly (asCObjectType::AddPropertyToClass)
• Script global properties must allocate memory on 16byte boundaries if holding these types (asCGlobalProperty::AllocateMemory)
• The script compiler must make sure to allocate the local variables on 16byte boundaries (asCCompiler::AllocateVariable)
• The script compiler must add pad bytes on the stack for all function calls to guarantee that the stack position is 16byte aligned on entry in the called function (asCCompiler)
• The bytecode serializer must be capable of adjusting these pad bytes to guarantee platform independent saved bytecode. Remember that the registered type may not be 16byte aligned on all platforms (asCWriter & asCReader)
• The bytecode serializer must also be prepared to adjust the position of the local variables according to the need fro 16byte alignment (asCWriter & asCReader)
• The code for the native calling conventions must be adjusted for all platforms that should support 16byte aligned types (as_callfunc...)
• When the context needs to grow the local stack memory it must copy the function arguments so that the stack entry position is 16byte aligned (asCContext::CallScriptFunction)
• When the context is prepared for a new call, it must set the initial stack position so the stack entry position is 16byte aligned (asCContext::Prepare)

There may be some other changes needed as well. If I remember anything else later I'll add to this list.

The bullets in red are the complex changes.The other changes should be quite trivial.

0

##### Share on other sites

Sounds good. I'll dive into the code as soon as I can.

One thing I noticed though: The alignment errors only happen in release builds, and not debug builds - is it possible that this is due to the behavioral changes of malloc()? If so, can those changed be simulated for the release build? Sounds a less-effort fix.

0

##### Share on other sites

I don't think it is because of anything malloc does. More likely the debug build uses different instructions to load the SIMD registers so the __m128 types doesn't require alignment. This is however something that would be in your application, and not in AngelScript.

You can set custom memory functions with asSetGlobalMemoryFunctions(). With this the application can use memory routines that is guaranteed to always return 16byte aligned memory to the script library. You probably don't want to use 16byte aligned allocations for everything though, as it will waste a lot of memory when the allocations are smaller than 16bytes.

This gave me an idea. The code in as_memory.h can perhaps be enhanced to have a new macro for allocating 16byte aligned memory, e.g. asNEW16 and asNEWARRAY16. This macro can then call a new userAlloc16 global function. The pieces of code I mentioned above that need to guarantee 16byte aligned memory would then only have to call these macros instead of the existing ones to allocate the memory.

1

##### Share on other sites

FWIW, they currently both call userAlloc, which defaults to malloc() which guarantees 16 byte alignment on OS X - so that's probably not it. Something else about the debug environment is causing variables created on the AS stack to always have 16-byte aligned addresses - if only I can know what it is I would change it so that it's guaranteed to behave in the way that happens by accident under the debug environment...

0

##### Share on other sites

There is nothing in the AngelScript code to guarantee 16byte aligned addresses of local variables at the moment. Even if malloc() is guaranteed to return properly aligned memory buffers, the local variables in the script will be packed at 4byte boundaries.

0

##### Share on other sites

I haven't made a complete mapping of all the changes that would be needed, but these are the high level changes:

• The script context must make sure to always allocate the local stack memory buffer on 16byte aligned boundaries (asCContext::ReserveStackSpace)

I think I implemented that part, check out commits https://bitbucket.org/sherief/angelscript/commits/b9bca19ffa001ce628d106adc95669c666af4efb and https://bitbucket.org/sherief/angelscript/commits/c8070156e0141c28fad7e782f28947431834919a and let me know what you think.

Edited by Starfox
0

##### Share on other sites
• The application needs to inform a new flag when registering types that require 16byte alignment, e.g. asOBJ_APP_ALIGN16 (asCScriptEngine::RegisterObjectType)

I didn't add a flag for 16-byte alignment (since we'll need more than just 16 byte aligned variables soon), I added a new optional parameter to RegisterObjectType:

https://bitbucket.org/sherief/angelscript/commits/a6dd89ef276d09825c15a929dac6c6d2055d0cd6

0

##### Share on other sites

• I didn't add a flag for 16-byte alignment (since we'll need more than just 16 byte aligned variables soon), I added a new optional parameter to RegisterObjectType:

What other alignment requirements are you in need of?

I'll review all above patches as soon as I can and provide my feedback.

0

##### Share on other sites

What other alignment requirements are you in need of?

32, for AVX. For future compatibility I want the changes to be able to support any alignment.

0

##### Share on other sites
• The engine must make sure to always allocate the memory for the script objects on 16byte aligned boundaries (asCScriptEngine::CallAlloc)

CallAlloc() relies on useralloc - should the signature be changed to add an alignment parameter? I can do this, and I'd also add a simple wrapper for the default malloc() / free() use case.

0

##### Share on other sites

I don't think AVX will require 32 byte alignment. Do you have any reference that says it will? In fact the wikipedia entry on avx says the memory alignment requirement of SIMD instructions may be relaxed, so it is quite possible that not even 16byte alignment will be needed. (unfortunately no reference for this was given).

But, feel free to make the code generic to support any alignment requirement.

As for CallAlloc(). Not all memory allocations needs to be aligned and you definitely do not want to force alignment of small allocations as it may waste a lot of memory. I think you need to add a secondary method, e.g. CallAllocAligned(). This new method can optionally take an argument with the desired alignment (though, for performance it may be better just to hardcode it to 16). The code that allocates memory must know if the memory needs to be aligned or not.

0

##### Share on other sites

vmovaps requires 32-byte alignment. We also see a performance benefit from aligning matrices on 32 or 64 byte boundaries due to cache behavior.

0

##### Share on other sites

For a reference see Intel® Advanced Vector Extensions Programming Reference:

[table]
[th='2']Table 2-4. Instructions Requiring Explicitly Aligned Memory[/th]
[tr][td]Require 16-byte alignment[/td][td]Require 32-byte alignment[/td][/tr]
[tr][td](V)MOVDQA xmm, m128      [/td][td]VMOVDQA ymm, m256[/td][/tr]
[tr][td](V)MOVDQA m128, xmm      [/td][td]VMOVDQA m256, ymm[/td][/tr]
[tr][td](V)MOVAPS xmm, m128      [/td][td]VMOVAPS ymm, m256[/td][/tr]
[tr][td](V)MOVAPS m128, xmm      [/td][td]VMOVAPS m256, ymm[/td][/tr]
[tr][td](V)MOVAPD xmm, m128      [/td][td]VMOVAPD ymm, m256[/td][/tr]
[tr][td](V)MOVAPD m128, xmm      [/td][td]VMOVAPD m256, ymm[/td][/tr]
[tr][td](V)MOVNTPS m128, xmm     [/td][td]VMOVNTPS m256, ymm[/td][/tr]
[tr][td](V)MOVNTPD m128, xmm     [/td][td]VMOVNTPD m256, ymm[/td][/tr]
[tr][td](V)MOVNTDQ m128, xmm     [/td][td]VMOVNTDQ m256, ymm[/td][/tr]
[tr][td](V)MOVNTDQA xmm, m128    [/td][td]VMOVNTDQA ymm, m256[/td][/tr]
[/table]

[table]
[th]Table 2-5. Instructions Not Requiring Explicit Memory Alignment[/th]
[tr][td](V)MOVDQU xmm, m128[/td][/tr]
[tr][td](V)MOVDQU m128, m128[/td][/tr]
[tr][td](V)MOVUPS xmm, m128[/td][/tr]
[tr][td](V)MOVUPS m128, xmm[/td][/tr]
[tr][td](V)MOVUPD xmm, m128[/td][/tr]
[tr][td](V)MOVUPD m128, xmm[/td][/tr]
[tr][td]VMOVDQU ymm, m256[/td][/tr]
[tr][td]VMOVDQU m256, ymm[/td][/tr]
[tr][td]VMOVUPS ymm, m256[/td][/tr]
[tr][td]VMOVUPS m256, ymm[/td][/tr]
[tr][td]VMOVUPD ymm, m256[/td][/tr]
[tr][td]VMOVUPD m256, ymm[/td][/tr]
[/table]

In http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf, we can read in section 3.6.4 that:

Misaligned data access can incur significant performance penalties. This is particularly true for cache line
splits. The size of a cache line is 64 bytes in the Pentium 4 and other recent Intel processors, including
processors based on Intel Core microarchitecture.

µops to be executed (instead of one). Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on machines with longer pipelines.


Edited by quarnster
0

^^ what he said.

0

##### Share on other sites

:)

The changes you've done so far seems to be fine.

Are you testing against the test_feature app in the svn? It's the best way to make sure you don't accidentally break anything.

0

##### Share on other sites

As for CallAlloc(). Not all memory allocations needs to be aligned and you definitely do not want to force alignment of small allocations as it may waste a lot of memory. I think you need to add a secondary method, e.g. CallAllocAligned(). This new method can optionally take an argument with the desired alignment (though, for performance it may be better just to hardcode it to 16). The code that allocates memory must know if the memory needs to be aligned or not.

Some types do need natural alignment though, and the default allocator (that wraps malloc()) is already aligning to the largest supported type alignment on the platform, since malloc() is guaranteed to do that. On OS X, my primary platform, malloc() is guaranteed to return 16 byte aligned data. On almost every platform with doubles aligned is required by the spec to return a pointer that is at least as aligned as a double (8 bytes).

Also, the patches add an alignment type to the type id info that defaults to 4, and that's what'll get passed to allocators when allocating memory for a type - so in existing code the behavior of the changes should be exactly identical.

0

##### Share on other sites

The changes you've done so far seems to be fine.

Are you testing against the test_feature app in the svn? It's the best way to make sure you don't accidentally break anything.

Not really, I can't get the test harness to work on my system. We test internally but out test isn't as comprehensive. Would it be too much to ask you to test my patches? Sorry.

0

## Create an account

Register a new account