Sign in to follow this  
Juliean

Which alignment to use?

Recommended Posts

Hello,

 

in some places in my code, ie. in my entity-component system, I do some custom memory management. What this means is that I will allocate a customary block of memory, and put some objects in, like in this code sample:

template<typename Component>
class ComponentPool
{

public:
    Component* AddComponent(void)
    {
        auto pComponent = (Component*)(m_vMemory.back() + 1);
        m_vMemory.resize(m_vMemory.size() + sizeof(Component));

        new(pComponent) Component();

        return pComponent;
    }

private:

    std::vector<char> m_vMemory;
}

Thats not the full implementation, but you get the idea.

 

So now I'm wondering about alignment. I've heard up and low that you should optimally align your memory for performance, and I know how to implement it, but I've never actually been able to find out:

 

What should you actually choose for alignment? 4 byte, power of two, 16 byte? For simplicity, lets just talk about desktop platforms, 32 bit and the MSVC-compiler for now. I failed to find anything in google and stackoverflow, so does anyone have some advice on how to choose or find out the optimal alignment for my case? "Component" is a class/struct that can be anything from 8 Byte to a few kilobyte (less likely, but oh well).

 

Thanks!

Share this post


Link to post
Share on other sites

What should you actually choose for alignment? 4 byte, power of two, 16 byte?

 

For a memory manager, provide alignment as a parameter with a default that matches the build settings. Look up the platform documentation for default alignment requirements.

 

By default Visual Studio uses 8-byte alignment but it can be changed with compiler options and #pragma commands.

 

Certain functions and instructions require objects to be at specific alignments. For example, loading into 128-bit SIMD registers have both an unaligned operation (MOVDQU) and a much faster aligned operation (MOVDQA) which crashes if the data is not 16-byte aligned.  Allow the programmers the opportunity to require different alignment if they know their needs require it.

 

Note there are a few graphics buffers in recent APIs that have 4KB alignment requirements for certain buffers. When first introduced the size broke quite a few custom memory managers that used naive alignment strategies. 

Share this post


Link to post
Share on other sites

 

Also, be very wary of the code your example indicates you might have. A vector is a _very bad_ choice for the backing store of a generalized memory pool as it can and will reallocate and move objects in memory, which can lead to all kinds of undefined behavior if your Component types are not strictly PODs (as in is_pod_v<Component> == true). The reason is that when the vector resizes, it thinks its just moving char's, which are trivial to move. A general class however might have a complex move constructor. Even a seemingly-simple class can end up having back-references (hidden ones inserted by the compiler!) that will fail to be updated and lead to all kinds of crashes or other nastiness. Although in this particular example, since your ComponentPool is a class templated parameterized over Component, it makes far more sense to just have a vector<Component> and let vector deal with alignment and all that jazz.

 

Even if they are PODs, the reallocation also invalidates all the pointers to the Components you've created.

Share this post


Link to post
Share on other sites

Yeah as above, this is bad:

        auto pComponent = (Component*)(m_vMemory.back() + 1);
        m_vMemory.resize(m_vMemory.size() + sizeof(Component)); // this action invalidates the pComponent pointer that you just acquired (and all previous pointers too)
        new(pComponent) Component(); // this writes to memory that you don't own - a memory corruption bug

As for the correct alignment, you can use alignof(Component) since C++11 (and it's possible to emulate on earlier compilers if you need to).

 

[edit] Oops, I didnt' realise Sean already posted about alignof.

Edited by Hodgman

Share this post


Link to post
Share on other sites

Thanks everyone for the comments!

 

 

What should you actually choose for alignment? 4 byte, power of two, 16 byte?

 

For a memory manager, provide alignment as a parameter with a default that matches the build settings. Look up the platform documentation for default alignment requirements.

 

By default Visual Studio uses 8-byte alignment but it can be changed with compiler options and #pragma commands.

 

Certain functions and instructions require objects to be at specific alignments. For example, loading into 128-bit SIMD registers have both an unaligned operation (MOVDQU) and a much faster aligned operation (MOVDQA) which crashes if the data is not 16-byte aligned.  Allow the programmers the opportunity to require different alignment if they know their needs require it.

 

Note there are a few graphics buffers in recent APIs that have 4KB alignment requirements for certain buffers. When first introduced the size broke quite a few custom memory managers that used naive alignment strategies. 

 

About allowing a paramater: Not sure how much this is useful here, since this memory manager acts entirely in the background of the engine. Its true that the developer of the component is responsible for registering it and thus creating the pool, so I could allow it as a parameter, but since all this does is allocate components that are being used by the ECS exclusively, I don't think it makes sense her, but thanks for the pointer anyway (will need to think about more general memory management strategies at some point).

 

@SeanMiddleditch:

 

 

It's not just performance; it is also correctness. Many architectures simply cannot read unaligned data, period, and will signal a trap if you attempt to access unaligned data. x86 can, but indeed it will be slow. Some x86 instructions however do require proper alignment of memory addresses.

I was aware of that, for the time being I'm  only interested in x86 at most, if I ever wanted to go beyond that I guess I have a lot of code that would break anyways, so thats a cause I'll have to look into once it gets there :/
 

 

The correct option is to query the alignment of the types you're allocating into your pool. e.g. use the alignof operator.

Ah, so I kind of knew about that something like this exists, but is this (aside from being a correct one) also the most efficient alignment? Ie. I tested it for a class that is 88 byte in size, whereas alignof gives me 8. But I've also read that you should align for cache line size. Thats means with 88 size and an align of 8, and say a 16 byte cache-line the first object would fit entirely in 6 cache lines, while the second object would partially fit in the 6th and the rest in another 6 cache lines. Is this even true (that you should align for cache line size), or did I misunderstand something here? (Ie., does it even make a difference if objects are cross-cacheline boundaries for sequential access?)

 

Also, be very wary of the code your example indicates you might have. A vector is a _very bad_ choice for the backing store of a generalized memory pool as it can and will reallocate and move objects in memory, which can lead to all kinds of undefined behavior if your Component types are not strictly PODs (as in is_pod_v<Component> == true). The reason is that when the vector resizes, it thinks its just moving char's, which are trivial to move. A general class however might have a complex move constructor. Even a seemingly-simple class can end up having back-references (hidden ones inserted by the compiler!) that will fail to be updated and lead to all kinds of crashes or other nastiness. Although in this particular example, since your ComponentPool is a class templated parameterized over Component, it makes far more sense to just have a vector<Component> and let vector deal with alignment and all that jazz.

 

I'm aware this is not the safest of ways to do it, since components are anything but PoD. I don't have a problem specifically with copying the components this way as of now (did some heavy testing and had a lot of different crashes not related to this already). Can you elaborate on what you mean by back-references inserted by the compiler? I don't really seem to understand what you mean by that.

 

You might laught (or cry), but I started out with a vector<Component>, however I had one specific problem at some point. So "Component" is a user-defined class, derived of a BaseComponent-class. This BaseComponent has a member "m_parent" that stores a handle to the owner of the component, which needs to be preserved (ie. in the move-assignment-operator). Now some of the derived classes might have custom copy/move-operators/ctors. The problem turned out that if you override the move-operator in the derived class, the operator of the base-class is not called anymore, and so I ended up with having incorrect parent-handles in my components, specially after deleting one component from the middle.

Now dealing with memory myself seemed to be the easiest solution, as otherwise I would have to enforce the call to the base-operators/ctors in the derived components (which is usually not necessary and not something I want the engine to malfunction because someone forgot to do it in their components), or do some extra-processing to preserve the correct handles.

I'm not sure if there is not some better design overall, especially since I'm trying to make my components behaviour more like PoDs anyways. So I'll consider changing this at some point (any suggestions are very much welcomed), but I think I'm fine with how it is now, at least on MSVC this doesn't seem to produce any problems (yet).

 

@PinkHorror:

 

Even if they are PODs, the reallocation also invalidates all the pointers to the Components you've created.

 

 

Yeah, but thats something that cannot be migitated if you want contingious memory layouts anyway. Even if you know how much memory you use beforehands (which I can't), if someone deleted say 10 elements from the middle, I need/want to move the last components back to preserve the continous order.

But thats not much of a practical problem in my case, since I required users of my ECS not to store component-pointers over frames anyways. Only downside is that I need to update some pointers in my component-cache on reallocation/deletion, but thats just a minor overhead.

 

@Hodgman:

 

Whoups, my bad. Didn't realize the code I just typed down was broken, but be asured that the actual code I'm using doesn't have that problem.

Share this post


Link to post
Share on other sites

 

So "Component" is a user-defined class, derived of a BaseComponent-class. This BaseComponent has a member "m_parent" that stores a handle to the owner of the component, which needs to be preserved (ie. in the move-assignment-operator). Now some of the derived classes might have custom copy/move-operators/ctors. The problem turned out that if you override the move-operator in the derived class, the operator of the base-class is not called anymore, and so I ended up with having incorrect parent-handles in my components, specially after deleting one component from the middle.

 

After I understand your measures in more scale, which is quite a "big bounty" like, I  would stop speaking about pointers, pointers are guarenteed to point to a memory that does not change its position, a heap allocated T *p=new T().

That is not a case in your pooled objects in vector, but you definitely could achieve all those measures by using indexes of vector contained objects instead.

 

One viable way I see, is to use vector as "instead of heap manager" and use it with conjunction of a linked list of the very indexes , to achieve this hierarchy/arbitrary pointing structure you want to have.

 

You could even delay the erase operations of the vector, to not process memory movement of it at every trivial freeing, while still able to keep adding new objects, and call erase in some more-wise condition, like GC.

 

Something like this:

 

-LinkedList of structures that contain :[real index to vector, real index to the owner- in vector]

-erasing from LinkedList will trivialy remove the node, keeping all vector real indexes in objects in LinkedList still intact, since real removal is not called, +update the owned object owner index to magic value of none. Add the free index to an integer array Fr. (thus the linked list becomes the immediate structure of your program to access the objects, or manipulate them)

- If in FR array is a bigger continous amount of values (indexes), call erase with this one enough large interval (vector::erase has overload for interval removal)

- Before you do this real erase on vector- you are already being viable to update the linked list real indexes- you see my point, so you call linked list update on idexes first, then vector erase-with the interval.

-now you can have external references around your program, wheather in private or public-managed sections, even if real objects gets moved in vector management

 

Of course, the fact of linked list, though of very small structures, is a certain performance hit, but it provides this measure to safely use and access your objects even if they get moved by managing vector.

Share this post


Link to post
Share on other sites

 

It's not just performance; it is also correctness. Many architectures simply cannot read unaligned data, period, and will signal a trap if you attempt to access unaligned data. x86 can, but indeed it will be slow. Some x86 instructions however do require proper alignment of memory addresses.

I was aware of that, for the time being I'm  only interested in x86 at most, if I ever wanted to go beyond that I guess I have a lot of code that would break anyways, so thats a cause I'll have to look into once it gets there :/

 


That would surprise me. There should be very, very little architecture-dependent code in any sane codebase - even performance-sensitive game engines - these days. You may still have a lot of platform-specific code, but that should at least be pushed out into leaf libraries (your audio library, renderer, etc.).
 

 

The correct option is to query the alignment of the types you're allocating into your pool. e.g. use the alignof operator.

Ah, so I kind of knew about that something like this exists, but is this (aside from being a correct one) also the most efficient alignment? Ie. I tested it for a class that is 88 byte in size, whereas alignof gives me 8. But I've also read that you should align for cache line size. Thats means with 88 size and an align of 8, and say a 16 byte cache-line the first object would fit entirely in 6 cache lines, while the second object would partially fit in the 6th and the rest in another 6 cache lines. Is this even true (that you should align for cache line size), or did I misunderstand something here? (Ie., does it even make a difference if objects are cross-cacheline boundaries for sequential access?)

 


Maybe, maybe not. That really depends on how you're accessing them. If you're iterating over them linearly then not really; if you're accessing them randomly then it certainly can. You don't just want to align everything to the cachline size, though; you want the object to evenly fit in the cacheline and then ensure that your backing store is cacheline aligned. e.g., if you're allocating small 8 byte structures, you can fit eight of those in a single cacheline (on typical x86 architectures). You thus don't want to align them to the cacheline, you just want to make sure that they are evenly divisible into a cacheline.

Unless you are talking about data that is access by multiple threads. In which case you want to avoid false sharing (only one thread should be accessing a given cacheline) and you actually do want to guarantee each element is on its own cacheline.

Of course, depending on the data in question, this really doesn't even matter. Some game developers fetishize performance to the detriment of their game as a whole. :)
 

 

Also, be very wary of the code your example indicates you might have. A vector is a _very bad_ choice for the backing store of a generalized memory pool as it can and will reallocate and move objects in memory, which can lead to all kinds of undefined behavior if your Component types are not strictly PODs (as in is_pod_v<Component> == true). The reason is that when the vector resizes, it thinks its just moving char's, which are trivial to move. A general class however might have a complex move constructor. Even a seemingly-simple class can end up having back-references (hidden ones inserted by the compiler!) that will fail to be updated and lead to all kinds of crashes or other nastiness. Although in this particular example, since your ComponentPool is a class templated parameterized over Component, it makes far more sense to just have a vector<Component> and let vector deal with alignment and all that jazz.

I'm aware this is not the safest of ways to do it, since components are anything but PoD. I don't have a problem specifically with copying the components this way as of now (did some heavy testing and had a lot of different crashes not related to this already). Can you elaborate on what you mean by back-references inserted by the compiler? I don't really seem to understand what you mean by that.
 
You might laught (or cry), but I started out with a vector<Component>, however I had one specific problem at some point. So "Component" is a user-defined class, derived of a BaseComponent-class. This BaseComponent has a member "m_parent" that stores a handle to the owner of the component, which needs to be preserved (ie. in the move-assignment-operator). Now some of the derived classes might have custom copy/move-operators/ctors. The problem turned out that if you override the move-operator in the derived class, the operator of the base-class is not called anymore, and so I ended up with having incorrect parent-handles in my components, specially after deleting one component from the middle.

 


That is a general problem with C++. As you know, you just have to call your parents' operator= in the child's operator=.

 

Of course, even better is to just not write custom move operators - the auto-generated ones should do everything you need automatically, including invoking parents' operators. If you use RAII-style ownership for everything then you can usually follow the Rule of Zero.
 

 

Even if they are PODs, the reallocation also invalidates all the pointers to the Components you've created.

 
Yeah, but thats something that cannot be migitated if you want contingious memory layouts anyway. Even if you know how much memory you use beforehands (which I can't), if someone deleted say 10 elements from the middle, I need/want to move the last components back to preserve the continous order.

 


It can totally be mitigated, just not with raw pointers. Store a smart pointer that knows how to use an indirection table, which is exactly what component/entity ids end up doing for you anyway. :)

Share this post


Link to post
Share on other sites

@SeanMiddleditch:

 

That would surprise me. There should be very, very little architecture-dependent code in any sane codebase - even performance-sensitive game engines - these days. You may still have a lot of platform-specific code, but that should at least be pushed out into leaf libraries (your audio library, renderer, etc.).

Eh, nobody said my codebase was sane ;)

Joke aside, I probably did overexaggerate a bit, but I have at least a few places in mind where I (ab)use the fact the MSVC has no strict aliasing, and I don't even want to talk about 32 vs. 64 bit type size stuff (damn, at least thats something I'll have to look at sometimes soon...).
 

 

Maybe, maybe not. That really depends on how you're accessing them. If you're iterating over them linearly then not really; if you're accessing them randomly then it certainly can. You don't just want to align everything to the cachline size, though; you want the object to evenly fit in the cacheline and then ensure that your backing store is cacheline aligned. e.g., if you're allocating small 8 byte structures, you can fit eight of those in a single cacheline (on typical x86 architectures). You thus don't want to align them to the cacheline, you just want to make sure that they are evenly divisible into a cacheline.

 

Well ok, since I'm mostly iterating in a semi-linear fashion in the places where performance matters (with mostly linear I mean iterating via sort of an ordered indirection table with some holes), based on your explanation I think I don't need to align that way then. Specially since I'm not using multithreading yet.
 

 

Of course, depending on the data in question, this really doesn't even matter. Some game developers fetishize performance to the detriment of their game as a whole. :)

 

I wouldn't totally count me to that list, but I certainly favour engine design over game design now. Also I'm doing this optimization of my ECS-framework as part of an university assignment. We were told to choose a certain piece of software and optimized it, so figured since I had this in the back of my mind, might as well do it - and even if it doesn't matter currently whether my 2D game runs in 2300 or 3000 FPS, at least the assignment gives me a good excuse to generate some higher, or lower numbers (in frametime) ;)

 

That is a general problem with C++. As you know, you just have to call your parents' operator= in the child's operator=.

 

 

Funny, I somehow never realized that until now. Might be because I generally tend to use inheritance and explicit operator= sparcely, but I know this bit me in the ass at some places with copy-ctors. Come to think of it, I just fixed a certain bug yesterday due to me forgetting to call a parents copy-ctor in some class.

 

Of course, even better is to just not write custom move operators - the auto-generated ones should do everything you need automatically, including invoking parents' operators. If you use RAII-style ownership for everything then you can usually follow the Rule of Zero.

 

I agree with you, problem was that MSVC only recently adopted auto-generated move-constructors - I'm not even talking about "Class(Class&&) = default;", but before that you had to declare an explicit move constructor just for them to get generated (at least I belive this was the case), so I ended up writing a whole f***ton of simple move ctors, and I'm still in the process of setting them all to "default". 

 

On problem I see though is that default-move-ctors will not properly move PoD values like integers, which means that sometimes classes will be left in an invalid state. Not so much a problem for containers, in fact I'm not sure how much of a problem that actually is. I decided to keep some explicit move-ctors specifically due to this, but I might need to evaluate it again.

 

It can totally be mitigated, just not with raw pointers. Store a smart pointer that knows how to use an indirection table, which is exactly what component/entity ids end up doing for you anyway. :)

 

Why yes, I'm already doing that, what I meant is that it cannot be done with raw pointers.

I have an "EntityHandle" class that keeps an unique id as well as a pointer to the entity container, and than does a lookup int he entity storage (since due to the layout of this I can find any entitiy in mostly a few iterations of a linear search). But the idea is exactly this, I use this class for long-lasting access, eigther to store a reference to an entity or when a certain part of the code that is iterating over a bunch of entities could potentially modify the entity container meanwhile (think script calls). Since I'm already doing a delayed deletion of entities, I'm thinking about delaying creation of entities too (or at least insertion to the container), so that I can completly skip this step as well. But thats a different topic, just had that on my mind.

That being said I don't have such a system for components because I didn't need it yet. Component access is currently via the actual entity, and in my ongoing work to optimize the system I'm in the process of caching all components for an iteration-cycle in a specified structure, which will always be updated accordingly. Don't know, maybe I'll need such a specified "smart pointer" for components at some point, but for now doesn't look like it.

 

@JohnnyCode:
 

 

Of course, the fact of linked list, though of very small structures, is a certain performance hit, but it provides this measure to safely use and access your objects even if they get moved by managing vector.

 

 

Hm, I see what you mean, but as you see above I already have structure in place that take care of this, just that they take no performance hit where "unsafe" access can actually be done safely (uh, that sounds right). And since I'm currently in the process of optimizing heavly for performance, I don't think I'll do that yet, but I'll keep it in the back of my head if there ever comes a need.

Share this post


Link to post
Share on other sites

Thanks everyone for the comments!
 

 

What should you actually choose for alignment? 4 byte, power of two, 16 byte?

 
For a memory manager, provide alignment as a parameter with a default that matches the build settings.

 

About allowing a paramater: Not sure how much this is useful here, since this memory manager acts entirely in the background of the engine. Its true that the developer of the component is responsible for registering it and thus creating the pool, so I could allow it as a parameter, but since all this does is allocate components that are being used by the ECS exclusively, I don't think it makes sense her, but thanks for the pointer anyway (will need to think about more general memory management strategies at some point).

 

  
As it has been publicly released, you might look over the EASTL implementation. Note how they provide a method with default alignment, another accepting a parameter for custom alignment and custom offsets for arrays.

Share this post


Link to post
Share on other sites

On problem I see though is that default-move-ctors will not properly move PoD values like integers, which means that sometimes classes will be left in an invalid state. Not so much a problem for containers, in fact I'm not sure how much of a problem that actually is. I decided to keep some explicit move-ctors specifically due to this, but I might need to evaluate it again.

What do you think is the "proper" way to move an integer?

Share this post


Link to post
Share on other sites

For hardware there are three major categories

 

1. Hardware and CPU instructions that REQUIRE specific alignment for multi-byte data access, and will trap/assert/crash if this is violated.

2. Hardware and CPU instructions that PREFER specific alignment for multi-byte data access, and will operate with a performance penalty if violated.

3. Hardware and CPU instructions that DON'T CARE about specific alignment for multi-byte data access and will operate the same regardless.

 

For compiler concerns, there are also

 

4. Assumed padding inside structures (to make data members properly aligned)

5. Assumed padding at the end of structures (to make sequential array items properly aligned)

 

 

Whatever you are building your memory manager on, you need to understand what is going on and why for all five of those. 

 

While you don't specifically need to know what individual instructions are involved for 1-3, you need to understand that the exist, what you need to do about them, why you need to do it, and when you need to (or not) do anything about it.

 

Further, you need to understand how, when, and why data might be misaligned, and how, when, and why you need to make it aligned again (and when/why not).

 

 

 

This is not a matter of trying to slip out a tiny bit more performance.  This is a fairly fundamental set of decisions your compiler is depending on for nearly every line of code. 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this