Which alignment to use?

Started by
11 comments, last by frob 7 years, 11 months ago

Hello,

in some places in my code, ie. in my entity-component system, I do some custom memory management. What this means is that I will allocate a customary block of memory, and put some objects in, like in this code sample:


template<typename Component>
class ComponentPool
{

public:
    Component* AddComponent(void)
    {
        auto pComponent = (Component*)(m_vMemory.back() + 1);
        m_vMemory.resize(m_vMemory.size() + sizeof(Component));

        new(pComponent) Component();

        return pComponent;
    }

private:

    std::vector<char> m_vMemory;
}

Thats not the full implementation, but you get the idea.

So now I'm wondering about alignment. I've heard up and low that you should optimally align your memory for performance, and I know how to implement it, but I've never actually been able to find out:

What should you actually choose for alignment? 4 byte, power of two, 16 byte? For simplicity, lets just talk about desktop platforms, 32 bit and the MSVC-compiler for now. I failed to find anything in google and stackoverflow, so does anyone have some advice on how to choose or find out the optimal alignment for my case? "Component" is a class/struct that can be anything from 8 Byte to a few kilobyte (less likely, but oh well).

Thanks!

Advertisement

What should you actually choose for alignment? 4 byte, power of two, 16 byte?

For a memory manager, provide alignment as a parameter with a default that matches the build settings. Look up the platform documentation for default alignment requirements.

By default Visual Studio uses 8-byte alignment but it can be changed with compiler options and #pragma commands.

Certain functions and instructions require objects to be at specific alignments. For example, loading into 128-bit SIMD registers have both an unaligned operation (MOVDQU) and a much faster aligned operation (MOVDQA) which crashes if the data is not 16-byte aligned. Allow the programmers the opportunity to require different alignment if they know their needs require it.

Note there are a few graphics buffers in recent APIs that have 4KB alignment requirements for certain buffers. When first introduced the size broke quite a few custom memory managers that used naive alignment strategies.

Which alignment to use?


I lean towards Lawful Neutral myself.

So now I'm wondering about alignment. I've heard up and low that you should optimally align your memory for performance, and I know how to implement it, but I've never actually been able to find out:


It's not just performance; it is also correctness. Many architectures simply cannot read unaligned data, period, and will signal a trap if you attempt to access unaligned data. x86 can, but indeed it will be slow. Some x86 instructions however do require proper alignment of memory addresses.

The correct option is to query the alignment of the types you're allocating into your pool. e.g. use the alignof operator.



Also, be very wary of the code your example indicates you might have. A vector is a _very bad_ choice for the backing store of a generalized memory pool as it can and will reallocate and move objects in memory, which can lead to all kinds of undefined behavior if your Component types are not strictly PODs (as in is_pod_v<Component> == true). The reason is that when the vector resizes, it thinks its just moving char's, which are trivial to move. A general class however might have a complex move constructor. Even a seemingly-simple class can end up having back-references (hidden ones inserted by the compiler!) that will fail to be updated and lead to all kinds of crashes or other nastiness. Although in this particular example, since your ComponentPool is a class templated parameterized over Component, it makes far more sense to just have a vector<Component> and let vector deal with alignment and all that jazz.

Sean Middleditch – Game Systems Engineer – Join my team!

Also, be very wary of the code your example indicates you might have. A vector is a _very bad_ choice for the backing store of a generalized memory pool as it can and will reallocate and move objects in memory, which can lead to all kinds of undefined behavior if your Component types are not strictly PODs (as in is_pod_v<Component> == true). The reason is that when the vector resizes, it thinks its just moving char's, which are trivial to move. A general class however might have a complex move constructor. Even a seemingly-simple class can end up having back-references (hidden ones inserted by the compiler!) that will fail to be updated and lead to all kinds of crashes or other nastiness. Although in this particular example, since your ComponentPool is a class templated parameterized over Component, it makes far more sense to just have a vector<Component> and let vector deal with alignment and all that jazz.

Even if they are PODs, the reallocation also invalidates all the pointers to the Components you've created.

Yeah as above, this is bad:


        auto pComponent = (Component*)(m_vMemory.back() + 1);
        m_vMemory.resize(m_vMemory.size() + sizeof(Component)); // this action invalidates the pComponent pointer that you just acquired (and all previous pointers too)
        new(pComponent) Component(); // this writes to memory that you don't own - a memory corruption bug

As for the correct alignment, you can use alignof(Component) since C++11 (and it's possible to emulate on earlier compilers if you need to).

[edit] Oops, I didnt' realise Sean already posted about alignof.

[...] if your Component types are not strictly PODs (as in is_pod_v<Component> == true).


Correct me if I'm wrong, but isn't the check for PODs too strong? std::is_trivially_copyable_v should be enough.

Thanks everyone for the comments!

What should you actually choose for alignment? 4 byte, power of two, 16 byte?

For a memory manager, provide alignment as a parameter with a default that matches the build settings. Look up the platform documentation for default alignment requirements.

By default Visual Studio uses 8-byte alignment but it can be changed with compiler options and #pragma commands.

Certain functions and instructions require objects to be at specific alignments. For example, loading into 128-bit SIMD registers have both an unaligned operation (MOVDQU) and a much faster aligned operation (MOVDQA) which crashes if the data is not 16-byte aligned. Allow the programmers the opportunity to require different alignment if they know their needs require it.

Note there are a few graphics buffers in recent APIs that have 4KB alignment requirements for certain buffers. When first introduced the size broke quite a few custom memory managers that used naive alignment strategies.

About allowing a paramater: Not sure how much this is useful here, since this memory manager acts entirely in the background of the engine. Its true that the developer of the component is responsible for registering it and thus creating the pool, so I could allow it as a parameter, but since all this does is allocate components that are being used by the ECS exclusively, I don't think it makes sense her, but thanks for the pointer anyway (will need to think about more general memory management strategies at some point).

@SeanMiddleditch:

It's not just performance; it is also correctness. Many architectures simply cannot read unaligned data, period, and will signal a trap if you attempt to access unaligned data. x86 can, but indeed it will be slow. Some x86 instructions however do require proper alignment of memory addresses.

I was aware of that, for the time being I'm only interested in x86 at most, if I ever wanted to go beyond that I guess I have a lot of code that would break anyways, so thats a cause I'll have to look into once it gets there :/

The correct option is to query the alignment of the types you're allocating into your pool. e.g. use the alignof operator.

Ah, so I kind of knew about that something like this exists, but is this (aside from being a correct one) also the most efficient alignment? Ie. I tested it for a class that is 88 byte in size, whereas alignof gives me 8. But I've also read that you should align for cache line size. Thats means with 88 size and an align of 8, and say a 16 byte cache-line the first object would fit entirely in 6 cache lines, while the second object would partially fit in the 6th and the rest in another 6 cache lines. Is this even true (that you should align for cache line size), or did I misunderstand something here? (Ie., does it even make a difference if objects are cross-cacheline boundaries for sequential access?)

Also, be very wary of the code your example indicates you might have. A vector is a _very bad_ choice for the backing store of a generalized memory pool as it can and will reallocate and move objects in memory, which can lead to all kinds of undefined behavior if your Component types are not strictly PODs (as in is_pod_v<Component> == true). The reason is that when the vector resizes, it thinks its just moving char's, which are trivial to move. A general class however might have a complex move constructor. Even a seemingly-simple class can end up having back-references (hidden ones inserted by the compiler!) that will fail to be updated and lead to all kinds of crashes or other nastiness. Although in this particular example, since your ComponentPool is a class templated parameterized over Component, it makes far more sense to just have a vector<Component> and let vector deal with alignment and all that jazz.

I'm aware this is not the safest of ways to do it, since components are anything but PoD. I don't have a problem specifically with copying the components this way as of now (did some heavy testing and had a lot of different crashes not related to this already). Can you elaborate on what you mean by back-references inserted by the compiler? I don't really seem to understand what you mean by that.

You might laught (or cry), but I started out with a vector<Component>, however I had one specific problem at some point. So "Component" is a user-defined class, derived of a BaseComponent-class. This BaseComponent has a member "m_parent" that stores a handle to the owner of the component, which needs to be preserved (ie. in the move-assignment-operator). Now some of the derived classes might have custom copy/move-operators/ctors. The problem turned out that if you override the move-operator in the derived class, the operator of the base-class is not called anymore, and so I ended up with having incorrect parent-handles in my components, specially after deleting one component from the middle.

Now dealing with memory myself seemed to be the easiest solution, as otherwise I would have to enforce the call to the base-operators/ctors in the derived components (which is usually not necessary and not something I want the engine to malfunction because someone forgot to do it in their components), or do some extra-processing to preserve the correct handles.

I'm not sure if there is not some better design overall, especially since I'm trying to make my components behaviour more like PoDs anyways. So I'll consider changing this at some point (any suggestions are very much welcomed), but I think I'm fine with how it is now, at least on MSVC this doesn't seem to produce any problems (yet).

@PinkHorror:

Even if they are PODs, the reallocation also invalidates all the pointers to the Components you've created.

Yeah, but thats something that cannot be migitated if you want contingious memory layouts anyway. Even if you know how much memory you use beforehands (which I can't), if someone deleted say 10 elements from the middle, I need/want to move the last components back to preserve the continous order.

But thats not much of a practical problem in my case, since I required users of my ECS not to store component-pointers over frames anyways. Only downside is that I need to update some pointers in my component-cache on reallocation/deletion, but thats just a minor overhead.

@Hodgman:

Whoups, my bad. Didn't realize the code I just typed down was broken, but be asured that the actual code I'm using doesn't have that problem.

So "Component" is a user-defined class, derived of a BaseComponent-class. This BaseComponent has a member "m_parent" that stores a handle to the owner of the component, which needs to be preserved (ie. in the move-assignment-operator). Now some of the derived classes might have custom copy/move-operators/ctors. The problem turned out that if you override the move-operator in the derived class, the operator of the base-class is not called anymore, and so I ended up with having incorrect parent-handles in my components, specially after deleting one component from the middle.

After I understand your measures in more scale, which is quite a "big bounty" like, I would stop speaking about pointers, pointers are guarenteed to point to a memory that does not change its position, a heap allocated T *p=new T().

That is not a case in your pooled objects in vector, but you definitely could achieve all those measures by using indexes of vector contained objects instead.

One viable way I see, is to use vector as "instead of heap manager" and use it with conjunction of a linked list of the very indexes , to achieve this hierarchy/arbitrary pointing structure you want to have.

You could even delay the erase operations of the vector, to not process memory movement of it at every trivial freeing, while still able to keep adding new objects, and call erase in some more-wise condition, like GC.

Something like this:

-LinkedList of structures that contain :[real index to vector, real index to the owner- in vector]

-erasing from LinkedList will trivialy remove the node, keeping all vector real indexes in objects in LinkedList still intact, since real removal is not called, +update the owned object owner index to magic value of none. Add the free index to an integer array Fr. (thus the linked list becomes the immediate structure of your program to access the objects, or manipulate them)

- If in FR array is a bigger continous amount of values (indexes), call erase with this one enough large interval (vector::erase has overload for interval removal)

- Before you do this real erase on vector- you are already being viable to update the linked list real indexes- you see my point, so you call linked list update on idexes first, then vector erase-with the interval.

-now you can have external references around your program, wheather in private or public-managed sections, even if real objects gets moved in vector management

Of course, the fact of linked list, though of very small structures, is a certain performance hit, but it provides this measure to safely use and access your objects even if they get moved by managing vector.

It's not just performance; it is also correctness. Many architectures simply cannot read unaligned data, period, and will signal a trap if you attempt to access unaligned data. x86 can, but indeed it will be slow. Some x86 instructions however do require proper alignment of memory addresses.

I was aware of that, for the time being I'm only interested in x86 at most, if I ever wanted to go beyond that I guess I have a lot of code that would break anyways, so thats a cause I'll have to look into once it gets there :/


That would surprise me. There should be very, very little architecture-dependent code in any sane codebase - even performance-sensitive game engines - these days. You may still have a lot of platform-specific code, but that should at least be pushed out into leaf libraries (your audio library, renderer, etc.).

The correct option is to query the alignment of the types you're allocating into your pool. e.g. use the alignof operator.

Ah, so I kind of knew about that something like this exists, but is this (aside from being a correct one) also the most efficient alignment? Ie. I tested it for a class that is 88 byte in size, whereas alignof gives me 8. But I've also read that you should align for cache line size. Thats means with 88 size and an align of 8, and say a 16 byte cache-line the first object would fit entirely in 6 cache lines, while the second object would partially fit in the 6th and the rest in another 6 cache lines. Is this even true (that you should align for cache line size), or did I misunderstand something here? (Ie., does it even make a difference if objects are cross-cacheline boundaries for sequential access?)


Maybe, maybe not. That really depends on how you're accessing them. If you're iterating over them linearly then not really; if you're accessing them randomly then it certainly can. You don't just want to align everything to the cachline size, though; you want the object to evenly fit in the cacheline and then ensure that your backing store is cacheline aligned. e.g., if you're allocating small 8 byte structures, you can fit eight of those in a single cacheline (on typical x86 architectures). You thus don't want to align them to the cacheline, you just want to make sure that they are evenly divisible into a cacheline.

Unless you are talking about data that is access by multiple threads. In which case you want to avoid false sharing (only one thread should be accessing a given cacheline) and you actually do want to guarantee each element is on its own cacheline.

Of course, depending on the data in question, this really doesn't even matter. Some game developers fetishize performance to the detriment of their game as a whole. :)

Also, be very wary of the code your example indicates you might have. A vector is a _very bad_ choice for the backing store of a generalized memory pool as it can and will reallocate and move objects in memory, which can lead to all kinds of undefined behavior if your Component types are not strictly PODs (as in is_pod_v<Component> == true). The reason is that when the vector resizes, it thinks its just moving char's, which are trivial to move. A general class however might have a complex move constructor. Even a seemingly-simple class can end up having back-references (hidden ones inserted by the compiler!) that will fail to be updated and lead to all kinds of crashes or other nastiness. Although in this particular example, since your ComponentPool is a class templated parameterized over Component, it makes far more sense to just have a vector<Component> and let vector deal with alignment and all that jazz.

I'm aware this is not the safest of ways to do it, since components are anything but PoD. I don't have a problem specifically with copying the components this way as of now (did some heavy testing and had a lot of different crashes not related to this already). Can you elaborate on what you mean by back-references inserted by the compiler? I don't really seem to understand what you mean by that.

You might laught (or cry), but I started out with a vector<Component>, however I had one specific problem at some point. So "Component" is a user-defined class, derived of a BaseComponent-class. This BaseComponent has a member "m_parent" that stores a handle to the owner of the component, which needs to be preserved (ie. in the move-assignment-operator). Now some of the derived classes might have custom copy/move-operators/ctors. The problem turned out that if you override the move-operator in the derived class, the operator of the base-class is not called anymore, and so I ended up with having incorrect parent-handles in my components, specially after deleting one component from the middle.


That is a general problem with C++. As you know, you just have to call your parents' operator= in the child's operator=.

Of course, even better is to just not write custom move operators - the auto-generated ones should do everything you need automatically, including invoking parents' operators. If you use RAII-style ownership for everything then you can usually follow the Rule of Zero.

Even if they are PODs, the reallocation also invalidates all the pointers to the Components you've created.


Yeah, but thats something that cannot be migitated if you want contingious memory layouts anyway. Even if you know how much memory you use beforehands (which I can't), if someone deleted say 10 elements from the middle, I need/want to move the last components back to preserve the continous order.


It can totally be mitigated, just not with raw pointers. Store a smart pointer that knows how to use an indirection table, which is exactly what component/entity ids end up doing for you anyway. :)

Sean Middleditch – Game Systems Engineer – Join my team!

@SeanMiddleditch:

That would surprise me. There should be very, very little architecture-dependent code in any sane codebase - even performance-sensitive game engines - these days. You may still have a lot of platform-specific code, but that should at least be pushed out into leaf libraries (your audio library, renderer, etc.).

Eh, nobody said my codebase was sane ;)

Joke aside, I probably did overexaggerate a bit, but I have at least a few places in mind where I (ab)use the fact the MSVC has no strict aliasing, and I don't even want to talk about 32 vs. 64 bit type size stuff (damn, at least thats something I'll have to look at sometimes soon...).

Maybe, maybe not. That really depends on how you're accessing them. If you're iterating over them linearly then not really; if you're accessing them randomly then it certainly can. You don't just want to align everything to the cachline size, though; you want the object to evenly fit in the cacheline and then ensure that your backing store is cacheline aligned. e.g., if you're allocating small 8 byte structures, you can fit eight of those in a single cacheline (on typical x86 architectures). You thus don't want to align them to the cacheline, you just want to make sure that they are evenly divisible into a cacheline.

Well ok, since I'm mostly iterating in a semi-linear fashion in the places where performance matters (with mostly linear I mean iterating via sort of an ordered indirection table with some holes), based on your explanation I think I don't need to align that way then. Specially since I'm not using multithreading yet.

Of course, depending on the data in question, this really doesn't even matter. Some game developers fetishize performance to the detriment of their game as a whole. :)

I wouldn't totally count me to that list, but I certainly favour engine design over game design now. Also I'm doing this optimization of my ECS-framework as part of an university assignment. We were told to choose a certain piece of software and optimized it, so figured since I had this in the back of my mind, might as well do it - and even if it doesn't matter currently whether my 2D game runs in 2300 or 3000 FPS, at least the assignment gives me a good excuse to generate some higher, or lower numbers (in frametime) ;)

That is a general problem with C++. As you know, you just have to call your parents' operator= in the child's operator=.

Funny, I somehow never realized that until now. Might be because I generally tend to use inheritance and explicit operator= sparcely, but I know this bit me in the ass at some places with copy-ctors. Come to think of it, I just fixed a certain bug yesterday due to me forgetting to call a parents copy-ctor in some class.

Of course, even better is to just not write custom move operators - the auto-generated ones should do everything you need automatically, including invoking parents' operators. If you use RAII-style ownership for everything then you can usually follow the Rule of Zero.

I agree with you, problem was that MSVC only recently adopted auto-generated move-constructors - I'm not even talking about "Class(Class&&) = default;", but before that you had to declare an explicit move constructor just for them to get generated (at least I belive this was the case), so I ended up writing a whole f***ton of simple move ctors, and I'm still in the process of setting them all to "default".

On problem I see though is that default-move-ctors will not properly move PoD values like integers, which means that sometimes classes will be left in an invalid state. Not so much a problem for containers, in fact I'm not sure how much of a problem that actually is. I decided to keep some explicit move-ctors specifically due to this, but I might need to evaluate it again.

It can totally be mitigated, just not with raw pointers. Store a smart pointer that knows how to use an indirection table, which is exactly what component/entity ids end up doing for you anyway. :)

Why yes, I'm already doing that, what I meant is that it cannot be done with raw pointers.

I have an "EntityHandle" class that keeps an unique id as well as a pointer to the entity container, and than does a lookup int he entity storage (since due to the layout of this I can find any entitiy in mostly a few iterations of a linear search). But the idea is exactly this, I use this class for long-lasting access, eigther to store a reference to an entity or when a certain part of the code that is iterating over a bunch of entities could potentially modify the entity container meanwhile (think script calls). Since I'm already doing a delayed deletion of entities, I'm thinking about delaying creation of entities too (or at least insertion to the container), so that I can completly skip this step as well. But thats a different topic, just had that on my mind.

That being said I don't have such a system for components because I didn't need it yet. Component access is currently via the actual entity, and in my ongoing work to optimize the system I'm in the process of caching all components for an iteration-cycle in a specified structure, which will always be updated accordingly. Don't know, maybe I'll need such a specified "smart pointer" for components at some point, but for now doesn't look like it.

@JohnnyCode:

Of course, the fact of linked list, though of very small structures, is a certain performance hit, but it provides this measure to safely use and access your objects even if they get moved by managing vector.

Hm, I see what you mean, but as you see above I already have structure in place that take care of this, just that they take no performance hit where "unsafe" access can actually be done safely (uh, that sounds right). And since I'm currently in the process of optimizing heavly for performance, I don't think I'll do that yet, but I'll keep it in the back of my head if there ever comes a need.

This topic is closed to new replies.

Advertisement