• Create Account

\$50

Like
15Likes
Dislike

# Flexible particle system - The Container

By Bartlomiej Filipek | Published Jun 23 2014 11:49 PM in Graphics Programming and Theory
Peer Reviewed by (Servant of the Lord, Madhed, ivan.spasov)

cpp particles animation optimization simd

One of the most crucial part of a particle system is the container for all the particles. It has to hold all the data that describe particles, it should be easy to extend and fast enough. In this post I will write about choices, problems and possible solutions for such a container.

# Introduction

What is wrong with this code?

class Particle {
public:
bool m_alive;
Vec4d m_pos;
Vec4d m_col;
float time;
// ... other fields
public:
// ctors...

void update(float deltaTime);
void render();
};


and then usage of this class:

std::vector<particle> particles;

// update function:
for (auto &p : particles)
p.update(dt);

// rendering code:
for (auto &p : particles)
p.render();


Actually one could say that it is OK. And for some simple cases indeed it is.
But let us ask several questions:

1. Are we OK with SRP principle here?
2. What if we would like to add one field to the particle? Or have one particle system with pos/col and other with pos/col/rotations/size? Is our structure capable of such configuration?
3. What if we would like to implement a new update method? Should we implement it in some derived class?
4. Is the code efficient?

1. It looks like SRP is violated here. The Particle class is responsible not only for holding the data but also performs updates, generations and rendering. Maybe it would be better to have one configurable class for storing the data, some other systems/modules for its update and another for rendering? I think that this option is much better designed.
2. Having the Particle class built that way we are blocked from the possibility to add new properties dynamically. The problem is that we use here an AoS (Array of Structs) pattern rather than SoA (Structure of Arrays). In SoA when you want to have one more particle property you simply create/add a new array.
3. As I mentioned in the first point: we are violating SRP so it is better to have a separate system for updates and rendering. For simple particle systems our original solution will work, but when you want some modularity/flexibility/usability then it will not be good.
4. There are at least three performance issues with the design:
1. AoS pattern might hurt performance.
2. In the update code for each particle we have not only the computation code, but also a (virtual) function call. We will not see almost any difference for 100 particles, but when we aim for 100k or more it will be visible for sure.
3. The same problem goes for rendering. We cannot render each particle on its own, we need to batch them in a vertex buffer and make as few draw calls as possible.

All of above problems must be addressed in the design phase.

It was not visible in the above code, but another important topic for a particle system is an algorithm for adding and killing particles:

void kill(particleID) { ?? }
void wake(particleID) { ?? }


How to do it efficiently?

## First thing: Particle Pool

It looks like particles need a dynamic data structure - we would like to dynamically add and delete particles. Of course we could use list or std::vector and change it every time, but would that be efficient? Is it good to reallocate memory often (each time we create a particle)? One thing that we can initially assume is that we can allocate one huge buffer that will contain the maximum number of particles. That way we do not need to have memory reallocations all the time.

We solved one problem: numerous buffer reallocations, but on the other hand we now face a problem with fragmentation. Some particles are alive and some of them are not. So how to manage them in one single buffer?

## Second thing: Management

We can manage the buffer it at least two ways:
• Use an alive flag and in the for loop update/render only active particles.
• this unfortunately causes another problem with rendering because there we need to have a continuous buffer of things to render. We cannot easily check if a particle is alive or not. To solve this we could, for instance, create another buffer and copy alive particles to it every time before rendering.
• Dynamically move killed particles to the end so that the front of the buffer contains only alive particles.

As you can see in the above picture when we decide that a particle needs to be killed we swap it with the last active one.

This method is faster than the first idea:
• When we update particles there is no need to check if it is alive. We update only the front of the buffer.
• No need to copy only alive particles to some other buffer

# What's Next

In the article I've introduced several problems we can face when designing a particle container. Next time I will show my implementation of the system and how I solved described problems.

BTW: do you see any more problems with the design? Please share your opinions in the comments.

14 Jun 2014: Initial version, reposted from Code and Graphics blog

Software developer trying to design and write some great code... plus share knowledge from time to time.

I went to your site and it is a real interesting series so far, but wouldn't it have made more sense to upload the start of the article series and link to the rest rather than put the second part here and link to the previous and later series articles? I know it is part one of the in-depth part of the series, but it skips the introduction. Just my two cents on it.

Overall I like the article, but... *puts on armor and jumps on horse* ...I need to ride to std::vector's rescue here.

Of course we could use std::list or std::vector and change it every time, but would that be efficient?

Yep! std::vector is very efficient.

Is it good to reallocate memory often (each time we create a particle)?

std::vector doesn't do that.

std::vector only reallocates memory when it runs out of reserved memory. It reserves more memory that you are currently using. In some implementations - like Microsoft's, every time std::vector reallocates, it reserves ~35% extra memory for future growth. GCC seems to do 50% extra.

If that is undesirable, tell std::vector how much memory you'd like to reserve!

One thing that we can initially assume is that we can allocate one huge buffer that will contain the maximum number of particles.

That's a single function call with std::vector: std::vector::reserve.

That way we do not need to have memory reallocations all the time.

By proper use of std::vector, you really don't have to.

Want to remove an element from the middle of a vector without reallocating? No problem:

//Swap-and-pop a specific element of a container, swapping the element with
//the element at the back of the container and then popping it off the stack.
//This does not preserve the order of the elements.
template<typename ContainerType>
void SwapAndPopAtIndex(ContainerType &container, size_t index)
{
//Don't swap the back element with itself, and also don't swap out of range.
if((index+1) >= container.size())
return;

//Swap the two values.
std::swap(container[index], container.back());

//Pop the back of the container, deleting our old element.
container.pop_back(); //No reallocations take place here.
}


Now if you have other reasons for not using std::vector, that's fine! I'm not saying std::vector fits every situation.

And there's also legitimate reasons for separating out the member variables - for example, storing your particle positions in their own array specifically so you can pass the entire array to the videocard with the positions being continuous (though I don't see why you can't separate out the members and still use std::vector - a vector for the position data, a std::vector for the color data, and so on).

You clearly have legitimate reasons for separating it out the individual member variables, which is cool - but the two reasons stated in the article for avoid std::vector ("numerous" reallocations, and fragmentation of dead particles) are easily solvable using std::vector without creating a custom container class.

In fact, even if you didn't bother to reserve your memory, you still wouldn't have "numerous" reallocations.

If you do this:

for(int i = 0; i < 10000; ++i)
{
array.push_back(MyType());
}


In GCC 4.8.1, only 15 buffer reallocations occur! (out of ten thousand push_backs) And that's without telling the array to reserve the memory in advance (in which case, only the original allocation would occur).

Apart from the std::vector tidbits, I enjoyed the article - I bookmarked one of the later ones to read it more thoroughly later, because they look interesting. It's definitely a good idea to have functions that can operate on an entire array of data instead of calling separate functions for each element. Currently, I'm reading over your Updaters article, which has some really cool ideas I've never thought of before. Thank you for sharing these!

Unless you are bound by other constraints, such as not being able to use exceptions, you should always begin with the standard containers. In many cases even naive use, without intentionally-pathological access patters, is quite good enough -- certainly good enough to start with.

If it isn't performing well enough, profile and take a second look at efficient API use (things like reserve(), or using algorithms that add/remove items in groups, rather than individually). If it still isn't performing, then look at creating a custom allocator for the container to use. Only after you've exhausted those options should you start looking at 3rd party or custom container implementations.

In particular here, you're advocating a container that is effectively behaving as a pooled allocator -- it would be just as effective to create just an allocator and wire it up to std::vector, which takes template parameters to do just that. Its somewhat unknown, and even less understood, which is why people reach for custom containers, but its not terribly difficult and the payoff is huge -- not the least of which is retaining the ability to make use of the standard algorithms.

Servant and Ravyne, both very true comments ! Though not invalidating the article itself really. Nothing is much said about the container for the pool of particles I believe. vector can be used. Also I would not recommend list for .. just about anything. maybe a list is OK only when combined with 2 things, a fast pool allocator and a large type T.

Considering that most modern hardware has support for Float Textures.

Why not simulate it in the GPU.

You can update around 600k particles with a 2048x2048 texture or even 2.4M in a 4096 one.

Obviously you don't want every particle to issue a draw call. So instancing all particles to a single draw call is the way forward.

You need an instance buffer that you should update every frame. If each of the 100K particles are going to carry 29 bytes of information then we are looking at a ~2,76MB big instance buffer. If you are updating 120 times a second we are looking at 330MB/s of bandwidth just for this one effect. Additionally you are using the CPU to calculate 330MB/s worth of data, even if you spread it out on all cores this is in my mind a huge performance hit.

What if you just copied the instance buffer ONCE to device memory and let the GPU calculate the new positions? Let's say your instance buffer contains 100K particles, that only carry an initial state, such as spawn time, death time, a random chaos value, spawn position and velocity. Then just have one global time variable. Then in the shader calculate if it should be drawn and where?

Obviously you don't want every particle to issue a draw call. So instancing all particles to a single draw call is the way forward.

You need an instance buffer that you should update every frame. If each of the 100K particles are going to carry 29 bytes of information then we are looking at a ~2,76MB big instance buffer. If you are updating 120 times a second we are looking at 330MB/s of bandwidth just for this one effect. Additionally you are using the CPU to calculate 330MB/s worth of data, even if you spread it out on all cores this is in my mind a huge performance hit.

What if you just copied the instance buffer ONCE to device memory and let the GPU calculate the new positions? Let's say your instance buffer contains 100K particles, that only carry an initial state, such as spawn time, death time, a random chaos value, spawn position and velocity. Then just have one global time variable. Then in the shader calculate if it should be drawn and where?

Example of this practice would help slow languages like Javascript to handle more particles.

http://haxor.thelaborat.org/demos/webgl/particle_sheet.html

@Servant of the Lord: Thanks for your positive feedback! Actually I did not use std::vector because they are very slow in Debug mode under Visual Studio. VS uses checked iterators or something... That way running a Debug version of application would not be smooth. I've decided to use almost 'raw' memory.

Still I suggest at least starting with std::vector and in most cases it will work fine.

@eduardo_costa - you've just revealed my second part of articles! In the future I plan to port my code to GPU, but for now I though playing with particles on CPU + adding some CPU optimization might be a nice experiment.

Thank you all for a great comments! I am still working on next parts of this series

@Servant of the Lord: Thanks for your positive feedback! Actually I did not use std::vector because they are very slow in Debug mode under Visual Studio. VS uses checked iterators or something... That way running a Debug version of application would not be smooth. I've decided to use almost 'raw' memory.

Still I suggest at least starting with std::vector and in most cases it will work fine.

@eduardo_costa - you've just revealed my second part of articles! In the future I plan to port my code to GPU, but for now I though playing with particles on CPU + adding some CPU optimization might be a nice experiment.

Thank you all for a great comments! I am still working on next parts of this series

I have this task finished in my opensource WebGL engine.

Feel free to ask anything when you reach this step!

http://haxor.thelaborat.org

http://mercurial.thelaborat.org/haxor_engine

I like the article because it gives me ideas for my current way of treating big amounts of particles. But I don't get why it is bad to use two lists, e.g. HashSets, because they don't need to be sorted, for storing alive and stock instances, whn you just add/delete references.

@eduardo_costa ok, I will definitely ask!

@PhilObyte hash_set, list will make your life easier, of course... but your particles would be scattered in memory. I would like to have a continuous chunk of memory used. That way performance should be better.

@fen

Feel free to follow here https://github.com/haxorplatform

Why do you believe SoA is more efficient than AoS in the case of particles? Can you explain the reasoning there to me? (All I see in my mind is massive cache thrashing.)

Why do you believe SoA is more efficient than AoS in the case of particles? Can you explain the reasoning there to me? (All I see in my mind is massive cache thrashing.)

SoA does not solve all problems. It can, of course, be slower that AoS. Everything depends on memory access patterns.

If you have a simple particle you usually compute position (using prev pos, velocity, acceleration), then you compute color, then time. Using SoA you can split this computation and it will be easier to use vectorization code.

Another valuable point is that having SoA enables you to have one stream of position, one stream of color. They can be easily transferred to GPU for rendering.

If you have a simple particle you usually compute position (using prev pos, velocity, acceleration), then you compute color, then time. Using SoA you can split this computation and it will be easier to use vectorization code.

I'm interested in how the cache usage would work in that case. What I'm concerned about is that - like you mentioned - updating the pos requires reading some of the other members. If you're using SoA then I would think that cache efficiency would vary depending on the size of each array. It would be possible to limit the number of arrays and try to keep it less than the number of cache lines, but that would restrict the design pretty severely. Is there a way of working around that? Something like 'SoAoS', where you're packing together similar data into substructs?

@Khatharr - I did not compare true AoS vs SoA in my system. You can look here for the implementation: http://www.bfilipek.com/2014/04/flexible-particle-system-container-2.html

So, this is a kind of SoAoS - I use arrays of 4d vectors.

Sup!

If you want to look how the compute shader for a GPU particle look like!

If you have a simple particle you usually compute position (using prev pos, velocity, acceleration), then you compute color, then time. Using SoA you can split this computation and it will be easier to use vectorization code.

I'm interested in how the cache usage would work in that case. What I'm concerned about is that - like you mentioned - updating the pos requires reading some of the other members. If you're using SoA then I would think that cache efficiency would vary depending on the size of each array. It would be possible to limit the number of arrays and try to keep it less than the number of cache lines, but that would restrict the design pretty severely. Is there a way of working around that? Something like 'SoAoS', where you're packing together similar data into substructs?

If calculating positions requires every other member in the structure, then it will be more cache efficient but it gets more complicated than that. If you start SIMD optimizing, you generally want your data to be 16 byte aligned, but a position is 3 floats which means 12 byte strides. If you then need 2 float[3]'s to calculate your position, they now need either a 4 byte padding, or pay the cost of an unaligned load into registers.

But in the SoA approach, can get additional benefits like cutting 1/4 of your operations down. For example, when applying something like: position += velocity * dt,
The AoS approach would likely be to operate on each position and velocity separately.
But if the arrays are tightly packed in an SoA approach, then you can do the multiply/add across float4s and get the work done faster.

From a cache usage point of view, SoA is also pretty good for the most part as you have multiple cache lines. So having position, array, colour, velocity etc in seperate cache lines isn't a problem. There's also the benefit of each cache line being invalidated less frequently as the cache lines will contain more elements.

Note: GameDev.net promotes a collaborative environment.

PARTNERS