# Writing an easily maintainable, powerful and flexible particle system (new: efficient particle data packing)

This topic is 2423 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I already have a particle system, but some of its functionality doesn't perform as well as I wish it did and there are a number of questions the answers to which warrant some research, which would probably take a lot of time to do. Which is why I think a public discussion on this might benefit more people than just myself.

What I need is:

- a dedicated particle systems whose parameters can be set individually and programmatically
- emitter parameters throughout a system need to be accessible from a script, which requires them to be named
- independence of the rendering code from the iteration/stepping code
- post-fact extensibility (adding new particle system classes after the code is compiled)
As right now I'm doing everything in a simple way (eg I have a particle system class that manages a list of particles), I need to extend this a fair bit. My idea is to implement the particle system as a self-referential emitter class that accepts a emitter configuration class as a parameter.

 //this is loaded from an XML file class EmitterConfiguration { float args[256]; char* names[256]; int iNumArgs; Shader* updateShader; }; class ParticleEmitter { std::vector<ParticleEmitter> particles; EmitterConfiguration* cfg; }; ParticleEmitter::Update() { cfg->EnableUpdateShader(true); for all active particles cfg->UpdateParticle(p); cfg->EnableUpdateShader(false); } ParticleEmitter* i_am_a_particle_system; 

This is all fine and dandy and should cover the extensibility and flexibility parts. However, I can see a number of speed bottlenecks here, which lead me to the following questions:

1) is moving particle updating off the CPU a good idea in a general sense? I mean, CPU cores are dime a dozen on many newer systems and can be expected to only become more widely spread.
2) how much of the iteration/updating should I move to the GPU? Everything? Everything except spawning?
3) how should I go about writing the GPU side of the system? The only solution I can see for storage is textures, but do they justify giving up the streaming cost?
4) how should I go about syncing between the CPU and the GPU? If most work is done on the GPU, I still need pretty detailed information about the system on the CPU and texture read-backs are probably the worst idea to opt for.
5) I'm getting no perceptible speed increase from using geometry shader billboarding - is it worth it to tie up additional GPU resources with it?
6) in systems with several particle textures, which would you recommend: sort the particles into individual lists (might require sorting in all cases as the particles are no longer drawn sequentially); suck it up and draw the particles individually (uh oh); store them in a single array, but parse the array once for all textures (same problem as in the first case); something else? (actually, come to think of it, the easiest and cheapest way is probably to instead build a texture atlas for each system when it is first created)

There are probably a number of other questions that I can't think of right off the bat, but I'm really most curious about how people have managed to pull of the updating bit. Right now I'm getting a 5-20 frame FPS drop in debug mode for a handful of particles (a few hundred to a few thousand). I'm not sure if I'm fill-rate limited (as the particles are relatively large) or the bottleneck is in the fact that I'm rendering them from a linked list instead of a fixed array.

##### Share on other sites
u should get easily a few thousand particles on "normal" consumer hardware nowadays without much optimizations.

This let me believe that you have some kind of implementation problem instead of one in yourr concept

PS: with newer Hardware every kind of particle stuff should be calculated on the GPU, the only downside is that the implementation is probably more difficult.
So if u just want some basic particle stuff do it on the CPU, but if u want to go in the extremes(20mio particles or so) research about GPU particle systems.

##### Share on other sites
I've personally found that the "shader instancing (with draw call batching)" technique described here is faster than either geometry shaders or hardware instancing for this kind of usage scenario. You can easily billboard on the GPU in a standard vertex shader using this, and the only downside is that your batch size is limited by how many constant registers you have available (but in practice the perf gains more than tip the balance to the other side).

Whether or not this translates into real world improvements for you depends on where you're bottlenecked, and this is also true of whether or not you should move more calculations (such as position/velocity updating) to the GPU. The truth is that with a particle system that needs to draw lots of overlapping particles up close to the viewpoint (and notice how the bazillion particle techdemos never really do this...), your main bottleneck is going to be fillrate. You can still agressively tackle everything else to ensure that your bottleneck actually is fillrate, but once you get to that stage further optimizations are going to be fairly worthless.

##### Share on other sites
I'd say you should definately do as much as possible on the GPU, the strength of the GPU (doing the exact same calculation simultaneously for different input data) fits perfectly with particles. As for how you should store the results I think using stream out to put it in a vertex buffer would work much better than using textures.

On the topic of fillrate I think you should experiment with trimming the particles something like described here:
http://www.humus.name/index.php?ID=266

I have implemented something similar and it gave us noticable performance increase. It will only help if you're fillrate limited though.

##### Share on other sites
CPU particle systems are still very popular at the moment, because the PS3 has tonnes of CPU power but a really crappy GPU. Likewise, Xbox360 games are usually GPU bound, not CPU bound.

1) Depends on where your bottleneck is. As above, on consoles, the GPU is probably your bottleneck, so you'd want to make more use of those extra CPU cores.
Also, if you need the particles to interact with *highly complex* geometry, you're likely better off on the CPU.
2) If you're going with GPU, probably as much as possible (everything except input events).
3) What do you mean by "giving up the streaming cost"?
4) The CPU shouldn't need to read-back any data. It should be able to launch a draw-call in response to a gameplay event (e.g. to spawn some stuff), and then issue draw-calls to update things, and then finally draw-calls to render things.
5) Compared to what?
6) Yeah, use an atlas.
the bottleneck is in the fact that I'm rendering them from a linked list instead of a fixed array.[/quote]Oh god yes that is likely a huge problem. Your cache hates you right now.
CPU particle systems can be really slow if done wrong and insanely fast if done right (i.e. hundreds of times difference in performance between the bad and the good).
You probably want to lay out the data for an entire system (group of particles) in SoA format, and then iterate over it updating all particles at once. If you can use SIMD here it will likely give good gains as well.

##### Share on other sites
Another Humus article (and demo program) that may be of interest to you in comparing drawing techniques: http://www.humus.name/index.php?page=3D&ID=52

##### Share on other sites
Ok, that's a huge load of really useful feedback, guys! Thanks.

The truth is, I very likely need to move most of the code to the GPU anyway as I really want to decouple all particle system types from the main code (in that I want the available particle system types to be extensible without needing to update them in the main code).

I've never done instancing in GL before, but I'll definitely look into it. As I'm targeting the 200 series and up, I'm fairly confident support won't be a problem. Plus the links you posted seem to cover pretty much everything on it (at least in DX). The 200x performace gain of single draw call per particle vs instancing on my GTX460 is actually really really impressive (120k particles @ 12 vs 217 FPS)!

I'm not targeting consoles, so I'll assume the requirement of a 200+ series card coupled with a multicore CPU (2 cores minimum for now, I guess). I'm not even sure what's used in consoles these days (and I don't own any of them either).

Danny02 - in the current implementation I can quite easily pull off a few thousand particles without a noticeable change; however, the framerate number still plummets from 160 to 120-140 for a single system. Adding more particle effects only compounds the problem.

Hodgman:

[color=#CCCCCC][size=2]>> 1) Depends on where your bottleneck is. As above, on consoles, the GPU is probably your bottleneck, so you'd want to make more use of those extra CPU cores.
[color=#CCCCCC][size=2]

[color="#cccccc"][color=#000000]

### @the SOA link. Here's a somewhat interesting performance comparison of AOS vs SOA both with and without SSE.

##### Share on other sites
Hmm, that's HUMUS article is pretty insane and so simple! Why I haven't thought of that...

##### Share on other sites
I would recommend you to look at the Bungie GDC 2011 presentation, about Halo Reach visual FX. They do everything on GPU, including particle collision detection in screen space, is a very interesting approach.

Also there is a presentation from GDC 2010 about visual FX on Brutal Legend, is very good too.

Brutal Legend GDC 2010

##### Share on other sites

As for how you should store the results I think using stream out to put it in a vertex buffer would work much better than using textures.

Ok, I'm having some trouble researching this - how do I update the vertex buffer from a shader in OpenGL? At best I can find examples that use a texture for the instance position and do a absolute transform on the current vertex. For a particle system I need to be able to do iterative transforms.

Edit: I'm guessing the term I'm looking for is transform feedback xD

##### Share on other sites
Alright then, I have transform feedback properly up and working now and I'm trying to figure out the most efficient way to pack my particle data in GPU memory. Here's what I have thus far:

Total: 3 + 4 + 3 = 10 floats = 40 bytes per particle (vec3, vec4, vec3)

I

vec3 = unpacked positional information

II

vec4 = packed source and destination colors: float0 = src color, float1 = src alpha, float2 = dst color, float3 = dst alpha

III

data packing is done as:

 //particle data packing into vec4 //float0 // bits function // 0 - 4 texture index in the atlas (0-15) // 4 - 8 stage // 8 - 16 persistence (0-255) (essentially decay rate) // 16 - 20 target scale (0-15) // 20 - 24 current scale (0-15) // 24 - 32 rotation (0-255) //float1 // stage age not packed //float2 // velocity not packed //float3 // packed direction vec3 packed into a float //Rotation speed is deduced from particle stage and persistence. 
The vec3->float->vec3 packing functions are:

 vec3 Float2Vec3(in float f) { vec3 color; f *= 256.0; color.x = floor(f); f -= color.x; f *= 256.0; color.y=floor(f); f -= color.y; color.z = floor(f*256.0); return color*0.00390625; // color/256 } float Vec32Float(vec3 color) { const vec3 byte_to_float = vec3(1.0, 1.0 / 256.0, 1.0 / (256.0 * 256.0)); return dot(color, byte_to_float); } 

If anyone has any suggestions on how to do this even better or how to increase precision, by all means please post some feedback!

##### Share on other sites

Oh, thanks for mentioning that! This shouldn't matter for storage, though, as I'm defining a fixed length stream of N floats (which is numParticles * sizeof(particle) of bytes in floats). That is to say, by using two streams of vec3's and one stream of vec4, I can still save 8 bytes of storage per particle (which as my aim), even if the readback from each of the the streams is done 4 floats at a time. Right?

##### Share on other sites
Xbox360 games are usually GPU bound, not CPU bound.

Indie developers that are forced to use XNA to develop on the Xbox 360 are typically bound by the CPU, as they have something like 1/2 - 1/10th the CPU power that one might have with a C++ devkit.

Microsoft jumps through hoops to make C# run in an environment where modification of code is prohibitied for security reasons, and C# being a managed language gets some of it's perf benefits from self modification.

The xbox uses the compact .NET CLR, which is a piece of crap, but adapting the more robust desktop interpreter to the xbox OS is a big job.

That said, Halo: Reach has particles (computed on the GPU) that collide with each other and the environment, and their stated upper limit on these particles is something like 25000. There's no way that could be done on the CPU.