Writing an easily maintainable, powerful and flexible particle system (new: efficient particle data packing)

Started by
12 comments, last by IntegralKing 12 years, 10 months ago
I already have a particle system, but some of its functionality doesn't perform as well as I wish it did and there are a number of questions the answers to which warrant some research, which would probably take a lot of time to do. Which is why I think a public discussion on this might benefit more people than just myself.

What I need is:

- a dedicated particle systems whose parameters can be set individually and programmatically
- emitter parameters throughout a system need to be accessible from a script, which requires them to be named
- independence of the rendering code from the iteration/stepping code
- post-fact extensibility (adding new particle system classes after the code is compiled)
As right now I'm doing everything in a simple way (eg I have a particle system class that manages a list of particles), I need to extend this a fair bit. My idea is to implement the particle system as a self-referential emitter class that accepts a emitter configuration class as a parameter.


//this is loaded from an XML file
class EmitterConfiguration {
float args[256];
char* names[256];
int iNumArgs;

Shader* updateShader;
};


class ParticleEmitter {
std::vector<ParticleEmitter> particles;
EmitterConfiguration* cfg;
};

ParticleEmitter::Update()
{
cfg->EnableUpdateShader(true);
for all active particles
cfg->UpdateParticle(p);
cfg->EnableUpdateShader(false);
}

ParticleEmitter* i_am_a_particle_system;


This is all fine and dandy and should cover the extensibility and flexibility parts. However, I can see a number of speed bottlenecks here, which lead me to the following questions:

1) is moving particle updating off the CPU a good idea in a general sense? I mean, CPU cores are dime a dozen on many newer systems and can be expected to only become more widely spread.
2) how much of the iteration/updating should I move to the GPU? Everything? Everything except spawning?
3) how should I go about writing the GPU side of the system? The only solution I can see for storage is textures, but do they justify giving up the streaming cost?
4) how should I go about syncing between the CPU and the GPU? If most work is done on the GPU, I still need pretty detailed information about the system on the CPU and texture read-backs are probably the worst idea to opt for.
5) I'm getting no perceptible speed increase from using geometry shader billboarding - is it worth it to tie up additional GPU resources with it?
6) in systems with several particle textures, which would you recommend: sort the particles into individual lists (might require sorting in all cases as the particles are no longer drawn sequentially); suck it up and draw the particles individually (uh oh); store them in a single array, but parse the array once for all textures (same problem as in the first case); something else? (actually, come to think of it, the easiest and cheapest way is probably to instead build a texture atlas for each system when it is first created)

There are probably a number of other questions that I can't think of right off the bat, but I'm really most curious about how people have managed to pull of the updating bit. Right now I'm getting a 5-20 frame FPS drop in debug mode for a handful of particles (a few hundred to a few thousand). I'm not sure if I'm fill-rate limited (as the particles are relatively large) or the bottleneck is in the fact that I'm rendering them from a linked list instead of a fixed array.
Advertisement
u should get easily a few thousand particles on "normal" consumer hardware nowadays without much optimizations.


This let me believe that you have some kind of implementation problem instead of one in yourr concept


PS: with newer Hardware every kind of particle stuff should be calculated on the GPU, the only downside is that the implementation is probably more difficult.
So if u just want some basic particle stuff do it on the CPU, but if u want to go in the extremes(20mio particles or so) research about GPU particle systems.
I've personally found that the "shader instancing (with draw call batching)" technique described here is faster than either geometry shaders or hardware instancing for this kind of usage scenario. You can easily billboard on the GPU in a standard vertex shader using this, and the only downside is that your batch size is limited by how many constant registers you have available (but in practice the perf gains more than tip the balance to the other side).

Whether or not this translates into real world improvements for you depends on where you're bottlenecked, and this is also true of whether or not you should move more calculations (such as position/velocity updating) to the GPU. The truth is that with a particle system that needs to draw lots of overlapping particles up close to the viewpoint (and notice how the bazillion particle techdemos never really do this...), your main bottleneck is going to be fillrate. You can still agressively tackle everything else to ensure that your bottleneck actually is fillrate, but once you get to that stage further optimizations are going to be fairly worthless.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I'd say you should definately do as much as possible on the GPU, the strength of the GPU (doing the exact same calculation simultaneously for different input data) fits perfectly with particles. As for how you should store the results I think using stream out to put it in a vertex buffer would work much better than using textures.

On the topic of fillrate I think you should experiment with trimming the particles something like described here:
http://www.humus.name/index.php?ID=266

I have implemented something similar and it gave us noticable performance increase. It will only help if you're fillrate limited though.
CPU particle systems are still very popular at the moment, because the PS3 has tonnes of CPU power but a really crappy GPU. Likewise, Xbox360 games are usually GPU bound, not CPU bound.

1) Depends on where your bottleneck is. As above, on consoles, the GPU is probably your bottleneck, so you'd want to make more use of those extra CPU cores.
Also, if you need the particles to interact with *highly complex* geometry, you're likely better off on the CPU.
2) If you're going with GPU, probably as much as possible (everything except input events).
3) What do you mean by "giving up the streaming cost"?
4) The CPU shouldn't need to read-back any data. It should be able to launch a draw-call in response to a gameplay event (e.g. to spawn some stuff), and then issue draw-calls to update things, and then finally draw-calls to render things.
5) Compared to what?
6) Yeah, use an atlas.
the bottleneck is in the fact that I'm rendering them from a linked list instead of a fixed array.[/quote]Oh god yes that is likely a huge problem. Your cache hates you right now.
CPU particle systems can be really slow if done wrong and insanely fast if done right (i.e. hundreds of times difference in performance between the bad and the good).
You probably want to lay out the data for an entire system (group of particles) in SoA format, and then iterate over it updating all particles at once. If you can use SIMD here it will likely give good gains as well.
Another Humus article (and demo program) that may be of interest to you in comparing drawing techniques: http://www.humus.name/index.php?page=3D&ID=52

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Ok, that's a huge load of really useful feedback, guys! Thanks.

The truth is, I very likely need to move most of the code to the GPU anyway as I really want to decouple all particle system types from the main code (in that I want the available particle system types to be extensible without needing to update them in the main code).

I've never done instancing in GL before, but I'll definitely look into it. As I'm targeting the 200 series and up, I'm fairly confident support won't be a problem. Plus the links you posted seem to cover pretty much everything on it (at least in DX). The 200x performace gain of single draw call per particle vs instancing on my GTX460 is actually really really impressive (120k particles @ 12 vs 217 FPS)!

I'm not targeting consoles, so I'll assume the requirement of a 200+ series card coupled with a multicore CPU (2 cores minimum for now, I guess). I'm not even sure what's used in consoles these days (and I don't own any of them either).

Danny02 - in the current implementation I can quite easily pull off a few thousand particles without a noticeable change; however, the framerate number still plummets from 160 to 120-140 for a single system. Adding more particle effects only compounds the problem.

Hodgman:

[color=#CCCCCC][size=2]>> 1) Depends on where your bottleneck is. As above, on consoles, the GPU is probably your bottleneck, so you'd want to make more use of those extra CPU cores.
[color=#CCCCCC][size=2]

[color="#cccccc"][color=#000000]

I think I'll implement a small number of common particle system types (fire/smoke, thruster, precipitation, etc) in the main code and update these on the CPU if, say, 4+ cores are found. All other systems can then be dynamic and be run on the GPU, plus the built-in ones can as well, if the workstation is CPU-limited.
[color="#cccccc"][color=#000000]



[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2]>> Also, if you need the particles to interact with *highly complex* geometry, you're likely better off on the CPU.
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2]

[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

Yeah, I've thought about this; however, I think I can get rid of most interaction through simple depth look-ups on the GPU. I don't need to redirect a particle stream on impact or anything like that, so no actual physics are required (for now at least).
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]



[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2]>> 2) If you're going with GPU, probably as much as possible (everything except input events).
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2]

[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

By 'input events' do you mean spawning new particles?
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]



[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2]>> 3) What do you mean by "giving up the streaming cost"?
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]



[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

This is poor wording on my part and was actually meant to mean "there's no streaming cost if the particles are kept on the GPU" :)
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]



[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2]>> 4) The CPU shouldn't need to read-back any data
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]



[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

A simple question: how do I know then a particle system dies? I suppose I could do sporadic once-per-second asynchronous read-backs from a persistent buffer, eg a texture, which is updated by all particle systems to determine which ones have died.
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]



[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2]>> 5) Compared to what
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2]

[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

Compared to software billboarding and streaming 4x more vertex data to the GPU than in the case of a geometry shader
[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]



[color="#cccccc"][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

[color=#CCCCCC][size=2][color=#000000]

@the SOA link. Here's a somewhat interesting performance comparison of AOS vs SOA both with and without SSE.

Hmm, that's HUMUS article is pretty insane and so simple! Why I haven't thought of that...
I would recommend you to look at the Bungie GDC 2011 presentation, about Halo Reach visual FX. They do everything on GPU, including particle collision detection in screen space, is a very interesting approach.

Also there is a presentation from GDC 2010 about visual FX on Brutal Legend, is very good too.

Link: GDC Vault .

Brutal Legend GDC 2010


As for how you should store the results I think using stream out to put it in a vertex buffer would work much better than using textures.


Ok, I'm having some trouble researching this - how do I update the vertex buffer from a shader in OpenGL? At best I can find examples that use a texture for the instance position and do a absolute transform on the current vertex. For a particle system I need to be able to do iterative transforms.

Edit: I'm guessing the term I'm looking for is transform feedback xD

This topic is closed to new replies.

Advertisement