Epic Optimization for Particles

Started by
18 comments, last by Muzzy A 11 years, 10 months ago
Hey, i was setting here fooling around with my particle manager for class before i went to class and I started testing how many particles I could Draw while keeping fps above 30.

My question is, would there be a way to get 100,000 particles moving around and drawing with a CPU and NOT a GPU at 30 fps? I know it varies from every CPU, but I consider mine to be decently fast.

I'm currently rending and updating 70,000 particles at 13 - 17 fps, my goal is to have 100,000 particles rendering above 30 fps.

I've found that i'm also limited by D3D's sprite manager. So would it be better for me to create vertex buffers instead?


// 70,000 particles - 13-17 fps

ID3DXSprite *spriteManager;

// Particle stuff
Vector3 Position;
Vector3 Velocity;
D3DXCOLOR Color;

// Draw
spriteManager->Draw(pTexture,NULL,&D3DXVECTOR3(0,0,0),&Position,Color);

Update(float dt)
{
Position += Velocity * dt;
}

Advertisement
It's most likely not the cpu performance that is the issue but that you somehow have to "move" all that data to the gpu every frame
either by transferring a big buffer containing all the particles or by having absurd amounts of draw calls.
It's been several years since I used D3DXSprite. It does use some form of batching, and is likely not the slowest way to draw sprites. Your current performance sounds decent.

If you want the best control of how the CPU->GPU communication and drawing is done, you can try batching the objects manually instead of using D3DXSprite. Be sure to use a ring buffer of vertex buffers, update the data in a single lock/unlock write loop, and pay particular attention to not doing any slow CPU operations. To get to 100k particles, you'll need to have a very optimized update inner loop that processes the particles in a good data cache optimized coherent fashion.

However, if you can, I would recommend investigating the option of doing pure GPU side particle system approaches, since they'll be an order of magnitude faster than anything you can do on the CPU. Of course, the downside is that it's more difficult to be flexible and some effects may be tricky to achieve in a GPU-based system (e.g. complex update functions, or interacting with scene geometry).
The biggest performance issue with particles is your cpu cache.

It looks like you are treating your particles as independent items. You cannot do that and still get good performance.

For cache purposes you want them to live in a continuous array of structures, with each node being 64 bytes or a multiple of 64 bytes. They need to be processed sequentially without any holes. They need to be processed and drawn in batch, not individually.

There are a few hundred other little details to improve performance, but keep the particle data small and tight are necessary for good performance.
In my engine 100.000 particles are rendering decently at 110-120 fps (and I have a 3- years old pc 2GHZ double core.. of course only 1 core is used for moving particles and a Radeon HD4570. ). Of course i'm just moving them around randomly. A more complex behaviour will have major hit on performance.

yes cache is bigger hit on performance. you particle data should be accessed ideally as an array (well in practice there are slightly faster solutions than an array but that is already a good deal). I'm speaking of a C-style array (not std::vector or similiars)

I'm using of course a VBO with streaming hint.

There are several really good opensource particle engines wich have good performance too.

Peace and love, now I understand really what it means! Guardian Angels exist! Thanks!

I can hit 500,000 particles and still maintain 60fps with my setup so it's definitely possible. Moving particles is just a moderately simple gravity/velocity equation that I can switch between running on the CPU or GPU, but am not noticing much in the way of performance difference - the primary bottlenecks are most definitely buffer updates and fillrate.

The D3DX sprite interface is just a wrapper around a dynamic vertex buffer so a naive switchover will just result in you rewriting ID3DXSprite in your own code. You need to tackle things a little more agressively.

One huge win I got on dynamic buffer updates was to use instancing - I fill a buffer with single vertexes as per-instance data then expand and billboard each single vertex on the GPU using per-vertex data. Instancing has it's own overhead for sure, but on balance this setup works well and comes out on the right side of the tradeoff.

Other things I explored included geometry shaders (don't bother, the overhead is too high) and pre-filling a static vertex buffer for each emitter type (but some emitter types use so few particles that it kills you on draw calls). I haven't yet experimented with a combination of instancing and pre-filled static buffers, but I do suspect that this would be as close to optimal as you're ever going to get with this particular bottleneck.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


I'm speaking of a C-style array (not std::vector or similiars)


Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
The trick with fast particle simulation on the CPU is basically use SSE and thinking about your data flow and layout.


struct Particle
{
float x, y, z;
}


That is basically the worst to even consider doing it.


struct Emitter
{
float * x;
float * y;
float * z;
}


On the other hand, where each pointer points to large array of components is the most SSE friendly way of doing things.
(You'd need other data too, they should also be in seperate arrays).

The reason you want them in seperate arrays is due to how SSE works; it multiples/adds/etc across registers rather than between components.

So, with that arrangement you could read in four 'x' and four 'x direction' chunks of data and add them together with ease; if you had them packed as per the above structure then in order to use SSE you'd have to read in more data, massage it into the right SSE friendly format, do the maths and then sort it out again to write back to memory.

You also have to be aware of cache and how it interacts with memory and CPU prefetching.

Using this method of data layout, some SSE and TTB I have an old 2D (not fully optimised) particle simulator which, on my i7 920 using 8 threads, can simulate 100 emitters with 10,000 particles each in ~3.3ms.

[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)


Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
[/quote]

Iterating isn't the problem - allocating is.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


[quote name='Madhed' timestamp='1339536018' post='4948632']
[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)


Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
[/quote]

Iterating isn't the problem - allocating is.
[/quote]

But you'd have to allocate the c-array too

This topic is closed to new replies.

Advertisement