Back to General and Gameplay Programming

Epic Optimization for Particles

General and Gameplay Programming Programming

Started by Muzzy A June 12, 2012 03:03 PM

18 comments, last by Muzzy A 11 years, 10 months ago

Muzzy A

737

Author

June 12, 2012 03:03 PM

Hey, i was setting here fooling around with my particle manager for class before i went to class and I started testing how many particles I could Draw while keeping fps above 30.

My question is, would there be a way to get 100,000 particles moving around and drawing with a CPU and NOT a GPU at 30 fps? I know it varies from every CPU, but I consider mine to be decently fast.

I'm currently rending and updating 70,000 particles at 13 - 17 fps, my goal is to have 100,000 particles rendering above 30 fps.

I've found that i'm also limited by D3D's sprite manager. So would it be better for me to create vertex buffers instead?



// 70,000 particles - 13-17 fps



ID3DXSprite *spriteManager;



// Particle stuff

Vector3 Position;

Vector3 Velocity;

D3DXCOLOR Color;



// Draw

spriteManager->Draw(pTexture,NULL,&D3DXVECTOR3(0,0,0),&Position,Color);



Update(float dt)

{

	Position += Velocity * dt;

}

japro

887

June 12, 2012 03:18 PM

It's most likely not the cpu performance that is the issue but that you somehow have to "move" all that data to the gpu every frame
either by transferring a big buffer containing all the particles or by having absurd amounts of draw calls.

Tweet tweet!
My videos on YouTube
OpenGL Example Collection

clb

2,152

June 12, 2012 04:17 PM

It's been several years since I used D3DXSprite. It does use some form of batching, and is likely not the slowest way to draw sprites. Your current performance sounds decent.

If you want the best control of how the CPU->GPU communication and drawing is done, you can try batching the objects manually instead of using D3DXSprite. Be sure to use a ring buffer of vertex buffers, update the data in a single lock/unlock write loop, and pay particular attention to not doing any slow CPU operations. To get to 100k particles, you'll need to have a very optimized update inner loop that processes the particles in a good data cache optimized coherent fashion.

However, if you can, I would recommend investigating the option of doing pure GPU side particle system approaches, since they'll be an order of magnitude faster than anything you can do on the CPU. Of course, the downside is that it's more difficult to be flexible and some effects may be tricky to achieve in a GPU-based system (e.g. complex update functions, or interacting with scene geometry).

frob

46,221

June 12, 2012 05:02 PM

The biggest performance issue with particles is your cpu cache.

It looks like you are treating your particles as independent items. You cannot do that and still get good performance.

For cache purposes you want them to live in a continuous array of structures, with each node being 64 bytes or a multiple of 64 bytes. They need to be processed sequentially without any holes. They need to be processed and drawn in batch, not individually.

There are a few hundred other little details to improve performance, but keep the particle data small and tight are necessary for good performance.

Dario Oliveri

290

June 12, 2012 08:48 PM

In my engine 100.000 particles are rendering decently at 110-120 fps (and I have a 3- years old pc 2GHZ double core.. of course only 1 core is used for moving particles and a Radeon HD4570. ). Of course i'm just moving them around randomly. A more complex behaviour will have major hit on performance.

yes cache is bigger hit on performance. you particle data should be accessed ideally as an array (well in practice there are slightly faster solutions than an array but that is already a good deal). I'm speaking of a C-style array (not std::vector or similiars)

I'm using of course a VBO with streaming hint.

There are several really good opensource particle engines wich have good performance too.

Peace and love, now I understand really what it means! Guardian Angels exist! Thanks!

21st Century Moose

13,459

June 12, 2012 09:04 PM

I can hit 500,000 particles and still maintain 60fps with my setup so it's definitely possible. Moving particles is just a moderately simple gravity/velocity equation that I can switch between running on the CPU or GPU, but am not noticing much in the way of performance difference - the primary bottlenecks are most definitely buffer updates and fillrate.

The D3DX sprite interface is just a wrapper around a dynamic vertex buffer so a naive switchover will just result in you rewriting ID3DXSprite in your own code. You need to tackle things a little more agressively.

One huge win I got on dynamic buffer updates was to use instancing - I fill a buffer with single vertexes as per-instance data then expand and billboard each single vertex on the GPU using per-vertex data. Instancing has it's own overhead for sure, but on balance this setup works well and comes out on the right side of the tradeoff.

Other things I explored included geometry shaders (don't bother, the overhead is too high) and pre-filling a static vertex buffer for each emitter type (but some emitter types use so few particles that it kills you on draw calls). I haven't yet experimented with a combination of instancing and pre-filled static buffers, but I do suspect that this would be as close to optimal as you're ever going to get with this particular bottleneck.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Madhed

4,095

June 12, 2012 09:20 PM

I'm speaking of a C-style array (not std::vector or similiars)

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.

_the_phantom_

11,263

June 12, 2012 09:26 PM

The trick with fast particle simulation on the CPU is basically use SSE and thinking about your data flow and layout.



struct Particle

{

float x, y, z;

}

That is basically the worst to even consider doing it.



struct Emitter

{

float * x;

float * y;

float * z;

}

On the other hand, where each pointer points to large array of components is the most SSE friendly way of doing things.
(You'd need other data too, they should also be in seperate arrays).

The reason you want them in seperate arrays is due to how SSE works; it multiples/adds/etc across registers rather than between components.

So, with that arrangement you could read in four 'x' and four 'x direction' chunks of data and add them together with ease; if you had them packed as per the above structure then in order to use SSE you'd have to read in more data, massage it into the right SSE friendly format, do the maths and then sort it out again to write back to memory.

You also have to be aware of cache and how it interacts with memory and CPU prefetching.

Using this method of data layout, some SSE and TTB I have an old 2D (not fully optimised) particle simulator which, on my i7 920 using 8 threads, can simulate 100 emitters with 10,000 particles each in ~3.3ms.

21st Century Moose

13,459

June 12, 2012 09:55 PM

[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
[/quote]

Iterating isn't the problem - allocating is.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Madhed

4,095

June 12, 2012 10:13 PM

[quote name='Madhed' timestamp='1339536018' post='4948632']
[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
[/quote]

Iterating isn't the problem - allocating is.
[/quote]

But you'd have to allocate the c-array too

Epic Optimization for Particles

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Epic Optimization for Particles

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines