Sign in to follow this  
Muzzy A

Epic Optimization for Particles

Recommended Posts

Muzzy A    737
Hey, i was setting here fooling around with my particle manager for class before i went to class and I started testing how many particles I could Draw while keeping fps above 30.

My question is, would there be a way to get 100,000 particles moving around and drawing with a CPU and NOT a GPU at 30 fps? I know it varies from every CPU, but I consider mine to be decently fast.

I'm currently rending and updating 70,000 particles at 13 - 17 fps, my goal is to have 100,000 particles rendering above 30 fps.

I've found that i'm also limited by D3D's sprite manager. So would it be better for me to create vertex buffers instead?

[code]
// 70,000 particles - 13-17 fps

ID3DXSprite *spriteManager;

// Particle stuff
Vector3 Position;
Vector3 Velocity;
D3DXCOLOR Color;

// Draw
spriteManager->Draw(pTexture,NULL,&D3DXVECTOR3(0,0,0),&Position,Color);

Update(float dt)
{
Position += Velocity * dt;
}

[/code] Edited by Muzzy A

Share this post


Link to post
Share on other sites
japro    887
It's most likely not the cpu performance that is the issue but that you somehow have to "move" all that data to the gpu every frame
either by transferring a big buffer containing all the particles or by having absurd amounts of draw calls.

Share this post


Link to post
Share on other sites
clb    2147
It's been several years since I used D3DXSprite. It does use some form of batching, and is likely not the slowest way to draw sprites. Your current performance sounds decent.

If you want the best control of how the CPU->GPU communication and drawing is done, you can try batching the objects manually instead of using D3DXSprite. Be sure to use a ring buffer of vertex buffers, update the data in a single lock/unlock write loop, and pay particular attention to not doing any slow CPU operations. To get to 100k particles, you'll need to have a very optimized update inner loop that processes the particles in a good data cache optimized coherent fashion.

However, if you can, I would recommend investigating the option of doing pure GPU side particle system approaches, since they'll be an order of magnitude faster than anything you can do on the CPU. Of course, the downside is that it's more difficult to be flexible and some effects may be tricky to achieve in a GPU-based system (e.g. complex update functions, or interacting with scene geometry).

Share this post


Link to post
Share on other sites
frob    44978
The biggest performance issue with particles is your cpu cache.

It looks like you are treating your particles as independent items. You cannot do that and still get good performance.

For cache purposes you want them to live in a continuous array of structures, with each node being 64 bytes or a multiple of 64 bytes. They need to be processed sequentially without any holes. They need to be processed and drawn in batch, not individually.

There are a few hundred other little details to improve performance, but keep the particle data small and tight are necessary for good performance.

Share this post


Link to post
Share on other sites
Dario Oliveri    290
In my engine 100.000 particles are rendering decently at 110-120 fps (and I have a 3- years old pc 2GHZ double core.. of course only 1 core is used for moving particles and a Radeon HD4570. ). Of course i'm just moving them around randomly. A more complex behaviour will have major hit on performance.

yes cache is bigger hit on performance. you particle data should be accessed ideally as an array (well in practice there are slightly faster solutions than an array but that is already a good deal). I'm speaking of a C-style array (not std::vector or similiars)

I'm using of course a VBO with streaming hint.

There are several really good opensource particle engines wich have good performance too. Edited by DemonRad

Share this post


Link to post
Share on other sites
mhagain    13430
I can hit 500,000 particles and still maintain 60fps with my setup so it's definitely possible. Moving particles is just a moderately simple gravity/velocity equation that I can switch between running on the CPU or GPU, but am not noticing much in the way of performance difference - the primary bottlenecks are most definitely buffer updates and fillrate.

The D3DX sprite interface is just a wrapper around a dynamic vertex buffer so a naive switchover will just result in you rewriting ID3DXSprite in your own code. You need to tackle things a little more agressively.

One huge win I got on dynamic buffer updates was to use instancing - I fill a buffer with single vertexes as per-instance data then expand and billboard each single vertex on the GPU using per-vertex data. Instancing has it's own overhead for sure, but on balance this setup works well and comes out on the right side of the tradeoff.

Other things I explored included geometry shaders (don't bother, the overhead is too high) and pre-filling a static vertex buffer for each emitter type (but some emitter types use so few particles that it kills you on draw calls). I haven't yet experimented with a combination of instancing and pre-filled static buffers, but I do suspect that this would be as close to optimal as you're ever going to get with this particular bottleneck.

Share this post


Link to post
Share on other sites
Madhed    4095
[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)
[/quote]

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is. Edited by Madhed

Share this post


Link to post
Share on other sites
_the_phantom_    11250
The trick with fast particle simulation on the CPU is basically use SSE and thinking about your data flow and layout.

[CODE]
struct Particle
{
float x, y, z;
}
[/CODE]

That is basically the worst to even consider doing it.

[CODE]
struct Emitter
{
float * x;
float * y;
float * z;
}
[/CODE]

On the other hand, where each pointer points to large array of components is the most SSE friendly way of doing things.
(You'd need other data too, they should also be in seperate arrays).

The reason you want them in seperate arrays is due to how SSE works; it multiples/adds/etc across registers rather than between components.

So, with that arrangement you could read in four 'x' and four 'x direction' chunks of data and add them together with ease; if you had them packed as per the above structure then in order to use SSE you'd have to read in more data, massage it into the right SSE friendly format, do the maths and then sort it out again to write back to memory.

You also have to be aware of cache and how it interacts with memory and CPU prefetching.

Using this method of data layout, some SSE and TTB I have an old 2D (not fully optimised) particle simulator which, on my i7 920 using 8 threads, can simulate 100 emitters with 10,000 particles each in ~3.3ms.

Share this post


Link to post
Share on other sites
mhagain    13430
[quote name='Madhed' timestamp='1339536018' post='4948632']
[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)
[/quote]

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
[/quote]

Iterating isn't the problem - allocating is.

Share this post


Link to post
Share on other sites
Madhed    4095
[quote name='mhagain' timestamp='1339538107' post='4948643']
[quote name='Madhed' timestamp='1339536018' post='4948632']
[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)
[/quote]

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
[/quote]

Iterating isn't the problem - allocating is.
[/quote]

But you'd have to allocate the c-array too

Share this post


Link to post
Share on other sites
Promit    13246
IF used correctly, std::vector should show equivalent performance to raw arrays IF optimizations are enabled and IF you have a sane implementation. There are caveats here to be wary of, especially if you're cross platform. But knocking std::vector out of your code is not an optimization strategy. Edited by Promit

Share this post


Link to post
Share on other sites
Muzzy A    737
std::vector is my fav =) except when it breaks in places that leave you clueless =\.

It depends on the array that you make, if you allocate new memory for it then it's gonna be just a bit faster than the vector not really anything to notice. But an array on the stack will be even faster, not much faster, but enough to notice an fps change. Either way they're still faster.

Anyways.

I ran Intel Parallel Amplifier on it, and out of 22.7 seconds 14 seconds of it was spent rendering the particles. All of the math and other stuff were all less than a second each.

That's why I was asking about the ID3DXSprite class. There's not really much optimizing left for me to do in the math, i've hard coded several things just to try to achieve 100k particles in 30 fps.

I am using an std::vector, but i really don't think that changing it to an array on the stack would help it much either.

Share this post


Link to post
Share on other sites
ApochPiQ    23065
First of all, a stack-allocated array has to fit in the stack space; 100k particles is going to take a chunk of stack that you may well not want to be giving up.

Secondly, allocating an array on the heap and allocating on the stack will [i]emphatically not[/i] make any speed difference to accessing the memory. What [b]will[/b] make a difference is cache behavior. Your stack space may be in cache... and then again, if you're bloating your stack with particle data, it might not. Memory access latency on modern hardware is a complicated beast and you can't just assume that "oh hey stack is faster than freestore" because [i]that is not necessarily true[/i]. In fact, it should be trivial to construct an artificial benchmark where prefetched "heap" pages are accessible faster than stack pages due to the behavior of virtual memory paging and cache locality.

Third, allocating memory with "new foo[]" versus allocating the same memory in a std::vector will make [b]zero[/b] difference if your compiler's optimization passes are worth anything. Not a tiny bit, not depending on the array - zero, period. If you reserve() your vector correctly instead of letting it grow/copy up to full size (i.e. if you make a fair comparison) then basically all it does is a call to new foo[] under the covers. There is no [i]performance[/i] reason whatsoever to eschew std::vector, and stating that it makes any practical difference whatsoever is liable to mislead people into thinking that they shouldn't use it because "OMFG mai codez must be teh fastar!"

Lastly, "I noticed an FPS change" does not constitute admissible data for performance measurement. I can run the [b]exact same code[/b] a dozen times on the [b]exact same hardware[/b] and get a deviation of framerate that is noticeable. Just comparing a couple runs of one implementation to a couple runs of something else doesn't give you a valid performance picture.

Share this post


Link to post
Share on other sites
Krohm    5031
[quote name='DemonRad' timestamp='1339534091' post='4948625']
In my engine 100.000 particles are rendering decently at 110-120 fps (and I have a 3- years old pc 2GHZ double core.. of course only 1 core is used for moving particles and a Radeon HD4570. ). Of course i'm just moving them around randomly. A more complex behaviour will have major hit on performance.
[/quote]I suggest [url="http://www.2ld.de/gdc2004/"]Building a million particle system by Lutz Latta[/url]. If memory serves, it used to do 100k particles on a GeForce 6600 albeit the performance was quite low, in the range of 20fps I believe.

[quote name='ApochPiQ' timestamp='1339554243' post='4948691']
Third, allocating memory with "new foo[]" versus allocating the same memory in a std::vector will make zero difference if your compiler's optimization passes are worth anything. Not a tiny bit, not depending on the array - zero, period. If you reserve() your vector correctly instead of letting it grow/copy up to full size (i.e. if you make a fair comparison) then basically all it does is a call to new foo[] under the covers. There is no performance reason whatsoever to eschew std::vector, and stating that it makes any practical difference whatsoever is liable to mislead people into thinking that they shouldn't use it because "OMFG mai codez must be teh fastar!"[/quote]I think I am misunderstanding everything here.
Perhaps it's just me but it was my understanding [font=courier new,courier,monospace]std::vector[/font] will not take GPU memory. I think the main point of mhagain was to use GPU memory directly through maps. Now, of course we can write an allocator to deal with the reallocations ourselves... besides the fact that's not how vertex buffers are supposed to be used, especially when dealing with particle systems IMHO, now that we have our custom allocator, can we still talk about an [font=courier new,courier,monospace]std::vector[/font]?
I'm afraid not.
Or perhaps we're suggesting to compute everything in system RAM and then copy to GPU memory?
Please explain this clearly so I can understand.
Because perhaps it's just me, but I still have difficulty having [font=courier new,courier,monospace]std::vector[/font] and GPU memory toghether. Edited by Krohm

Share this post


Link to post
Share on other sites
Madhed    4095
Hm seems I derailed the thread a little, sorry for that I just tend to get defensive when someone claims c-style arrays over std::vectors for performance reasons. And of course Promit is right too. There are various implementations of the standard library, some pretty buggy/slow. And also there are platforms where one simply cannot use it. For most tasks on a standard PC however I would say that one should use the c++ std library extensively.

[b]@Krohm[/b]
I guess you are talking about FooBuffer->Map()'ing in D3D10/11? Of course, you get a void* pointer from that function and you will have to copy data into/out of that memory region. I didn't mean to construct a vector from the returned pointer. If you mean something else, mind to explain?

Share this post


Link to post
Share on other sites
Muzzy A    737
[quote name='ApochPiQ' timestamp='1339554243' post='4948691']
First of all, a stack-allocated array has to fit in the stack space; 100k particles is going to take a chunk of stack that you may well not want to be giving up.

Secondly, allocating an array on the heap and allocating on the stack will [i]emphatically not[/i] make any speed difference to accessing the memory. What [b]will[/b] make a difference is cache behavior. Your stack space may be in cache... and then again, if you're bloating your stack with particle data, it might not. Memory access latency on modern hardware is a complicated beast and you can't just assume that "oh hey stack is faster than freestore" because [i]that is not necessarily true[/i]. In fact, it should be trivial to construct an artificial benchmark where prefetched "heap" pages are accessible faster than stack pages due to the behavior of virtual memory paging and cache locality.

Third, allocating memory with "new foo[]" versus allocating the same memory in a std::vector will make [b]zero[/b] difference if your compiler's optimization passes are worth anything. Not a tiny bit, not depending on the array - zero, period. If you reserve() your vector correctly instead of letting it grow/copy up to full size (i.e. if you make a fair comparison) then basically all it does is a call to new foo[] under the covers. There is no [i]performance[/i] reason whatsoever to eschew std::vector, and stating that it makes any practical difference whatsoever is liable to mislead people into thinking that they shouldn't use it because "OMFG mai codez must be teh fastar!"

Lastly, "I noticed an FPS change" does not constitute admissible data for performance measurement. I can run the [b]exact same code[/b] a dozen times on the [b]exact same hardware[/b] and get a deviation of framerate that is noticeable. Just comparing a couple runs of one implementation to a couple runs of something else doesn't give you a valid performance picture.
[/quote]

I apologize, I haven't learned that in school yet. I was just taught that Stack memory is faster than Heap. I havent got to the operating systems class yet so I had no idea. You made me feel like an idiot lol.


Thanks for the link Krohm, i hope it's as helpful as it looks. I'm not giving up til i get that 100k mark lol. If I don't then I'm pretty sure the game I'm working on is going to be pretty laggy. Edited by Muzzy A

Share this post


Link to post
Share on other sites
Krohm    5031
[quote name='Madhed' timestamp='1339590825' post='4948792']I guess you are talking about FooBuffer->Map()'ing in D3D10/11? Of course, you get a void* pointer from that function and you will have to copy data into/out of that memory region. I didn't mean to construct a vector from the returned pointer. If you mean something else, mind to explain?[/quote]My only goal was to cool down the [font=courier new,courier,monospace]std::vector[/font] performance debate. While I trust it in general, I see no way to mix it with STL containers. It appears to me we either put the data there from scratch or we use std::vector and then copy to VB. That's it.

Share this post


Link to post
Share on other sites
Muzzy A    737
ok I managed to get it to 85,000 particles at 25 fps, I think that's about as good as im going to get without using a gpu. I wish the cpu was just like the gpu, what's wrong with making the computers have a gpu rather than a cpu? Things would be much faster.

Share this post


Link to post
Share on other sites
_the_phantom_    11250
No they wouldn't.

A GPU is very good at doing a lot of tasks at once when you can generate enough work to hide latency of memory access and jumps. They are high through put, high latency devices.

However not all work loads map to a GPU well, which is where the CPU comes in with its advanced caches, branch prediction and the ability to execute work in a different order. CPUs are low latency, low throughput devices.

Or to put it another way; if you tried to run Word on a GPU then it would run horribly when compared to a CPU because the workload isn't suited to the device.

This is why AMD, Intel and ARM are putting so much effort into devices which combine a CPU and a GPU onto one die; so that work loads can be placed where they make sense.

Unlike what nVidia would probably like you to believe not every workload can be pushed massively parallel and run with 10000x speed up on a GPU, CPUs still very much have their place and will do for some time yet.

Share this post


Link to post
Share on other sites
Muzzy A    737
well I can't wait until the CPU is much faster than it is now and working with the GPU much more than it is now. I want some performance :P

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this