• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Muzzy A

Epic Optimization for Particles

19 posts in this topic

Hey, i was setting here fooling around with my particle manager for class before i went to class and I started testing how many particles I could Draw while keeping fps above 30.

My question is, would there be a way to get 100,000 particles moving around and drawing with a CPU and NOT a GPU at 30 fps? I know it varies from every CPU, but I consider mine to be decently fast.

I'm currently rending and updating 70,000 particles at 13 - 17 fps, my goal is to have 100,000 particles rendering above 30 fps.

I've found that i'm also limited by D3D's sprite manager. So would it be better for me to create vertex buffers instead?

[code]
// 70,000 particles - 13-17 fps

ID3DXSprite *spriteManager;

// Particle stuff
Vector3 Position;
Vector3 Velocity;
D3DXCOLOR Color;

// Draw
spriteManager->Draw(pTexture,NULL,&D3DXVECTOR3(0,0,0),&Position,Color);

Update(float dt)
{
Position += Velocity * dt;
}

[/code] Edited by Muzzy A
0

Share this post


Link to post
Share on other sites
It's most likely not the cpu performance that is the issue but that you somehow have to "move" all that data to the gpu every frame
either by transferring a big buffer containing all the particles or by having absurd amounts of draw calls.
1

Share this post


Link to post
Share on other sites
The biggest performance issue with particles is your cpu cache.

It looks like you are treating your particles as independent items. You cannot do that and still get good performance.

For cache purposes you want them to live in a continuous array of structures, with each node being 64 bytes or a multiple of 64 bytes. They need to be processed sequentially without any holes. They need to be processed and drawn in batch, not individually.

There are a few hundred other little details to improve performance, but keep the particle data small and tight are necessary for good performance.
1

Share this post


Link to post
Share on other sites
In my engine 100.000 particles are rendering decently at 110-120 fps (and I have a 3- years old pc 2GHZ double core.. of course only 1 core is used for moving particles and a Radeon HD4570. ). Of course i'm just moving them around randomly. A more complex behaviour will have major hit on performance.

yes cache is bigger hit on performance. you particle data should be accessed ideally as an array (well in practice there are slightly faster solutions than an array but that is already a good deal). I'm speaking of a C-style array (not std::vector or similiars)

I'm using of course a VBO with streaming hint.

There are several really good opensource particle engines wich have good performance too. Edited by DemonRad
0

Share this post


Link to post
Share on other sites
I can hit 500,000 particles and still maintain 60fps with my setup so it's definitely possible. Moving particles is just a moderately simple gravity/velocity equation that I can switch between running on the CPU or GPU, but am not noticing much in the way of performance difference - the primary bottlenecks are most definitely buffer updates and fillrate.

The D3DX sprite interface is just a wrapper around a dynamic vertex buffer so a naive switchover will just result in you rewriting ID3DXSprite in your own code. You need to tackle things a little more agressively.

One huge win I got on dynamic buffer updates was to use instancing - I fill a buffer with single vertexes as per-instance data then expand and billboard each single vertex on the GPU using per-vertex data. Instancing has it's own overhead for sure, but on balance this setup works well and comes out on the right side of the tradeoff.

Other things I explored included geometry shaders (don't bother, the overhead is too high) and pre-filling a static vertex buffer for each emitter type (but some emitter types use so few particles that it kills you on draw calls). I haven't yet experimented with a combination of instancing and pre-filled static buffers, but I do suspect that this would be as close to optimal as you're ever going to get with this particular bottleneck.
0

Share this post


Link to post
Share on other sites
[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)
[/quote]

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is. Edited by Madhed
0

Share this post


Link to post
Share on other sites
The trick with fast particle simulation on the CPU is basically use SSE and thinking about your data flow and layout.

[CODE]
struct Particle
{
float x, y, z;
}
[/CODE]

That is basically the worst to even consider doing it.

[CODE]
struct Emitter
{
float * x;
float * y;
float * z;
}
[/CODE]

On the other hand, where each pointer points to large array of components is the most SSE friendly way of doing things.
(You'd need other data too, they should also be in seperate arrays).

The reason you want them in seperate arrays is due to how SSE works; it multiples/adds/etc across registers rather than between components.

So, with that arrangement you could read in four 'x' and four 'x direction' chunks of data and add them together with ease; if you had them packed as per the above structure then in order to use SSE you'd have to read in more data, massage it into the right SSE friendly format, do the maths and then sort it out again to write back to memory.

You also have to be aware of cache and how it interacts with memory and CPU prefetching.

Using this method of data layout, some SSE and TTB I have an old 2D (not fully optimised) particle simulator which, on my i7 920 using 8 threads, can simulate 100 emitters with 10,000 particles each in ~3.3ms.
2

Share this post


Link to post
Share on other sites
[quote name='Madhed' timestamp='1339536018' post='4948632']
[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)
[/quote]

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
[/quote]

Iterating isn't the problem - allocating is.
0

Share this post


Link to post
Share on other sites
[quote name='mhagain' timestamp='1339538107' post='4948643']
[quote name='Madhed' timestamp='1339536018' post='4948632']
[quote name='DemonRad' timestamp='1339534091' post='4948625']
I'm speaking of a C-style array (not std::vector or similiars)
[/quote]

Why that? iterating over a std::vector should be no slower than to iterate over a c-array. In release mode, that is.
[/quote]

Iterating isn't the problem - allocating is.
[/quote]

But you'd have to allocate the c-array too
0

Share this post


Link to post
Share on other sites
IF used correctly, std::vector should show equivalent performance to raw arrays IF optimizations are enabled and IF you have a sane implementation. There are caveats here to be wary of, especially if you're cross platform. But knocking std::vector out of your code is not an optimization strategy. Edited by Promit
1

Share this post


Link to post
Share on other sites
std::vector is my fav =) except when it breaks in places that leave you clueless =\.

It depends on the array that you make, if you allocate new memory for it then it's gonna be just a bit faster than the vector not really anything to notice. But an array on the stack will be even faster, not much faster, but enough to notice an fps change. Either way they're still faster.

Anyways.

I ran Intel Parallel Amplifier on it, and out of 22.7 seconds 14 seconds of it was spent rendering the particles. All of the math and other stuff were all less than a second each.

That's why I was asking about the ID3DXSprite class. There's not really much optimizing left for me to do in the math, i've hard coded several things just to try to achieve 100k particles in 30 fps.

I am using an std::vector, but i really don't think that changing it to an array on the stack would help it much either.
0

Share this post


Link to post
Share on other sites
First of all, a stack-allocated array has to fit in the stack space; 100k particles is going to take a chunk of stack that you may well not want to be giving up.

Secondly, allocating an array on the heap and allocating on the stack will [i]emphatically not[/i] make any speed difference to accessing the memory. What [b]will[/b] make a difference is cache behavior. Your stack space may be in cache... and then again, if you're bloating your stack with particle data, it might not. Memory access latency on modern hardware is a complicated beast and you can't just assume that "oh hey stack is faster than freestore" because [i]that is not necessarily true[/i]. In fact, it should be trivial to construct an artificial benchmark where prefetched "heap" pages are accessible faster than stack pages due to the behavior of virtual memory paging and cache locality.

Third, allocating memory with "new foo[]" versus allocating the same memory in a std::vector will make [b]zero[/b] difference if your compiler's optimization passes are worth anything. Not a tiny bit, not depending on the array - zero, period. If you reserve() your vector correctly instead of letting it grow/copy up to full size (i.e. if you make a fair comparison) then basically all it does is a call to new foo[] under the covers. There is no [i]performance[/i] reason whatsoever to eschew std::vector, and stating that it makes any practical difference whatsoever is liable to mislead people into thinking that they shouldn't use it because "OMFG mai codez must be teh fastar!"

Lastly, "I noticed an FPS change" does not constitute admissible data for performance measurement. I can run the [b]exact same code[/b] a dozen times on the [b]exact same hardware[/b] and get a deviation of framerate that is noticeable. Just comparing a couple runs of one implementation to a couple runs of something else doesn't give you a valid performance picture.
2

Share this post


Link to post
Share on other sites
[quote name='DemonRad' timestamp='1339534091' post='4948625']
In my engine 100.000 particles are rendering decently at 110-120 fps (and I have a 3- years old pc 2GHZ double core.. of course only 1 core is used for moving particles and a Radeon HD4570. ). Of course i'm just moving them around randomly. A more complex behaviour will have major hit on performance.
[/quote]I suggest [url="http://www.2ld.de/gdc2004/"]Building a million particle system by Lutz Latta[/url]. If memory serves, it used to do 100k particles on a GeForce 6600 albeit the performance was quite low, in the range of 20fps I believe.

[quote name='ApochPiQ' timestamp='1339554243' post='4948691']
Third, allocating memory with "new foo[]" versus allocating the same memory in a std::vector will make zero difference if your compiler's optimization passes are worth anything. Not a tiny bit, not depending on the array - zero, period. If you reserve() your vector correctly instead of letting it grow/copy up to full size (i.e. if you make a fair comparison) then basically all it does is a call to new foo[] under the covers. There is no performance reason whatsoever to eschew std::vector, and stating that it makes any practical difference whatsoever is liable to mislead people into thinking that they shouldn't use it because "OMFG mai codez must be teh fastar!"[/quote]I think I am misunderstanding everything here.
Perhaps it's just me but it was my understanding [font=courier new,courier,monospace]std::vector[/font] will not take GPU memory. I think the main point of mhagain was to use GPU memory directly through maps. Now, of course we can write an allocator to deal with the reallocations ourselves... besides the fact that's not how vertex buffers are supposed to be used, especially when dealing with particle systems IMHO, now that we have our custom allocator, can we still talk about an [font=courier new,courier,monospace]std::vector[/font]?
I'm afraid not.
Or perhaps we're suggesting to compute everything in system RAM and then copy to GPU memory?
Please explain this clearly so I can understand.
Because perhaps it's just me, but I still have difficulty having [font=courier new,courier,monospace]std::vector[/font] and GPU memory toghether. Edited by Krohm
2

Share this post


Link to post
Share on other sites
Hm seems I derailed the thread a little, sorry for that I just tend to get defensive when someone claims c-style arrays over std::vectors for performance reasons. And of course Promit is right too. There are various implementations of the standard library, some pretty buggy/slow. And also there are platforms where one simply cannot use it. For most tasks on a standard PC however I would say that one should use the c++ std library extensively.

[b]@Krohm[/b]
I guess you are talking about FooBuffer->Map()'ing in D3D10/11? Of course, you get a void* pointer from that function and you will have to copy data into/out of that memory region. I didn't mean to construct a vector from the returned pointer. If you mean something else, mind to explain?
0

Share this post


Link to post
Share on other sites
[quote name='ApochPiQ' timestamp='1339554243' post='4948691']
First of all, a stack-allocated array has to fit in the stack space; 100k particles is going to take a chunk of stack that you may well not want to be giving up.

Secondly, allocating an array on the heap and allocating on the stack will [i]emphatically not[/i] make any speed difference to accessing the memory. What [b]will[/b] make a difference is cache behavior. Your stack space may be in cache... and then again, if you're bloating your stack with particle data, it might not. Memory access latency on modern hardware is a complicated beast and you can't just assume that "oh hey stack is faster than freestore" because [i]that is not necessarily true[/i]. In fact, it should be trivial to construct an artificial benchmark where prefetched "heap" pages are accessible faster than stack pages due to the behavior of virtual memory paging and cache locality.

Third, allocating memory with "new foo[]" versus allocating the same memory in a std::vector will make [b]zero[/b] difference if your compiler's optimization passes are worth anything. Not a tiny bit, not depending on the array - zero, period. If you reserve() your vector correctly instead of letting it grow/copy up to full size (i.e. if you make a fair comparison) then basically all it does is a call to new foo[] under the covers. There is no [i]performance[/i] reason whatsoever to eschew std::vector, and stating that it makes any practical difference whatsoever is liable to mislead people into thinking that they shouldn't use it because "OMFG mai codez must be teh fastar!"

Lastly, "I noticed an FPS change" does not constitute admissible data for performance measurement. I can run the [b]exact same code[/b] a dozen times on the [b]exact same hardware[/b] and get a deviation of framerate that is noticeable. Just comparing a couple runs of one implementation to a couple runs of something else doesn't give you a valid performance picture.
[/quote]

I apologize, I haven't learned that in school yet. I was just taught that Stack memory is faster than Heap. I havent got to the operating systems class yet so I had no idea. You made me feel like an idiot lol.


Thanks for the link Krohm, i hope it's as helpful as it looks. I'm not giving up til i get that 100k mark lol. If I don't then I'm pretty sure the game I'm working on is going to be pretty laggy. Edited by Muzzy A
0

Share this post


Link to post
Share on other sites
[quote name='Madhed' timestamp='1339590825' post='4948792']I guess you are talking about FooBuffer->Map()'ing in D3D10/11? Of course, you get a void* pointer from that function and you will have to copy data into/out of that memory region. I didn't mean to construct a vector from the returned pointer. If you mean something else, mind to explain?[/quote]My only goal was to cool down the [font=courier new,courier,monospace]std::vector[/font] performance debate. While I trust it in general, I see no way to mix it with STL containers. It appears to me we either put the data there from scratch or we use std::vector and then copy to VB. That's it.
0

Share this post


Link to post
Share on other sites
ok I managed to get it to 85,000 particles at 25 fps, I think that's about as good as im going to get without using a gpu. I wish the cpu was just like the gpu, what's wrong with making the computers have a gpu rather than a cpu? Things would be much faster.
0

Share this post


Link to post
Share on other sites
No they wouldn't.

A GPU is very good at doing a lot of tasks at once when you can generate enough work to hide latency of memory access and jumps. They are high through put, high latency devices.

However not all work loads map to a GPU well, which is where the CPU comes in with its advanced caches, branch prediction and the ability to execute work in a different order. CPUs are low latency, low throughput devices.

Or to put it another way; if you tried to run Word on a GPU then it would run horribly when compared to a CPU because the workload isn't suited to the device.

This is why AMD, Intel and ARM are putting so much effort into devices which combine a CPU and a GPU onto one die; so that work loads can be placed where they make sense.

Unlike what nVidia would probably like you to believe not every workload can be pushed massively parallel and run with 10000x speed up on a GPU, CPUs still very much have their place and will do for some time yet.
0

Share this post


Link to post
Share on other sites
well I can't wait until the CPU is much faster than it is now and working with the GPU much more than it is now. I want some performance :P
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0