I also decided to grab a trial copy of Intel's Parallel Studio to have a play with it, see what its auto vectorisation is like and also try things like the Thread Building Blocks.
One of the things I need in this game are particles.
Lots of them.
Now, the obvious way to deal with it is the give it to my shiney HD5870 and let that sim things, however I've never been one to take the obvious route and I've got a core i7 sitting here which hardly gets a work out; time for particles on the CPU.
Having installed the Intel stuff I had a look over the TBB code and decided that the task based system would make a useful platform to build a multi-threaded game on; it's basically a C++ way of playing with a threadpool, the thread pool being hidden behind the scenes and shared out as required to tasks. You don't even have to be explicate with tasks they provide parallel_for loops which can subdivide work and share it out between threads, including task stealing. All in all pretty sweet.
After putting it off for a number of days I decided to finally knuckle down and get on with some performance testing code. I've decided to start simply for now, get a vague framework up and running, however it'll require some work to make it useful with the final goal being to throw it at some people and see what performance numbers I get back for emitters and particle count.
I've so far tested 5 different methods of updating the particle system;
The first is a simple serial processing on a single thread; so each emitter is processed in turn and each particle is processed one step at a time. Best I can hope for is some single floating point value SSE processing to help out. The particles themselves are structs stored in an std::vector (floats and arrays of floats in the struct).
The second is the first parallel system I went for; emitters and particles in parallel. This uses the parallel_for from the TBB to split the emitters over threads and then to break each emitter's particles up between the threads. Same data storage as the first test.
The third was me wondering if splitting the particles over threads was a good idea so I removed the parallel_for from the particle code and left it in the emitter processing. Same storage setup as before.
Having had a look at my basic processing of the particles and a bit of a read of the Intel TBB docs I realised that I had a bit of a dependancy going on;
momentum += velocity
position += momentum * time;
So, I decided to split it into two passes; the first works out the momentum and the second the position. Storage stayed the same as before.
At this point I had some timing data;
With 100 emitters, each with 10,000 particles and the particle system assuming an update rate of 16ms a frame and a particle lifetime of 5 seconds:
- serial processing took 5.06 seconds to complete, with the final frame taking 23ms to process
- Emitter and particle parallel took 2.44 seconds to complete, with the final frame time taking 8.6ms to complete.
- Emitter only parallel took 2.4 seconds to complete, with the final frame taking 8.53ms
- Two pass parallel processing took 2.44 seconds to complete, with the final frame taking 8ms
Clearly, single threaded processing wasn't going to win and speed prizes, however splitting the work over multiple threads had the desired effect, with the processing now taking ~8ms, or half a frame.
The final step; SSE.
I decided to build on the two pass parallel processing as it seemed a natural way to go.
The major change to SSE was dropping the std::vector of structures. There were two reasons for this;
- memory alignment is a big one; SSE likes things to be 16byte aligned
- SIMD. Structures of data don't lend themselves well to this as at any given point you are only looking at once chunk of data.
SIMD was a major one, as the various formulas being applied to the components could be applied to x and y seperately, so if we closely pack the data this means we can hold say 4 X positions and work on them at once.
So, the new storage was a bunch of aligned chunks of memory aligned to 16bytes. The particle counts were forced to be multiple of 4 so that we can sanely work with 4 components at a time.
The dispatch is the same as the previous versions; we split threads off into emitters and then into groups of particles and these particles are processed.
- Two pass SSE parallel processing took 1.09 seconds, with the final frame taking 4.9ms.
Thats a nicer number indeed, as it still leaves me 11ms to play with to hit 60fps.
There is still some tuning to do (min. particles per thread for example, currently 128) and it might even be better to do my own task dispatch rather than leave it to the parallel_for loop to setup. I've also not done any profiling with regards to stalling etc or how particles and emitter count effect the various methods of processing. Oh, and my own timing system needs a bit of work as well; taking all frame times and processing them for min/max/average might be a good idea.
However, for now at least I'm happy with todays work.
ToDo: Improve test bed.