Jump to content
  • Advertisement
Sign in to follow this  
Husbj

What kind of performance to expect from real-time particle sorting?

This topic is 1421 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

This is mostly a theoretical question.

In the recent days I have implemented a simple GPU-driven particle system for use in my engine and after finally tracking down some annoyingly elusive bugs it is running quite well.

However I realized that sorting the individual particles so that they are drawn back-to-front (for proper alpha blending) seems to take significant processing time compared to just generating, updating and drawing the individual particles.

For example I get a frame rate drop of about 35 times if I compare running some 8000 particles with and without depth sorting (I get ~2100 FPS witout it and ~55 with on a relatively high-end GPU).

Is this to be expected due to way in which sorting has to be done on the GPU; eventually you need a single thread to traverse the whole list to ensure everything is in proper order? Or is there some kind of in-place sorting algorithm that can have separate threads work on separate sub-sections of the particle list without having to have a final step that puts all the pieces together? I have been unable to think of one.

 

My sorting algorithm is a pretty straight forward mergesort implementation; I first perform a swap-sort to bring sub-lists of two elements in order, then I proceed by calling a MergeSort shader program that merges two adjacent pre-sorted sub-lists 1 through log2(numParticles) times where the sub-list size goes from 2 and doubles for each step. This obviously means I have lots of threads doing little work in the first passes and the finally just one thread comparing all elements in the entire buffer at the end.

 

 

Ideally I would of course want to be able to perform the sorting in a quicker way (I suppose the key to this is better work distribution but as said I'm having a hard time figuring out how to accomplish that). If that is indeed not possible I guess I'll have to tinker with sorting parts of the buffer in each frame, but that can of course permit the ocassional artifact to slip through.

Share this post


Link to post
Share on other sites
Advertisement
Yeah I havent implemented it either, but bitonic sort seems to be the default algorithm on the GPU side.

You can also try using an order-independent transparency system.
A simple one that results in cutout-alpha when particles overlap with each other, but nice alpha-blending when they overlap with the world is described in "foliage rendering in Pure" presentation.
Basically, you render your particles twice, to two offscreen buffers (but keeping the same depth buffer as the regular scene for depth testing). To one buffer you output the particle's colours, and use alpha-testing to get hard-edged transparency. To the other buffer you clear to black and then render the particle's alpha values, using additive blending.
You then draw a full-screen quad with alpha blending enabled, outputting the first buffer's contents as RGB and the second buffer's contents as A.

Share this post


Link to post
Share on other sites
AMD put out a presentation earlier in the year at GDC 2014 specifically about particle system performance, combining several techniques. It includes a section on utilizing bitonic sort on the GPU (I think it's the same compute code as the talk from GDC 2013). It also had a pretty interesting section about using forward+ style tiling to reduce overdraw during particle rendering. I haven't tried to implement it yet but it looked worth using.

http://www.slideshare.net/DevCentralAMD/holy-smoke-faster-particle-rendering-using-direct-compute-by-gareth-thomas

Share this post


Link to post
Share on other sites

Yeah, I imagined as much... that bitonic sort keeps popping up, yet it seems eerily lacking of proper descriptions (mostly it's things like

"for n = a..b where n *= 2

[PERFORM MAGIC GPU SORTING] 

end for").

 

Anyway, from what I can gather, it seems to essentially be a kind of merge sort where your two sub-lists are organized so that one is decreasing from left to right, while the other is increasing. As such I imagine it might save one or two branching instructions per thread but the overall problem domain (required number of dispatch calls and elements to traverse, including the fact that the entire list will eventually have to be put together by a single thread in the last dispatch) seems to remain the same.

Is this a correct assesment of the algorithm? And if so, will it really yield significantly better performance?

I guess I'll try to implement it like I outlined above, but it feels like something is missing - I cannot see it being more than maybe some tenths or so faster than the standard merge sort if that is all there is to it.

 

I was also contemplating arranging data in equally-sized buckets so that the buckets have values in a certain range (so that all values in bucket 2 are greater or equal to any element in bucket 1 and so on) and the merge sort the buckets in parallel. Might this be efficient? It would most likely require a large single-thread operation at the beginning though.

 

 

As for that other clipping / alpha blending approach that does seem quite interesting, however I don't believe it to be applicable for my current situation as I have semi-transparent particles of different colours that in some situations will need to be visible through other particles (ie. they're not all white / some other consistent colour and not a solid colour in the middle like it sounds is what those approaches are for). Thanks for the links nonetheless smile.png

Edited by Husbjörn

Share this post


Link to post
Share on other sites
Nope, the point of using bitonic sort is that it can be completely parallel. E.g. In the Wikipedia network diagram, each horizontal line could be it's own thread, performing (synchronized) comparisons at the vertical lines, resulting in a sorted buffer at the end-
320px-Batcher_Bitonic_Mergesort_for_eigh

Here's an explanation of the algorithm from a GPU point of view, but it's old, so their example implementation is a pixel shader, not a compute shader-
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter46.html

I would assume that he nVidia and AMD sample collections would probably include a compute-shader implementation somewhere. If not, a quick google brings up some CUDA/OpenCL ones you could look at.

As for the bucket idea - you could try a counting sort / radix sort, which can also be parallelized.

Share this post


Link to post
Share on other sites

Ah, yes, that makes better sense then (and proves I don't understand it at all).

Won't that mean that you cannot cram more than 1024 elements into a single thread group though, and as far as I know you cannot synchronize access between multiple thread groups in any other way than to make several dispatch calls?

 


Here's an explanation of the algorithm from a GPU point of view, but it's old, so their example implementation is a pixel shader, not a compute shader-

Yes, I've read through that however I wrote it off as probably unnecessarily complicated since he mentions making use of the rasterizer and vertex shader stages as optimizations in the final step (the idea was never demonstrated without those rather specific considerations as far as I saw so that made it quite hard to follow). I'll give it another read though.

 


I would assume that he nVidia and AMD sample collections would probably include a compute-shader implementation somewhere.

There is an Nvidia example I've found, but I think it has been tampered with because it won't even compile as-is. Not to mention that it lacks commenting and contains really funky bitwise manipulations on group shared memory that to me just doesn't seem to make sense. I suppose I can see where the shared memory comes from now that you mention synchronization though. I'll have another look at it. Also I didn't think about looking for CUDA examples so cheers for pointing that out.

 


As for the bucket idea - you could try a counting sort / radix sort, which can also be parallelized.

Ah I see, thanks for the names.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!