This is mostly a theoretical question.
In the recent days I have implemented a simple GPU-driven particle system for use in my engine and after finally tracking down some annoyingly elusive bugs it is running quite well.
However I realized that sorting the individual particles so that they are drawn back-to-front (for proper alpha blending) seems to take significant processing time compared to just generating, updating and drawing the individual particles.
For example I get a frame rate drop of about 35 times if I compare running some 8000 particles with and without depth sorting (I get ~2100 FPS witout it and ~55 with on a relatively high-end GPU).
Is this to be expected due to way in which sorting has to be done on the GPU; eventually you need a single thread to traverse the whole list to ensure everything is in proper order? Or is there some kind of in-place sorting algorithm that can have separate threads work on separate sub-sections of the particle list without having to have a final step that puts all the pieces together? I have been unable to think of one.
My sorting algorithm is a pretty straight forward mergesort implementation; I first perform a swap-sort to bring sub-lists of two elements in order, then I proceed by calling a MergeSort shader program that merges two adjacent pre-sorted sub-lists 1 through log2(numParticles) times where the sub-list size goes from 2 and doubles for each step. This obviously means I have lots of threads doing little work in the first passes and the finally just one thread comparing all elements in the entire buffer at the end.
Ideally I would of course want to be able to perform the sorting in a quicker way (I suppose the key to this is better work distribution but as said I'm having a hard time figuring out how to accomplish that). If that is indeed not possible I guess I'll have to tinker with sorting parts of the buffer in each frame, but that can of course permit the ocassional artifact to slip through.