I recently looked into GPU sorting algorithms as part of my goal for a compute GPU particle system. I studied bitonic sort and the DXSDK implementation. This looks good when I need to sort a lot of particles, but there is one case I have where I need sorted particles the list is relatively small < 100. It is for fire, where I use video textures stored in a volume map and therefore do not need a lot of particles to get good results.
I'm sure I can modify the bitonic sort to handle particle counts less than the thread group size. But I'm wondering if I should just do a more bruteforce sort like brick sort (http://en.wikipedia.org/wiki/Odd%E2%80%93even_sort).
In either case, I'm not expecting huge gains from GPU implementation since the particle count will only run on one thread group in this particular case. The goal is just to avoid CPU intervention.