So I finally got to implementing my CS particle system. So I see that I can use the CopyStructureCount to copy the number of "alive" particles into a constant buffer and regular buffer (as the indirect argument buffer) for drawing.
However, when it comes to dispatching thread groups, I need to use a formula like: NumThreadGroups = (NumAliveParticles + 255) / 256, where 256 is my thread group size. This way I only dispatch as many thread groups as I actually need.
However, I don't really see a way to do this without CPU intervention. There is DispatchIndirect, but I only have NumAliveParticles in some d3d11 buffer, not the result of the calculation (NumAliveParticles + 255) / 256.
I noticed in Hieroglyph 3 ParticleStorm demo, he dispatches enough thread groups to handle the "maximum" particle count. This will result in "empty" thread groups if the particle system is not near maximum capacity. Is this a big deal or not? I assume the GPU overhead is loading the thread group into the multiprocessor, doing a conditional statement to see if any work needs to be done. If the thread group is "empty," all threads will have the same branch behavior in that no work needs to be done, and the thread group is done being processed. So it seems pretty negligible. But I wanted a 2nd opionion, and also to know if there is a way to do a calculation like (NumAliveParticles + 255) / 256 without CPU intervention.