ComputeShader Particle System DispatchIndirect

Started by
13 comments, last by michaelruecker 10 years, 7 months ago

So I finally got to implementing my CS particle system. So I see that I can use the CopyStructureCount to copy the number of "alive" particles into a constant buffer and regular buffer (as the indirect argument buffer) for drawing.

However, when it comes to dispatching thread groups, I need to use a formula like: NumThreadGroups = (NumAliveParticles + 255) / 256, where 256 is my thread group size. This way I only dispatch as many thread groups as I actually need.

However, I don't really see a way to do this without CPU intervention. There is DispatchIndirect, but I only have NumAliveParticles in some d3d11 buffer, not the result of the calculation (NumAliveParticles + 255) / 256.

I noticed in Hieroglyph 3 ParticleStorm demo, he dispatches enough thread groups to handle the "maximum" particle count. This will result in "empty" thread groups if the particle system is not near maximum capacity. Is this a big deal or not? I assume the GPU overhead is loading the thread group into the multiprocessor, doing a conditional statement to see if any work needs to be done. If the thread group is "empty," all threads will have the same branch behavior in that no work needs to be done, and the thread group is done being processed. So it seems pretty negligible. But I wanted a 2nd opionion, and also to know if there is a way to do a calculation like (NumAliveParticles + 255) / 256 without CPU intervention.

-----Quat
Advertisement

Just run a very simple compute shader with 1 thread that reads the buffer count, calculates the number of thread groups needed to process that number of particle, and outputs it to a buffer. Then you can use that buffer as the args for a DispatchIndirect.

I had originally tried reading back to the CPU what the particle count was, and then using that number to dispatch an appropriate amount of thread groups. However, that was predictably slow, and I ended up coming to the solution that you mentioned (sending a fixed number of thread groups regardless of how many particles are present).

This solution works in particular for this example, since the particles have a fixed lifespan and can be reliably counted as dead after a certain time period. The number of particles that are created are specifically throttled to ensure that this is true. So after a startup period, there is always going to be nearly a full set of particles and there won't be any wasted thread groups anymore.

If you are able to have similar control on your particle system (i.e. you can reliably model the number of particles on the CPU side) then I would recommend the method used in ParticleStorm. The method that MJP mentioned sounds like a good solution if you can't easily model the particles, and it only has a very small performance penalty of a single dispatch. I would be interested to hear your results once you get it up and running though, and hear how your experience turns out.

@MJP

I am having exactly the same problem. And I am trying to do what you suggested but I am still stuck since there is no example on the internet.

There are two things I do not unterstand yet.

1. What properties to set when creating the buffer for DispatchIndirect

2. I can imagine how to call the compute shader that calculates the thread groups but what then? The thread group size is stored in that buffer but how am I dispatching the actual compute shader with this information then?

It would be incredible helpful if you could provide some example.

To point #1, read: http://msdn.microsoft.com/en-us/library/windows/desktop/ff476406(v=vs.85).aspx

The buffer needs to be 12 bytes in size (one UINT for group count X, Y and Z) and specify the MiscFlag D3D11_RESOURCE_MISC_DRAWINDIRECT_ARGS.

To point #2, just call DispatchIndirect and pass in the 12 byte buffer created earlier and specify 0 for the offset to the args: http://msdn.microsoft.com/en-us/library/windows/desktop/ff476406(v=vs.85).aspx

As long as the buffer has these 3 UINTs stored within it in a contiguous fashion by the time the GPU gets around to executing the DispatchIndirect event, you'll get that many thread groups being executed.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

^^^ what ajmiles said smile.png

I got it finally working. Thank's a lot!

Yet I am stuck at the next similar problem. My particles are stored in a StructuredBuffer and when I am going to actually draw them I bind a SRV to the VertexShader and use the deviceContext->Draw(?,0) call.

Here I have the same problem as above. I don't know how many particles to draw on the CPU side since they are spawned and destroyed purely in my ComputeShaders.

I thought about using DrawAuto() but that requires the particles to be in a VertexBuffer. And I think I can't create UAV's of a VertexBuffer and manipulate it with the ComputeShaders.

DrawInstancedIndirect will do what you want to do. Copy the SB size into an indirect args buffer and pass that to the indirect draw method. (I mean copying the size to the specific location of the arguments in the indirect args buffer you want. (control number of verts vs number of instances, etc.)

Sorry for the late response, had a lot of things going on lately and no time to work on this project. Anyways...

I tried to use DrawInstancedIndirect with half success. I am not 100% sure what data has to be stored in ID3D11Buffer *pBufferForArgs.

I have created the buffer with no specific initial data and tried to copy the structure count with:


m_pdevicecontext->CopyStructureCount(pbDrawIndirectArgs, 0, puavSimulationSateNew);

This draws nothing at all!

After that I tried to play a bit with the initial data of pbDrawIndirectArgs.


IndirectArgs indirectArgs;
indirectArgs._one = 0;
indirectArgs._two = 10;
indirectArgs._three = 0;
indirectArgs._four = 0;


D3D11_SUBRESOURCE_DATA InitData;
InitData.pSysMem = &indirectArgs;
InitData.SysMemPitch = 0;
InitData.SysMemSlicePitch = 0;


HRESULT result = m_pdevice->CreateBuffer(&bufferDesc, &InitData, &pbDrawIndirectArgs);

Now the strange thing happens. As soon as I set indirectArgs._two to anything but 0 it actually draws my particles.

After that I removed the CopyStructureCount call. And again I had a different behavior. Now the particles are blinking as if only a few at a time are drawn.

In conclusion I guess CopyStructureCount does actually work but only if i set indirectArgs._two to anything but 0.

This totally confuses me and I have no idea why...

The only resource that explains anything about that buffer structure is the book "Practical Rendering and Computation with Direct3D 11".

There it is something like:

Each of these numbers represent a 4 byte size:


0 = Alligned Byte Offset For Args (uint)
1 = Alligned Byte Offset For Args (uint)
2 = Alligned Byte Offset For Args (uint)
3 = Vertex Count Per Instance (uint)
4 = Instance Count (uint)
5 = Start Vertex Location (uint)
6 = Start Instance Location (uint)

0-2: Is space available for whatever data I want? Can this be expanded arbitrary? Is this the number of bytes I have to skip and can be used in DrawInstancedIndirect as second parameter (AlignedByteOffsetForArgs)?

3: This must be 1 for me since I am drawing 1 vertex per particle and will create a billboard in the geometry shader.

4: I guess this is the actual number of particles that are drawn.

5, 6: Well no idea about these two.

This topic is closed to new replies.

Advertisement