## Recommended Posts

So I finally got to implementing my CS particle system.  So I see that I can use the CopyStructureCount to copy the number of "alive" particles into a constant buffer and regular buffer (as the indirect argument buffer) for drawing.

However, when it comes to dispatching thread groups, I need to use a formula like: NumThreadGroups = (NumAliveParticles + 255) / 256, where 256 is my thread group size.  This way I only dispatch as many thread groups as I actually need.

However, I don't really see a way to do this without CPU intervention.  There is DispatchIndirect, but I only have NumAliveParticles in some d3d11 buffer, not the result of the calculation (NumAliveParticles + 255) / 256.

I noticed in Hieroglyph 3 ParticleStorm demo, he dispatches enough thread groups to handle the "maximum" particle count.  This will result in "empty" thread groups if the particle system is not near maximum capacity.  Is this a big deal or not?  I assume the GPU overhead is loading the thread group into the multiprocessor, doing a conditional statement to see if any work needs to be done.  If the thread group is "empty," all threads will have the same branch behavior in that no work needs to be done, and the thread group is done being processed.  So it seems pretty negligible.  But I wanted a 2nd opionion, and also to know if there is a way to do a calculation like (NumAliveParticles + 255) / 256 without CPU intervention.

##### Share on other sites

Just run a very simple compute shader with 1 thread that reads the buffer count, calculates the number of thread groups needed to process that number of particle, and outputs it to a buffer. Then you can use that buffer as the args for a DispatchIndirect.

##### Share on other sites

I had originally tried reading back to the CPU what the particle count was, and then using that number to dispatch an appropriate amount of thread groups.  However, that was predictably slow, and I ended up coming to the solution that you mentioned (sending a fixed number of thread groups regardless of how many particles are present).

This solution works in particular for this example, since the particles have a fixed lifespan and can be reliably counted as dead after a certain time period.  The number of particles that are created are specifically throttled to ensure that this is true.  So after a startup period, there is always going to be nearly a full set of particles and there won't be any wasted thread groups anymore.

If you are able to have similar control on your particle system (i.e. you can reliably model the number of particles on the CPU side) then I would recommend the method used in ParticleStorm.  The method that MJP mentioned sounds like a good solution if you can't easily model the particles, and it only has a very small performance penalty of a single dispatch.  I would be interested to hear your results once you get it up and running though, and hear how your experience turns out.

##### Share on other sites

@MJP

I am having exactly the same problem. And I am trying to do what you suggested but I am still stuck since there is no example on the internet.

There are two things I do not unterstand yet.

1. What properties to set when creating the buffer for DispatchIndirect

2. I can imagine how to call the compute shader that calculates the thread groups but what then? The thread group size is stored in that buffer but how am I dispatching the actual compute shader with this information then?

It would be incredible helpful if you could provide some example.

##### Share on other sites

The buffer needs to be 12 bytes in size (one UINT for group count X, Y and Z) and specify the MiscFlag D3D11_RESOURCE_MISC_DRAWINDIRECT_ARGS.

To point #2, just call DispatchIndirect and pass in the 12 byte buffer created earlier and specify 0 for the offset to the args: http://msdn.microsoft.com/en-us/library/windows/desktop/ff476406(v=vs.85).aspx

As long as the buffer has these 3 UINTs stored within it in a contiguous fashion by the time the GPU gets around to executing the DispatchIndirect event, you'll get that many thread groups being executed.

##### Share on other sites

^^^ what ajmiles said

##### Share on other sites

I got it finally working. Thank's a lot!

Yet I am stuck at the next similar problem. My particles are stored in a StructuredBuffer and when I am going to actually draw them I bind a SRV to the VertexShader and use the deviceContext->Draw(?,0) call.

Here I have the same problem as above. I don't know how many particles to draw on the CPU side since they are spawned and destroyed purely in my ComputeShaders.

I thought about using DrawAuto() but that requires the particles to be in a VertexBuffer. And I think I can't create UAV's of a VertexBuffer and manipulate it with the ComputeShaders.

##### Share on other sites

DrawInstancedIndirect will do what you want to do. Copy the SB size into an indirect args buffer and pass that to the indirect draw method. (I mean copying the size to the specific location of the arguments in the indirect args buffer you want. (control number of verts vs number of instances, etc.)

##### Share on other sites

Sorry for the late response, had a lot of things going on lately and no time to work on this project. Anyways...

I tried to use DrawInstancedIndirect with half success. I am not 100% sure what data has to be stored in ID3D11Buffer *pBufferForArgs.

I have created the buffer with no specific initial data and tried to copy the structure count with:

m_pdevicecontext->CopyStructureCount(pbDrawIndirectArgs, 0, puavSimulationSateNew);

This draws nothing at all!

After that I tried to play a bit with the initial data of pbDrawIndirectArgs.

IndirectArgs indirectArgs;
indirectArgs._one = 0;
indirectArgs._two = 10;
indirectArgs._three = 0;
indirectArgs._four = 0;

D3D11_SUBRESOURCE_DATA InitData;
InitData.pSysMem = &indirectArgs;
InitData.SysMemPitch = 0;
InitData.SysMemSlicePitch = 0;

HRESULT result = m_pdevice->CreateBuffer(&bufferDesc, &InitData, &pbDrawIndirectArgs);

Now the strange thing happens. As soon as I set indirectArgs._two to anything but 0 it actually draws my particles.

After that I removed the CopyStructureCount call. And again I had a different behavior. Now the particles are blinking as if only a few at a time are drawn.

In conclusion I guess CopyStructureCount does actually work but only if i set indirectArgs._two to anything but 0.

This totally confuses me and I have no idea why...

Edited by me_12

##### Share on other sites

The only resource that explains anything about that buffer structure is the book "Practical Rendering and Computation with Direct3D 11".

There it is something like:

Each of these numbers represent a 4 byte size:

0 = Alligned Byte Offset For Args (uint)
1 = Alligned Byte Offset For Args (uint)
2 = Alligned Byte Offset For Args (uint)
3 = Vertex Count Per Instance (uint)
4 = Instance Count (uint)
5 = Start Vertex Location (uint)
6 = Start Instance Location (uint)

0-2: Is space available for whatever data I want? Can this be expanded arbitrary? Is this the number of bytes I have to skip and can be used in  DrawInstancedIndirect as second parameter (AlignedByteOffsetForArgs)?

3: This must be 1 for me since I am drawing 1 vertex per particle and will create a billboard in the geometry shader.

4: I guess this is the actual number of particles that are drawn.

5, 6: Well no idea about these two.

Edited by me_12

##### Share on other sites

Should be vertex count, instance count, 0,0 (startvertloc and start inst loc).

So lets imagine you have a quad and you want to instance if 10 times, your indirect args buffer should be 4, 10, 0, 0.

if you plan to expand each vert into quads, it would be 10, 1, 0, 0.

(At least that's what I remember off the top of my head, I can check tonight when I get home).

Inspecting the results of the buffer is more annoying than it should, NSight refuses to show it to me. However the VS2012 graphics debugger displays it no probs (finally something it does well :) ) or you can go the way of copying to staging buffer and displaying in your app.

##### Share on other sites

if you plan to expand each vert into quads, it would be 10, 1, 0, 0.

I guess that is a typo and you mean 1,10,0,0?

But still it does not explain why it does not work if I set the initial value to 1,0,0,0 and use CopyStructureCount to update the count...

Hm... alright I am going to install vs 2012. (Since I also was not able to figure the buffer results out via NSight)

##### Share on other sites

I guess that is a typo and you mean 1,10,0,0?

Both can work, depends how you setup your input layout

But still it does not explain why it does not work if I set the initial value to 1,0,0,0 and use CopyStructureCount to update the count

You're drawing 0 instances, so nothing gets drawn. If you want to change the second parameter with CopyStructureCount I think it should be:
[tt]
m_pdevicecontext->CopyStructureCount(pbDrawIndirectArgs, 4, puavSimulationSateNew);
[/tt]

##### Share on other sites

if you plan to expand each vert into quads, it would be 10, 1, 0, 0.

I guess that is a typo and you mean 1,10,0,0?

But still it does not explain why it does not work if I set the initial value to 1,0,0,0 and use CopyStructureCount to update the count...

Hm... alright I am going to install vs 2012. (Since I also was not able to figure the buffer results out via NSight)

Could world both ways: if you wanted 10 verts which you would expand in GS, it would be 10,1,0,0 ( 10verts -> 10 quads * 1 instance of the 10 ). Or you can use the HW instancing. I have noticed differences in performance when generating hundreds of thousands of quads, instancing being slightly slower than single instance with loads of expanded verts.

If you set initial value of 1,0,0,0, you are specifying 1 vertex, 0 instances. So as long as you update the second parameter with your structured count, it should work.

##### Share on other sites

Oh now it makes sense!

Visual Studio 2012 was a good idea as well!

And it works! Thank you all very very much!

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628285
• Total Posts
2981836

• 10
• 10
• 10
• 11
• 17