IncrementCounter, DecrementCounter, atomic operations works properly on overlapped compute shader?

Graphics and GPU Programming Programming DX12

Started by Mr_Fox November 14, 2016 08:52 AM

2 comments, last by Mr_Fox 7 years, 5 months ago

Mr_Fox

806

Author

November 14, 2016 08:52 AM

Hey Guys,

Assume we have a job queue buffer, and we have two compute shader (cs1, cs2) adding jobs into this job queue buffer by using IncrementCounter function on its counter (DX12). So to make this work properly, we can first dispatch cs1 and then set a barrier for job queue buffer's counter buffer, and then dispatch cs2, to avoid race condition where cs1 is still updating job queue, while cs2 starts to run.

But I was thinking, how IncrementCounter works under the hood. If the atomicity is guaranteed by GPU hardware, I think we probably don't need the barrier between dispatch cs1 and cs2. But I have no idea how GPU guaranteed the atomicity of IncrementCounter, so it would be nice if someone could talk about this when we have overlapped cs using same IncrementCounter on same buffer. Also does the same concept apply to all atomic operations (if we have overlapped compute shader using atomic operation update the same buffer, do we have to serialize those compute shaders to ensure correctness?)

Also what's the difference between Append/Consume buffer and StructuredBuffer with IncrementCounter/DecrementCounter as stack style buffer? I feel they are essentially the same thing (I may be wrong though).

Thanks

Mr_Fox

806

Author

November 15, 2016 03:35 AM

For my second question about differences between Append/Consume buffer and using StructuredBuffer with IncrementCounter/DecrementCounter to achieve same functionality, I found this page from Jason Z. In his post, he said there is a hidden counter in Append/Consume buffer to indicate the number of item in the Append/Consume buffer, does that mean they are actually the same?

Jason Z also mentioned the counter is a dumb counter, it will under-flow if you call consume on a empty buffer instead of notify you that there is nothing in the buffer....

I can think of a naive way to write my shader to actually implement correct stack behavior by adding empty stack check, but that require extra atomic operations which seems pretty expensive. So it will be great if someone could share how they implemented/design their GPU stack buffer~~

Thanks in advance.

P.S. for anyone who are willing to answer my first question, big thanks, too.

MJP

20,295

November 15, 2016 06:56 AM

Yes, the counter used by Increment/DecrementCounter and Append/Consume is the same hidden counter. Append/Consume are just a bit of syntactic sugar that warps usage of the hidden counter. You can implement the same functionality yourself with Increment/DecrementCounter:

uint outputIdx = OutputBuffer.IncrementCounter();
OutputBuffer[outputIdx] = outputData;

If you look at the generated bytecode for a shader that calls Append, you'll see that it does the equivalent of the above code. Let's try on this this simple compute shader:

AppendStructuredBuffer<uint> OutputBuffer;
 
[numthreads(64, 1, 1)]
void CSMain(in uint3 ThreadID : SV_GroupThreadID)
{
    OutputBuffer.Append(ThreadID.x);
}

Here's the resulting bytecode generated by FXC:

cs_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u0, 4
dcl_input vThreadIDInGroup.x
dcl_temps 1
dcl_thread_group 64, 1, 1
imm_atomic_alloc r0.x, u0
store_structured u0.x, r0.x, l(0), vThreadIDInGroup.x
ret

The imm_atomic_alloc performs the atomic add on the hidden counter, and the value of the counter prior to that add is used as the index for writing into the structured buffer.

Different GPU architectures have different ways for implementing this functionality. The most straightforward way to do it would be for the GPU to support a global atomic add on a memory location. Then the "hidden counter" is just a 4 byte value somewhere in GPU-accessible memory. On the GPU architectures that I'm familiar with (AMD's GCN family), this sort of global atomic operation is supported and could be used to implement the counter. However these GPU's actually have a bit of on-chip memory called "GDS" that the driver uses for this purpose. Using GDS allows the operation to be kept entirely on-chip without having to go out to caches or external memory.

On AMD GPU's the way these operations work would allow the counter to be accessed concurrently by two different shaders with no issues. I would assume it would also work on other GPU's, but I'm not too familiar with Intel chips and Nvidia is notoriously tight-lipped about the specifics of their hardware. Either way I'm not sure exactly what guarantees are made by the D3D12 API. D3D11 didn't even have the notion of multiple shaders executing in parallel, and essentially forced sync points between sequential dispatches (as if you always used a UAV barrier). Since most of the shader-centric documentations is from the D3D11 era, it doesn't have any information about accessing counters across multiple dispatches. The docs for D3D12_RESOURCE_UAV_BARRIER do say this:

You don't need to insert a UAV barrier between 2 draw or dispatch calls that only read a UAV. Additionally, you don't need to insert a UAV barrier between 2 draw or dispatch calls that write to the same UAV if you know that it's safe to execute the UAV accesses in any order.

I would think that the "safe to execute UAV accesses in any order" would apply in the case of an atomic append, since you're expecting the results to be unordered. But perhaps someone more familar with the specs or documentation could clear that up further. Either way I would be surprised if what you described didn't work in practice.

The Blog | The Book