Direct3D11+ComputeShader: Reference device bug with AllMemoryBarrier()?

Started by
10 comments, last by JB2009 13 years, 6 months ago
I'm finding that the reference device appears to ignore AllMemoryBarrier(), or has some other problem reading buffer values that were written earlier in a given compute shader.

Compute shader:
RWStructuredBuffer<float> DataA;RWStructuredBuffer<float> DataB;int NumItems;[numthreads(1,2,1)]void Method(uint3 i_DispatchThreadID:SV_DispatchThreadID){  int Index;  Index=i_DispatchThreadID.y;  DataA[Index]=1.0;  AllMemoryBarrier(); // Wait for adjacent values of DataA to be set.  [branch] if (Index!=NumItems-1)    DataB[Index]=DataA[Index+1];}


Called by:
Dispatch(1,ceil(NumItems/ThreadGroupSize.y),1);


Before the code is called, DataA contains all 0.0s.

The code writes 1.0 to each element of DataA (1 item for each thread in the thread group), waits for the writes to complete, then sets each element of DataB from the next element of DataA.

With 8 items, 2 threads per group, and 4 groups, the results are:

GPU:

DataA={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0};
DataB={1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0}; // Correct.

Reference device:

DataA={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0};
DataB={1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0}; // Wrong!

With 8 items, 8 threads per group, and 1 group, the results are:

GPU:

DataA={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0};
DataB={1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0}; // Correct.

Reference device:

DataA={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0};
DataB={1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0}; // Wrong!

Any ideas?

JB.

[Edited by - JB2009 on November 9, 2010 6:07:57 AM]
Advertisement
If you comment out the AllMemoryBarrier(), do the GPU and reference solutions match?
Chris,

> If you comment out the AllMemoryBarrier(), do the GPU and reference solutions match?

The results are the same as above. NumItems is less than the warp size for the GPU I'm using, so for this (small) test case, AllMemoryBarrier() is not required, and it's presence has no effect.

I don't know what the warp size is for the reference device.

JB.
Have you tried the AllMemoryBarrierWithGroupSync() instead? I think you need this to guarantee that all of the threads have been started by that point. Since the reference device is probably running on your CPU it will have a significantly smaller 'wavefront' size than the GPU, making it very possible that some threads haven't started yet by the time the AllMemoryBarrier() is called.

On an aside note about *MemoryBarrier(): I have been trying for a couple months now to find out what this instruction is actually good for. I have tried contacts at the IHVs, Microsoft, and the public at large and I have not found a single instance where this instruction is more beneficial over the version with group sync... I suspect it has something to do with Append/Consume buffer usage with mixed threads (i.e. threads having different workloads), but I can't say for sure. If you know of such a situation, please let me know!
Hi Jason,

> Have you tried the AllMemoryBarrierWithGroupSync() instead?

Thanks for that. With AllMemoryBarrier() replaced with AllMemoryBarrierWithGroupSync(), the reference device then gives the correct result. Without the AllMemoryBarrierWithGroupSync() the reference device was exhibiting a wavefront/warp of 4 (on a quad core CPU), which as you say is much smaller than a typical GPU. And my example that used 2 threads per group with NumItems=8 was because I did not realize that only the threads in a group can be synchronized, not all threads across all groups as I was trying to achieve.

The MSDN documentation at http://msdn.microsoft.com/en-us/library/ff471350(v=VS.85).aspx
says that AllMemoryBarrier() will guarantee that (all?) outstanding memory operations have completed. I assumed that guaranteed that all outstanding memory writes will have been both run and completed. However John Rapp (http://forums.create.msdn.com/forums/p/29933/169809.aspx) says (as you do) that without the "WithGroupSync" suffix the threads may not have reached the memory write instructions yet.

I need to handle 4096 items, and the number of threads per group is limited to 1024 (in Direct3D 11). As each item depends on values written to memory by its neighbors, I need to wait for the neighbors to complete their memory writes before reading. Am I right in thinking that I either need to (a) call Dispatch() multiple times (to synchronize all threads in all groups), or (b) have a single group and process multiple items per thread?

Both approaches have drawbacks.

JB.
I remember reading somewhere that there is a way to make the memory barriers act across all thread groups by adding something like a 'globally coherent' attribute to the UAV declaration in HLSL. That might take care of your problem right off the bat...

But you could also easily process multiple items per thread. I don't think you will get that big of a throughput problem... especially if you aren't doing much math - you will already be memory bound.
Hi Jason,

Adding "globallycoherent" does not appear to ensure that all groups have been run - i.e. it does not change the function of AllMemoryBarrierWithGroupSync() to "AllMemoryBarrierWithAllGroupsSync()". Therefore some groups access the UAV before it has been written to.

What I'd like to happen is for the GPU to run all groups up to a given point in the code before crossing each barrier.

Consider this GPU driver/hardware pseudo code. Assume there are the same number of stream processors in the GPU as threads in each group:
globallycoherent RWStructuredBuffer<float> SharedUAV;for (GroupIndex=0;GroupIndex<NumGroups;++GroupIndex){  RunGroupOnStreamProcessors_FirstPartOfProgram(GroupIndex);  SaveStateOfStreamProcessors(GroupIndex);}AllMemoryBarrierWithGroupSync(); // Wait for all elements in SharedUAV to be set by all thread groups.for (GroupIndex=0;GroupIndex<NumGroups;++GroupIndex){  RestoreStateOfStreamProcessors(GroupIndex);  RunGroupOnStreamProcessors_SecondPartOfProgram(GroupIndex);  SaveStateOfStreamProcessors(GroupIndex);}etc 

Clearly this would be difficult to implement and I initially assumed that it was not possible. However, I guess that something like this has already been implemented as the mobility HD5850 has only 800 stream processors and Direct3D 11 allows 1024 threads per group. My guess is that they do need to save the stream process (or SIMD etc) state when there are more than 800 threads per group.

So, is this sort of behaviour possible with DirectCompute?

The annoying thing is that if the GPU can emulate more threads per group than it has stream processors available, then MS did not need to limit the number of threads per group.

(Of course it may be quicker to get each thread to process multiple items than to do the above).

---

> But you could also easily process multiple items per thread. I don't think you will get that big of a throughput problem

The full code is several hundred lines (a fluid flow simulation in a highly detailed and bendy tube) and has to process up to 4096 items more than 100 times per final visual frame, so currently performance is an issue.

I've had some problems getting each thread to process multiple items. The CS code looks something like this:
for (IterationIndex=0;IterationIndex<NumIterations;++IterationIndex){  AllMemoryBarrierWithGroupSync();  for (ItemIndex=0;ItemIndex<NumItemsPerThread;++ItemIndex)    DoSomething_0(ItemIndex);  AllMemoryBarrierWithGroupSync();  for (ItemIndex=0;ItemIndex<NumItemsPerThread;++ItemIndex)    DoSomething_1(ItemIndex);  AllMemoryBarrierWithGroupSync();  for (ItemIndex=0;ItemIndex<NumItemsPerThread;++ItemIndex)    DoSomething_2(ItemIndex);}

I think the outer loop has to be unrolled because the compiler doesn't like AllMemoryBarrierWithGroupSync() inside loops. However I've had situations where it fails to unroll the code, but doesn't give a reason.

Also, my original code computes many local variables before the IterationIndex outer loop. These variables apply to a particular item. If more than one item is being processed then they need to be stored in arrays, or need to be calculated each time in the loop(s), both of which are messy.

JB.
DirectCompute doesn't contain any simple mechanisms to force groups to sync to the same point. You'd have to implement something yourself and this would be difficult since there's no way to get a group to 'wait' until it's been flagged. You could spin but that would be lost productivity, plus there might not be enough register space to have all groups running concurrently, at which point the algorithm would never complete.

GloballyCoherent ensures that non-atomic writes to UAV's will be visible between groups. Otherwise these writes may only be visible to the current group.

Atomic writes will always be visible to all groups.

*WithGroupSync means that threads in a group will be sync'd, but NOT all threads across all groups reaching the same sync before any proceed.

AllMemoryBarrier() does ensure that writes by each current thread are visible to the entire group. Groups with large thread counts will require more than one warp/wavefront. I don't think that there are any requirements for warps/wavefronts to run lock step with each other -- which is where the *WithGroupSync comes in handy. Algorithms not using *WithGroupSync will be more complex and not readily apparent since there would be inter-thread communication but no guarantee on when all threads would run - other than that sibling threads of the same warp/wavefront would be running.
Also, it'd be great if you could post code for anything that doesn't compile and doesn't indicate why. Those would be good compiler bugs to know about.
If you really need to have all 4096 items processed simultaneously, then I would suppose you could just break the shader up into sections that would normally be separated by the memory barriers. You would use more bandwidth by reading and writing to the UAV that way, but if I understand the scenario correctly then you are doing this already within the single shader invocation.

Did you try doing multiple passes over the data instead of everything all at once yet?

This topic is closed to new replies.

Advertisement