Sign in to follow this  
JB2009

Direct3D11+ComputeShader: Reference device bug with AllMemoryBarrier()?

Recommended Posts

JB2009    100
I'm finding that the reference device appears to ignore AllMemoryBarrier(), or has some other problem reading buffer values that were written earlier in a given compute shader.

Compute shader:

RWStructuredBuffer<float> DataA;
RWStructuredBuffer<float> DataB;
int NumItems;

[numthreads(1,2,1)]
void Method(uint3 i_DispatchThreadID:SV_DispatchThreadID)
{
int Index;

Index=i_DispatchThreadID.y;

DataA[Index]=1.0;

AllMemoryBarrier(); // Wait for adjacent values of DataA to be set.

[branch] if (Index!=NumItems-1)
DataB[Index]=DataA[Index+1];
}


Called by:

Dispatch(1,ceil(NumItems/ThreadGroupSize.y),1);


Before the code is called, DataA contains all 0.0s.

The code writes 1.0 to each element of DataA (1 item for each thread in the thread group), waits for the writes to complete, then sets each element of DataB from the next element of DataA.

With 8 items, 2 threads per group, and 4 groups, the results are:

GPU:

DataA={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0};
DataB={1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0}; // Correct.

Reference device:

DataA={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0};
DataB={1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0}; // Wrong!

With 8 items, 8 threads per group, and 1 group, the results are:

GPU:

DataA={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0};
DataB={1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0}; // Correct.

Reference device:

DataA={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0};
DataB={1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0}; // Wrong!

Any ideas?

JB.

[Edited by - JB2009 on November 9, 2010 6:07:57 AM]

Share this post


Link to post
Share on other sites
JB2009    100
Chris,

> If you comment out the AllMemoryBarrier(), do the GPU and reference solutions match?

The results are the same as above. NumItems is less than the warp size for the GPU I'm using, so for this (small) test case, AllMemoryBarrier() is not required, and it's presence has no effect.

I don't know what the warp size is for the reference device.

JB.

Share this post


Link to post
Share on other sites
Jason Z    6434
Have you tried the AllMemoryBarrierWithGroupSync() instead? I think you need this to guarantee that all of the threads have been started by that point. Since the reference device is probably running on your CPU it will have a significantly smaller 'wavefront' size than the GPU, making it very possible that some threads haven't started yet by the time the AllMemoryBarrier() is called.

On an aside note about *MemoryBarrier(): I have been trying for a couple months now to find out what this instruction is actually good for. I have tried contacts at the IHVs, Microsoft, and the public at large and I have not found a single instance where this instruction is more beneficial over the version with group sync... I suspect it has something to do with Append/Consume buffer usage with mixed threads (i.e. threads having different workloads), but I can't say for sure. If you know of such a situation, please let me know!

Share this post


Link to post
Share on other sites
JB2009    100
Hi Jason,

> Have you tried the AllMemoryBarrierWithGroupSync() instead?

Thanks for that. With AllMemoryBarrier() replaced with AllMemoryBarrierWithGroupSync(), the reference device then gives the correct result. Without the AllMemoryBarrierWithGroupSync() the reference device was exhibiting a wavefront/warp of 4 (on a quad core CPU), which as you say is much smaller than a typical GPU. And my example that used 2 threads per group with NumItems=8 was because I did not realize that only the threads in a group can be synchronized, not all threads across all groups as I was trying to achieve.

The MSDN documentation at http://msdn.microsoft.com/en-us/library/ff471350(v=VS.85).aspx
says that AllMemoryBarrier() will guarantee that (all?) outstanding memory operations have completed. I assumed that guaranteed that all outstanding memory writes will have been both run and completed. However John Rapp (http://forums.create.msdn.com/forums/p/29933/169809.aspx) says (as you do) that without the "WithGroupSync" suffix the threads may not have reached the memory write instructions yet.

I need to handle 4096 items, and the number of threads per group is limited to 1024 (in Direct3D 11). As each item depends on values written to memory by its neighbors, I need to wait for the neighbors to complete their memory writes before reading. Am I right in thinking that I either need to (a) call Dispatch() multiple times (to synchronize all threads in all groups), or (b) have a single group and process multiple items per thread?

Both approaches have drawbacks.

JB.

Share this post


Link to post
Share on other sites
Jason Z    6434
I remember reading somewhere that there is a way to make the memory barriers act across all thread groups by adding something like a 'globally coherent' attribute to the UAV declaration in HLSL. That might take care of your problem right off the bat...

But you could also easily process multiple items per thread. I don't think you will get that big of a throughput problem... especially if you aren't doing much math - you will already be memory bound.

Share this post


Link to post
Share on other sites
JB2009    100
Hi Jason,

Adding "globallycoherent" does not appear to ensure that all groups have been run - i.e. it does not change the function of AllMemoryBarrierWithGroupSync() to "AllMemoryBarrierWithAllGroupsSync()". Therefore some groups access the UAV before it has been written to.

What I'd like to happen is for the GPU to run all groups up to a given point in the code before crossing each barrier.

Consider this GPU driver/hardware pseudo code. Assume there are the same number of stream processors in the GPU as threads in each group:

globallycoherent RWStructuredBuffer<float> SharedUAV;

for (GroupIndex=0;GroupIndex<NumGroups;++GroupIndex)
{
RunGroupOnStreamProcessors_FirstPartOfProgram(GroupIndex);
SaveStateOfStreamProcessors(GroupIndex);
}

AllMemoryBarrierWithGroupSync(); // Wait for all elements in SharedUAV to be set by all thread groups.

for (GroupIndex=0;GroupIndex<NumGroups;++GroupIndex)
{
RestoreStateOfStreamProcessors(GroupIndex);
RunGroupOnStreamProcessors_SecondPartOfProgram(GroupIndex);
SaveStateOfStreamProcessors(GroupIndex);
}

etc

Clearly this would be difficult to implement and I initially assumed that it was not possible. However, I guess that something like this has already been implemented as the mobility HD5850 has only 800 stream processors and Direct3D 11 allows 1024 threads per group. My guess is that they do need to save the stream process (or SIMD etc) state when there are more than 800 threads per group.

So, is this sort of behaviour possible with DirectCompute?

The annoying thing is that if the GPU can emulate more threads per group than it has stream processors available, then MS did not need to limit the number of threads per group.

(Of course it may be quicker to get each thread to process multiple items than to do the above).

---

> But you could also easily process multiple items per thread. I don't think you will get that big of a throughput problem

The full code is several hundred lines (a fluid flow simulation in a highly detailed and bendy tube) and has to process up to 4096 items more than 100 times per final visual frame, so currently performance is an issue.

I've had some problems getting each thread to process multiple items. The CS code looks something like this:

for (IterationIndex=0;IterationIndex<NumIterations;++IterationIndex)
{
AllMemoryBarrierWithGroupSync();

for (ItemIndex=0;ItemIndex<NumItemsPerThread;++ItemIndex)
DoSomething_0(ItemIndex);

AllMemoryBarrierWithGroupSync();

for (ItemIndex=0;ItemIndex<NumItemsPerThread;++ItemIndex)
DoSomething_1(ItemIndex);

AllMemoryBarrierWithGroupSync();

for (ItemIndex=0;ItemIndex<NumItemsPerThread;++ItemIndex)
DoSomething_2(ItemIndex);
}

I think the outer loop has to be unrolled because the compiler doesn't like AllMemoryBarrierWithGroupSync() inside loops. However I've had situations where it fails to unroll the code, but doesn't give a reason.

Also, my original code computes many local variables before the IterationIndex outer loop. These variables apply to a particular item. If more than one item is being processed then they need to be stored in arrays, or need to be calculated each time in the loop(s), both of which are messy.

JB.

Share this post


Link to post
Share on other sites
DieterVW    724
DirectCompute doesn't contain any simple mechanisms to force groups to sync to the same point. You'd have to implement something yourself and this would be difficult since there's no way to get a group to 'wait' until it's been flagged. You could spin but that would be lost productivity, plus there might not be enough register space to have all groups running concurrently, at which point the algorithm would never complete.

GloballyCoherent ensures that non-atomic writes to UAV's will be visible between groups. Otherwise these writes may only be visible to the current group.

Atomic writes will always be visible to all groups.

*WithGroupSync means that threads in a group will be sync'd, but NOT all threads across all groups reaching the same sync before any proceed.

AllMemoryBarrier() does ensure that writes by each current thread are visible to the entire group. Groups with large thread counts will require more than one warp/wavefront. I don't think that there are any requirements for warps/wavefronts to run lock step with each other -- which is where the *WithGroupSync comes in handy. Algorithms not using *WithGroupSync will be more complex and not readily apparent since there would be inter-thread communication but no guarantee on when all threads would run - other than that sibling threads of the same warp/wavefront would be running.

Share this post


Link to post
Share on other sites
DieterVW    724
Also, it'd be great if you could post code for anything that doesn't compile and doesn't indicate why. Those would be good compiler bugs to know about.

Share this post


Link to post
Share on other sites
Jason Z    6434
If you really need to have all 4096 items processed simultaneously, then I would suppose you could just break the shader up into sections that would normally be separated by the memory barriers. You would use more bandwidth by reading and writing to the UAV that way, but if I understand the scenario correctly then you are doing this already within the single shader invocation.

Did you try doing multiple passes over the data instead of everything all at once yet?

Share this post


Link to post
Share on other sites
JB2009    100
Hi Dieter,

I did try the manual inter-group sync, but aborted the work due to the need for the spin wait. Thanks for the warning about register space.

Your explanation of AllMemoryBarrier() agrees exactly with my experience with testing AllMemoryBarrier() and the reference device. Other posts (e.g. [D3D11] Compute Shader Memory Barriers) imply that AllMemoryBarrier() somehow ensures that all previous shared memory writes in the code have been completed, which is why I had the problem that started this thread. I think that (erroneous) idea comes from a misinterpretation of the MSDN documentation, but I think the documentation is ambiguous because many of us have made the same mistake. It says "Blocks execution of all threads in a group until all memory accesses have been completed". Perhaps it should say "Blocks execution of all threads in a group until all memory accesses that have already started have completed"?

I was hoping to find a way to overcome the 1024 "Num threads per group" limit. My compute shaders require up to 4096 calculations, where each calculation depends on its neighbors, so it is not OK to use multiple groups. I suspect (from a quick glance at the ATI Stream Computing document) that the hardware can achieve this, but the 1024 numthreads Direct3D 11 limit prevents that being available to us.

Many thanks for your help,

JB.

Share this post


Link to post
Share on other sites
JB2009    100
Hi Jason,

> Did you try doing multiple passes over the data instead of everything all at once yet?

Each of the 4096 items must be processed once per iteration, and each item depends on values from its two adjacent items from the previous iteration.

Therefore it is one indivisible problem.

I've got the "4 items per thread" code working, though I'll probably need multiple versions of each shader for different numbers of items (e.g. using macros) as the number of items varies from 6 to 4096.

Thanks for your help.

JB.

[Edited by - JB2009 on November 12, 2010 5:54:58 AM]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this