Jump to content
  • Advertisement
maxest

DX11 DirectCompute: sync within warp

This topic is 379 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

In countless sources I've found that, when operating within a warp, one might skip syncthreads because all instructions are synchronous within a single warp. In CUDA-related sources. I followed that advice and applied it in DirectCompute (I use NV's GPU). I wrote this code that does nothing else but good old prefix-sum of 64 elements (64 is the size of my block):

	groupshared float errs1_shared[64];
groupshared float errs2_shared[64];
groupshared float errs4_shared[64];
groupshared float errs8_shared[64];
groupshared float errs16_shared[64];
groupshared float errs32_shared[64];
groupshared float errs64_shared[64];
	
void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
    if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
    if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[2*threadIdx] + errs4_shared[2*threadIdx + 1];
    if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[2*threadIdx] + errs8_shared[2*threadIdx + 1];
    if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[2*threadIdx] + errs16_shared[2*threadIdx + 1];
    if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[2*threadIdx] + errs32_shared[2*threadIdx + 1];
}
	

This works flawlessly. I noticed that I have bank conflicts in here so I changed that code to this:

	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32];
    if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16];
    if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8];
    if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4];
    if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2];
    if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1];
}
	

And to my surprise this one causes race conditions. Is it because I should not rely on that functionality (auto-sync within warp) when working with DirectCompute instead of CUDA? Because that hurts my performance by measurable margin. With bank conflicts (first version) I am still faster by around 15-20% than in the second version, which is conflict-free but I have to add GroupMemoryBarrierWithGroupSync in between each assignment.

Share this post


Link to post
Share on other sites
Advertisement

First, why do you use multiple arrays for intermediate results? Here is some code avoiding this:

    if (lID<32) _lds[(((lID >> 0) << 1) | (lID &  0) |  1) ]    += _lds[(((lID >> 0) << 1) |  0) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 1) << 2) | (lID &  1) |  2) ]    += _lds[(((lID >> 1) << 2) |  1) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 2) << 3) | (lID &  3) |  4) ]    += _lds[(((lID >> 2) << 3) |  3) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 3) << 4) | (lID &  7) |  8) ]    += _lds[(((lID >> 3) << 4) |  7) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 4) << 5) | (lID & 15) | 16) ]    += _lds[(((lID >> 4) << 5) | 15) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 5) << 6) | (lID & 31) | 32) ]    += _lds[(((lID >> 5) << 6) | 31) ];    memoryBarrierShared(); barrier();

... maybe you get better / different results because this way it becomes easier for the compiler to figure things out?

AMDs adwise is: Use barriers in any case. the compiler will remove them if they are not necessary.

But i never tested if they (and NV) really do.

 

Do you really get a performance drop when adding barriers in your first snippet? (You did'n made this clear, but i'd be very disappointed.)

 


 

Share this post


Link to post
Share on other sites

Oh, now i see: Your initial conditions make your threads divergent (<32, <16, <8... In my case it's always <32 so for a warp you could even remove the branch).

I believe it is the divergence that causes race conditions. Probably you can only remove barriers if the code is guaranteed th have the same control flow for each thread.

Share this post


Link to post
Share on other sites

Oh, I completely forgot that I can't have divergent branches to make use of that "assumption". But I've tried this code before as well:

	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32)
    {
        errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32];
        errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16];
        errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8];
        errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4];
        errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2];
        errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1];
    }
}
	

And it also causes race conditions even though there are no branches within the warp.

"Do you really get a performance drop when adding barriers in your first snippet? (You did'n made this clear, but i'd be very disappointed.)"

The *second* snippet, yes. When I add barriers to the second snippet the code is slower than the one from the first snipper.

Edited by maxest

Share this post


Link to post
Share on other sites

Probably 2 options:

Without the barrier it is not guaranteed that the data has been written although the threads run in lockstep.

Or the threads do not really run in lockstep. I remember since Kepler there are only subgroups of 16 or 8 threads in lockstep (not sure). And in Cuda they have to disable this feature because of their previous (stupid) adwise you can assume lockstep for waves?

No clue. But let me know if you try my snippet without multiple arrays.

 

Sidenode: On AMD i have submitted driver bug reports of malfunctioning prefix sums for OpenCL and one year later for Vulkan (not sure if the latter has been fixed yet.) Also, in OpenCL i've had the case where adding the useless branch <64 in front improved performance a lot, because otherwise the compiler decided to waste registers like crazy :) It seems those simple prefix sums are a bit of a compiler stress test.

 

Share this post


Link to post
Share on other sites

Implementation using one array:

	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 32];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 16];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 8];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 4];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 2];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 1];
}
	

It works but you might be surprised that it runs slower than when I used this:

	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32)
    {
        errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
        errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
        errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
        errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
        errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
        errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
    }
}
	

This one is modification of my first snippet (from the first post) that is ping-ponging two arrays. And here again, this one is faster by 15-20% then the one-array version. So my guess is it's the barriers that cost time. Please note that I run CalculateErrs 121 times in my shader, which runs for every pixel so that is a lot.

I would be perfectly fine on *not* relying on warp size to avoid using barriers because maybe DirectCompute does not allow this "trick" as it's not NV-only. But what bites my neck is that when I run bank-conflicked second snippet from this post, or the first snippet from the first post, it works like a charm. And I save performance by not having to use barriers.

Edited by maxest

Share this post


Link to post
Share on other sites
3 minutes ago, maxest said:

This one is modification of my first snippet that is ping-ponging two arrays. And here again, this one is faster by 15-20% then the first snippet in this post. So my guess is it's the barriers that cost time.

And the first snippet does NOT work if you remove the barriers?

And the second becomes slower if you add barriers?

 

About performance, this still can have a lot of other reasons:

On NV using more LDS can actually improve performance (but in practice we never have enough, so we usually use as little as possible if only to improve occupancy).

The memory access pattern is different.

First snippet has more 'useless' initial branches (which may also affect the requirement of barriers).

Share this post


Link to post
Share on other sites

D3D/DirectCompute has no concept of a "warp". It follows a model where every thread is considered to be completely independent, and where synchronization can only be performed across a thread group (which is different than a warp). This is starting to change a bit with SM6.0 and wave-level intrinsics, but even then there aren't necessarily guarantees about accessing shared memory in lock-step. You should really just insert the appropriate barriers if you want your code to be robust across different hardware.

The technique you're trying to use here is called "warp-synchronous programming" in the CUDA world, and you should know that Nvidia no longer recommends using it (in fact they now explicitly recommend that you *don't* use it). At some point they surely realized that requiring execution and memory access to be synchronized across exactly 32 threads painted them into a corner hardware-wise, and they've already made some changes that don't work with the old warp-synchronous programming examples that they provided.

Share this post


Link to post
Share on other sites

I wasn't aware NVIDIA doesn't recommend warp-synchronous programming anymore. Good to know.

I checked my GPU's warp size simply with some CUDA sample that prints debug info. That size is 32 for my GeForce 1080 GTX what does not surprise me as NV's GPUs have long been characterized by this number (I think AMD's is 64).

I have two more listings for you. Actually I had to change my code to operate on 16x16=256 pixel blocks instead of 8x8=64 pixels what forced me to call barriers. My first attempt:

	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 128) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 64) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 16) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 8) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 4) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 2) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 1) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
}
	

And the second attempt:

	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 128) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 64) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
    if (threadIdx < 16) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
    if (threadIdx < 8) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
    if (threadIdx < 4) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
    if (threadIdx < 2) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
    if (threadIdx < 1) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
}
	

I dropped a few barriers as from some point on I'm working with <= 32 threads. Both of these listings produce exactly the same outcome. If I skipped one more barrier the race condition appears.

Performance differs in both listings. Second one is around 15% faster.

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!