DirectCompute: sync within warp

Started by
10 comments, last by JoeJ 6 years, 5 months ago

In countless sources I've found that, when operating within a warp, one might skip syncthreads because all instructions are synchronous within a single warp. In CUDA-related sources. I followed that advice and applied it in DirectCompute (I use NV's GPU). I wrote this code that does nothing else but good old prefix-sum of 64 elements (64 is the size of my block):


	groupshared float errs1_shared[64];
groupshared float errs2_shared[64];
groupshared float errs4_shared[64];
groupshared float errs8_shared[64];
groupshared float errs16_shared[64];
groupshared float errs32_shared[64];
groupshared float errs64_shared[64];
	
void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
    if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
    if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[2*threadIdx] + errs4_shared[2*threadIdx + 1];
    if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[2*threadIdx] + errs8_shared[2*threadIdx + 1];
    if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[2*threadIdx] + errs16_shared[2*threadIdx + 1];
    if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[2*threadIdx] + errs32_shared[2*threadIdx + 1];
}
	

This works flawlessly. I noticed that I have bank conflicts in here so I changed that code to this:


	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32];
    if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16];
    if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8];
    if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4];
    if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2];
    if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1];
}
	

And to my surprise this one causes race conditions. Is it because I should not rely on that functionality (auto-sync within warp) when working with DirectCompute instead of CUDA? Because that hurts my performance by measurable margin. With bank conflicts (first version) I am still faster by around 15-20% than in the second version, which is conflict-free but I have to add GroupMemoryBarrierWithGroupSync in between each assignment.

Advertisement

First, why do you use multiple arrays for intermediate results? Here is some code avoiding this:

    if (lID<32) _lds[(((lID >> 0) << 1) | (lID &  0) |  1) ]    += _lds[(((lID >> 0) << 1) |  0) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 1) << 2) | (lID &  1) |  2) ]    += _lds[(((lID >> 1) << 2) |  1) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 2) << 3) | (lID &  3) |  4) ]    += _lds[(((lID >> 2) << 3) |  3) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 3) << 4) | (lID &  7) |  8) ]    += _lds[(((lID >> 3) << 4) |  7) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 4) << 5) | (lID & 15) | 16) ]    += _lds[(((lID >> 4) << 5) | 15) ];    memoryBarrierShared(); barrier();
    if (lID<32) _lds[(((lID >> 5) << 6) | (lID & 31) | 32) ]    += _lds[(((lID >> 5) << 6) | 31) ];    memoryBarrierShared(); barrier();

... maybe you get better / different results because this way it becomes easier for the compiler to figure things out?

AMDs adwise is: Use barriers in any case. the compiler will remove them if they are not necessary.

But i never tested if they (and NV) really do.

 

Do you really get a performance drop when adding barriers in your first snippet? (You did'n made this clear, but i'd be very disappointed.)

 


 

Oh, now i see: Your initial conditions make your threads divergent (<32, <16, <8... In my case it's always <32 so for a warp you could even remove the branch).

I believe it is the divergence that causes race conditions. Probably you can only remove barriers if the code is guaranteed th have the same control flow for each thread.

Oh, I completely forgot that I can't have divergent branches to make use of that "assumption". But I've tried this code before as well:


	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32)
    {
        errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32];
        errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16];
        errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8];
        errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4];
        errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2];
        errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1];
    }
}
	

And it also causes race conditions even though there are no branches within the warp.

"Do you really get a performance drop when adding barriers in your first snippet? (You did'n made this clear, but i'd be very disappointed.)"

The *second* snippet, yes. When I add barriers to the second snippet the code is slower than the one from the first snipper.

Probably 2 options:

Without the barrier it is not guaranteed that the data has been written although the threads run in lockstep.

Or the threads do not really run in lockstep. I remember since Kepler there are only subgroups of 16 or 8 threads in lockstep (not sure). And in Cuda they have to disable this feature because of their previous (stupid) adwise you can assume lockstep for waves?

No clue. But let me know if you try my snippet without multiple arrays.

 

Sidenode: On AMD i have submitted driver bug reports of malfunctioning prefix sums for OpenCL and one year later for Vulkan (not sure if the latter has been fixed yet.) Also, in OpenCL i've had the case where adding the useless branch <64 in front improved performance a lot, because otherwise the compiler decided to waste registers like crazy :) It seems those simple prefix sums are a bit of a compiler stress test.

 

Implementation using one array:


	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 32];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 16];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 8];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 4];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 2];
    GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 1];
}
	

It works but you might be surprised that it runs slower than when I used this:


	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 32)
    {
        errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
        errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
        errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
        errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
        errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
        errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
    }
}
	

This one is modification of my first snippet (from the first post) that is ping-ponging two arrays. And here again, this one is faster by 15-20% then the one-array version. So my guess is it's the barriers that cost time. Please note that I run CalculateErrs 121 times in my shader, which runs for every pixel so that is a lot.

I would be perfectly fine on *not* relying on warp size to avoid using barriers because maybe DirectCompute does not allow this "trick" as it's not NV-only. But what bites my neck is that when I run bank-conflicked second snippet from this post, or the first snippet from the first post, it works like a charm. And I save performance by not having to use barriers.

3 minutes ago, maxest said:

This one is modification of my first snippet that is ping-ponging two arrays. And here again, this one is faster by 15-20% then the first snippet in this post. So my guess is it's the barriers that cost time.

And the first snippet does NOT work if you remove the barriers?

And the second becomes slower if you add barriers?

 

About performance, this still can have a lot of other reasons:

On NV using more LDS can actually improve performance (but in practice we never have enough, so we usually use as little as possible if only to improve occupancy).

The memory access pattern is different.

First snippet has more 'useless' initial branches (which may also affect the requirement of barriers).

D3D/DirectCompute has no concept of a "warp". It follows a model where every thread is considered to be completely independent, and where synchronization can only be performed across a thread group (which is different than a warp). This is starting to change a bit with SM6.0 and wave-level intrinsics, but even then there aren't necessarily guarantees about accessing shared memory in lock-step. You should really just insert the appropriate barriers if you want your code to be robust across different hardware.

The technique you're trying to use here is called "warp-synchronous programming" in the CUDA world, and you should know that Nvidia no longer recommends using it (in fact they now explicitly recommend that you *don't* use it). At some point they surely realized that requiring execution and memory access to be synchronized across exactly 32 threads painted them into a corner hardware-wise, and they've already made some changes that don't work with the old warp-synchronous programming examples that they provided.

Unless you're using D3D12 with Shader Model 6, in which case CheckFeatureSupport for D3D12_FEATURE_DATA_D3D12_OPTIONS1 exposes your wave / warp size.

I wasn't aware NVIDIA doesn't recommend warp-synchronous programming anymore. Good to know.

I checked my GPU's warp size simply with some CUDA sample that prints debug info. That size is 32 for my GeForce 1080 GTX what does not surprise me as NV's GPUs have long been characterized by this number (I think AMD's is 64).

I have two more listings for you. Actually I had to change my code to operate on 16x16=256 pixel blocks instead of 8x8=64 pixels what forced me to call barriers. My first attempt:


	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 128) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 64) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 16) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 8) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 4) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 2) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 1) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
}
	

And the second attempt:


	void CalculateErrs(uint threadIdx)
{
    if (threadIdx < 128) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 64) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync();
    if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
    if (threadIdx < 16) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
    if (threadIdx < 8) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
    if (threadIdx < 4) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
    if (threadIdx < 2) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1];
    if (threadIdx < 1) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1];
}
	

I dropped a few barriers as from some point on I'm working with <= 32 threads. Both of these listings produce exactly the same outcome. If I skipped one more barrier the race condition appears.

Performance differs in both listings. Second one is around 15% faster.

This topic is closed to new replies.

Advertisement