Sign in to follow this  
Mr_Fox

Why InterlockedAdd is much faster than InterlockedCompareStore under high contention

Recommended Posts

Hey Guys

 

I have a compute shader with threadgroup size 8*8*8 try to update a 32bit uint (UAV, and is set to 0 before cs). 

 // Update brick structure u3BlockIdx is same for every thread in one threadgroup
    if (true || fSDF < vParam.fTruncDist) { // for testing purpose, I made this always true so every thread will do the update
        //tex_uavFlagVol[u3BlockIdx] = 1;                              // case 1, fastest. no atomicity, only useful when we just want to know is there any thread update it or not
        //InterlockedAdd(tex_uavFlagVol[u3BlockIdx], 1);               // case 2, take twice time as case 1, useful when we need to know how many thread actually update it.
        //InterlockedCompareStore(tex_uavFlagVol[u3BlockIdx], 0, 1);   // case 3, take almost twice time as case 2, not much useful, just test for fun
    }

It looks as the result is straight forward. But my expectation is that under such contention (512 threads try to update the same data) case 3 maybe the fastest since it's doing 512 serialized read and compare but only 1 write. while case 1 didn't do atomic write, it actually doing 512 write, and with bank conflict, it should not be much faster than case 3. For case 2, this really confuses me: it's doing 512 serialized write, why its twice fast as case 3 where we only have 1 write? 

 

My understanding is that serialized read should be no slow than serialized write, extra 512 compare time should be negligible, so case 3 really shouldn't be that slow compare to case 2.

 

But apparently, I must get something wrong. Thanks for anyone who could enlightening me on this

 

 

 

Share this post


Link to post
Share on other sites

I get more result...

 

The previous test is from my GTX 680m 

I just run my test on a desktop with GTX1080, with higher memory presure to get measurable perf delta(total memory accessed is 8 times larger,  8 times more threadgroup, but per threadgroup mem read/write should be the same) and  case 1 the fastest as before, case 2 is almost 3 time slower, case 3 is only slightly slower than case 2 (3%).

 

 

Also I add another test on both plantform:

    // Case 4 uses groupshared variable, so instead of atomic operate on uav, this only operates on register (am I wrong?)
    // Then the result is got for this threadgroup after a groupmemorybarrier. And during final stage, atomic update by using only 1
    // thread in each threadgroup, thus contention is much much lower 
    groupshared uint uFlag = 0;
...
...
    if (true ||fSDF < vParam.fTruncDist) {
        InterlockedCompareStore(uFlag, 0, 1);
    }
    GroupMemoryBarrierWithGroupSync();
    if (all(GTid == uint3(0,0,0)) && uFlag) {
        InterlockedCompareStore(tex_uavFlagVol[u3BlockIdx], 0, 1);
    }

For case 4, on GTX680m is 2 time slower than case3.... on GTX1080 it's almost 10% faster than case 1!!! My geuss is that on GTX1080 contention on UAV is the bottleneck, even for case 1 non-atomic write it's get bottlenecked. So case 4 can be faster than case 1 by heavily reduce contention on UAV (but we don't get contention on uFlag?). As always, my guess maybe totally wrong though... Also I still confused by the result on GTX680m...

Anyone could try to explain this? Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

I have a compute shader with threadgroup size 8*8*8 try to update a 32bit uint (UAV, and is set to 0 before cs). // Update brick structure u3BlockIdx is same for every thread in one threadgroup if (true || fSDF < vParam.fTruncDist) { // for testing purpose, I made this always true so every thread will do the update //tex_uavFlagVol[u3BlockIdx] = 1; // case 1, fastest. no atomicity, only useful when we just want to know is there any thread update it or not //InterlockedAdd(tex_uavFlagVol[u3BlockIdx], 1); // case 2, take twice time as case 1, useful when we need to know how many thread actually update it. //InterlockedCompareStore(tex_uavFlagVol[u3BlockIdx], 0, 1); // case 3, take almost twice time as case 2, not much useful, just test for fun } It looks as the result is straight forward. But my expectation is that under such contention (512 threads try to update the same data) case 3 maybe the fastest since it's doing 512 serialized read and compare but only 1 write. while case 1 didn't do atomic write, it actually doing 512 write, and with bank conflict, it should not be much faster than case 3. For case 2, this really confuses me: it's doing 512 serialized write, why its twice fast as case 3 where we only have 1 write? My understanding is that serialized read should be no slow than serialized write, extra 512 compare time should be negligible, so case 3 really shouldn't be that slow compare to case 2.

 

"while case 1 didn't do atomic write, it actually doing 512 write, and with bank conflict, it should not be much faster than case 3"

 

Case 1: The GPU may have a mechanism to determinate that all threads write the same value to the seme memory (allthough that's more likely for reading than writing - GCN does that)

 

"For case 2, this really confuses me: it's doing 512 serialized write, why its twice fast as case 3 where we only have 1 write?"

 

Implemetation of case 2 is probably like this: Read memory once, add number of active threads, write once.

Case 3 is more expensive: Read once, Do all the camprisions and changes in order, write once.

 

 

My experience on GCN is: Use case 4 whenever possible - it's usually more than 10 times faster than atomics to global memory.

I do not understand why you get so bad results on 680m, also 10% on GTX1080 seems far too less. (Maybe the compiler already does it internally for you?)

Some comments:

 

if (true ||fSDF < vParam.fTruncDist) {
InterlockedCompareStore(uFlag, 0, 1); // why Compare AND Store? Usually in such a case you only want to set the flag no matter what it's current state is.
}
GroupMemoryBarrierWithGroupSync();
if (all(GTid == uint3(0,0,0)) && uFlag) { // that's 3 comparisions just for the thread ID? - you should use only one to identify the first thread
InterlockedCompareStore(tex_uavFlagVol[u3BlockIdx], 0, 1);
}

Share this post


Link to post
Share on other sites

Thanks JoeJ, really appreciated.

if (true ||fSDF < vParam.fTruncDist) { InterlockedCompareStore(uFlag, 0, 1); // why Compare AND Store? Usually in such a case you only want to set the flag no matter what it's current state is. }
 

I did this only for testing purpose, just want to check whether it is faster than interlockedAdd :)  Should  InterlockedCompare* faster than Interlocked* for general cases under high contention since the former can avoid lots of write?

 

Also in the case where I have all thread only update the same global memory, should that memory be in cache all the time during interlocked update? I was wondering why case 4 is much much faster?

 

Thanks alot

Share this post


Link to post
Share on other sites

Should InterlockedCompare* faster than Interlocked* for general cases under high contention since the former can avoid lots of write?

 

I assume the comparision should be slower but i never tested this. I assume at latest the memory controller performs atomic operations in a way that the final write happens only once per Warp or even just once per all simultaneous Warps? - i can't imagine it's done for each single thread.

 

Also in the case where I have all thread only update the same global memory, should that memory be in cache all the time during interlocked update? I was wondering why case 4 is much much faster?

 

Not sure about the cache (still a black box to me). But i know why case 4 should be faster:

Here most atomic operations happens in LDS on chip memory. IIRC on GCN that's considered 14 times faster in bandwidth than global memory and latency is so small that LDS operations are not blocking (blocking means switching to another task while waiting on global memory operations to finish).

Additionally having only one thread per Warp writing the final value to global memory should be a lot faster than all of them.

Those rules should apply no matter if it's atomic or not.

But i'm not sure if this is as important on NV than on AMD.

 

Maybe some guy with better knowledge can give some better answers.

 

But whatever you hear, keep profiling and testing such issues all over the time.

Results may vary a lot depending on current shader (and of course GPU model, so i keep a lot of those things configurable by preprocessing to maintain specific optimizations).

I use global atomics a lot for debug output - mostly it doubles execution time, in rare cases it has almost no effect even i assumed the opposite.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this