Compute shader memory barrier fun

Started by
19 comments, last by MMONewbie 7 years, 7 months ago

I've got myself somewhat confused regards required memory barriers in a compute shader where invocations within the same dispatch access a list stored in a SSBO.

I have the SSBO declared coherent and I protected the list with an atomic exchanged based spinlock (taking care to lock only in the lead thread for a warp).

If I do the following in order in the shader:

1) Acquire lock

2) Modify list items and list item count

3) Release lock

Where do I need a memory barrier? I think at best I only need a memoryBarrierBuffer() between (2) & (3)?

The atomic operation in (3) I believe does not guarantee that the changes made in (2) have occurred from the point of other invocations within the dispatch right? So by placing the memoryBarrierBuffer() between (2) & (3) it should ensure that the other invocations within the same dispatch see the buffers changes correctly after they acquire the lock?

Also if I call memoryBarrierBuffer() only in the lead warp thread I assume that's not good?

So what I should really have is maybe(?) something like:

if(leadWarpThread) AcquireLock();

ModifyBufferListItemsAndCount();

memoryBarrierBuffer();

if(leadWarpThread) ReleaseLock();

Advertisement

I see one is missing:

if(leadWarpThread) AcquireLock();

barrier(); // ensure no other thread begins changing buffer before locking

ModifyBufferListItemsAndCount();

memoryBarrierBuffer();

if(leadWarpThread) ReleaseLock();

Thanks but the execution barrier() isn't needed because of the atomic based spinlock.

How do you implement AcquireLock/ReleaseLock without synchronizing with global memory?

Also, how do you ensure that one GLSL work group maps to exactly one hardware wavefront?

Yep, if lead thread is modifying global memory for the lock (buffer or whatever), you need not just a barrier, but also a memoryBarrier() afterwards.

And those barriers cant be in AcquireLock(), because all threads must execute them - so it seems you miss them.

Currently, other threads may start calling ModifyBufferListItemsAndCount() before the lock is done, yes?

Other than that i agree to your initial statements, but it smells like there should be a way to do things more efficiently.

I use atomics in global memory for debugging and it's really slow.

The spinlock is usually something like:


uint lock;

if(leadThread) while(atomicCompSwap(lock, 0, 1) != 0); // locks
if(leadThread) atomicExchange(lock, 0); // unlocks

They are done on a variable in a SSBO which is the global memory. With the ARB shader ballot etc extensions you can figure out warp synchronous programming now in GL. But I am cheating a little bit and switching some constants between 32/64 threads per warp/wavefront on nvidia/AMD, as although you can get that value in the shader via those extensions there seems no way to query it client side with glGetInteger().

The other threads in the warp can't call ModifyBufferListItemsAndCount() until after the lock as they are in the same warp as the lead thread taking the lock - so they have to wait to re-converge their instruction stream I believe. But actually this might not actually be guaranteed by the GL spec even if it is by today's hardware? It's not clear from the ballot and vote extensions to me if this guaranteed? So to conform to GL specs maybe it would need a barrier()? I guess really they should have added a warpExecutionBarrier() to the ballot extension as it all seems very thorny.

Really right now I don't trust any of this as it's all poorly explained and documented unfortunately, with broken examples all over the internet (And it seemingly working is not a guarantee of it being correct!). Also from CUDA guides that changed at some point to recommend against warp synchronous stuff (but still everyone seemingly doing it in CUDA land) to now nvidia/AMD pushing warp synchronous GL extensions/intrinsics...

So from my current world view I still don't think I need an execution barrier() as I know all threads in a warp wait on their lead thread to grab the lock. And I think the behaviour of the atomics guarantee the rest of it enough?

As to the speed of atomics, yes not ideal, but in some places they are the only way to do required syncs, so I try to buffer local lists to amortise their costs.

Maybe I should mention that my slimmed down reasoning case is actually more like:


if(leadWarpThread) AcquireLock();


ModifyBufferListItems();


if(leadWarpThread) ModifyBufferListItemCount();


memoryBarrierBuffer();


if(leadWarpThread) ReleaseLock();

As only the lead thread should alter the final list count. Which is three divergent if()'s. It does also get a little confusing regards avoiding divergence best practices (as there is no good recent documentation on this) but for now I have been assuming it's usually better to if() around 'one line' blocks on booleans and execute effective no-ops on warp threads that are not contributing to that divergent branch, rather than if around larger divergent chunks.

So you assume all threads in a Warp/Wavefront are guaranteed to execute in lockstep - do the specs mention such a guarantee?

AFIK there is already some hardware where this is not necessarily the case (Maxwell?).

Personally i put barriers everywhere and leave it to the compiler to optimize them away when possible.

I second your problems with workgroup functions:

There is NO documantation on GCN shader extensions (and no answer when asking for it),

their compiler breaks if the shader contains math functions (https://github.com/GPUOpen-LibrariesAndSDKs/VkMBCNT/issues/2),

and Vulkan validation layers complain against them (because Spir-V extensions are not yet standartized by Khronos)

I'm afraid i'll have to wait on next Vulkan version until things get better, probably it's the same for OpenGL.

EDIT: Just updated glslang (https://github.com/KhronosGroup/glslang) and see AMD extensions has been integrated :)

No more need for AMDs compiler, all issues fixed.

glslang source should also help to figure out how to use undocumented extensions...

Awesome! :D :D :D

You have some CUDA experience?

Did you ever compare CUDA <-> GLSL performance?

Two years back i made a large project in both OpenGL compute shader and OpenCL 1.x.

OpenCL was almost two times faster on Nvidia, and slightly faster on AMD.

Today things have changed - Vulkan compute shader seems always faster than OpenCL :)

I've done only a small number of shaders yet, tested on AMD only, but for now Vulkan is about 1.5 x faster.

The assumption that all warps/wavefronts execute in lockstep is something I am unsure of the far reaching truth of. It seemed it was solid until sometime after kepler mixed messages were given and CUDA docs began to advise against it but I can no longer find that in the current docs. But a lot of CUDA libraries rely on that behaviour and there seems to be conflicting opinions on the matter with neither nVidia/AMD clarifying the current state of play. Of course CUDA spec != OpenGL spec so yeh even more confusing and room for things not to be quite as assumed even on the same hardware as one spec may give more leeway.

You mentioning that Maxwell may not execute warps in lockstep is the first I have heard of that - and I cannot find any reference to this so far on google? Where did you hear that!? I am intending to try this stuff on Pascal over the next week so you have worried me greatly! :-)

I have only written basic CUDA stuff but tried to stick with GL for cross platform stuff with the mind to switch completely to Vulkan when its 'ready' - but it looks like I may hold off until Vulkan Next shows up now.

I do keep getting tempted to switch to GL + CUDA/OpenCL currently due to debugging issues (I've crashed my system lots mostly on nvidia recently) - but currently CUDA debugging profiling on Pascal isn't really working either so it's a bit awkward anyway.

I suspect CUDA must be faster than GL compute still in some ways for straight compute but I haven't compared - but I thought the general consensus was that they had caught up. I have seen some strange things - such as surprisingly close performance in GL compute between a RX470 and GTX1080, and I did intend to see if I tested in CUDA if that was a GL compute inefficiency on nvidia (or me likely doing things wrong). Latest nvidia drivers for Pascal seemed to speed up compute too since I last tested that though. But I haven't got that far yet so no I don't really know :-)

Really the quicker everyone can just focus on one (Vulkan!) API that will be great - and then I am perfectly fine with vendor specific extensions given there are only two vendors that really matter for this stuff.

So if I can find something that definitely says relying on warp lockstep execution is not safe then I will litter the code with a warpBarrier() macro that can be defined as a barrier() to be safe. So this:


#define WARPBARRIER barrier


bool leadThread = (gl_SubGroupInvocationARB == 0);

if(leadThread) while(atomicCompSwap(lock, 0, 1) != 0);

ModifyBufferListItems();

if(leadWarpThread) ModifyBufferListItemCount();

memoryBarrierBuffer();
WARPBARRIER();


if(leadThread) atomicExchange(lock, 0);

EDIT: As a note to others that may read. There are lots of similar warnings to be found by google regards the dangers of relying on warp lock step execution such as:

https://devtalk.nvidia.com/default/topic/632471/is-syncthreads-required-within-a-warp-/

But I am not really convinced how valid that is currently. Given nvidia/amd posting recent stuff seemingly encouraging it such as:

https://developer.nvidia.com/reading-between-threads-shader-intrinsics

ONE... interesting note is that I think the ballot instructions are defined to intrinsically sync divergent warp threads - so in theory if you use one or more of those types of instructions (which I actually do) you would maybe not need the WARPBARRIER macro I have in that example above - as you could be sure they have re-converged at that point.

But it would be nice to know if for sure you can always rely on warp lockstep behaviour in GLSL with the recent extensions on all hardware...

You mentioning that Maxwell may not execute warps in lockstep is the first I have heard of that - and I cannot find any reference to this so far on google? Where did you hear that!? I am intending to try this stuff on Pascal over the next week so you have worried me greatly! :-)

Can't remember a related marketing buzzword search term - i heared it when the chip was launched, probably from those 'new archticture' PDFs NV released.

I'm not sure if it was Maxwell at all, maybe Kepler.

strange things - such as surprisingly close performance in GL compute between a RX470 and GTX1080

Two years back my R9-280X was 1.9 times faster than Titan, this factor was pretty constant across all shaders (10 - 15) including complex tree traversals, ray tracing and simpler image filter stuff. Code was optimized individually for both (which is probably not true for public benchmarks and makes a big difference).

NV is fast at rasterization, AMD is fast at compute.

I have access to a 1070 now but not tested yet. I'm curious how things have changed, but it seems not so much if you're right.

ONE... interesting note is that I think the ballot instructions are defined to intrinsically sync divergent warp threads

Agree, but don't know for sure either :)

This topic is closed to new replies.

Advertisement