# Compute shader memory barrier fun

This topic is 477 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

## Recommended Posts

I've got myself somewhat confused regards required memory barriers in a compute shader where invocations within the same dispatch access a list stored in a SSBO.

I have the SSBO declared coherent and I protected the list with an atomic exchanged based spinlock (taking care to lock only in the lead thread for a warp).

If I do the following in order in the shader:

1) Acquire lock

2) Modify list items and list item count

3) Release lock

Where do I need a memory barrier?  I think at best I only need a memoryBarrierBuffer() between (2) & (3)?

The atomic operation in (3) I believe does not guarantee that the changes made in (2) have occurred from the point of other invocations within the dispatch right?  So by placing the memoryBarrierBuffer() between (2) & (3) it should ensure that the other invocations within the same dispatch see the buffers changes correctly after they acquire the lock?

Also if I call memoryBarrierBuffer() only in the lead warp thread I assume that's not good?

So what I should really have is maybe(?) something like:

if(leadWarpThread) AcquireLock();

ModifyBufferListItemsAndCount();

memoryBarrierBuffer();

if(leadWarpThread) ReleaseLock();

#### Share this post

##### Share on other sites

I see one is missing:

if(leadWarpThread) AcquireLock();

barrier(); // ensure no other thread begins changing buffer before locking

ModifyBufferListItemsAndCount();

memoryBarrierBuffer();

if(leadWarpThread) ReleaseLock();

#### Share this post

##### Share on other sites

Thanks but the execution barrier() isn't needed because of the atomic based spinlock.

#### Share this post

##### Share on other sites

How do you implement AcquireLock/ReleaseLock without synchronizing with global memory?

Also, how do you ensure that one GLSL work group maps to exactly one hardware wavefront?

#### Share this post

##### Share on other sites

Yep, if lead thread is modifying global memory for the lock (buffer or whatever), you need not just a barrier, but also a memoryBarrier() afterwards.

And those barriers cant be in AcquireLock(), because all threads must execute them - so it seems you miss them.

Currently, other threads may start calling ModifyBufferListItemsAndCount() before the lock is done, yes?

Other than that i agree to your initial statements, but it smells like there should be a way to do things more efficiently.

I use atomics in global memory for debugging and it's really slow.

#### Share this post

##### Share on other sites

The spinlock is usually something like:

uint lock;

if(leadThread) while(atomicCompSwap(lock, 0, 1) != 0); // locks
if(leadThread) atomicExchange(lock, 0); // unlocks


They are done on a variable in a SSBO which is the global memory.  With the ARB shader ballot etc extensions you can figure out warp synchronous programming now in GL.  But I am cheating a little bit and switching some constants between 32/64 threads per warp/wavefront on nvidia/AMD, as although you can get that value in the shader via those extensions there seems no way to query it client side with glGetInteger().

The other threads in the warp can't call ModifyBufferListItemsAndCount() until after the lock as they are in the same warp as the lead thread taking the lock - so they have to wait to re-converge their instruction stream I believe.  But actually this might not actually be guaranteed by the GL spec even if it is by today's hardware?  It's not clear from the ballot and vote extensions to me if this guaranteed?  So to conform to GL specs maybe it would need a barrier()?  I guess really they should have added a warpExecutionBarrier() to the ballot extension as it all seems very thorny.

Really right now I don't trust any of this as it's all poorly explained and documented unfortunately, with broken examples all over the internet (And it seemingly working is not a guarantee of it being correct!).  Also from CUDA guides that changed at some point to recommend against warp synchronous stuff (but still everyone seemingly doing it in CUDA land) to now nvidia/AMD pushing warp synchronous GL extensions/intrinsics...

So from my current world view I still don't think I need an execution barrier() as I know all threads in a warp wait on their lead thread to grab the lock.  And I think the behaviour of the atomics guarantee the rest of it enough?

As to the speed of atomics, yes not ideal, but in some places they are the only way to do required syncs, so I try to buffer local lists to amortise their costs.

Edited by MMONewbie

#### Share this post

##### Share on other sites

Maybe I should mention that my slimmed down reasoning case is actually more like:

if(leadWarpThread) AcquireLock();

ModifyBufferListItems();

if(leadWarpThread) ModifyBufferListItemCount();

memoryBarrierBuffer();

if(leadWarpThread) ReleaseLock();

As only the lead thread should alter the final list count.  Which is three divergent if()'s.  It does also get a little confusing regards avoiding divergence best practices (as there is no good recent documentation on this) but for now I have been assuming it's usually better to if() around 'one line' blocks on booleans and execute effective no-ops on warp threads that are not contributing to that divergent branch, rather than if around larger divergent chunks.

#### Share this post

##### Share on other sites

So you assume all threads in a Warp/Wavefront are guaranteed to execute in lockstep - do the specs mention such a guarantee?

AFIK there is already some hardware where this is not necessarily the case (Maxwell?).

Personally i put barriers everywhere and leave it to the compiler to optimize them away when possible.

I second your problems with workgroup functions:

There is NO documantation on GCN shader extensions (and no answer when asking for it),

their compiler breaks if the shader contains math functions (https://github.com/GPUOpen-LibrariesAndSDKs/VkMBCNT/issues/2),

and Vulkan validation layers complain against them (because Spir-V extensions are not yet standartized by Khronos)

I'm afraid i'll have to wait on next Vulkan version until things get better, probably it's the same for OpenGL.

EDIT: Just updated glslang (https://github.com/KhronosGroup/glslang) and see AMD extensions has been integrated :)

No more need for AMDs compiler, all issues fixed.

glslang source should also help to figure out how to use undocumented extensions...

Awesome! :D :D :D

You have some CUDA experience?

Did you ever compare CUDA <-> GLSL performance?

Two years back i made a large project in both OpenGL compute shader and OpenCL 1.x.

OpenCL was almost two times faster on Nvidia, and slightly faster on AMD.

Today things have changed - Vulkan compute shader seems always faster than OpenCL :)

I've done only a small number of shaders yet, tested on AMD only, but for now Vulkan is about 1.5 x faster.

Edited by JoeJ

#### Share this post

##### Share on other sites

The assumption that all warps/wavefronts execute in lockstep is something I am unsure of the far reaching truth of.  It seemed it was solid until sometime after kepler mixed messages were given and CUDA docs began to advise against it but I can no longer find that in the current docs.  But a lot of CUDA libraries rely on that behaviour and there seems to be conflicting opinions on the matter with neither nVidia/AMD clarifying the current state of play.  Of course CUDA spec != OpenGL spec so yeh even more confusing and room for things not to be quite as assumed even on the same hardware as one spec may give more leeway.

You mentioning that Maxwell may not execute warps in lockstep is the first I have heard of that - and I cannot find any reference to this so far on google?  Where did you hear that!?  I am intending to try this stuff on Pascal over the next week so you have worried me greatly!  :-)

I have only written basic CUDA stuff but tried to stick with GL for cross platform stuff with the mind to switch completely to Vulkan when its 'ready' - but it looks like I may hold off until Vulkan Next shows up now.

I do keep getting tempted to switch to GL + CUDA/OpenCL currently due to debugging issues (I've crashed my system lots mostly on nvidia recently) - but currently CUDA debugging profiling on Pascal isn't really working either so it's a bit awkward anyway.

I suspect CUDA must be faster than GL compute still in some ways for straight compute but I haven't compared - but I thought the general consensus was that they had caught up.  I have seen some strange things - such as surprisingly close performance in GL compute between a RX470 and GTX1080, and I did intend to see if I tested in CUDA if that was a GL compute inefficiency on nvidia (or me likely doing things wrong).  Latest nvidia drivers for Pascal seemed to speed up compute too since I last tested that though.  But I haven't got that far yet so no I don't really know :-)

Really the quicker everyone can just focus on one (Vulkan!) API that will be great - and then I am perfectly fine with vendor specific extensions given there are only two vendors that really matter for this stuff.

So if I can find something that definitely says relying on warp lockstep execution is not safe then I will litter the code with a warpBarrier() macro that can be defined as a barrier() to be safe.  So this:

#define WARPBARRIER barrier

bool leadThread = (gl_SubGroupInvocationARB == 0);

if(leadThread) while(atomicCompSwap(lock, 0, 1) != 0);

ModifyBufferListItems();

if(leadWarpThread) ModifyBufferListItemCount();

memoryBarrierBuffer();
WARPBARRIER();

if(leadThread) atomicExchange(lock, 0);

EDIT: As a note to others that may read.  There are lots of similar warnings to be found by google regards the dangers of relying on warp lock step execution such as:

https://devtalk.nvidia.com/default/topic/632471/is-syncthreads-required-within-a-warp-/

But I am not really convinced how valid that is currently.  Given nvidia/amd posting recent stuff seemingly encouraging it such as:

https://developer.nvidia.com/reading-between-threads-shader-intrinsics

ONE... interesting note is that I think the ballot instructions are defined to intrinsically sync divergent warp threads - so in theory if you use one or more of those types of instructions (which I actually do) you would maybe not need the WARPBARRIER macro I have in that example above - as you could be sure they have re-converged at that point.

But it would be nice to know if for sure you can always rely on warp lockstep behaviour in GLSL with the recent extensions on all hardware...

Edited by MMONewbie

#### Share this post

##### Share on other sites

You mentioning that Maxwell may not execute warps in lockstep is the first I have heard of that - and I cannot find any reference to this so far on google? Where did you hear that!? I am intending to try this stuff on Pascal over the next week so you have worried me greatly! :-)

Can't remember a related marketing buzzword search term - i heared it when the chip was launched, probably from those 'new archticture' PDFs NV released.

I'm not sure if it was Maxwell at all, maybe Kepler.

strange things - such as surprisingly close performance in GL compute between a RX470 and GTX1080

Two years back my R9-280X was 1.9 times faster than Titan, this factor was pretty constant across all shaders (10 - 15) including complex tree traversals, ray tracing and simpler image filter stuff. Code was optimized individually for both (which is probably not true for public benchmarks and makes a big difference).

NV is fast at rasterization, AMD is fast at compute.

I have access to a 1070 now but not tested yet. I'm curious how things have changed, but it seems not so much if you're right.

ONE... interesting note is that I think the ballot instructions are defined to intrinsically sync divergent warp threads

Agree, but don't know for sure either :)

#### Share this post

##### Share on other sites

I just came across this: https://asc.llnl.gov/DOE-COE-Mtg-2016/talks/1-10_NVIDIA.pdf

If you look at page 18 'coop groups' it seems to be targeting part of this confusion.  Where they have a: sync(this_warp()) construct to do a warp execution barrier thing.

But unsure where that leaves extended OpenGL right now!

I dunno if something happened where they intended to introduce say Volta earlier that maybe does not guarantee these things, so in preparation for that they warned people, but then pushed it back.  But now with CUDA8 they are putting the machinery in place for CUDA at least?  Unless it does break on maxwell/pascal already...

#### Share this post

##### Share on other sites

Yep, that's the nice stuff we wanna have, although i'm most interested in exchanging registers between threads - could reduce LDS usage a lot.

Let's hope they find ways to expose most of this vendor independent and soon...

At least for the next months i'll focus on what we already have, probably that codepath remains necessery for older hardware anyways.

I'm really happy with Vulkan - it's great and it's ready just now. I get better performance than expected and so don't need to squeeze every bit using extensions.

If you plan to migrate to Vulkan anyways, maybe it's better to spend time on that now and the problem of unavailabe / badly documented extensions solves itself in the meantime.

#### Share this post

##### Share on other sites

The persistence of register state surprised me and is quite interesting - I don't know if AMD can do that in their new generation?  I suppose it's a Pascal only feature but could be quite powerful indeed.

Not sure what impact that would have on techniques like persistent threads - maybe that would require something like a GPU side multi-dispatch indirect but then in CUDA there are things like dynamic parallelism I haven't looked into properly.

On vulkan I read things on the AMD and Nvidia forums still at the moment that make me hesitant despite getting as far as writing initial setup/shutdown code already!  I was going to judge it by Doom's Vulkan support beginning to overtake the OpenGL performance lots as the right time :P

I get the feeling AMD/Nvidia are still scrambling to get this all working properly still but to be fair there has been a lot of changes and many API's to support currently.

#### Share this post

##### Share on other sites

I get the feeling AMD/Nvidia are still scrambling to get this all working properly still but to be fair there has been a lot of changes and many API's to support currently.

Vulkan feels very solid, not the early beta kind of i have expected. Probably because it's similar to existing DX12/Mantle and requires only a fraction of complexity of a OpenGL driver.

Downside is initially you need a LOT of code to get even simple things going, but it does not hide the hardware behind a black box like OpenGL, so finally development is faster. (Less guessing and trial and error to find the fastest of many possible ways).

Validation layers tell you pretty much everything you do wrong in plain english. How often did i get stuck with OpenGL not rendering anything and i have had no clue why - this never happens with Vulkan.

The only thing i miss is profiler support - AMD CodeXL does not yet work for Vulkan, which is why i still develop in OpenCL first and port to GLSL aided by a preprocessor.

(I don't think RenderDoc shows the interesting things like occupancy, register usage etc.)

Maybe it's better on Nvidia, don't know.

The persistence of register state surprised me and is quite interesting - I don't know if AMD can do that in their new generation?

I'd guess any hardware can do this to some point but doupt it will be exposed in next Vulkan or SM 6.0.

Probably it's in use to do things like keeping static shader input parameters in registers, keeping the texture cache in LDS for the next pixel etc.

Same for similar things like device side enqueue (OpenCL 2.0) - AFAIK this is in Mantle, but neither in DX12 nor Vulkan.

#### Share this post

##### Share on other sites

I'd guess any hardware can do this to some point but doupt it will be exposed in next Vulkan or SM 6.0.

I don't know about persisting state from one pixel to the next -- pretty much all GPU architectures are based around shader invocations being fairly isolated...

The SM6 preview documentation is up though, and includes instructions to communicate between threads within a wavefront/warp -- basically exposing each GPU core as a SIMD processor. There's also Dx11 extensions to allow intra-wavefront communication in HLSL SM5 :)

Edited by Hodgman

#### Share this post

##### Share on other sites

The SM6 docs seem to indicate catching up with the GL extensions finally.  They also seem to state to me that you can safely assume that relying on warp/wavefront threads executing in lockstep being ok without execution barriers... only nvidia says this is not safe in CUDA and as previously mentioned has introduced explicit functions in CUDA 8 for this (Which makes me think nvidia will break this assumption before AMD if they do).

Granted CUDA/GL/Vulkan/Direct3D have 'different specs' but given the underlying hardware being the same this is a little bit worrying if say for example nvidia introduces Volta that breaks that assumption especially as these intrinsics are being pushed in GL/Vulkan/D3D right now.  Unless in Direct3D/GL/Vulkan there is a button to press to turn on non-synchronous warp behaviour for whatever reason I guess :-/

#### Share this post

##### Share on other sites
The SM6 preview documentation is up though, and includes instructions to communicate between threads within a wavefront/warp -- basically exposing each GPU core as a SIMD processor.

Nice!

...although i still miss some goodies in accessing registers.

GCN is quite powerful here: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

We could do things like: Solving a n-body problem with 64 bodies, sorting two size 32 arrays in parallel, ...

All with registers only.

With SM 6.0 we can speed things up using the quad shuffels, but still need the same LDS amount to work around limitations.

Of course that's an AMD example so subject for extensions, but i was hopping for a little bit more... at least register rotation... nag, nag :)

Edit:

AMD has silently fixed extension issues, see my edited post above :)

Going to swizzle registers...

Edited by JoeJ

#### Share this post

##### Share on other sites

GCN is quite powerful here: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

We could do things like: Solving a n-body problem with 64 bodies, sorting two size 32 arrays in parallel, ...

All with registers only.

Ah that link is quite interesting in that it talks about GCN splitting its' wavefront into rows.  I hadn't dug into GCN stuff specifically yet.  So I guess there are some wavefront operations that don't guarantee 'warp synchronous' behaviours on AMD already that might be exposed in future GL/Vulkan extensions.  So I definitely should be putting in WARPBARRIER macro placeholders at least then.

I looked into OpenCL 2 on AMD just now and found out they don't support C++ yet (but do in a 1.2 extension!).  As I was getting curious about attempting a Vulkan + hacked around CUDA/OpenCL2 interim combo before we can get at least a reliable decent C++ subset into Vulkan shaders (I would be quite happy with just with C plus templates, plus little things like C+11 list initialisers really though).  I guess next year will be pretty good regards all of this!

#### Share this post

##### Share on other sites

No, GCN wavefronts execute in lockstep. You can consider that 64/16 splitting as a single op, but taking four cycles.

Vulkan has no data sharing yet, so you can't use it with CUDA/CL efficiently - you would need to download the data to the host and upload to GPU again so graphics can see it.

Fortunately data sharing has been announced to come with next Vulkan.

I would not demand C++ features yet, i would be happy enough with proper C.

E.g. with OpenCLs C language you can use the same LDS memory for an integer array, and later cast it to be reused as an array of floats, vectors or whatever.

With GLSL you can't because there are no pointers, so you have to allocate twice the LDS amount due to a stupid language limitation.

I really hope for an alternative to GLSL fixing those things. HLSL seems already having support but i guess it has similar or even worse limitations.

Kind of OpenCL C would be just right.

This files contains all the GCN extensions: https://github.com/KhronosGroup/glslang/blob/master/glslang/MachineIndependent/Initialize.cpp

Already beyond SM 6.0 :)

Edit: Arghh - no, tried around but the extensions don't work, same for older ARB extensions :(

Edited by JoeJ

#### Share this post

##### Share on other sites

Oh thanks for that further clarification and info - knowing about the data sharing has saved me some short term heartache then!

I actually kind of hoped that with the GPU compute languages we would get a 'better C/C++' from it all that was closer to C and had a lot of cruft removed with GLSL convenience vector types etc.  I am in no hurry to see virtual functions or multiple inheritance supported fully in all of this personally either!  Really it's templates I would miss right now - actually I should say generic functions/types not specifically the C++ way of doing them.

At the moment I have a unholy mess of #ifdef's around code that is both a mix of GLSL and C/C++ in the same files.  It will be nice when I can remove more of that stuff.  But such is this transition period.

Nice to know Vulkan/GL are keeping a lead over D3D still then too :-)  Now I wish nVidia would expose things like max3/min3 etc as from PTX I believe their hardware has them too.

#### Share this post

##### Share on other sites

Actually I have tried a few different variations of warp level spinlocks and they all crash the nvidia display driver :-(

This:

if(gl_SubGroupInvocationARB== 0) while(atomicCompSwap(lock, 0, 1) != 0);

Hacking around I got this not to crash but I am not convinced it works:

bool locked = false;

do
{
if(gl_SubGroupInvocationARB == 0) locked = (atomicCompSwap(lock, 0, 1) == 0);

barrier(); // Will crash without barrier
}
while(ballotARB(locked) == 0);

I tried some of the other GLSL spinlocks from google (non warp level) and they seem to freeze due to issues of warp lockstep and divergence.

Neither AMD or Nvidia tools let me step through shaders.  Contemplating switching API and it seems that is an issue with Vulkan too on both AMD/Nvidia.  The OpenCL debugger for AMD won't let you step through kernels with atomics right now.  It seems only CUDA has working debugging tools at the moment :-/

It seems hopeless to debug something like this without using CUDA right now.

#### Share this post

##### Share on other sites

This topic is 477 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

## Sign in

Already have an account? Sign in here.

Sign In Now

• ### Forum Statistics

• Total Topics
628682
• Total Posts
2984213

• 11
• 13
• 13
• 9
• 10