The assumption that all warps/wavefronts execute in lockstep is something I am unsure of the far reaching truth of. It seemed it was solid until sometime after kepler mixed messages were given and CUDA docs began to advise against it but I can no longer find that in the current docs. But a lot of CUDA libraries rely on that behaviour and there seems to be conflicting opinions on the matter with neither nVidia/AMD clarifying the current state of play. Of course CUDA spec != OpenGL spec so yeh even more confusing and room for things not to be quite as assumed even on the same hardware as one spec may give more leeway.
You mentioning that Maxwell may not execute warps in lockstep is the first I have heard of that - and I cannot find any reference to this so far on google? Where did you hear that!? I am intending to try this stuff on Pascal over the next week so you have worried me greatly! :-)
I have only written basic CUDA stuff but tried to stick with GL for cross platform stuff with the mind to switch completely to Vulkan when its 'ready' - but it looks like I may hold off until Vulkan Next shows up now.
I do keep getting tempted to switch to GL + CUDA/OpenCL currently due to debugging issues (I've crashed my system lots mostly on nvidia recently) - but currently CUDA debugging profiling on Pascal isn't really working either so it's a bit awkward anyway.
I suspect CUDA must be faster than GL compute still in some ways for straight compute but I haven't compared - but I thought the general consensus was that they had caught up. I have seen some strange things - such as surprisingly close performance in GL compute between a RX470 and GTX1080, and I did intend to see if I tested in CUDA if that was a GL compute inefficiency on nvidia (or me likely doing things wrong). Latest nvidia drivers for Pascal seemed to speed up compute too since I last tested that though. But I haven't got that far yet so no I don't really know :-)
Really the quicker everyone can just focus on one (Vulkan!) API that will be great - and then I am perfectly fine with vendor specific extensions given there are only two vendors that really matter for this stuff.
So if I can find something that definitely says relying on warp lockstep execution is not safe then I will litter the code with a warpBarrier() macro that can be defined as a barrier() to be safe. So this:
#define WARPBARRIER barrier
bool leadThread = (gl_SubGroupInvocationARB == 0);
if(leadThread) while(atomicCompSwap(lock, 0, 1) != 0);
ModifyBufferListItems();
if(leadWarpThread) ModifyBufferListItemCount();
memoryBarrierBuffer();
WARPBARRIER();
if(leadThread) atomicExchange(lock, 0);
EDIT: As a note to others that may read. There are lots of similar warnings to be found by google regards the dangers of relying on warp lock step execution such as:
https://devtalk.nvidia.com/default/topic/632471/is-syncthreads-required-within-a-warp-/
But I am not really convinced how valid that is currently. Given nvidia/amd posting recent stuff seemingly encouraging it such as:
https://developer.nvidia.com/reading-between-threads-shader-intrinsics
ONE... interesting note is that I think the ballot instructions are defined to intrinsically sync divergent warp threads - so in theory if you use one or more of those types of instructions (which I actually do) you would maybe not need the WARPBARRIER macro I have in that example above - as you could be sure they have re-converged at that point.
But it would be nice to know if for sure you can always rely on warp lockstep behaviour in GLSL with the recent extensions on all hardware...