[D3D11] Compute Shader Memory Barriers

Started by
20 comments, last by Jason Z 13 years, 7 months ago
Yup, all threads in a group will run concurrently in lock step. However groups do not necessary run concurrently. For e.g GroupA and B may run concurrently but not Group C and D.

I do agree with Jason Z that Device*** and All*** tend to gear towards the large GPGPU programs (maybe at the gather stage of a GPGPU program). Most of the time we would try to use local thread registers (limted to 16k of 32bit registers per SIMD core) or GroupShared Memory (max 32kb in DX11), and only touch the Device memory when we really have to.

Hope this help.
Advertisement
Yeah, you make a good point re:warp/wave front (AMD's term) vs max group size.

I tend to keep my group sizes to warp/wave front sizes (or half of them) so I guess you'd have to rig a test up with a 'large' group size and see what happens.

Probably to try and find the best results you'll have to introduce something which performs plenty of memory lookups and writes in order to allow for latency to break things.
After thinking about this some more, it actually makes some sense to use the GroupMemoryBarrier() instead of GroupMemoryBarrierWithGroupSync() in some situations like DieterVW mentioned. When a thread group is larger than a warp, and the shader program performs all its writes to GSM and then is followed by some extra calculations (to fill the time for the latency of writing to the memory) then it would make more sense to not block with a group sync. That would just force you to wait until all warps had been executed up to that point.

To be honest, I can't think of a reason that you would ever want to use the GroupSync() version of the functions... Is that something to allow for bypassing some bad implementation of the thread scheduling or something? If you can only communicate between threads with one of these memory pools, then why would you need to sync the execution too? Maybe I'm missing something here - do you guys see why you would need to sync execution???
Well, it all depends on the usage model;

- small warp/wave front sized groups you could probably use the non-sync version as it won't matter and both versions will effectively mean the same thing.

- larger than warp/wave front sized groups you could use non-sync iff you know for certain that threads outside of a warp/wf won't be accessing memory to read before the writes are done but I'd really be 100% sure of that.

- larger than warp/wf sized groups which DO have common memory read-after-write however would required the 'sync' version to ensure everything was done on time ready for that next read.

So, yeah, most of the time you won't want to use sync, at least not on current 'lock step' hardware anyway. And its probably not worth worrying about it for future hardware which might introduce non-lock step simply because I'm willing to be we are still a little way off anything like that anyway.
Quote:Original post by phantom
Well, it all depends on the usage model;

- small warp/wave front sized groups you could probably use the non-sync version as it won't matter and both versions will effectively mean the same thing.

- larger than warp/wave front sized groups you could use non-sync iff you know for certain that threads outside of a warp/wf won't be accessing memory to read before the writes are done but I'd really be 100% sure of that.

- larger than warp/wf sized groups which DO have common memory read-after-write however would required the 'sync' version to ensure everything was done on time ready for that next read.

So, yeah, most of the time you won't want to use sync, at least not on current 'lock step' hardware anyway. And its probably not worth worrying about it for future hardware which might introduce non-lock step simply because I'm willing to be we are still a little way off anything like that anyway.

These memory barrier functions only ensure that all writing up to that point in the shader program have been actually written to memory for the entire thread group. If that consists of 1 or 10 or 100 warps, it shouldn't matter at all - if you perform a memory barrier then all of the warps have to respect that all of the writes are completed. If you put any accesses to the shared memory after the memory barrier, then I don't see any situation where you would want to have all threads in the thread group (from all warps!) execute to the same point and then resume.

Since the memory (consisting of either GSM or resources) is the only possible communication mechanism between threads, then a memory barrier is enough to synchronize - there shouldn't be any need to sync the execution too (at least that I can see...).
OK, lets take 'group shared memory' as an example.

You have a group consisting of say 5 warps worth of threads, depending on the thread the work load is different.

warp 1 + 2 are executed at the same time and issue a bunch of memory writes and then hit a wait point without a group sync.
They get swapped out.
warp 3 + 4 are executed BUT due to some logic differences they don't issue all their writes before they are swapped out (if statements, dependant reads, pick a reason)
group 1 + 2 get swapped back in and their issued writes are completed, however the writes for warp 3 and 4 aren't done but as those were issued AFTER the sync point was hit by warp 1 and 2 they don't count towards the wait so group 1 and 2 go off on their way and read back from a shared location warp 3 and/or 4 were writing to.
warp 3 and 4 get back in, complete their writes and hit a wait
warp 1 and 2 get back in and are now working on incorrect data.

Now, IF they had been using GroupSync in that situation warp 1 and 2 would have stalled until 3 and 4 had hit their GroupSync, all data would have been flushed and everything would have been fine.

The point is, if the barrier only waits on currently pending memory operations then this kinda of problem can occur, where a different warp/wf/context cuts in and without the forced sync memory race conditions could happen.

Now, it's a very unlikely situation, certainly in the graphics area were you are likely to work with multiple warp/wf-sized groups in an overall dispatch so get around this problem, but for GPGPU type applications which might have large multi-warp/wf groups this could become a problem.
I was under the impression that the determination of the write instructions is done somewhere at compile time, not at runtime. If a memory barrier instruction just flushes the pending write instructions at that moment in time, then you could never use it because you don't know which warps would be run first or how many there would be (since it depends on the card you are using)...

If it is a compile time decision, then the runtime behavior would still be deterministic regardless of how large the hardware warps are. I don't think the decision is made at runtime, it wouldn't provide any benefit if it just flushes the write queue, right?
Since changing the pipeline has non-zero cost, it may be worth using a sync between algorithms in the same shader program instead of running several programs one after another. GSM is also very fast, pretty much like a very large cache. So if you can avoid saving the entire GSM out to a resource, only to read it all back in again for the next shader program, then you should get a win.

This approach to shader writing could have other problems though, like increasing register pressure if the compilers have any trouble reclaiming registers.

Phantom's scenario for warps/wf getting out of sync, or even having one warp complete before another has started should theoretically be possible.

We've got some OIT algorithms that do sorts and blends in the same program, which requires syncs.
I don't think it can be done at compile time simply because until you start doing something you can't know the state of the memory sub-system and the related write details.

It isn't a matter of saying 'have all writes been issued', it's a matter of 'wait until the data which is currently waiting to write has been physically written'; it only flushes the write queue in the sense that no more data is added to it from the threads currently waiting in the warp/wf which issued the barrier. It effectively suspends the thread until the memory subsystem gives the 'ok' on data having been written.

The only way your suggestion would work would be to make the non-sync version an implicit sync version as even if you could determine when all the writes had been issued and read back every thread would have to wait when it hit a barrier for all other possible writes at the point to complete, which means we have to wait on everyone else in the group anyway.

A good example would be fetching and writing to threadgroup local memory;
(only 1 warp at a time in this model)
- warp 1 executes a read request for a chunk of data
- warp 2 executes a read request for a chunk of data
- warp 1 is woken up when that data is read and writes it to thread local and issues a barrier without sync
- warp 2 is woken up when that data is read BUT due to an 'if' statement one or more of the threads issues another read request

And already we have a problem;
- as every thread in warp one took the same 'if' path to avoid the extra read they got to the barrier earlier than warp two's threads
- warp two is now asleep until its data comes back; this could be fast if its in the cache, it could be slow if it has to round trip to memory for something

So how do we wait?
We know a write is coming 'soon' but how soon?
When will it be done?
To be safe, as it appears you are suggesting, the compiler would have to assume that warp 1 could read from the data warp 2 will be writing 'soon' so it has to stall for even longer until it knows warp 2 is done writing data; bam! implicit GroupSync.

You are right in that its only 'safe' to do a non-groupsync barrier IFF* you know for sure that reads will never overlap writes in a non-synced area. If nothing else I could call it an optimisation as you are trading safety of knowing the groups is a certain size against the stalling of more threads than required due to redundant waiting on data which never overlaps.

*(I'm assuming those reading know that 'iff' isn't a typo and stands for 'if and only if' btw)
Ugh, I think I am getting more confused the more I go through this... Alright, let me get a couple of things straight:

- Is the memory barrier to flush reads, writes, or both?

- If the flush is only for the currently active reads/writes (how phantom describes), how could a thread group consisting of more than 1 warp load some texture data to the GSM then do a memory barrier before all threads in the group use it? If some warps haven't even been run yet, then how can it be called a memory barrier? The loaded shared memory would be partially loaded even though the threads were supposed to be synchronized with a memory barrier.

If it is just a memory flush, then I don't see the use case for using just a XXXMemoryBarrier(). You can't be certain about which parts of a thread group have actually run, so you don't have any way to reliably tell which threads may or may not have read/written from a shared memory. Perhaps I'm still missing the point - does anyone have any example of using just the XXXMemoryBarrier() that could help clear it up?

Thanks again for the help - I understand much more already than when I started out...




This topic is closed to new replies.

Advertisement