Sign in to follow this  

[D3D11] Compute Shader Memory Barriers

This topic is 2659 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

In the past, I have used GroupMemoryBarrierWithGroupSync to synchronize all of the threads in a group after filling the group shared memory with a bunch of texture data loaded by the individual threads. After that call, I was assured that the writing to the GSM was completed and that all threads in the group came to this point. It has worked out well in all instances that I have used it.

While reading through the documentation today, I found that there are actually six different types of these synch intrinsic functions:

AllMemoryBarrier, AllMemoryBarrierWithGroupSync
DeviceMemoryBarrier, DeviceMemoryBarrierWithGroupSync
GroupMemoryBarrier, GroupMemoryBarrierWithGroupSync

From this, I have a couple general questions which I haven't been able to find an answer online or in the docs. Hopefully someone out there has clarified this already - here goes:

1. What is the difference between with and without GroupSync? The docs say this ensures all threads hit this point in their execution before any are allowed to proceed, but then what is the behavior of the version without the GroupSync?

2. What precisely is the difference between the All***, Device***, and Group*** methods? They appear to be different thread scope specifiers, but only the threads within a thread group can communicate with one another (except via a UAV I suppose) so why would there be other scopes of synchronization levels?

Thanks in advance to anyone that can help clarify the topic!

Share this post


Link to post
Share on other sites
Hmmm, my take on it would be:

the difference between GroupSync and non GroupSync is that with group sync, all threads in the thread group will have to reach that instruction before any thread in the thread group can proceed. However, the one without GroupSync only ensures that after the particular barrier point, all memory access to the region is completed.

The difference between GroupMemory, DeviceMemory, and AllMemory, is basically referring to the memory location. GroupMemory are memory that only reside within a thread group, but DeviceMemory spans across all groups. Abit similar to the CUDA threading model of warps, blocks and grids. AllMemory would refer to both Group and Device Memory. The reason for having thread groups is to help reduce synchronization across all memory. Since if we can get away with synchronizing only the group memory, there is no need to synchronize the device memory.

regards.

[Edited by - littlekid on August 29, 2010 3:50:46 PM]

Share this post


Link to post
Share on other sites
Thanks for the response. So regarding the versions without GroupSync, this means that all writes that occur prior to this point in the shader must be completed? Can anyone think of a situation where this would be more appropriate than synchronizing the group in addition?

One other point - when you mention device memory, are you referring to resources such as buffers and textures? That would make sense since they would also allow communication across threads (as well as thread groups I suppose...).

Thanks again for the help!

Share this post


Link to post
Share on other sites
Typically:
local scalar, local arrays are in the scope of a thread memory.
shared variables falls in the scope of a group memory.
Constants, Shader Resource View, UAV are device memory, and are globals.

XXXMemoryBarrier are useful as they guarantee all access to a memory is completed and is thus visible to other threads. This is because although we may say update a particular XXX value, it does not necessary means that when other thread attempt to read it, the value is already updated. As the GPU may queue the write process or deferred it slightly later. What MemoryBarrier does is then ensure that all updates to the memory are done before we access it.

For example (off the top my head) we may have a match_counter, which increase by one whenever a worker thread found a matching data and do some independent work on it. We then have another thread which will do a gather operation which will loop through all matching data. Rightfully before looping through the matched data, we would have to call GroupMemoryBarrier, as we would want a stable state of match_counter. If not a worker thread might still be pending an increment to match_counter and the gather operation would result in a wrong state.

The benefit of XXXXMemoryBarrier is that it only ensure a memory barrier. Hence only access to a memory needs to be synchronize, but not the actual thread instructions.

P.S I have not tested this, will try out some time later, but I believe XXXMemoryBarrier can be place in divergent code, whereas XXXMemoryBarrierGroupSync cannot as it will stall instead

Share this post


Link to post
Share on other sites
I see - so the XXXMemoryBarrier is intended for ensuring all the writes to a memory have been executed before moving on. I would presume that this only checks for writes that are located before this statement in the shader code, right? If so, then any later writes to that memory could again be barrier'ed with another call?

Also, for the DeviceMemoryBarrier, this means that all the resource writes have been completed up to this point, right? This must be intended to synchronize across multiple thread groups - so if the memory barrier is half way through a shader, then every thread group would be executed up to that point (at least until all of their memory accesses have been completed) and then they could all continue? This seems like a very heavy operation if you all of the thread groups would have to halt until the writes were completed...

Thanks again for your help, this is very informative. Have you had the opportunity to use these instructions (without the GroupSync) in an actual algorithm? I'd love to hear about it if you have!

Share this post


Link to post
Share on other sites
Quote:
Original post by Jason Z
I see - so the XXXMemoryBarrier is intended for ensuring all the writes to a memory have been executed before moving on. I would presume that this only checks for writes that are located before this statement in the shader code, right? If so, then any later writes to that memory could again be barrier'ed with another call?


Yes; the memory barriers only apply to outstanding memory operations.

As such, having completed one memory barrier there will be no more outstanding operations once the next instruction executes and when you hit the next memory barrier it will stall until the next set of operations are done and so on and so forth.

While the question clearly wasn't directed at me I'll just throw out that I've not found a reason to use the versions sans 'GroupSync' yet; really my only operations thus far have been to use it to setup a dataset to be shared among a group with some threads doing a little more work than others to read in a 'skirt' around the area of operation. At which point of course GroupSync is needed to ensure we've got all our data for later operations.

(This is, of course, a very common optimisation for compute shaders which need groups to operate on shared data; the saving in memory bandwidth alone makes it worth while in high bandwidth situations).

Share this post


Link to post
Share on other sites
Hi Phantom,

I also have used the versions with ***GroupSync to save on bandwidth for ambient occlusion and also for my water simulation shaders. They seem quite intuitive and easier to understand than the other version. I just tried running my water simulation code with GroupMemoryBarrier() instead of GroupMemoryBarrierWithGroupSync() which uses it in two different places. All appears to be working properly...

I wonder if it is more performant to use the version without GroupSync() - stand by, I'll rebuild my engine in release mode and see if there is a noticeable difference (the demo is completely GPU bound, so hopefully this will show if there is some scheduling efficiency in using the shorter version)...

Share this post


Link to post
Share on other sites
Having now looked at the D3D docs on the instructions as noted above the only difference seems to be, as noted above, the conditions of when a thread 'unblocks'.

So, with 'GroupSync' would force all the threads to wait until all the threads have hit the same point in the shader; sans 'GroupSync' any thread is free to continue once all memory operations have completed which were issued before it hit the barrier.

Now, on hardware which executes in 'lock step' fashion I'm pretty sure there is going to be no difference at all in the outcome when it comes to group memory level operations.

As all threads are executing the same instructions at the same time they will all hit the memory barrier at the same point, which means they will all unblock at the same point once all pending memory operations have been completed.

I'm pretty sure this 'group lock step' applies even to the most recent DX11 hardware, at least nothing in my memory is screaming at me that this isn't the case (I might have to go check some write ups in a bit however if anyone thinks differently).

As such I'll be very surprised if using one over the other will matter at a group level, unless there are some strange micro-coding differences at the gpu level of course.

As for 'all' and 'device', well I suspect using one over the other could result in strange data race conditions depending on the resources being written/read and how the hardware sets up the various threads to be executed.

That would be my take on it anyway...

Share this post


Link to post
Share on other sites
From my testing, its pretty much a draw, or the with GroupSync is slightly faster (but the difference is well within the margin of error...). I think your statements about the lockstep processors are correct, but I honestly haven't kept up to speed on the latest GPU architectures to know if that is how they work or not :(

In any case, I feel like I have a good understanding of what is going on now. I still think the Device*** and All*** are most likely geared toward gigantic GPGPU workloads, since they would essentially stall out multiple thread groups until every dispatched group had been processed to that point... it seems like quite a 'macro' level synchronization.

Share this post


Link to post
Share on other sites
I believe that the all threads running the same shader program on warp will run in lock step. However, that doesn't limit other warps running the same program to have to run lock step with each other. Since the largest group sizes can be 1024 threads which won't fit on a single warp. IHV's might be adding extra sync's or changing up something in the driver's shader compiler to ensure correctness which might be limiting performance changes between sync and non sync. I'm not sure how thorough their optimizations will be in this area given that it's still a bit young.

Share this post


Link to post
Share on other sites
Yup, all threads in a group will run concurrently in lock step. However groups do not necessary run concurrently. For e.g GroupA and B may run concurrently but not Group C and D.

I do agree with Jason Z that Device*** and All*** tend to gear towards the large GPGPU programs (maybe at the gather stage of a GPGPU program). Most of the time we would try to use local thread registers (limted to 16k of 32bit registers per SIMD core) or GroupShared Memory (max 32kb in DX11), and only touch the Device memory when we really have to.

Hope this help.

Share this post


Link to post
Share on other sites
Yeah, you make a good point re:warp/wave front (AMD's term) vs max group size.

I tend to keep my group sizes to warp/wave front sizes (or half of them) so I guess you'd have to rig a test up with a 'large' group size and see what happens.

Probably to try and find the best results you'll have to introduce something which performs plenty of memory lookups and writes in order to allow for latency to break things.

Share this post


Link to post
Share on other sites
After thinking about this some more, it actually makes some sense to use the GroupMemoryBarrier() instead of GroupMemoryBarrierWithGroupSync() in some situations like DieterVW mentioned. When a thread group is larger than a warp, and the shader program performs all its writes to GSM and then is followed by some extra calculations (to fill the time for the latency of writing to the memory) then it would make more sense to not block with a group sync. That would just force you to wait until all warps had been executed up to that point.

To be honest, I can't think of a reason that you would ever want to use the GroupSync() version of the functions... Is that something to allow for bypassing some bad implementation of the thread scheduling or something? If you can only communicate between threads with one of these memory pools, then why would you need to sync the execution too? Maybe I'm missing something here - do you guys see why you would need to sync execution???

Share this post


Link to post
Share on other sites
Well, it all depends on the usage model;

- small warp/wave front sized groups you could probably use the non-sync version as it won't matter and both versions will effectively mean the same thing.

- larger than warp/wave front sized groups you could use non-sync iff you know for certain that threads outside of a warp/wf won't be accessing memory to read before the writes are done but I'd really be 100% sure of that.

- larger than warp/wf sized groups which DO have common memory read-after-write however would required the 'sync' version to ensure everything was done on time ready for that next read.

So, yeah, most of the time you won't want to use sync, at least not on current 'lock step' hardware anyway. And its probably not worth worrying about it for future hardware which might introduce non-lock step simply because I'm willing to be we are still a little way off anything like that anyway.

Share this post


Link to post
Share on other sites
Quote:
Original post by phantom
Well, it all depends on the usage model;

- small warp/wave front sized groups you could probably use the non-sync version as it won't matter and both versions will effectively mean the same thing.

- larger than warp/wave front sized groups you could use non-sync iff you know for certain that threads outside of a warp/wf won't be accessing memory to read before the writes are done but I'd really be 100% sure of that.

- larger than warp/wf sized groups which DO have common memory read-after-write however would required the 'sync' version to ensure everything was done on time ready for that next read.

So, yeah, most of the time you won't want to use sync, at least not on current 'lock step' hardware anyway. And its probably not worth worrying about it for future hardware which might introduce non-lock step simply because I'm willing to be we are still a little way off anything like that anyway.

These memory barrier functions only ensure that all writing up to that point in the shader program have been actually written to memory for the entire thread group. If that consists of 1 or 10 or 100 warps, it shouldn't matter at all - if you perform a memory barrier then all of the warps have to respect that all of the writes are completed. If you put any accesses to the shared memory after the memory barrier, then I don't see any situation where you would want to have all threads in the thread group (from all warps!) execute to the same point and then resume.

Since the memory (consisting of either GSM or resources) is the only possible communication mechanism between threads, then a memory barrier is enough to synchronize - there shouldn't be any need to sync the execution too (at least that I can see...).

Share this post


Link to post
Share on other sites
OK, lets take 'group shared memory' as an example.

You have a group consisting of say 5 warps worth of threads, depending on the thread the work load is different.

warp 1 + 2 are executed at the same time and issue a bunch of memory writes and then hit a wait point without a group sync.
They get swapped out.
warp 3 + 4 are executed BUT due to some logic differences they don't issue all their writes before they are swapped out (if statements, dependant reads, pick a reason)
group 1 + 2 get swapped back in and their issued writes are completed, however the writes for warp 3 and 4 aren't done but as those were issued AFTER the sync point was hit by warp 1 and 2 they don't count towards the wait so group 1 and 2 go off on their way and read back from a shared location warp 3 and/or 4 were writing to.
warp 3 and 4 get back in, complete their writes and hit a wait
warp 1 and 2 get back in and are now working on incorrect data.

Now, IF they had been using GroupSync in that situation warp 1 and 2 would have stalled until 3 and 4 had hit their GroupSync, all data would have been flushed and everything would have been fine.

The point is, if the barrier only waits on currently pending memory operations then this kinda of problem can occur, where a different warp/wf/context cuts in and without the forced sync memory race conditions could happen.

Now, it's a very unlikely situation, certainly in the graphics area were you are likely to work with multiple warp/wf-sized groups in an overall dispatch so get around this problem, but for GPGPU type applications which might have large multi-warp/wf groups this could become a problem.

Share this post


Link to post
Share on other sites
I was under the impression that the determination of the write instructions is done somewhere at compile time, not at runtime. If a memory barrier instruction just flushes the pending write instructions at that moment in time, then you could never use it because you don't know which warps would be run first or how many there would be (since it depends on the card you are using)...

If it is a compile time decision, then the runtime behavior would still be deterministic regardless of how large the hardware warps are. I don't think the decision is made at runtime, it wouldn't provide any benefit if it just flushes the write queue, right?

Share this post


Link to post
Share on other sites
Since changing the pipeline has non-zero cost, it may be worth using a sync between algorithms in the same shader program instead of running several programs one after another. GSM is also very fast, pretty much like a very large cache. So if you can avoid saving the entire GSM out to a resource, only to read it all back in again for the next shader program, then you should get a win.

This approach to shader writing could have other problems though, like increasing register pressure if the compilers have any trouble reclaiming registers.

Phantom's scenario for warps/wf getting out of sync, or even having one warp complete before another has started should theoretically be possible.

We've got some OIT algorithms that do sorts and blends in the same program, which requires syncs.

Share this post


Link to post
Share on other sites
I don't think it can be done at compile time simply because until you start doing something you can't know the state of the memory sub-system and the related write details.

It isn't a matter of saying 'have all writes been issued', it's a matter of 'wait until the data which is currently waiting to write has been physically written'; it only flushes the write queue in the sense that no more data is added to it from the threads currently waiting in the warp/wf which issued the barrier. It effectively suspends the thread until the memory subsystem gives the 'ok' on data having been written.

The only way your suggestion would work would be to make the non-sync version an implicit sync version as even if you could determine when all the writes had been issued and read back every thread would have to wait when it hit a barrier for all other possible writes at the point to complete, which means we have to wait on everyone else in the group anyway.

A good example would be fetching and writing to threadgroup local memory;
(only 1 warp at a time in this model)
- warp 1 executes a read request for a chunk of data
- warp 2 executes a read request for a chunk of data
- warp 1 is woken up when that data is read and writes it to thread local and issues a barrier without sync
- warp 2 is woken up when that data is read BUT due to an 'if' statement one or more of the threads issues another read request

And already we have a problem;
- as every thread in warp one took the same 'if' path to avoid the extra read they got to the barrier earlier than warp two's threads
- warp two is now asleep until its data comes back; this could be fast if its in the cache, it could be slow if it has to round trip to memory for something

So how do we wait?
We know a write is coming 'soon' but how soon?
When will it be done?
To be safe, as it appears you are suggesting, the compiler would have to assume that warp 1 could read from the data warp 2 will be writing 'soon' so it has to stall for even longer until it knows warp 2 is done writing data; bam! implicit GroupSync.

You are right in that its only 'safe' to do a non-groupsync barrier IFF* you know for sure that reads will never overlap writes in a non-synced area. If nothing else I could call it an optimisation as you are trading safety of knowing the groups is a certain size against the stalling of more threads than required due to redundant waiting on data which never overlaps.

*(I'm assuming those reading know that 'iff' isn't a typo and stands for 'if and only if' btw)

Share this post


Link to post
Share on other sites
Ugh, I think I am getting more confused the more I go through this... Alright, let me get a couple of things straight:

- Is the memory barrier to flush reads, writes, or both?

- If the flush is only for the currently active reads/writes (how phantom describes), how could a thread group consisting of more than 1 warp load some texture data to the GSM then do a memory barrier before all threads in the group use it? If some warps haven't even been run yet, then how can it be called a memory barrier? The loaded shared memory would be partially loaded even though the threads were supposed to be synchronized with a memory barrier.

If it is just a memory flush, then I don't see the use case for using just a XXXMemoryBarrier(). You can't be certain about which parts of a thread group have actually run, so you don't have any way to reliably tell which threads may or may not have read/written from a shared memory. Perhaps I'm still missing the point - does anyone have any example of using just the XXXMemoryBarrier() that could help clear it up?

Thanks again for the help - I understand much more already than when I started out...




Share this post


Link to post
Share on other sites
A barrier doesn't really 'flush' anything; it simply suspends the threads until outstanding memory operations are completed. Based on the wording it would seem to imply that its both reads and writes but generally its only writes you are going to worry about syncing.

The problem you have described there is the reason 'GroupSync' versions exist. A memory barrier isn't really a system to sync up threads, its just a method of saying 'wait until memory operations are completed'. Its a barrier which blocks until that setup becomes true.

And you are right; with larger than warp/wf groups you wouldn't be able to use a non-group sync memory barrier and guarantee correct access to memory which is shared between multiple warps/wave fronts due to no information on the order they are run.

Put simply;
- XXXMemoryBarrier stalls until outstanding memory operations, which are active at the time of calling, have completed
- XXXMemoryBarrierWithGroupSync stalls until outstanding memory operations, which are active at the time of calling, have completed AND all threads in the group have hit the instruction

Really, I would consider the XXXMemoryBarrier a optimisation for larger than warp/wf thread groups; for situations when you know for certain that you won't be accessing shared memory outside of a warp/wf so its only important that writes are completed within that group.

Yes, you have to be careful but there might be cases where you don't want to stall everything in order to wait for memory access to be completed across the whole group.

Share this post


Link to post
Share on other sites
Thanks for sticking with me - it is becoming clearer. I thought over the potential uses today, and I have devised a few tests to run and see what exactly the various combinations of access (reads and writes) do with the various memory pools (group shared and device) and memory barriers (with and without sync). I believe the test will help me understand more precisely what these instructions currently do.

A few more comments based on your responses later in the post:
Quote:
Really, I would consider the XXXMemoryBarrier a optimisation for larger than warp/wf thread groups; for situations when you know for certain that you won't be accessing shared memory outside of a warp/wf so its only important that writes are completed within that group.
Based on our current understand, I don't think that is really practical unless you get a database of cards with their warp/wave front size. What if someone develops an ultra cheap DX11 card that only has 8x8 warps (or even smaller)! You wouldn't be able to blindly rely on a particular minimum warp size...
Quote:
Yes, you have to be careful but there might be cases where you don't want to stall everything in order to wait for memory access to be completed across the whole group.
This is precisely what the XXXMemoryBarrier() function does - it suspends the entire thread group until the memory accesses are completed. The difference with the XXXYYYWithGroupSync() is that all threads have to make it to the instruction before the group is restarted.

Further to the point - if you have a mixed threading situation where some threads are writing to a shared area and other are not (doing some calculations perhaps) then it would be required to use the XXXMemoryBarrier() since the non-writing threads would never hit a XXXYYYWithGroupSync() instruction. In this scenario, the writing threads would write their data and then call XXXMemoryBarrier(). This stalls the entire group, including the computation threads, until the writes are completed. Then everyone gets started back up and continues on.

Perhaps that is the ultimate purpose for the instructions - they are used for allowing heterogeneous work batches. I still have difficulty believing that the XXXMemoryBarrier() allows different results depending on the size of the warp/wave front, but my experiments tonight should help clarify that topic (in my head anyways [grin])! I'll certainly post the results of my testing once it is working properly.

Share this post


Link to post
Share on other sites

This topic is 2659 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this