Jump to content
  • Advertisement
Sign in to follow this  
Jason Z

[D3D11] Compute Shader Memory Barriers

This topic is 2911 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

In the past, I have used GroupMemoryBarrierWithGroupSync to synchronize all of the threads in a group after filling the group shared memory with a bunch of texture data loaded by the individual threads. After that call, I was assured that the writing to the GSM was completed and that all threads in the group came to this point. It has worked out well in all instances that I have used it.

While reading through the documentation today, I found that there are actually six different types of these synch intrinsic functions:

AllMemoryBarrier, AllMemoryBarrierWithGroupSync
DeviceMemoryBarrier, DeviceMemoryBarrierWithGroupSync
GroupMemoryBarrier, GroupMemoryBarrierWithGroupSync

From this, I have a couple general questions which I haven't been able to find an answer online or in the docs. Hopefully someone out there has clarified this already - here goes:

1. What is the difference between with and without GroupSync? The docs say this ensures all threads hit this point in their execution before any are allowed to proceed, but then what is the behavior of the version without the GroupSync?

2. What precisely is the difference between the All***, Device***, and Group*** methods? They appear to be different thread scope specifiers, but only the threads within a thread group can communicate with one another (except via a UAV I suppose) so why would there be other scopes of synchronization levels?

Thanks in advance to anyone that can help clarify the topic!

Share this post


Link to post
Share on other sites
Advertisement
Hmmm, my take on it would be:

the difference between GroupSync and non GroupSync is that with group sync, all threads in the thread group will have to reach that instruction before any thread in the thread group can proceed. However, the one without GroupSync only ensures that after the particular barrier point, all memory access to the region is completed.

The difference between GroupMemory, DeviceMemory, and AllMemory, is basically referring to the memory location. GroupMemory are memory that only reside within a thread group, but DeviceMemory spans across all groups. Abit similar to the CUDA threading model of warps, blocks and grids. AllMemory would refer to both Group and Device Memory. The reason for having thread groups is to help reduce synchronization across all memory. Since if we can get away with synchronizing only the group memory, there is no need to synchronize the device memory.

regards.

[Edited by - littlekid on August 29, 2010 3:50:46 PM]

Share this post


Link to post
Share on other sites
Thanks for the response. So regarding the versions without GroupSync, this means that all writes that occur prior to this point in the shader must be completed? Can anyone think of a situation where this would be more appropriate than synchronizing the group in addition?

One other point - when you mention device memory, are you referring to resources such as buffers and textures? That would make sense since they would also allow communication across threads (as well as thread groups I suppose...).

Thanks again for the help!

Share this post


Link to post
Share on other sites
Typically:
local scalar, local arrays are in the scope of a thread memory.
shared variables falls in the scope of a group memory.
Constants, Shader Resource View, UAV are device memory, and are globals.

XXXMemoryBarrier are useful as they guarantee all access to a memory is completed and is thus visible to other threads. This is because although we may say update a particular XXX value, it does not necessary means that when other thread attempt to read it, the value is already updated. As the GPU may queue the write process or deferred it slightly later. What MemoryBarrier does is then ensure that all updates to the memory are done before we access it.

For example (off the top my head) we may have a match_counter, which increase by one whenever a worker thread found a matching data and do some independent work on it. We then have another thread which will do a gather operation which will loop through all matching data. Rightfully before looping through the matched data, we would have to call GroupMemoryBarrier, as we would want a stable state of match_counter. If not a worker thread might still be pending an increment to match_counter and the gather operation would result in a wrong state.

The benefit of XXXXMemoryBarrier is that it only ensure a memory barrier. Hence only access to a memory needs to be synchronize, but not the actual thread instructions.

P.S I have not tested this, will try out some time later, but I believe XXXMemoryBarrier can be place in divergent code, whereas XXXMemoryBarrierGroupSync cannot as it will stall instead

Share this post


Link to post
Share on other sites
I see - so the XXXMemoryBarrier is intended for ensuring all the writes to a memory have been executed before moving on. I would presume that this only checks for writes that are located before this statement in the shader code, right? If so, then any later writes to that memory could again be barrier'ed with another call?

Also, for the DeviceMemoryBarrier, this means that all the resource writes have been completed up to this point, right? This must be intended to synchronize across multiple thread groups - so if the memory barrier is half way through a shader, then every thread group would be executed up to that point (at least until all of their memory accesses have been completed) and then they could all continue? This seems like a very heavy operation if you all of the thread groups would have to halt until the writes were completed...

Thanks again for your help, this is very informative. Have you had the opportunity to use these instructions (without the GroupSync) in an actual algorithm? I'd love to hear about it if you have!

Share this post


Link to post
Share on other sites
Quote:
Original post by Jason Z
I see - so the XXXMemoryBarrier is intended for ensuring all the writes to a memory have been executed before moving on. I would presume that this only checks for writes that are located before this statement in the shader code, right? If so, then any later writes to that memory could again be barrier'ed with another call?


Yes; the memory barriers only apply to outstanding memory operations.

As such, having completed one memory barrier there will be no more outstanding operations once the next instruction executes and when you hit the next memory barrier it will stall until the next set of operations are done and so on and so forth.

While the question clearly wasn't directed at me I'll just throw out that I've not found a reason to use the versions sans 'GroupSync' yet; really my only operations thus far have been to use it to setup a dataset to be shared among a group with some threads doing a little more work than others to read in a 'skirt' around the area of operation. At which point of course GroupSync is needed to ensure we've got all our data for later operations.

(This is, of course, a very common optimisation for compute shaders which need groups to operate on shared data; the saving in memory bandwidth alone makes it worth while in high bandwidth situations).

Share this post


Link to post
Share on other sites
Hi Phantom,

I also have used the versions with ***GroupSync to save on bandwidth for ambient occlusion and also for my water simulation shaders. They seem quite intuitive and easier to understand than the other version. I just tried running my water simulation code with GroupMemoryBarrier() instead of GroupMemoryBarrierWithGroupSync() which uses it in two different places. All appears to be working properly...

I wonder if it is more performant to use the version without GroupSync() - stand by, I'll rebuild my engine in release mode and see if there is a noticeable difference (the demo is completely GPU bound, so hopefully this will show if there is some scheduling efficiency in using the shorter version)...

Share this post


Link to post
Share on other sites
Having now looked at the D3D docs on the instructions as noted above the only difference seems to be, as noted above, the conditions of when a thread 'unblocks'.

So, with 'GroupSync' would force all the threads to wait until all the threads have hit the same point in the shader; sans 'GroupSync' any thread is free to continue once all memory operations have completed which were issued before it hit the barrier.

Now, on hardware which executes in 'lock step' fashion I'm pretty sure there is going to be no difference at all in the outcome when it comes to group memory level operations.

As all threads are executing the same instructions at the same time they will all hit the memory barrier at the same point, which means they will all unblock at the same point once all pending memory operations have been completed.

I'm pretty sure this 'group lock step' applies even to the most recent DX11 hardware, at least nothing in my memory is screaming at me that this isn't the case (I might have to go check some write ups in a bit however if anyone thinks differently).

As such I'll be very surprised if using one over the other will matter at a group level, unless there are some strange micro-coding differences at the gpu level of course.

As for 'all' and 'device', well I suspect using one over the other could result in strange data race conditions depending on the resources being written/read and how the hardware sets up the various threads to be executed.

That would be my take on it anyway...

Share this post


Link to post
Share on other sites
From my testing, its pretty much a draw, or the with GroupSync is slightly faster (but the difference is well within the margin of error...). I think your statements about the lockstep processors are correct, but I honestly haven't kept up to speed on the latest GPU architectures to know if that is how they work or not :(

In any case, I feel like I have a good understanding of what is going on now. I still think the Device*** and All*** are most likely geared toward gigantic GPGPU workloads, since they would essentially stall out multiple thread groups until every dispatched group had been processed to that point... it seems like quite a 'macro' level synchronization.

Share this post


Link to post
Share on other sites
I believe that the all threads running the same shader program on warp will run in lock step. However, that doesn't limit other warps running the same program to have to run lock step with each other. Since the largest group sizes can be 1024 threads which won't fit on a single warp. IHV's might be adding extra sync's or changing up something in the driver's shader compiler to ensure correctness which might be limiting performance changes between sync and non sync. I'm not sure how thorough their optimizations will be in this area given that it's still a bit young.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!