Sign in to follow this  
Mr_Fox

Are XXXMemoryBarrierWithGroupSync essentially equivalent?

Recommended Posts

Hey Guys,

 

I come across this question when I need to sync all my threads in one threadgroup. And after looking related MSDN pages, I feel like both GroupMemoryBarrierWithGroupSync, DeviceMemoryBarrierWithGroupSync, AllMemoryBarrierWithGroupSync are equivalent. Although they target on different memory scope, but at the end, they all ensure that all threads in this thread group reached this exact call, which (to my understanding) should guarante all previous shader statements (including all memory accesses) are finished. I know XXXMemoryBarrier calls makes differences, but how XXXMemoryBarrierWithGroupSync makes any differences?

 

I feel I must missed some key parts when understanding how those sync function works, it will be great if someone could point it out.

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

Even if all threads have executed the barrier, there is no guarantee that writes to memory they initated earlier have been finished too, so there's no guarantee that thread X sees the written value from thread Y.

Thus we need both execution and memory barriers.

 

Edit:

For example we have a loop where any thread writes to device memory and may set a shared flag to indicate there's work left and the loop should continue.

It's likely you only want a GroupMemoryBarrierWithGroupSync before checking and after updating the flag if you know you don't read back values from device memory.

Using an additional device memory barrier would just slow things down and is not necessary.

Edited by JoeJ

Share this post


Link to post
Share on other sites

Even if all threads have executed the barrier, there is no guarantee that writes to memory they initated earlier have been finished too, so there's no guarantee that thread X sees the written value from thread Y.

Thus we need both execution and memory barriers.

 

Edit:

For example we have a loop where any thread writes to device memory and may set a shared flag to indicate there's work left and the loop should continue.

It's likely you only want a GroupMemoryBarrierWithGroupSync before checking and after updating the flag if you know you don't read back values from device memory.

Using an additional device memory barrier would just slow things down and is not necessary.

 

Thanks JoeJ for being so helpful :-)

So in case I want some kinda of device memory for inter-threadgroup communication, DeviceMemoryBarrierWithGroupSync is my only choices?  Also I was wondering if I have a atomic device memory write, and then even if I only have a GroupMemoryBarrierWithGroupSync after it, are all other threads in this thread group guarantted to see that write? So basically does atomic write by pass cache go directly to memory and invalid related caches?

 

Thanks again

Share this post


Link to post
Share on other sites

So in case I want some kinda of device memory for inter-threadgroup communication, DeviceMemoryBarrierWithGroupSync is my only choices?

Yes. There may be (or coming) NV specific extensions i'm not aware of (presistend threads) that could help, dependent on what you're trying to do.

 

Also I was wondering if I have a atomic device memory write, and then even if I only have a GroupMemoryBarrierWithGroupSync after it, are all other threads in this thread group guarantted to see that write? So basically does atomic write by pass cache go directly to memory and invalid related caches?

No, GroupMemory means LDS only. So if you don't use DeviceMemoryBarrier there is no guarantee (even if such things may seem to work on one GPU).

 

I don't know how atomics and cache are related. I know GPUs can disable the cache for writes (which is what i would want for my current work because i never read back written memory within a dispatch), but i doupt current APIs expose this.

 

 

I'm curious what your goal is that makes sync on device memory so important. Because latency is so high (300 cycles) that's a serious problem.

A typical solution may be: Use a lightweight compute shader that builds a list of work in LDS and finally copy that to device using a single atomic add, also updating the buffer containig indirect dispatch numbers for a following heavy shader doing the work independently.

Share this post


Link to post
Share on other sites

I'm curious what your goal is that makes sync on device memory so important. Because latency is so high (300 cycles) that's a serious problem. A typical solution may be: Use a lightweight compute shader that builds a list of work in LDS and finally copy that to device using a single atomic add, also updating the buffer containing indirect dispatch numbers for a following heavy shader doing the work independently

 
Thanks JoeJ, I will try to explain what I want to achieve: Think about doing voxolization of your dynamic scene every frame into a TSDF volume (truncated sign distance field, each voxel store the truncated distance to nearest surface) with reso 1024^3. In that case I need to maintain a structure buffer of all non-empty voxels (voxels whose value is not truncated, so very close to surface). Due to structure buffer size limitation(2^27) and performance considerations you have to instead maintain a non-empty blocks volume(block contain 32^3 voxels). Now beside a 1024^3 TSDF volume, I also have a  32^3 Block volume with each block indicate whether this block contains non-empty voxels or not. Also for less memory footprint and better performance, each block voxel only has 8bits
 
As dynamic scene changes, voxel will change from non-empty to empty or vice versa. Your Block volume also need to change accordingly. So when your shader update the 1024^3 TSDF volume (each thread update one voxel), you also need to update the block volume. 
 
For blocks changing from empty to non-empty is easy, any thread find current voxel is non-empty, just do blockVolume[u3BlockIdx] = 0 (so let use 0 to indicate non-empty)
 
But for blocks changing from non-empty to empty is tricky. One could say that for any thread find current voxel is empty, do InterlockedAdd(blockVolume[u3BlockIdx],1) and thus at the end blockVolume[u3BlockIdx] == 32^3 means this block is empty, and non-empty otherwise. But we only have 8 bit for each block, so this won't work.
 
Well, we could use groupshare to count how many voxels are empty, and after a GroupMemoryBarrierWithGroupSync we update blockVolume[u3BlockIdx] base on the groupshared value. It works, but that only work if each threadgroup update one block of voxels.
 
So if our block contains 32^3 voxels, and our threadgroup is 8^3, (I was told that having thread group size too big like 32^3 is not recommended, should keep it a few multiple of 32/64, what's the caveat?)how to do it?  And that is the reason I need device memory barrier, so one thread in each threadgroup will call InterlockedAdd(blockVolume[u3BlockIdx],1) if all voxels within this threadgroup is empty, and after a device memory barrier, I know if blockVolume[u3BlockIdx] == uThreadGroupPerBlock this block is empty. So here the DeviceMemoryBarrierWithGroupSync is definitely needed since there are multiple threadgroup write to same uav location and all later operations are depend on that data.
 
 
However I think I find a workaround to avoid this device memory barrier, but I haven't test the code yet. I just post my scratch code here (there may be silly errors, but just focus on the method)
 

RWTexture3D<int> tex_uavBlockStateVol : register(u2);
groupshared uint uOccupiedTG = 0;

RWStructuredBuffer<uint> buf_uavNewOccupiedBlocksBuf : register(u3);
RWStructuredBuffer<uint> buf_uavFreedOccupiedBlocksBuf : register(u4);
StructuredBuffer<uint> buf_srvUpdateBlockQueue : register(t3);

//------------------------------------------------------------------------------
// Compute Shader
//------------------------------------------------------------------------------
// Each thread is updating one voxel, one block contains uThreadGroupPerBlock of
// threadgroup.
groupshared uint uOccupiedTG = 0;
groupshared uint uEmptyThread = 0;
[numthreads(THREAD_X, THREAD_Y, THREAD_Z)]
void main(uint3 u3GID : SV_GroupID, uint3 u3GTID : SV_GroupThreadID,
    uint uGIdx : SV_GroupIndex)
{
    uint uWorkQueueIdx = u3GID.x / uThreadGroupPerBlock;
    uint uThreadGroupIdxInBlock = u3GID.x % uThreadGroupPerBlock;
    uint3 u3ThreadGroupIdxInBlock =
        MakeU3Idx(uThreadGroupIdxInBlock, uThreadGroupBlockRatio);
    uint3 u3BlockIdx = UnpackedToUint3(buf_srvWork[uWorkQueueIdx]);
    uint3 u3VolumeIdx = u3BlockIdx * vParam.uVoxelBlockRatio +
        u3ThreadGroupIdxInBlock * THREAD_X + u3GTID;

    // above code is just find the offset for this threadgroup, this thread 

    bool bEmpty = true;
    // uNumEmptyTG values for blocks in update queue are guaranteed to
    // be either 0 (full occupied) or uThreadGroupPerBlock (empty) right before
    // this pass
    uint uNumEmptyTG =
        tex_uavBlockStateVol[u3BlockIdx] & BLOCKSTATEMASK_OCCUPIED;
    // bool UpdateVoxel(uint3 voxelIdx, out bool IsVoxelEmpty) return false
    // if update is not performed
    if (UpdateVoxel(u3VolumeIdx, bEmpty)) {
        if (bEmpty) {
            InterlockedAdd(uEmptyThread, 1);
        } else {
            uOccupiedTG = 1;
        }
    }
    GroupMemoryBarrierWithGroupSync();

    // Need only one thread in threadgroup to do the following, so early out
    if (uGIdx != 0) {
        return;
    }

    // Occupied Block is freed, add block to free block queue
    if (uNumEmptyTG == 0
        && uEmptyThread == THREAD_X * THREAD_Y * THREAD_Z) {
        uint uOrig = 0;
        InterlockedAdd(tex_uavBlockStateVol[u3BlockIdx], 1, uOrig);
        if (uOrig == uThreadGroupPerBlock - 1) {
            uint uFreeQueueIdx =
                buf_uavFreedOccupiedBlocksBuf.IncrementCounter();
            buf_uavFreedOccupiedBlocksBuf[uFreeQueueIdx] =
                PackedToUint(u3BlockIdx);
        }
    }
    // Empty Block found surface, add block to new occupied block queue
    if (uNumEmptyTG == uThreadGroupPerBlock && uOccupiedTG) {
        uint uOrig = 1;
        InterlockedAdd(tex_uavBlockStateVol[u3BlockIdx], -1, uOrig);
        if (uOrig == uThreadGroupPerBlock) {
            uint uNewQueueIdx = buf_uavNewOccupiedBlocksBuf.IncrementCounter();
            buf_uavNewOccupiedBlocksBuf[uNewQueueIdx] =
                PackedToUint(u3BlockIdx);
        }
    }
    // Reset block's update status for next update iteration
    if (uThreadGroupIdxInBlock == 0 && uGIdx == 0) {
        InterlockedAnd(
            tex_uavBlockStateVol[u3BlockIdx], ~BLOCKSTATEMASK_UPDATE);
    }
} 

I feel this shader should work, but is unnecessarily complicated. Any suggestions or comments will be appreciated

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

I was told that having thread group size too big like 32^3 is not recommended, should keep it a few multiple of 32/64, what's the caveat?

 

For AMD the maximum group size is 1024 in GL/VK, but only 256 in OpenCL.

A high end GPU has about 4000 threads, mid range 2000 and entry level 1000.

So even if a GPU could sync all its threads to form an entire group, a big number of 32^3 would be impossible.

The minimum group size is 32 on NV (Warp) and 64 on AMD (Wavefront). If you use less the other threads would go idle.

 

You likely need to use work to sizes to 64,128 or 256 in practice.

E.g. a Fiji GPU with 4096 threads can execute 16 workgroups of size 256 at the same time.

But we need more than 16 to keep the GPU really busy, because if there is a pending read from device memory the GPU will switch to another workgroup instead waiting for 300 cycles.

GCN can have 10 wavefronts in flight to switch. So this is good to hide device memory latency.

 

The caveat here is: All of those 10 potential wavefronts need to share the same LDS memory and register space of a compute unit.

One CU on GCN (executing 64 threads) has 64 kB LDS and 256 kB of VGPR registers (divided by 64 = 4 kB per thread).

If you want to have 10 wavefronts ready, you need to divide those numbers by 10!

That means the more registers or LDS you need for your shader, the less wavefronts are available to hide latency.

The term used here is 'occupancy' (100% for all 10 wavefronts, 50% for just 5)

This leads to the not so obvious conclusion: If you need more LDS, you may want to increase the work group size.

 

On NV you can only assume it's similar, they don't publicate detailed specs.

 

In any case its very good to have a profiling tool to show register / LDS usage, occupancy etc.

 

 

 

I've had no luck in understanding your algorithm :)

I don't know how you voxelize the scene, if you build low and high res volume at the same time, how a work queue is involved... no general picture of it - just confusion.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this