Sign in to follow this  
Mr_Fox

DX12 uav barrier and atomic operation

Recommended Posts

Hey Guys,

 

Think about the following case:

You have multiple dispatch calls with compute shader which launches tons of threads, and each thread will do an atomic add on the same memory location (same address for all threads from all dispatch calls). Since DX12 allow dispatch to be overlapped, all related document will suggest to put a uav barrier between each dispatch since they all operate on same buffer, and has data dependency.

 

I can understand the reasoning for normal read/write, since you read/write will be cached in multiple level, and to get correct result, you need to flush your cache, and do bunch of related cache invalidation. So that's why we need uav barrier between dispatches.(is my understanding correct?)

 

But what if all my write is atomic as described in the above case? In my experience, atomic write will be immediate visible to all other threads from the same dispatch even if threads are in different execution unit (I am not super sure about that, since no doc specifically state that....), which, to me, means all cached data is synced across entire GPU. And if that is the case, I think it will be safe to remove uav barrier between dispatches since data cache has already been taken care, there is no out-of-date data.

 

And from my experiment, it seems to work correctly on my GTX680m without the barrier. But there is no official document out there can confirm it, so I have no idea whether it is officially safe to do it, or it will cause undefined behavior, or corrupted data, and I am just accidentally get the correct result.

 

Please correct me if my assumption is wrong, and it will be great if someone could explain how atomic operation works in GPU (especially for those support reading back the original data, since it seems atomic read/write bypass all caches)

 

Thanks in advance

Share this post


Link to post
Share on other sites

I don't think UAV barriers flush any caches, they just ensure an execution order (I could be wrong, but IIRC UAV's are implemented on the standard cache hierarchy which is coherent).  Which leads to point two which is if the order of execution does not matter then don't use UAV barriers.

 

edit - oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.
 

Thanks Infinisearch, so does that mean all atomic read/write(on uav) bypass L1 cache (which is only shared within EU?), and which gives all read/write behavior kinda equivalent to the std::memory_order_seq_cst

Share this post


Link to post
Share on other sites
edit - oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.

BTW I only know this about AMD's GCN architecture

 

 

 

Thanks Infinisearch, so does that mean all atomic read/write(on uav) bypass L1 cache (which is only shared within EU?), and which gives all read/write behavior kinda equivalent to the std::memory_order_seq_cst

 

I don't know if it bypass's it but it acts as a cache miss (since the L1 and L2 are coherent and inclusive).  I'm not really familiar with std::memory_order_seq_cst so I can't really comment on it but I will say how do you guarantee execution order within a wave/warp?  I don't think you can so if what you're doing has an order to it, it won't work.  Only coarse grained order using barriers is possible.

 

edit - nevermind that last part.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

This might clear things up for you.

From this page: https://msdn.microsoft.com/en-us/library/windows/desktop/dn903898(v=vs.85).aspx

 

  • D3D12_RESOURCE_UAV_BARRIER - Unordered access view barriers indicate all UAV accesses (read or writes) to a particular resource must complete before any future UAV accesses (read or write) can begin. The specified resource may be NULL. It is not necessary to insert a UAV barrier between two draw or dispatch calls which only read a UAV. Additionally, it is not necessary to insert a UAV barrier between two draw or dispatch calls which write to the same UAV if the application knows that it is safe to execute the UAV accesses in any order. The resource can be NULL (indicating that any UAV access could require the barrier).

Share this post


Link to post
Share on other sites

 If I'm not mistaken it implies there's no flushing of the cache necessary for writes to become visible to subsequent dispatches.

 

yup, I have that kept in mind. My read will be in a much later pass, and I transit it into a srv for just reading, but before that, all dispatch just do atomic access(though these including atomic read like interlockedadd, but it should be safe according to some answers in my other post), so should be fine.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Announcements

  • Forum Statistics

    • Total Topics
      628402
    • Total Posts
      2982469
  • Similar Content

    • By Trylz Engine
      Hello !
      I would like to share with you a personnal project i started this Year.
      The Trylz Renderer is a CPU  unidirectional path tracer with DirectX 12 preview written in C++

      General features include:
      User interface with basic settings
      Create scenes from model files and save it in xml files
      Render high quality images. The full features and binaries can be seen on the project page. Its is only for windows at the time
       
      An example render i made with it:

    • By ZachBethel
      Hey all,
      I'm trying to debug some async compute synchronization issues. I've found that if I force all command lists to run through a single ID3D12CommandQueue instance, everything is fine. However, if I create two DIRECT queue instances, and feed my "compute" work into the second direct queue, I start seeing the issues again.
      I'm not fencing between the two queues at all because they are both direct. According to the docs, it seems as though command lists should serialize properly between the two instances of the direct queue because they are of the same queue class.
      Another note is that I am feeding command lists to the queues on an async thread, but it's the same thread for both queues, so the work should be serialized properly. Anything obvious I might be missing here?
      Thanks!
    • By Vilem Otte
      So, I've been playing a bit with geometry shaders recently and I've found a very interesting bug, let me show you the code example:
      struct Vert2Geom { float4 mPosition : SV_POSITION; float2 mTexCoord : TEXCOORD0; float3 mNormal : TEXCOORD1; float4 mPositionWS : TEXCOORD2; }; struct Geom2Frag { float4 mPosition : SV_POSITION; nointerpolation float4 mAABB : AABB; float3 mNormal : TEXCOORD1; float2 mTexCoord : TEXCOORD0; nointerpolation uint mAxis : AXIS; float3 temp : TEXCOORD2; }; ... [maxvertexcount(3)] void GS(triangle Vert2Geom input[3], inout TriangleStream<Geom2Frag> output) { ... } So, as soon as I have this Geom2Frag structure - there is a crash, to be precise - the only message I get is:
      D3D12: Removing Device.
      Now, if Geom2Frag last attribute is just type of float2 (hence structure is 4 bytes shorter), there is no crash and everything works as should. I tried to look at limitations for Shader Model 5.1 profiles - and I either overlooked one for geometry shader outputs (which is more than possible - MSDN is confusing in many ways ... but 64 bytes limit seems way too low), or there is something iffy that shader compiler does for me.
      Any ideas why this might happen?
    • By VietNN
      Hi everyone, I am new to Dx12 and working on a game project.
      My game just crash at CreateShaderResourceView with no infomation output in debug log, just: 0xC0000005: Access violation reading location 0x000001F22EF2AFE8.
      my code at current:
      CreateShaderResourceView(m_texture, &desc, *cpuDescriptorHandle);
       - m_texture address is: 0x000001ea3c68c8a0
      - cpuDescriptorHandle address is 0x00000056d88fdd50
      - desc.Format, desc.ViewDimension, Texture2D.MostDetailedMip, Texture2D.MipLevels is initalized.
      The crash happens all times at that stage but not on same m_texture. As I noticed the violation reading location is always somewhere near m_texture address.
      I just declare a temp variable to check how many times CreateShaderResourceView already called, at that moment it is 17879 (means that I created 17879 succesfully), and CreateDescriptorHeap for cpuDescriptorHandle was called 4190, do I reach any limit?
      One more infomation, if I set miplevel of all texture when create to 1 it seem like there is no crash but game quality is bad. Do not sure if it relative or not.
      Anyone could give me some advise ?
    • By VietNN
      Hi all,
      The D3D12_SHADER_RESOURCE_VIEW_DESC has a member Shader4ComponentMapping but I don't really know what is it used for? As several example set its value to D3D12_DEFAULT_SHADER_4_COMPONENT_MAPPING. I also read the document on MSDN but still do not understand anything about it.
      https://msdn.microsoft.com/en-us/library/windows/desktop/dn903814(v=vs.85).aspx
      https://msdn.microsoft.com/en-us/library/windows/desktop/dn770406(v=vs.85).aspx
      Anyone could help me, thank you.
  • Popular Now