• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By Jason Smith
      While working on a project using D3D12 I was getting an exception being thrown while trying to get a D3D12_CPU_DESCRIPTOR_HANDLE. The project is using plain C so it uses the COBJMACROS. The following application replicates the problem happening in the project.
      #define COBJMACROS #pragma warning(push, 3) #include <Windows.h> #include <d3d12.h> #include <dxgi1_4.h> #pragma warning(pop) IDXGIFactory4 *factory; ID3D12Device *device; ID3D12DescriptorHeap *rtv_heap; int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE pinst, PWSTR cline, int cshow) { (hinst), (pinst), (cline), (cshow); HRESULT hr = CreateDXGIFactory1(&IID_IDXGIFactory4, (void **)&factory); hr = D3D12CreateDevice(0, D3D_FEATURE_LEVEL_11_0, &IID_ID3D12Device, &device); D3D12_DESCRIPTOR_HEAP_DESC desc; desc.NumDescriptors = 1; desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_RTV; desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_NONE; desc.NodeMask = 0; hr = ID3D12Device_CreateDescriptorHeap(device, &desc, &IID_ID3D12DescriptorHeap, (void **)&rtv_heap); D3D12_CPU_DESCRIPTOR_HANDLE rtv = ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart(rtv_heap); (rtv); } The call to ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart throws an exception. Stepping into the disassembly for ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart show that the error occurs on the instruction
      mov  qword ptr [rdx],rax
      which seems odd since rdx doesn't appear to be used. Any help would be greatly appreciated. Thank you.
       
    • By lubbe75
      As far as I understand there is no real random or noise function in HLSL. 
      I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
      Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious? 
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
       
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      MSN>
      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..
      <MSN

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



       
  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 uav barrier and atomic operation

This topic is 500 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey Guys,

 

Think about the following case:

You have multiple dispatch calls with compute shader which launches tons of threads, and each thread will do an atomic add on the same memory location (same address for all threads from all dispatch calls). Since DX12 allow dispatch to be overlapped, all related document will suggest to put a uav barrier between each dispatch since they all operate on same buffer, and has data dependency.

 

I can understand the reasoning for normal read/write, since you read/write will be cached in multiple level, and to get correct result, you need to flush your cache, and do bunch of related cache invalidation. So that's why we need uav barrier between dispatches.(is my understanding correct?)

 

But what if all my write is atomic as described in the above case? In my experience, atomic write will be immediate visible to all other threads from the same dispatch even if threads are in different execution unit (I am not super sure about that, since no doc specifically state that....), which, to me, means all cached data is synced across entire GPU. And if that is the case, I think it will be safe to remove uav barrier between dispatches since data cache has already been taken care, there is no out-of-date data.

 

And from my experiment, it seems to work correctly on my GTX680m without the barrier. But there is no official document out there can confirm it, so I have no idea whether it is officially safe to do it, or it will cause undefined behavior, or corrupted data, and I am just accidentally get the correct result.

 

Please correct me if my assumption is wrong, and it will be great if someone could explain how atomic operation works in GPU (especially for those support reading back the original data, since it seems atomic read/write bypass all caches)

 

Thanks in advance

Share this post


Link to post
Share on other sites
Advertisement

I don't think UAV barriers flush any caches, they just ensure an execution order (I could be wrong, but IIRC UAV's are implemented on the standard cache hierarchy which is coherent).  Which leads to point two which is if the order of execution does not matter then don't use UAV barriers.

 

edit - oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.
 

Thanks Infinisearch, so does that mean all atomic read/write(on uav) bypass L1 cache (which is only shared within EU?), and which gives all read/write behavior kinda equivalent to the std::memory_order_seq_cst

Share this post


Link to post
Share on other sites
edit - oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.

BTW I only know this about AMD's GCN architecture

 

 

 

Thanks Infinisearch, so does that mean all atomic read/write(on uav) bypass L1 cache (which is only shared within EU?), and which gives all read/write behavior kinda equivalent to the std::memory_order_seq_cst

 

I don't know if it bypass's it but it acts as a cache miss (since the L1 and L2 are coherent and inclusive).  I'm not really familiar with std::memory_order_seq_cst so I can't really comment on it but I will say how do you guarantee execution order within a wave/warp?  I don't think you can so if what you're doing has an order to it, it won't work.  Only coarse grained order using barriers is possible.

 

edit - nevermind that last part.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

This might clear things up for you.

From this page: https://msdn.microsoft.com/en-us/library/windows/desktop/dn903898(v=vs.85).aspx

 

  • D3D12_RESOURCE_UAV_BARRIER - Unordered access view barriers indicate all UAV accesses (read or writes) to a particular resource must complete before any future UAV accesses (read or write) can begin. The specified resource may be NULL. It is not necessary to insert a UAV barrier between two draw or dispatch calls which only read a UAV. Additionally, it is not necessary to insert a UAV barrier between two draw or dispatch calls which write to the same UAV if the application knows that it is safe to execute the UAV accesses in any order. The resource can be NULL (indicating that any UAV access could require the barrier).

Share this post


Link to post
Share on other sites

Thanks Infinisearch, I should have read the doc more carefully, but it's good to know that in my case it is safe to remove the uav barrier B-)

Share this post


Link to post
Share on other sites

You're welcome... but did you read into that quote?  If I'm not mistaken it implies there's no flushing of the cache necessary for writes to become visible to subsequent dispatches.

Share this post


Link to post
Share on other sites

 If I'm not mistaken it implies there's no flushing of the cache necessary for writes to become visible to subsequent dispatches.

 

yup, I have that kept in mind. My read will be in a much later pass, and I transit it into a srv for just reading, but before that, all dispatch just do atomic access(though these including atomic read like interlockedadd, but it should be safe according to some answers in my other post), so should be fine.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement