• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By Jason Smith
      While working on a project using D3D12 I was getting an exception being thrown while trying to get a D3D12_CPU_DESCRIPTOR_HANDLE. The project is using plain C so it uses the COBJMACROS. The following application replicates the problem happening in the project.
      #define COBJMACROS #pragma warning(push, 3) #include <Windows.h> #include <d3d12.h> #include <dxgi1_4.h> #pragma warning(pop) IDXGIFactory4 *factory; ID3D12Device *device; ID3D12DescriptorHeap *rtv_heap; int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE pinst, PWSTR cline, int cshow) { (hinst), (pinst), (cline), (cshow); HRESULT hr = CreateDXGIFactory1(&IID_IDXGIFactory4, (void **)&factory); hr = D3D12CreateDevice(0, D3D_FEATURE_LEVEL_11_0, &IID_ID3D12Device, &device); D3D12_DESCRIPTOR_HEAP_DESC desc; desc.NumDescriptors = 1; desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_RTV; desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_NONE; desc.NodeMask = 0; hr = ID3D12Device_CreateDescriptorHeap(device, &desc, &IID_ID3D12DescriptorHeap, (void **)&rtv_heap); D3D12_CPU_DESCRIPTOR_HANDLE rtv = ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart(rtv_heap); (rtv); } The call to ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart throws an exception. Stepping into the disassembly for ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart show that the error occurs on the instruction
      mov  qword ptr [rdx],rax
      which seems odd since rdx doesn't appear to be used. Any help would be greatly appreciated. Thank you.
    • By lubbe75
      As far as I understand there is no real random or noise function in HLSL. 
      I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
      Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious? 
    • By turanszkij
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!

  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 IncrementCounter, DecrementCounter, atomic operations works properly on overlapped compute shader?

This topic is 526 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey Guys,


Assume we have a job queue buffer, and we have two compute shader (cs1, cs2) adding jobs into this job queue buffer by using IncrementCounter function on its counter (DX12). So to make this work properly, we can first dispatch cs1 and then set a barrier for job queue buffer's counter buffer, and then dispatch cs2, to avoid race condition where cs1 is still updating job queue, while cs2 starts to run.


But I was thinking, how IncrementCounter works under the hood. If the atomicity is guaranteed by GPU hardware, I think we probably don't need the barrier between dispatch cs1 and cs2. But I have no idea how GPU guaranteed the atomicity of IncrementCounter, so it would be nice if someone could talk about this when we have overlapped cs using same IncrementCounter on same buffer. Also does the same concept apply to all atomic operations (if we have overlapped compute shader using atomic operation update the same buffer, do we have to serialize those compute shaders to ensure correctness?) 


Also what's the difference between Append/Consume buffer  and StructuredBuffer with IncrementCounter/DecrementCounter as stack style buffer?  I feel they are essentially the same thing (I may be wrong though).



Share this post

Link to post
Share on other sites

For my second question about differences between Append/Consume buffer and using StructuredBuffer with IncrementCounter/DecrementCounter to achieve same functionality, I found this page from Jason Z. In his post, he said there is a hidden counter in Append/Consume buffer to indicate the number of item in the Append/Consume buffer, does that mean they are actually the same?


Jason Z also mentioned the counter is a dumb counter, it will under-flow if you call consume on a empty buffer instead of notify you that there is nothing in the buffer....

I can think of a naive way to write my shader to actually implement correct stack behavior by adding empty stack check, but that require extra atomic operations which seems pretty expensive. So it will be great if someone could share how they implemented/design their GPU stack buffer~~


Thanks in advance. 


P.S. for anyone who are willing to answer my first question, big thanks, too.

Share this post

Link to post
Share on other sites

Yes, the counter used by Increment/DecrementCounter and Append/Consume is the same hidden counter. Append/Consume are just a bit of syntactic sugar that warps usage of the hidden counter. You can implement the same functionality yourself with Increment/DecrementCounter:


uint outputIdx = OutputBuffer.IncrementCounter();
OutputBuffer[outputIdx] = outputData;


If you look at the generated bytecode for a shader that calls Append, you'll see that it does the equivalent of the above code. Let's try on this this simple compute shader:


AppendStructuredBuffer<uint> OutputBuffer;
[numthreads(64, 1, 1)]
void CSMain(in uint3 ThreadID : SV_GroupThreadID)


Here's the resulting bytecode generated by FXC:


dcl_globalFlags refactoringAllowed
dcl_uav_structured u0, 4
dcl_input vThreadIDInGroup.x
dcl_temps 1
dcl_thread_group 64, 1, 1
imm_atomic_alloc r0.x, u0
store_structured u0.x, r0.x, l(0), vThreadIDInGroup.x


The imm_atomic_alloc performs the atomic add on the hidden counter, and the value of the counter prior to that add is used as the index for writing into the structured buffer.


Different GPU architectures have different ways for implementing this functionality. The most straightforward way to do it would be for the GPU to support a global atomic add on a memory location. Then the "hidden counter" is just a 4 byte value somewhere in GPU-accessible memory. On the GPU architectures that I'm familiar with (AMD's GCN family), this sort of global atomic operation is supported and could be used to implement the counter. However these GPU's actually have a bit of on-chip memory called "GDS" that the driver uses for this purpose. Using GDS allows the operation to be kept entirely on-chip without having to go out to caches or external memory.


On AMD GPU's the way these operations work would allow the counter to be accessed concurrently by two different shaders with no issues. I would assume it would also work on other GPU's, but I'm not too familiar with Intel chips and Nvidia is notoriously tight-lipped about the specifics of their hardware. Either way I'm not sure exactly what guarantees are made by the D3D12 API. D3D11 didn't even have the notion of multiple shaders executing in parallel, and essentially forced sync points between sequential dispatches (as if you always used a UAV barrier). Since most of the shader-centric documentations is from the D3D11 era, it doesn't have any information about accessing counters across multiple dispatches. The docs for D3D12_RESOURCE_UAV_BARRIER do say this:


You don't need to insert a UAV barrier between 2 draw or dispatch calls that only read a UAV. Additionally, you don't need to insert a UAV barrier between 2 draw or dispatch calls that write to the same UAV if you know that it's safe to execute the UAV accesses in any order.


I would think that the  "safe to execute UAV accesses in any order" would apply in the case of an atomic append, since you're expecting the results to be unordered. But perhaps someone more familar with the specs or documentation could clear that up further. Either way I would be surprised if what you described didn't work in practice.

Share this post

Link to post
Share on other sites

Thanks MJP, then I guess, there isn't any reason to use Append/Consume buffer over StructureBuffer since later one is much more flexible :-) 

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement