• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
       
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      MSN>
      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..
      <MSN

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



       
    • By Axiverse
      I'm wondering when upload buffers are copied into the GPU. Basically I want to pool buffers and want to know when I can reuse and write new data into the buffers.
    • By NikiTo
      AMD forces me to use MipLevels in order to can read from a heap previously used as RTV. Intel's integrated GPU works fine with MipLevels = 1 inside the D3D12_RESOURCE_DESC. For AMD I have to set it to 0(or 2). MSDN says 0 means max levels. With MipLevels = 1, AMD is rendering fine to the RTV, but reading from the RTV it shows the image reordered.

      Is setting MipLevels to something other than 1 going to cost me too much memory or execution time during rendering to RTVs, because I really don't need mipmaps at all(not for the 99% of my app)?

      (I use the same 2D D3D12_RESOURCE_DESC for both the SRV and RTV sharing the same heap. Using 1 for MipLevels in that D3D12_RESOURCE_DESC gives me results like in the photos attached below. Using 0 or 2 makes AMD read fine from the RTV. I wish I could sort this somehow, but in the last two days I've tried almost anything to sort this problem, and this is the only way it works on my machine.)


  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 UAV Counters

This topic is 771 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

In DX12, the CreateUnorderedAccessView now accept a new optional ID3D12Resource param which is called counter resource, and MSDN has a page briefly mentioned about this UAV Counters. But the online resource about this is very sparse. So it will be greatly appreciated if someone could elaborate on that (what is UAV counter, in which scenario we hope to associate a counter buffer to a UAV, what render technique or GPGPU algorithm will benefit most from using UAV counters, and in terms of performance, how is UAV counters compare to its alternatives)

 

Sorry for asking a lot of questions, but this really bothers me

 

Thanks

 

Peng

Share this post


Link to post
Share on other sites
Advertisement

I've asked this before and am similarly bothered.

 

I'm sort of assuming that vendors can optionally implement some fancy technique to speed them up -- for example, a card with fast shared memory atomics might use a different implementation than a card that doesn't. If you're writing HLSL shaders and don't know which card it's going to run on, you'd need to write two different shaders and switch based on the gpu at runtime which sucks. 

 

But it's still kind of odd because one might want different techniques depending on the usage case -- e.g. if every thread increments a counter vs very few vs multiple varying increments per thread, and it seems like this can't possibly address every situation. 

 

Anyway having used them before I learned a few things:

  • You can use the same buffer as a counter buffer for the UAV buffer
  • Counters have a really big alignment, 4096 bytes. Putting other stuff within the 4092 bytes seems fine. Also the counter buffer can be 4 bytes big if you want. IDGI.
  • Putting the counter at the very beginning of a buffer works and is kind of convenient
  • The counter value can be accessed like any other data
  • It ran about comparably as fast as InterlockedAdding but again it might be doing some optimization on other cards that would be better than a global interlocked add.

 

Most of this is speculation.

 

---

 

To compare to other techniques, you can use a scan algorithm or histopyramid to do a lot of tasks that counters do. Mainly compaction of sparse data in buffers, filtering, or even just counting occurrences of something. Off the top of my head, marching cubes can use any of these three techniques (counters, scan, histop) to list occupied voxels in a contiguous array.

Edited by Dingleberry

Share this post


Link to post
Share on other sites

I've asked this before and am similarly bothered.

 

I'm sort of assuming that vendors can optionally implement some fancy technique to speed them up -- for example, a card with fast shared memory atomics might use a different implementation than a card that doesn't. If you're writing HLSL shaders and don't know which card it's going to run on, you'd need to write two different shaders and switch based on the gpu at runtime which sucks. 

 

But it's still kind of odd because one might want different techniques depending on the usage case -- e.g. if every thread increments a counter vs very few vs multiple varying increments per thread, and it seems like this can't possibly address every situation. 

 

Anyway having used them before I learned a few things:

  • You can use the same buffer as a counter buffer for the UAV buffer
  • Counters have a really big alignment, 4096 bytes. Putting other stuff within the 4092 bytes seems fine. Also the counter buffer can be 4 bytes big if you want. IDGI.
  • Putting the counter at the very beginning of a buffer works and is kind of convenient
  • The counter value can be accessed like any other data
  • It ran about comparably as fast as InterlockedAdding but again it might be doing some optimization on other cards that would be better than a global interlocked add.

 

Most of this is speculation.

 

---

 

To compare to other techniques, you can use a scan algorithm or histopyramid to do a lot of tasks that counters do. Mainly compaction of sparse data in buffers, filtering, or even just counting occurrences of something. Off the top of my head, marching cubes can use any of these three techniques (counters, scan, histop) to list occupied voxels in a contiguous array.

Thanks Dingleberry for share your experience and thought on counter buffer. I think the thing I am curious most is the design purpose of counter buffer, since we can to counting or similar things with just buffer and atomic ops, so why directx have specific uva counter stuff around its api level. There must be some cases where normal uav and atomic ops cannot do the job.... 

Share this post


Link to post
Share on other sites

Another poster on here told me some drivers will just implement it as an interlocked add. I'm nearly certain an interlocked add can functionally do everything a counter can. I'm guessing it just sometimes isn't done that way, like Microsoft told venders "the counter has these requirements but doesn't need to work in any specific way". 

 

Thinking it through further, I don't think many people use them so it creates a chicken and egg problem where vendors aren't going to care about improving its performance and then no one uses them because they're not significantly faster than atomics.

 

Again I could be really wrong, since gpus can do a lot of things that aren't exposed directly to DX12.

Edited by Dingleberry

Share this post


Link to post
Share on other sites
AFAIK it's equivalent to Dx11's append/consume buffers, which had a magic hidden counter. This is the same, but it's no longer hidden.
Some GPU's might have special hardware for them, but most modern ones likely use general purpose hardware to implement them like you've guessed.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement