• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By lubbe75
      As far as I understand there is no real random or noise function in HLSL. 
      I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
      Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious? 
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
       
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      MSN>
      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..
      <MSN

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



       
    • By Axiverse
      I'm wondering when upload buffers are copied into the GPU. Basically I want to pool buffers and want to know when I can reuse and write new data into the buffers.
  • Advertisement
  • Advertisement

DX12 Are there some memory limits to consider when writing a shader?

Recommended Posts

Would it be a problem to create in HLSL ~50 uninitialized arrays of ~300000 cells each and then use them for my algorithm(what I currently do in C++(and I had stack overflows problems because of large arrays)).
It is something internal to the shader. Shader will create the arrays in the beginning, will use them and not need them anymore. Not taking data for the arrays from the outside world, not giving back data from the arrays to the outside world either. Nothing shared.
My question is not very specific, it is about memory consumption considerations when writing shaders in general, because my algorithm still has to be polished. I will let the writing of HLSL for when I have the algorithm totally finished and working(because I expect writing HLSL to be just as unpleasant as GLSL). Still it is useful for me to know beforehand what problems to consider.

Share this post


Link to post
Share on other sites
Advertisement

If I'm understanding you correctly, what you're suggesting is probably the worst thing you could be trying to do. Shaders don't have a stack or any local memory to play with. They have a few (~256) registers to play with and that's it. Compute Shaders can use LDS / Group Shared Memory, but I don't think that's what you're using here.

If you're writing a Pixel Shader that uses large arrays local to each thread then you're in for a world of hurt I'm afraid. The compilers will probably find a way to run what you've written by spilling back to main memory a lot, but performance will be dire.

Do you have an example of the types of situations where you'd want a thread on the GPU to have access to 300,000 'cells' of data and for that data to not live in a Resource such as a Buffer or a Texture?

Share this post


Link to post
Share on other sites

If you're trying to use the GPU to circumvent your stack overflow problem, the solution would be to allocate heap memory using new/malloc.

If you're trying to speed up an algorithm, you could simply bounce data between image buffers - 300000 is less than a 600*600 buffer - depending on what you define as a 'cell'.

Without more information it's hard to understand what it is you're trying to achieve.

 

Share this post


Link to post
Share on other sites

I am designing my algorithm in C++ now. Lately I want to adapt it to HLSL. It is working but it is huge and I have to optimize it more because I know the CPU each shader uses is not as fast as the main CPU where C++ runs. I could even redesign it completely if needed, but I don't think there is a way to do it with 256 registers only.

I could say I wrote a genetic algorithm, but I don't want to enter discussions about what is a genetic algorithm/ML/big data/etc. I know for sure that my data grows iteratively and then iteratively shrinks. I think it is genetic algorithm, but I don't want to argue.

Should I use RWBuffer* or something else more appropriate than an array?

(a "cell" meaning "a small room in which a prisoner data is locked up", not "the smallest structural and functional unit of an organism")

Share this post


Link to post
Share on other sites
16 minutes ago, NikiTo said:

I know the CPU each shader uses is not as fast as the main CPU where C++ runs

It's not just that they're slower - they're different. Shader processors use SIMD to run on multiple pixels/vertices/compute-threads at the same time.

On the CPU we have SSE instructions, which operate on four values at the same time. GPU processors operate on up to 64 values at the same time! You need to be aware of this when structuring your algorithms. e.g. branching on a CPU is fine, but branching on the GPU can come at a huge cost (if only 1 pixel needs to take the branch, you'll be running instructions where 1/64th of the computational power is being utilized!).

Share this post


Link to post
Share on other sites

I used to use a lot the SIMD in assembler. I'm used to it. For some reason It is harder for me to explore the GPU the same way I did with CPU. Intel has tons of very explicit documentation. in ASM I know what goes where and where is everything at any moment. But in GPU, although should be very similar it is not so clear for me. I will try to redesign the algorithm to run on the SIMD of Intel. It should help me to orientate myself. And it should be easier to port it from SIMD Intel to GPU, than from my C++ code full of branches(for some reason I totally forgot about branches, if you ask me: "are branches good for GPU" I would say NO! but for some reason I totally forgot about them)

Share this post


Link to post
Share on other sites

@Hodgman Why do you refer to it as SIMD instead of SIMT?  To me SIMT is more accurate.

As far as your C++ version goes either as mentioned before don't allocate the arrays on the stack (use the heap or dare I say it globals) or increase the size of the stack.

Also as alluded to by my reply to Hodgman, GPU's are SIMT not SIMD its similar but not the same.  Also the number of registers should have no bearing on whether you can or cannot implement your algorithm although it will affect performance.

Share this post


Link to post
Share on other sites
10 hours ago, Infinisearch said:

@Hodgman Why do you refer to it as SIMD instead of SIMT?  To me SIMT is more accurate.

Because I'm imprecise when I talk most of the time :D There's not too much difference between them. You're right though.
Shader code is SIMD if your thread-group size matches the hardware vector instruction width. If your thread-group size is bigger then it's SIMT as the GPU is juggling multiple thread-groups at the same time (something akin to "hyperthreading").
i.e. the instruction set is SIMD, but the shader code environment built on top of that is SIMT. 

Share this post


Link to post
Share on other sites
3 hours ago, Hodgman said:

Shader code is SIMD if your thread-group size matches the hardware vector instruction width.

I don't agree with this either... SIMD doesn't allow for branch divergence, SIMT does.  You can have branch divergence within a single thread group size that matches the hardware vector instruction width.  At least thats my understanding of it.

Share this post


Link to post
Share on other sites
22 minutes ago, Infinisearch said:

I don't agree with this either... SIMD doesn't allow for branch divergence, SIMT does.  You can have branch divergence within a single thread group size that matches the hardware vector instruction width.  At least thats my understanding of it.

Many vector instruction sets offer masked operations nowadays. And with instructions like movemask you can make sure that only those branches that are in use are evaluated. So it really is more of a programming model thing than actual differences in hardware.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Advertisement