Stencil vs Discard performance

Started by
8 comments, last by NikiTo 6 years ago

I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



 

Advertisement
1 hour ago, NikiTo said:

But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.

At least you save memory operations, which can be a huge win. I assume distributing threads only to 'used' pixels would be less efficient. AIFAIK one reason for GPUs to process 2x2 quads of pixels is to calculate gradients by accessing neighboring thread registers directly (origin of SM6 quad shuffle), and this would no longer work. Other reasons may be efficient memory access, color compression, depth buffer decompression, etc.

I don't know for sure if rejecting a full 2x2 quad is possible with stencil test, but guess so. So if stencil covers larger solid filled areas, it's probably better than discard. (Not sure - curious what others have to say...)

1 hour ago, NikiTo said:

I have a problem. My shaders are huge, in the meaning that they have lot of code inside.

Sounds like if you might have an occupancy problem? Tackling that could be something worth to fix...

 

 

 

Logically, the depth/stencil test is in the output merger (OM) stage, which runs after the pixel shader. But that pipeline is really more of a abstracted virtual pipeline: it doesn't dictate how the hardware actually works, only how results should appear to the application. This means that GPU's are free to put in optimizations that don't match the virtual pipeline as long as the results match what the specification says should be produced. So in practice, all GPU's have some sort of early depth/stencil test that they can use to keep the pixel shader from running whenever they can determine that it's safe to do so. The two big exception cases are using discard, and using depth/stencil export from the PS. In both of those cases the pixel shader has to be run to know how depth/stencil testing + writing should happen, so the hardware is limited in how it can optimize this. Some hardware will still use early Z for discard: they basically have to do early Z and then re-update the Z value again after the pixel shader if the pixel was discarded. Some will just give up in this case. Full depth output from the PS will always disable early Z, but there are "conservative" depth output modes that will still allow the hardware to do a subset of early Z operations.

One trick that's used in a lot of games to speed up alpha testing is to run a depth pre-pass for alpha-test geo, and then re-render that geometry with a full shader but with depth writes off and depth test set to EQUAL. Basically you pay the discard cost for your simpler pre-pass shader, but then your full shader will only run for the non-discarded pixels (and won't need to have a 'discard' operation, so the hardware can still use early Z).

It's really worth measuring what MJP suggested -- doing a pre-pass where your PS samples the alpha texture and does a discard (and nothing else) and then a final Z-equal pass with the final PS without any discard. If your VS is cheap and you have at least some over-draw, it should be a win.

A nice side effect of Z-pre-pass will be occlusion by "big" geometry (independent of whether it has some alpha-holes or not). As long as VS is cheap.

If you want to use stencil, you have to produce it somehow, which means running simple pre-pass sampling the alpha anyway, no?

 

Thank you, all!

I am going to do what @MJP suggested. Plus, I can decide which pixel is discarded in the previous render pass, having no need for a dedicated pass for Z-values. So the solution of MJP is purrfect for me.

It is just that, at the moment of making design decisions, I assumed that Stencil would be easier to use. Now I have to use a pure depth format excluding the stencil byte from the format. Using this way the depth functionality for Stenciling. I don't understand why Stencil is so tricky to use, forcing people to emulate the same functionality using depth buffer. Is Stencil somehow deprecated?!

Stencil isn't deprecated yet. There are very specialised circuits to support very fast stencil + depth (HiZ, HiS, Depth-block caches, etc.) on all GPUs.

In most usages I've seen, the draw-call specifies a test value and/or write value (operation) for the stencil operation and ALL the pixels do it. Typically you mask out whole objects or regions (sky, foreground, characters, water, ...) and then run some full-screen pass only on the respective parts. You normally don't want to read nor write a specific stencil value from within the shader. You normally don't do that with depth either.

The win is that the GPU is able to quickly discard (or accept) a whole (typically) 8x8 block of pixels from execution, if the test (stencil, depth or both) doesn't match. Blocks which won't run won't consume any resources, which is much faster than scheduling threads for execution, then fetching addresses of texture descriptors, then sampling those textures and realising that we actually didn't want to render there. It all depends on whether it's faster to prepare the stencil/depth "filter" or not.

So I'd say it's still pretty useful. Having small holes or small triangles or otherwise divergent execution within the typically 8x8 blocks of pixels will hurt performance and using stencil with tiny 1-pixel holes is exactly that - a bit of a waste :)

Imagine that I have big triangles with static TV(white noise) texture. I need to discard only the black pixels.

52 minutes ago, NikiTo said:

Imagine that I have big triangles with static TV(white noise) texture. I need to discard only the black pixels.

if you discard only 1/255 of the pixel (assuming you mean black == 0.f), then there won't be any performance benefit from discard nor stenciling, as the GPU will dispatch pixel in groups of at least 2x2 (yes, fragments will be processed that are already rejected). Therefor the only speedup you'll get is in cases where there are 2x2 pixel blocks of black pixel, which are also exactly in the 2x2 pixel grid. 1:4Billion

Hence, if that is only about performance, go for a discard, it has at least less setup overhead.

Thank you for the answer, @Krypt0n! I have it clearer now. I will use discard, and not warm my head for now. And I will let depth/stencil apart as a final optimization.

This topic is closed to new replies.

Advertisement