Full screen pass optimization on G70 hardware

Started by
25 comments, last by Krypt0n 13 years, 9 months ago
Quote:Original post by Zemedelec
This will discard fragments, *after* the pixel shader is executed for them.
It will not discard any fragments when you draw your scene, but that is what you want, too. You do want to draw those fragments.

When you draw your 5 fullscreen quads, it will discard the sky fragments after the depth test, but *before* running the pixel shader, and this is what you want. It would of course be yet more efficient if you could exploit the z buffer for that, but this isn't easy, and it would probably be slower due to some necessary state changes and an extra pass (albeit double-speed).

There used to be a "Humus Demo" a few years ago which used the stencil buffer technique to implement dynamic branching on hardware that per se was unable to do so, reducing of calculations for many small, local lights.
Personally, I didn't like the unjustified contra-nVidia, pro-ATI propaganda that came with the demo, but the technique in the demo nevertheless remains entirely valid.
Advertisement
Quote:Original post by samoth
When you draw your 5 fullscreen quads, it will discard the sky fragments after the depth test, but *before* running the pixel shader, and this is what you want.

G70 has no early-stencil, so afaik no stencil test can cull out pixel shader execution on that hardware. G80+ feature.

About Humus demo - it was ATI-targeted or what? I don't see how stencil can be used to emulate dynamic branching on pre-G80 NVidias.

P.S: Humus demo uses ATI-feature - they have early-stencil for many years now, while NV has it working G80+.
Strange, I have respect for NVidia for making optimizing techniques and now I realize that this is quite the contrary - early-z/stencil coming late and DBT unable to stop pixel shader running...
>>Performance is in single digit range (1-2 fps) <<

mate, say for arguments sake it doesnt have early-z

if it did, in theory (in reality it will be less) youll gain twice the fps thus fps will be ~3fps
3fps is still an order of magnitude (at least) too slow
i.e. u need to look at a different solution
Quote:Original post by zedz
if it did, in theory (in reality it will be less) youll gain twice the fps thus fps will be ~3fps


Twice the fps? In theory? I think it can go way over 'twice the fps', easily.
Where did you get those numbers?
Quote:Original post by Zemedelec
P.S: Humus demo uses ATI-feature - they have early-stencil for many years now, while NV has it working G80+.
The card I tried this with, years ago, was a 6600 LE, and it very noticeably increased the framerate. Maybe the early models didn't have an all-aggressive hierarchical z-culling, but they certainly discard fragments before running the pixel shader. Heck, it wouldn't make sense to do anything else. :-)
Quote:Original post by samoth
Quote:Original post by Zemedelec
P.S: Humus demo uses ATI-feature - they have early-stencil for many years now, while NV has it working G80+.
The card I tried this with, years ago, was a 6600 LE, and it very noticeably increased the framerate. Maybe the early models didn't have an all-aggressive hierarchical z-culling, but they certainly discard fragments before running the pixel shader. Heck, it wouldn't make sense to do anything else. :-)


Are you sure, the increase in frame rate was due to pixel shader culling, or just from cut in the bandwidth? 6600 LE sounds like it would strongly benefit from both...

Also, let me repeat - earlier hardware have early HZ (Z-Cull), not fine-grained per-pixel test (Early-Z). So, the speed up was due to the hierarchical tests probably.
Quote:Original post by zedz
>>Performance is in single digit range (1-2 fps) <<

mate, say for arguments sake it doesnt have early-z

if it did, in theory (in reality it will be less) youll gain twice the fps thus fps will be ~3fps
3fps is still an order of magnitude (at least) too slow
i.e. u need to look at a different solution

zedz is right at this point. I still don't understand why early-z or early stencil should increase your performance ? Even if it doubles the fps by only affecting half the pixels, it is still too slow.

I once implemented CSM with 4 layers on a 6600 with a forward renderer without really having trouble with the performance.

I would guess, that you have either a sub-optimal implementation or one of your shaders fallback to software mode.
Quote:Original post by Zemedelec
Quote:Original post by Krypt0n
early-z exists since gf3, like mentioned before. it is disabled if you
-enable alpha test
-use kill/clip in pixelshader
-change compare func

And it is disabled only for the DIP that violated these rules, not later on - right? Since my screen space quads are rendered after the alpha-tested vegetation - they are solid and ok, but are rendered... after the vegetation.
that depends on what rule you break, some will partly disable the optimization, some will just reduce the efficiency of it and some will switch it of until you clear the surface.

Quote:
Quote:Original post by Krypt0nin order to get speed again on G70, you need to work around your alpha-testing.

Well, you mean to render the scene without alpha testing? That's kinda not an option - the vegetation can not be rendered without alpha testing...

not even when you sort it and render it with alphablend to fake alphatest (setting alpha to either 0.f or 1.f) ?

Quote:G70 has no early-stencil, so afaik no stencil test can cull out pixel shader execution on that hardware. G80+ feature.
it has some stencil optimization within the z-cull. you need to clear the stencilbuffer and mask the area you want to use in the stencilbuffer and then draw on this area (not changing any stencil states) using your lightin pass batch.
Quote:Original post by Ashaman73I still don't understand why early-z or early stencil should increase your performance ?

Early-Z disables pixel shader from executing. Having pixel shader as the bottleneck, Early-Z can remove some of this 'bottleneck' from execution.
So, clearly it will speed things up. And the 'two times' is not entirely correct here, because the execution time for different fragments is not the same in the general case (dependant reads, texture fetches based on fragment information).

Quote:Original post by Ashaman73
Even if it doubles the fps by only affecting half the pixels, it is still too slow.

In reality, each cascade can cover very small amount of pixels - and early-stencil + early-z cuts them nicely. When they are not present or work after PS, the number of pixels processed are very big, not twice, but more times bigger.

Quote:Original post by Ashaman73
I once implemented CSM with 4 layers on a 6600 with a forward renderer without really having trouble with the performance.

I have that one working too, but my concern is with the deferred renderer in that case.
Quote:Original post by Krypt0n
that depends on what rule you break, some will partly disable the optimization, some will just reduce the efficiency of it and some will switch it of until you clear the surface.

Alpha test.

Quote:Original post by Krypt0nnot even when you sort it and render it with alphablend to fake alphatest (setting alpha to either 0.f or 1.f) ?

I do not want to sort all the triangles of all the vegetation - this is not practical, since there are tons of geometry there.

Quote:Original post by Krypt0nit has some stencil optimization within the z-cull. you need to clear the stencilbuffer and mask the area you want to use in the stencilbuffer and then draw on this area (not changing any stencil states) using your lightin pass batch.

I do just the same, but with only one Clear(), before the scene rendering.
Will try to clear only the stencil after the scene and before shadow rendering!

This topic is closed to new replies.

Advertisement