Sign in to follow this  
Cypher19

Thoughts on scissor/stencil test as optimizations

Recommended Posts

As some of you know, I'm currently working with omnidirectional shadow mapping. One optimization I've been tooling with in my head is the possibility of using scissor or stencil tests as optimizations when generation the SMs. However, this image from the MSDN suggests that even using both would not be enormously practical. If it helps any, here's the optimization I'm thinking of: On the CPU, calculate a rectangle that will represent the area that the view frustum exists in in image space for the current SM face, and run the scissor test on that. In addition to this, the view frustum would also be rendered to the SM face before anything else, to mask any pixels that lie outside of the view frustum (e.g. a finer, secondary test). Also, (just thought of this while writing) render the back parts of the view frustum to define the max Z value in that area as well (a tertiary test). So, do you guys think this would be helpful to do in order to reduce fill? I haven't worked at all with scissoring/stencil, and I am not sure how big of a jump in speed the Z test would do as well. The biggest concern that I have is that it would be almost pointless because the extra fill used by the view frustum rendering and extra checks would offset the fill/pixel processing saved by doing these tests.

Share this post


Link to post
Share on other sites
I have seen that pipeline diagram too and it never made sense to me...It seems like it's easy enough for the hardware to do the various tests before the pixel shader is executed - except for the alpha test, and the depth test in the case of a pixel shader which writes to oDepth. And, even if pixels are being processed in parallel, it will be common for all of them to fail the z test, for instance.

In practice it does seem like it is doing tests before invocation of the pixel shader, since I have implemented stencil tests and scissor tests in various situations to save per-pixel work and have noticed significant speedups - more than I would expect from just frame buffer bandwidth. I guess I can't know for sure though. I'd go ahead and try your optimization, I think there's a good chance it will speed things up.

The card manufacturers often advise using a Z Fill pass and I think they imply that it saves pixel shader invocations.

I'd be very interested in hearing some definitive information about this from someone who knows though...

Share this post


Link to post
Share on other sites
Quote:
Original post by ganchmaster
The card manufacturers often advise using a Z Fill pass and I think they imply that it saves pixel shader invocations.


Actually, because this is being applied to shadow maps, there is only a single pass being done, so a Z Fill pass is pretty much out of the question. It'd be absolutely redundant, since once I conduct all of these optimizations I'll be rendering things to the shadow map in front-to-back.

Share this post


Link to post
Share on other sites
Yeah, I'm sure I've read in nVidia documentation you're meant to avoid changing depth values in pixel shaders if at all possible, exactly because it stops depth being tested before the pixel shader is run and stops hierarchical depth buffers working. As for that pipeline I think that's the theoretical one (with the exception of the scissor test, I can't see why that has to come after the pixel shader), i.e. it may not be possible to do Z/stencil test before pixel shading.

Share this post


Link to post
Share on other sites
Quote:
Original post by Cypher19
Quote:
Original post by ganchmaster
The card manufacturers often advise using a Z Fill pass and I think they imply that it saves pixel shader invocations.


Actually, because this is being applied to shadow maps, there is only a single pass being done, so a Z Fill pass is pretty much out of the question. It'd be absolutely redundant, since once I conduct all of these optimizations I'll be rendering things to the shadow map in front-to-back.


Yeah. I meant that they wouldn't be as keen on everyone using a Z fill pass if it didn't help avoid spending cycles on pixel shading. I was arguing that it is likely that the Z test actually occurs before the pixel shader, but the order in which I wrote the statements was confusing. Certainly it seems pointless for you to use a Z fill pass when rendering shadow maps.

Let us know how your stencil and scissor optimizations come out...I bet the scissor will result in a net savings, but I'm not sure about the stencil.

Share this post


Link to post
Share on other sites
Actually, I did more looking around regarding stencil/scissor/z test locations in pipelines, and here's the conclusion I came to:

1) Scissor is a definite yes
2) Stencil is a no, but requires more research. It seems like this varies from card to card...
3) But who cares, because I can do an early Z pass with the back of the cam's view frustum which can act like a stencil! Basically clear the z buffer to a value of, say, 0, and then with no z-testing being done, render the back half of the view frustum at its normal depth, and then render the SM normally (with <= z comp's). It should be noted that basically EVERY card out there does early hierarchical Z culling, allowing large chunks to quickly be dumped or accepted before the shader. However, not every card does early stencil (why just boggles the mind, imo).

Share this post


Link to post
Share on other sites
Scissor is logically done late, but in reality it's done early. I believe most modern cards won't even generate fragments outside the scissor region, so that's a definite win.

Often shadow maps are not b/w bound or fill bound, but setup, vertex or attribute bound, b/c there is very little for the rasterizer or shader to do, so I suspect that doing the Z-only frustum, while a cool idea, won't buy you much.

Early Z and early stencil optimizations often rely on limited on-chip resources to be allocated by the driver in order to function. These are typically allocated to the main back buffer's z buffer first, so you may get limited or no fast Z or stencil culling on shadow maps rendered to an off-screen texture.

Have you tried NVPerfHUD to verify that you are fillbound on this part of the scene?

Share this post


Link to post
Share on other sites
No, I haven't, but I cvan tell you that I DID get a sweet performance boost by not rendering the convex object (big fat room) to the SM.

I just felt that the fill rate is going to be an issue either now or later.

Share this post


Link to post
Share on other sites
Quote:
Original post by SimmerD
Have you tried NVPerfHUD to verify that you are fillbound on this part of the scene?


Maybe I'm missing something here, but how would you use NVPerfHUD to verify that you are fillbound at a particular part of the scene? You can use it to look at the fill implications for your entire scene by turning on and off rasterization, but how for a specific part of the scene?

Share this post


Link to post
Share on other sites
I could comment out all of the render calls to the main scene itself, leaving SM generation as the only part of the render function?

Share this post


Link to post
Share on other sites
I suppose you don't even need perfhud. Varying the size of the shadow map should give you a good idea.

Share this post


Link to post
Share on other sites
Quote:
Original post by SimmerD
I suppose you don't even need perfhud. Varying the size of the shadow map should give you a good idea.


Ok, I just wanted to make sure that there wasn't some powerful feature of NVPerfHUD that I was missing.

The only thing about varying the size of the shadow map, BTW, is that it will also increase the speed of sampling it to make it smaller...I guess to really do it right, you'd have to render to it but not from it, and then test the effects of the size variation.

Share this post


Link to post
Share on other sites
Quote:
Original post by SimmerD
I suppose you don't even need perfhud. Varying the size of the shadow map should give you a good idea.


Hmm, I hadn't considered that. Doing the tests at piss-slow 2048x2048x6 should do the trick juuust fiiine, and certainly help accentuate any optimizations I get.

Share this post


Link to post
Share on other sites
My experience with this is from OpenGL, not DirectX, but the theory should be the same: I've found that in every case that I've used them Scissoring always helped speed things up for me, while stenciling slowed my app down 99% of the time. (My test programs were always very fillrate limited, BTW.)

I think that part of it is due to the fact that when working with a stencil buffer you are still taking a hit to your fillrate simply because you have to fill the stencil buffer first. In addition to that, the stencil must be tested on a per pixel basis like a depth buffer, with whatever equation you give it, while a scissor test only checks against a bounding box (much faster).

In either case, I don't understand why it would run through the pixel shader first, as taht seems like a massive waste of fragment processing. I wouldn't doubt that's how it happens, though. In my test app, I noticed that if I rendered only in the area that was stenciled I would get nearly double the framerate of drawing a full-screen quad against the stencil buffer.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this