Occlusion queries and shader state changes.

Started by
4 comments, last by Ashkan 16 years, 10 months ago
In the renderer I am writing I want to use the occlusion culling algorithm outlined in "GPU Gems 2: Occlusion Queries Made Useful", with my shader management system. I am going to organise my scene into either a octree or a bsp of some variety, to get front to back sorting for the occlusion queries, but I also need to sort the polygon sets by shader. The occlusion culling algorithm requires rendering of close geometry before far geometry is even tested for occlusion, which ruins any chance of sorting by shader except within leaves themselves. How do I get the occlusion queries and the shader/state sorting to work together smoothly? I read in the GPU Gems 2 chapter about doing a depth render pass first with no shader usage, which would make further passes more efficient. Surely if only depth geometry is rendered then the occlusion queries would cause stalls, as the rendering is so fast, and it could not be spread out over the rest of the frame. Also how do I go about using the depth information from a depth pass to skip pixels in further passes (as is also mentioned in the chapter)? Or does this just refer to the early outs in the graphics hardware pipeline related to the z buffering? (i.e. no overdraw in the remaining passes).
Advertisement
That was an interesting question you brought up, which also made me curious. But for the second part:

Quote:I read in the GPU Gems 2 chapter about doing a depth render pass first with no shader usage, which would make further passes more efficient. Surely if only depth geometry is rendered then the occlusion queries would cause stalls, as the rendering is so fast, and it could not be spread out over the rest of the frame. Also how do I go about using the depth information from a depth pass to skip pixels in further passes (as is also mentioned in the chapter)? Or does this just refer to the early outs in the graphics hardware pipeline related to the z buffering? (i.e. no overdraw in the remaining passes).


You don't need to add any extra logic to take advantage of the early z cull feature. The early z cull is just extra logic and circuitry built into the GPU so as to increase the chances of detecting fragments that will be overdrawn later. Actually you don't even need to do a depth render pass first to take advantage of it, since there are usually a great deal of fragments that don't pass the depth test and at any stage, as soon as a given pixel's depth is initialized, all later fragments can take advantage of the early z cull. The fragments that pass the test will later update the value in the depth buffer. Rendering a depth pass first would maximize its effectiveness since it initializes the depth buffer to its final values, so all occluded fragments would be discarded as soon as possible (opposed to the previous approach where some occluded fragments can indeed pass the test only to be overdrawn later).

Surely, rendering a depth pass first is not free, but it increases performance if pixel shaders are the bottleneck in your application. Note that the GeForce FX, GeForce 6 Series, and GeForce 7 series GPUs render at double speed when rendering only depth or stencil values, but to enable this feature you need to disable the color writes and alpha tests. The active depth-stencil surface should not be multisampled either.
Quote:
You don't need to add any extra logic to take advantage of the early z cull feature. The early z cull is just extra logic and circuitry built into the GPU so as to increase the chances of detecting fragments that will be overdrawn later.

Okay thats kind of what I was begining to work out by then end of writing my question :)
Quote:
Note that the GeForce FX, GeForce 6 Series, and GeForce 7 series GPUs render at double speed when rendering only depth or stencil values, but to enable this feature you need to disable the color writes and alpha tests.

Surely the speed up depends on the cost of the shaders you are using? Or do you mean twice the base speed with no materials or lighting?
Thanks for the reply!
Any ideas on getting the front to back sorted rendering required for occlusion culling to work with shader/state sorting? Or is doing the occlusion pass first, then the render the only way?
It was almost a long time since I read that article, so I just took the book, and skipped through the pages when it suddenly stroke me...

Why would anyone need to go through that hell, traverse the hierarchy and issue all those queries twice? That's the whole point of laying the depth first.

Let's for a moment forget about all those tricky algorithms and start with a simple naive one as presented in the former sections of the article:
1. Issue occlusion query for the node.2. Stop and wait for the result of the query.3. If the node is visible:   a. If it is an interior node:      i. Sort the children in front to back order.      ii. Call the algorithm recursively for all children.   b. If it is a leaf node, RENDER THE OBJECTS CONTAINED IN THE NODE.

The part written in capital is where I would like to draw more attention into. We don't need to go through all this algorithm again and again for every single subsequent pass just to come up with the very same list of objects. So what we would basically do is write a depth shader, use this simple occlusion algorithm to traverse the spatial hierarchy, find those visible objects and
1) STORE THEM IN A RENDER QUEUE for use in subsequent passes
2) render them for the current depth pass

In the actual full-color pass, where we want to render whatever effect we like, we sort the render queue so as to minimize state changes and render the geometry with whatever effect we have in mind.

The good news is that:
1) We don't need to sort for anything in the first pass since all objects would be rendered with the same depth shader.
2) As I mentioned before, you can render this pass at double-speed if you enable the feature by disabling the color writes, alpha tests and multisampling (this last one is important while using RTT and can be ignored in this context). This would be twice as fast as when rendering with the same shader with this feature disabled (color writes, alpha tests are on).
3) We can actually play smart and use multistreaming to our advantage. All we need in this pass is the position of vertices, so we can decrease the bus traffic and also avoid polluting the cache by only sending the positions to the pipeline.

In the forthcoming passes we can then sort the obtained render queue and issue render calls.

The naive algorithm can now be replaced by the more efficient one presented in the article.
That is pretty much what the chapter says!
What is confusing me a bit is:
The point of the algorithm presented in the chapter is to do useful rendering work while waiting for occlusion queries to complete. But if you can only render the depth while waiting for occlusion queries, then the rendering will complete much more quickly and your algorithm will be stalled by the occlusion queries before it has even drawn any actual geometry. Or am I missjuding the relative times it takes to do the depth render and occlusion queries?
It doesn't really matter, I am going to try it anyway at some point, just looking for a heads up on implementing this really.
If that's what the chapter says, then that's what the chapter says. I'm not the one who claims this algorithm will result in performance achievements. The writers do, and they have provided enough proof for both the editors to include their article in the book and for the readers to justify their efforts.

This topic is closed to new replies.

Advertisement