Home » Community » Forums » » Occlusion Culling Using DirectX 9
  Intel sponsors gamedev.net search:   
[Control Panel] [Register] [Bookmarks] [Who's Online] [Active Topics] [Stats] [FAQ] [Search]

Add Forum to Favorites |  Send Topic To a Friend | View Forum FAQ | Track this topic


 Last Thread Next Thread 
 Occlusion Culling Using DirectX 9
Post Reply 
Nice timing!

When I first read this article on your (circlesoft) website, I could not run the demo for I did not have hardware that supported DX9 features. Now I have, sweeeeeet!

I'm not back. I'm just visiting.

 User Rating: 1469   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

Good intro, but doesn't say much more than the docs do.

Maybe a part II that mentions how to use this more efficently (Rendering for real while waiting for the card to respond)

 User Rating: 1118   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Occlusion culling as it is implemented by the drivers is notoriously slow and has a strong potential to bubble the pipeline , this because , the programmer must query the pixel draw in an explicit way , now i would like to follow this example
what if , inside the triangle rasterization function there was a little counter using a register of the gpu ?
i try to explain :

set pixel count register to 0

draw whatever you like a 2k triangles or a single triangle
using normal glvertex3f / display list or whatever

read back the regsiter holding the pixel written

no query , no double rendering, only a variable is read back

the counter is a register which is incremented if the pixel
is visbile and this variable is present inside the very same
triangle rasterization.

i don't see any problem , preventing to take into consideration
from the next arb meeting

Am i thinking wrong ?

 User Rating: 1021   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

I think what you said ketek, is exactly what is going on in the dx occlusion query – that is why it is called a hardware occlusion query. Issuing the query sets the hardware counter to zero and enables counting of accepted pixels, ending the query stops the counting and tells the GPU to send the data to the driver immediately when it is available – which is when the card finishes rendering the data in between query start/stop. The while loop with the query->GetData simply waits until the data is available in software…

Probably if the whole process was implemented by the drivers it would be called software occlusion query…

Perhaps I am wrong, but I always thought that the problem with hardware occlusion queries was the problem with CPU / GPU parallelism. Since the GPU is buffering commands it can even be rendering several frames behind the CPU – in such cases waiting for the query to return the result will force the CPU and GPU to sync, and while waiting, the CPU looses precious cycles. DX allows for asynchronous queries but in such case they are rather useless especially if the occlusion data can arrive after several frames…

Can anyone support this theory?


 User Rating: 1046   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Maybe my concept is not so clear, sorry , english is not my native language. I try to explain better
in occlusion culling you have to render first a bounding box, bounding sphere or whatever for the object , and then ask for the query. now , imagine that by default during normal triangle rasterization a register is incremented when the pixel is visible
i'm not talking about query , i'm talking about normal rasterization cycle ,you have to perform only 2 operations
reset this register and read back it , at any point in the pipeline , i'm not talking about 2 separate method with concurrent cpu/gpu , just a simple regsiter incremented during the normal rasterization function for a triangle.
Any thoughts ?

 User Rating: 1021   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Nice article - Straight forward and to the point. I think this is a nice introduction to the concept.

One thing that comes to mind however is some possible optimizations regarding a hierarchy or PVS style of structure. I believe there were some discussions a ways back on this topic with the venerable YannL (all hail!!) in which an adaptive binary tree was used for this technique. I believe he said that they hand wrote an ASM rendering engine for this specific purpose that they execute in parallel with the gpu.

As an optimization, this makes much more sense to me then trying to optimize at the pixel level (could be wrong though...). It would be possible for me to cull an entire tree of geometry (occluders and all) because I know that this occluder object hides other occluders from a given vantage point, implying that represented geometry could be culled. I think that would obligate you away from the hardware solution - but it would be interesting to see if, and in what conditions, you would have a speed up. I would predict both technique are useful at different times depending on the potential scene content.

I have not thought this entirely through (as well as it being late) but I just thought I would toss my hat into the ring, as well as compliment the author.


Cheers -

#dth-0





 User Rating: 1121   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by ketek
Maybe my concept is not so clear, sorry , english is not my native language. I try to explain better
in occlusion culling you have to render first a bounding box, bounding sphere or whatever for the object , and then ask for the query. now , imagine that by default during normal triangle rasterization a register is incremented when the pixel is visible
i'm not talking about query , i'm talking about normal rasterization cycle ,you have to perform only 2 operations
reset this register and read back it , at any point in the pipeline , i'm not talking about 2 separate method with concurrent cpu/gpu , just a simple regsiter incremented during the normal rasterization function for a triangle.
Any thoughts ?


halting the pipeline halfway usually equates to a pipeline flush, which is very costly. The occl query pass will not cause such a flush and is fast becos it skips a lot of operations that the normal pipeline does. We need to understand that the use of the occlu query is to cut down the geometric data that we are going to pump into the full rasterizing pipeline and not the other way round. Hence doing occl query at the full rasterizing process is probably out of the question due to obvious performance penalties.


-Hun Yen Kwoon

 User Rating: 1052   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Hey guys,

This article is actually quite old. I wrote it back in March, and since then, I've learned a thing or two about hardware occlusion culling

So, I've been thinking about writing a Part II, going over some optimizations. The technique presented in this article is actually very naive.

Here are some of the preliminary things I've been thinking about:

(1) Use an array of IDirect3DQuery9's (100-200 of them seems to work well, depending on the amount of objects you are using). First, draw the bounding meshes of all the objects, and use a separate query for each. After all objects are drawn, go back and query the data. At this point, it should be ready, and you don't have to waste any CPU cycles in the while loop.

(2) Batch render all of the bounding meshes for the preliminary render. This will save many DrawIndexedPrimitive (DIP) calls.

(3) Couple this occlusion technique with another culling technique, such as octrees. That way, you aren't wasting valuable DIP calls for an object that isn't even close to being in the scene.

(4) Use occlusion culling in an offline method, coupled with PVS. The occlusion tool will take the world, render all possible viewpoints, and export the PVS data. The engine will then load the PVS data, and use it accordingly. This is nice, because it's very accurate (you won't experience overdraw/underdraw), and it replaces some complicated offline algorithms. It should also be very flexible for both indoor and outdoor environments. However, this is a little complicated to implement in an article.

(5) Only test the objects that are above a certain triangle count. This is to make sure that actually rendering the full object is not faster than doing all of the occlusion stuff.

The biggest killer for this method is all of the DIP calls it sucks up. The rendered meshes actually take up 3 calls a piece, which is *definetly* not good.

If anybody has any suggestions, I'd love to hear them. Thanks for the compliments


Dustin Franklin ( circlesoft :: KBase :: Mystic GD :: ApolloNL )

 User Rating: 1712   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Part 2 would be a very good idea IMHO...

Anyway, I was also thinking about a possible optimization for this - wouldn't it be sufficient to render the bounding meshes only once provided they would be sorted according to their distance from the observer?

You would render them starting from the nearest to the most distant. For each mesh you could test its occlusion immediately after first render since all objects that can occlude it would already be rasterized. Of course you would start the queries from the second object since the nearest one could not be occluded ;)

What do you think?

 User Rating: 1046   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

That could be good, provided you sort the static objects at load-time and not run-time. However, as soon as you introduce dynamic objects, you have to sort every time an object is moved. Maybe you could only re-sort the objects close to the changed object, though. That way, you aren't rebuilding the entire list every time one object is changed.

 User Rating: 1712   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Turn off color writes. All you need to know is if a pixel would have been rendered or not, so color writes are superfluous. Maybe you did turn them off, but I didn't notice it...

 User Rating: 1046   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

I modified your code, now it queries up to 100 bounding mesh renderings before fetching the results - FPS is much better. I have a mobility Radeon9600, and before this modification the FPS were WORSE with the queries than with bruteforce only. Now, when I move the cam to a corner and look upon all trees, with the nearest tree occluding everything, I get about 78 fps with culling and 44 fps without. This can be further refined by introducing some viewfrustum and hiearchical culling (octree/ABT/whatever). However, I like the results.

 User Rating: 1015   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

What framerates are you guys getting on the example? I'm getting about 20 to 30 fps on a GForceFX 5200.

 User Rating: 1033   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by Ardor
I modified your code, now it queries up to 100 bounding mesh renderings before fetching the results - FPS is much better. I have a mobility Radeon9600, and before this modification the FPS were WORSE with the queries than with bruteforce only. Now, when I move the cam to a corner and look upon all trees, with the nearest tree occluding everything, I get about 78 fps with culling and 44 fps without. This can be further refined by introducing some viewfrustum and hiearchical culling (octree/ABT/whatever). However, I like the results.


Would you please post the modifications here? I'm interested in seeing what you changed.

Thanks.

-Nick

 User Rating: 1033   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Yeah, even rendering a couple hundred queried objects and then checking the registers wouldn't keep from stalling the pipeline. A better trick (as presented in GPU gems) is to check it all next frame. You'll get 1 frame delays in showing objects, but that's not a big deal as long as you're keeping track of them effectively.

 User Rating: 1015    Report this Post to a Moderator | Link

the article says that the bounding meshes are all rendered to the depth buffer, and then you do occlusion tests against that depth buffer.

this is a problem, because the bounding volumes are, by definition, bigger than the real object. Since they are bigger, they can potentially occlude some objects that the real object does not occlude.

you need to compare bounding volumes against the real depth buffer. one computed by rendering the true geometry. see gpu gems 2 for a good algorithm for making this efficient (occlusion queries made useful)

 User Rating: 1015    Report this Post to a Moderator | Link

All times are ET (US)

Post Reply
 Last Thread Next Thread 
Forum Rules:
You may not post new threads
You may post replies
You may not edit your posts
You may not use HTML in your posts
Jump To:
Administrative Options: