From my point of view, occlusion culling is, just like frustum culling, a way to trade some CPU cycles for GPU cycles. Back in the early days of 3D, frustum culling on cpu was a real time consumer (I think on the first game I made it was 20%+ of the frame time), probably more than occlusion culling nowadays, as games usually don't occupy 100% of all cores, but with just one core and culling down to polygon level, you saw the impact of frustum culling.
That's why I prefer the cpu solution in general, if you would write a molecular simulator, occupying 100% cpu time and showing some ogl spheres, I wouldn't recomment the software version.
the GPU solutions have quite some issues from my experience
Solution 1: Every Drawcall is predicted by a bounding box and based on it's visibility, the actual drawcall executed or skipped, by hardware..
usage:
-For frustum culling, works on PSP
- occlusion culling on newer consoles
problems:
- 1. the commandbuffer has a stream of states that rely on previous settings, you can skip the actuall drawcalls, but you have to process all states. and you might generate "bubbles" in the GPU pipeline, the GPU can be processing a lot of commandbuffer, setting shader, constants, states etc. but skipping all drawcalls. the ALUs etc. will stay idle and you might get commandbuffer bound, it's something you really don't want to. (with frustum culling on PSP it can be 80% of the scene that you reject, if you move frustum culling to GPU)
-2 you might have quite a lot of CPU overhead to setup all that (e.g. skinning) although the HW will just ignore the drawcalls, (that would be especially be bad on PC APIs)
-3 for occlusion culling you might need to sort objects front to back, but that might lead to a lot of unnecessary state switches which might become the bottleneck, as you'd usual sort for states.
[/quote]
Solution 2: clustering objects.
usage:
- you start a query, draw an AABB of a cluster of objects, get the result next frame. works on most PC APIs
problems:
- clustering: it's not trivial to decide what objects to cluster, I saw some empirical models that generate cluster on developer machines while they are testing, submitting to some central PC and then those clusters are used like PVS. but with dynamic occlusion queries, which makes me wonder, why not use a PVS in that case, it would be more deterministic and simpler to implement. works anyway just with statics in that case.
- "ping pong" effects, clustering a big amount of objects leads to cheap test, but if just ONE pixel is visible, you draw all objects, as the testing has a latency and just splitting your cluster into smaller clusters that you want to test can lead to several frames of latency until something is visible. not only would that result in ugly popping of whole chunks of the level, you'd also miss a lot of occluder, which would lead to a lot of drawcalls that will be detected as "hidden" in some of the next frames. so, usually there is no hierarchy and you switch between those two states where all or nothing is visible. with clusters it's common that quite some area is empty, you have no tight volumes around objects, it might lead to ping-pong every frame and to stuttering frames although visually nothing changes. not fun to debug that.
- fillrate, when I was implementing this on my geforce 3 back then, I really saved quite some drawcalls and especially triangles/vertices, but I got fillrate limited. GPUs won't stop drawing tons of boxes just because they are visible, they will finish the whole job and give you the pixel count, this can be a penalty.
[/quote]
Solution: one object, one query, use report for next frame
usage: I think I saw something like that in UE3
problems:
- a query can cost something, some hardware has a limitation in the amount of queries it can buffer in a frame and also the amount of queries that can be 'in flight' in the pipeline. having a lot of cheap objects separated by queries might make you hit this limit. (and that is really an arse of an issue, if you cull more, it will be faster, cull less, everything will be slower, removing culling makes it even slower -> everything is fine? just till some GPU vendor writes you a mail how stupid you are ;) ). but I think the UE3 guys use this only for big chunks of the level, few drawcalls, it seems to be fine.
- similar to the ping pong problem, in one frame, you might detect something is occluded, so you dont draw it this frame, but this frame it might be not occluded, but as you didn't draw it, it also does not occlude anything behind it -> you draw everything the next frame. in UE3 you can see it e.g. in "Shangri La" (part of the demo), there is a fence and you can strafe left/right as spectator, the objects behind the fence will flicker.
[/quote]
solution: D3D11, using unordered buffers and Draw Indirect
usage: just a crazy idea I had, I think nobody uses it yet. You draw the simple meshes that you would usually tessellate in a "prepass", each is writing out it's drawcall ID to the framebuffer. Next pass you would use a kernel to set bits to an unordered buffer, based on the drawcall ID, 3rd pass would create a buffer that is passed to draw indexed instanced indirect, based on the 0s and 1s set in the unordered buffer.
pro/contra: I have no experience with that, ordered buffer seem to work quite fine in general.
here are quite some smart D3D11 user that might want to give it a try, no idea if it will work out at all ;)
[/quote]
@Tordin
I accept the challenge ;), if I draw just one triangle per object, I get 49fps with 135000 drawcalls per frame . Then we start to be API/Kernel limited I guess, unless we want to use instancing. Btw, my benchmarks are from the OpenGL demo, but I have a D3D9 version running as well.
I you intend to compare something more, you must specify more accurately how the scene and camera is set, so I could provide you numbers.
there is no preview button here :/