Is my frustum culling slow ?

Started by
44 comments, last by lipsryme 11 years ago

Oh my god I've finally done it smile.png....using the btAlignedArray from the bullet physics sdk.

And it is blazingly fast. With just the culling itself it takes about 0.03ms for 10k AABBs.

50k goes to around 0.53ms. 100k AABBs in 1.25ms.

Big thanks to everyone in here !

Advertisement

Remember, even the fastest frustum culling loop will be no match with hierarchical culing.

Most of a world is made of static geometry, so it is only a preprocess to clusterize it and merge the clusters AABB in some kind of hierarchical structure. An octree can be a good choice, each node split into 8 sub node ( neat, it is 2 times a 4 box simd culling ). You send have a usefull information for traversal, fully visible, invisible and partially culled. You can then push the primitives of the fully visible node without even culling for example.

And you seems to still miss understand the difference between type size and memory alignement, so study them again and again, also add bunch of assert each time to try to do an aligned load or write to catch bugs as soon as possible.

Remember, even the fastest frustum culling loop will be no match with hierarchical culing.
Most of a world is made of static geometry, so it is only a preprocess to clusterize it and merge the clusters AABB in some kind of hierarchical structure. An octree can be a good choice, each node split into 8 sub node ( neat, it is 2 times a 4 box simd culling ). You send have a usefull information for traversal, fully visible, invisible and partially culled. You can then push the primitives of the fully visible node without even culling for example.

And you seems to still miss understand the difference between type size and memory alignement, so study them again and again, also add bunch of assert each time to try to do an aligned load or write to catch bugs as soon as possible.

Hierarchical can be overkill for many cases and the added overhead and complexity can be then net loss.
In Battlefield 3 culling paper their paraller brute force algorithm was 3 times faster than hierarchical, code size was 80% smaller and because of this simplicity further optimizations was easier.
http://dice.se/publications/culling-the-battlefield-data-oriented-design-in-practice/

I perfectly know about that talks but the 360/PS3 way is not anymore. Beeing an in order RISC processor was a pain in the ass for code size and branching. Things as needed as possible in frustum culling like the float compare instruction was really costly too. Add the need for simple data layout to not saturate the DMA communication with the SPUs and the battefield choice may be legitime in their context.

But today and tomorow's engine will again scale the number of primitives to manage and draw by a good amount. Hierarchical layout will strike back easely with the more modern hardwares working more efficiently with branching.

And being hierarchical do not means we have to go down to single primitive leaves, everything is in the balance between the overhead of the hierarchy and the raw test. May be instead of storing hundreds of primitive in leafs, we will store thousands and dispatch each leaf on separate thread. In the end, only profile session and testing will give the answer, an answer dependant of the context of each game.

And strangely, on a previous game i work on, with a lot of instances to manage, one optimisation i did was to strip the hierarchical split of the culling after it reach a too small size because it was more efficient. So no, i am not a pro hierarchical, i am just a pro performance :)

@OP - Could you share what hardware are you running this on?Those are some awesome results.

>removed<

Core i5 2.8ghz quad core (I think it's a nehalem)
Mind you the frustum cull was basically the only thing inside my render loop and the results might not have been 100% perfectly measured

This topic is closed to new replies.

Advertisement