I have implemented a tilebased and clustered deferred shading pipeline and am currently profiling and optimizing.
I am constructing a LBVH for all the lightsources every frame from scratch.
I'm doing this like described here: http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/
(1) compute a conservative AABB for all lights
(2) calculate Morton codes for each light
(3) sort the lights along their corresponding Morton codes
(4) construct a LBVH using a sparse tree representation
This works pretty well and uses all together 1 - 1.5 ms for up to 100 000 lights.
On the other hand the traversal of the LBVH consumes lots of time, primarily because I have to do this 2 times. 1st time to calculate the number of lights in each cluster, and the 2nd time after I have partitioned my indices texture to put the actual light indices into.
I have several different implementations for the traversal...
traversal with stack: http://pastebin.com/GZnzGrPw
stackless traversal: http://pastebin.com/4NE5UqVG
there are more variants to the stackless traversal (with less texture memory access) but I think they are not relevant for the time being.
My question is now, why is the stackless traversal faster (2 times as fast!) than the one with stack, even if there are AL LOT more texture memory reads. I figured the order in which the nodes are traversed is the same.
My theory goes as follows:
GPUs utilize fast context switches to gain performance. The stack (the 32 field array) uses up a lots of registers and the context switch is actually pretty slow. Although this is pure guessing an I have no proof what so ever.
I tried to squeeze in as much info without bloating the post, so thanks to all those who have read up to this line
I'm very interested in your explanations as well.
Edit: I should have told you how the tree is represented in memmory...
It's a sparse tree representation with a texture holding the 2 children for the node N at the position N
The same with the parents, the parent of Node N is found by accessing the texture at position N