Simple Alternative to Clustered Shading for Thousands of Lights

Started by
13 comments, last by Ashaman73 9 years, 1 month ago

In my experience with clustered, you store all the lights in the buffer in a very similar manner, except each cluster has a linear section of the buffer.

Yes, I think that this would be the more common way of doing it. I currently use the uniform way to do it in my engine.


So the main additional cost is doing a bounding sphere test per light that you visit, and iterating through the light array as a linked-list rather than a simpler linear array traversal.

It is more like a tree structure, isn't it ? Traversing a tree will, even if you will narrow down to the candidates quite quickly, still need to jump around quite often. The virtual nodes in this tree (which will not represent a light) would add atleast extra costs and if you need to traverse partly through a tree with 3000 leafs, then you could have a few cache misses and this for every pixel.

I would be interested in the expected overhead, would it be roughly x1.5 , x2.0, x3.0. The system could perform better on more realistic game setup (few hundred visible lights) with a much flatter tree structure.

Nevertheless, I found this approach very sexy. I could think about using it to add some extra eye-candy to the forward rendered stuff (particles) as option for people with hi-performance hardware.

PS: if you do the light calculation in the vertex shader you could push out lit particles with ease and with much better performance.

PPS: @fries: how about posting your approach as article on gamedev ?

Advertisement

Yes, I think that this would be the more common way of doing it. I currently use the uniform way to do it in my engine.

Sorry I'm taking the thread off topic here!
Do you have a single UBO containing an array with info on all clusters, and then you index that array?

AFAIK, on modern cards, that array indexing will be very similar (if not equal) in cost to fetching from a buffer.

Traversing a tree will, even if you will narrow down to the candidates quite quickly, still need to jump around quite often. The virtual nodes in this tree (which will not represent a light) would add atleast extra costs and if you need to traverse partly through a tree with 3000 leafs, then you could have a few cache misses and this for every pixel.

Ah yeah I forgot about the non-leaf nodes. The stackless traversal is nice though that it's still linear in the best case scenario, and the worst case just adds more gaps/skips into the linear iteration. I imagine it will still cache somewhat well.


Sorry I'm taking the thread off topic here!
Do you have a single UBO containing an array with info on all clusters, and then you index that array?

I'm still using OGL2.1, thought UBOs are available as ARB extension. Up until now I have targeted more urgent buffer usages(most recently PBO texture streaming... yeah, I know, but time is a very valueable resource). UBOs in conjunction with animation data or light data sounds interesting, especially due to concurrently filling the buffers with multiple threads/jobs ... *making mental note for next engine improvement*.

I'm still somewhat curious about nextgl and what they will show off next month, so I'm a little bit careful with my coding time and hoping to add support of a modern,slim,cross platform,open APIsmile.png

Hi,

Good to see that you're all talking about my technique smile.png

I'll have some time this weekend to work on a comparison demo with clustered shading and I will release the code. It will be targeted at opengl. BVH accelerated shading should work fine without modern gpu features like structured buffers etc. as you should have no problem encoding the light BVH into a texture. Since the traversal is stackless, it should map well to older hardware (no thread local storage required for a stack).

As for the performance/bandwidth of BVH accelerated vs Clusterd, we won’t really know until I work up a comparison demo, but there are some things to consider:

1. With clustered, for each pixel you are looping over a light list of which not all lights are contributing to all pixels in the cluster, with BVH accelerated, the culling is per pixel exact.
2. With 1000 lights, 20bytes per light you would have roughly 24k of nodes in the tree (less if you pull the light info out into its own buffer and store only spatial information in the tree). So 24k to upload to the gpu every frame for 1000 fully lights, if they dont move, you dont have to upload anything every frame. Either way, 24k is nothing for a modern gpu, your vertex buffers are bigger than that.
3. Clustered shading has setup costs (pixel clustering, light assignment) that BVH accelerated shading does not.
4. Clustered shading potentially jumps around the light buffer (Buffer<Lights> lights in Hodgman's code above) as much as BVH accelerated jumps around the tree data, possibly equalising the bandwidth and cache issues of the two methods even more.

So I’m hoping (lots of hand waiving, I don’t really know yet...) that with the coherency of nearby pixels traversing similar paths of the BVH, the bandwidth consumption and cache misses will not be too much worse than with clustered.

Hodgman: your idea for using the bvh for light assignment in a grid for clustered is exactly what I was planning to do next! :) I always wondered why they brute force the light assignment in a lot of those clustered shading demos/papers, using a hierarchy would make it much simpler, and possibly faster – just throw a thread at each cell in the grid and have it traverse the BVH using the grid cell’s bounding frusta. Maybe is it the cost of keeping the bvh up to date that prevents people from doing this, but for a scene with mostly static lights, ie. most games, the bvh update should be insanely fast.

Ashaman73: I dont know how to post an article on here, but maybe I'll do that at some point though, if theyre interested in me doing that.


So I’m hoping (lots of hand waiving, I don’t really know yet...) that with the coherency of nearby pixels traversing similar paths of the BVH, the bandwidth consumption and cache misses will not be too much worse than with clustered.

I've done some testing with a quick BVH implementation to check the number of nodes read. I will not present my result, because the implementation is quick and dirty and would most likely not represent your work. Nevertheless, the question that arose was the number of how many nodes you need to touch per pixel in average. Eg the spheres in a spheretree get quite big really fast, and all the not light related spheres will increase the overhead of all pixels, even if you have unlit pixels etc. So, if you have pixel lit by 100 lights and you have an overhand of X nodes, it is quite cheap, but if you have a pixel lit by none or only one light, X nodes is really expensive. A clustered/cell based shader would fit more tighly around the variable number of lights influencing pixel areas. The question is, how high is X under different setups ?

Just for testing this number I would implement the same search algo from the shader on the CPU to do some rendering simulation to get some quick statistics.


Ashaman73: I dont know how to post an article on here, but maybe I'll do that at some point though, if theyre interested in me doing that.

Check this out to get some infos smile.png

This topic is closed to new replies.

Advertisement