• Advertisement
Sign in to follow this  

Simple Alternative to Clustered Shading for Thousands of Lights

This topic is 1090 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Advertisement

Nice! I do something similar per-mesh on the CPU in my engine: compute light intensity at mesh's position and then sort by decreasing intensity, submitting the first N lights to the mesh's shader. Your technique is per-pixel though...

Share this post


Link to post
Share on other sites

Nice! I do something similar per-mesh on the CPU in my engine: compute light intensity at mesh's position and then sort by decreasing intensity, submitting the first N lights to the mesh's shader. Your technique is per-pixel though...

I think that's how I was doing it a while ago. But BVH Accelerated Shading has the added bonus of not having to switch state between objects, so you can potentially batch a lot of objects together in one draw call.

 

Also, if you already have a hierarchy of lights, implementing BVH Accelerated Shading shouldn't take very long ;)

Edited by fries

Share this post


Link to post
Share on other sites

Glad you guys like it.

 

I'm going to be working on a clustering comparrison demo with code. It should be available in the next few days.

Share this post


Link to post
Share on other sites

Very interesting and good work smile.png

 

When taking a look at your technique, you would need to sample the memory quite often, compared to a clustered/cell deferred rendering pipeline, where the lights are stored as uniforms/parameters. So, my question is, how much bandwidth is your approach using ? How does it perform on a low-bandwidth system ? How much time does it take to rebuild the bhv and upload it per frame (dynamic light sources) ? How does it scale with number of lights ? How does it scale with the size of the render buffer ?

 

That is works with deferred and forward and a mix of it makes it really interesting, especially for the notoriously hard to implement lit particles and transparent surfaces in a deferred shader.

Share this post


Link to post
Share on other sites

When taking a look at your technique, you would need to sample the memory quite often, compared to a clustered/cell deferred rendering pipeline, where the lights are stored as uniforms/parameters.

In my experience with clustered, you store all the lights in the buffer in a very similar manner, except each cluster has a linear section of the buffer.
e.g. pseudo-shader:

struct ClusterLightRange { int start, size; };
struct LightInfo { float3 position; etc };

Buffer<ClusterLightRange> clusters;
Buffer<Lights> lights;

ClusterLightRange c = clusters[ClusterIdxFromPixelPosition(pixelPosition)];
for( int i=c.start, end=c.start+c.size; i!=end; ++i )
{
  Light l = lights[i];
  DoLight(l);
}

So the main additional cost is doing a bounding sphere test per light that you visit, and iterating through the light array as a linked-list rather than a simpler linear array traversal. On modern hardware, it should do pretty well. Worse than clustered, but probably not by much -- especially if the "DoLight" funciton above is expensive.

It would be interesting to do some comparisons between the two using (A) a simple Lambert lighting model, and (B) a very complex Cook-Torrence/GGX/Smith/etc fancy new lighting model.

 

Also, you could merge this technique with clustered shading:

A) CPU creates the BVH as described, and uploads into a GPU buffer.

B) A GPU compute shader traverses the BVH for each cluster, and generates the 'Lights' buffer in my pseudo example above.

C) Lighting is performed as in clustered shading (as in my pseudo example above).

Share this post


Link to post
Share on other sites

In my experience with clustered, you store all the lights in the buffer in a very similar manner, except each cluster has a linear section of the buffer.

Yes, I think that this would be the more common way of doing it. I currently use the uniform way to do it in my engine.

 

 

 


So the main additional cost is doing a bounding sphere test per light that you visit, and iterating through the light array as a linked-list rather than a simpler linear array traversal.

It is more like a tree structure, isn't it ? Traversing a tree will, even if you will narrow down to the candidates quite quickly, still need to jump around quite often. The virtual nodes in this tree (which will not represent a light) would add atleast extra costs and if you need to traverse partly through a tree with 3000 leafs, then you could have a few cache misses and this for every pixel.

 

I would be interested in the expected overhead, would it be roughly x1.5 , x2.0, x3.0. The system could perform better on more realistic game setup (few hundred visible lights) with a much flatter tree structure.

 

Nevertheless, I found this approach very sexy. I could think about using it to add some extra eye-candy to the forward rendered stuff (particles) as option for people with hi-performance hardware.

 

PS: if you do the light calculation in the vertex shader you could push out lit particles with ease and with much better performance.

 

PPS: @fries: how about posting your approach as article on gamedev ?

Edited by Ashaman73

Share this post


Link to post
Share on other sites

Yes, I think that this would be the more common way of doing it. I currently use the uniform way to do it in my engine.

Sorry I'm taking the thread off topic here!
Do you have a single UBO containing an array with info on all clusters, and then you index that array?

AFAIK, on modern cards, that array indexing will be very similar (if not equal) in cost to fetching from a buffer.

Traversing a tree will, even if you will narrow down to the candidates quite quickly, still need to jump around quite often. The virtual nodes in this tree (which will not represent a light) would add atleast extra costs and if you need to traverse partly through a tree with 3000 leafs, then you could have a few cache misses and this for every pixel.

Ah yeah I forgot about the non-leaf nodes. The stackless traversal is nice though that it's still linear in the best case scenario, and the worst case just adds more gaps/skips into the linear iteration. I imagine it will still cache somewhat well.

Share this post


Link to post
Share on other sites


Sorry I'm taking the thread off topic here!
Do you have a single UBO containing an array with info on all clusters, and then you index that array?

I'm still using OGL2.1, thought UBOs are available as ARB extension. Up until now I have targeted more urgent buffer usages(most recently PBO texture streaming... yeah, I know, but time is a very valueable resource). UBOs in conjunction with animation data or light data sounds interesting, especially due to concurrently filling the buffers with multiple threads/jobs ... *making mental note for next engine improvement*.

I'm still somewhat curious about nextgl and what they will show off next month, so I'm a little bit careful with my coding time and hoping to add support of a modern,slim,cross platform,open APIsmile.png

Share this post


Link to post
Share on other sites

Hi,

Good to see that you're all talking about my technique smile.png

I'll have some time this weekend to work on a comparison demo with clustered shading and I will release the code. It will be targeted at opengl. BVH accelerated shading should work fine without modern gpu features like structured buffers etc. as you should have no problem encoding the light BVH into a texture. Since the traversal is stackless, it should map well to older hardware (no thread local storage required for a stack).

As for the performance/bandwidth of BVH accelerated vs Clusterd, we won’t really know until I work up a comparison demo, but there are some things to consider:

1. With clustered, for each pixel you are looping over a light list of which not all lights are contributing to all pixels in the cluster, with BVH accelerated, the culling is per pixel exact.
2. With 1000 lights, 20bytes per light you would have roughly 24k of nodes in the tree (less if you pull the light info out into its own buffer and store only spatial information in the tree). So 24k to upload to the gpu every frame for 1000 fully lights, if they dont move, you dont have to upload anything every frame. Either way, 24k is nothing for a modern gpu, your vertex buffers are bigger than that.
3. Clustered shading has setup costs (pixel clustering, light assignment) that BVH accelerated shading does not.
4. Clustered shading potentially jumps around the light buffer (Buffer<Lights> lights in Hodgman's code above) as much as BVH accelerated jumps around the tree data, possibly equalising the bandwidth and cache issues of the two methods even more.

So I’m hoping (lots of hand waiving, I don’t really know yet...) that with the coherency of nearby pixels traversing similar paths of the BVH, the bandwidth consumption and cache misses will not be too much worse than with clustered.

Hodgman: your idea for using the bvh for light assignment in a grid for clustered is exactly what I was planning to do next! :) I always wondered why they brute force the light assignment in a lot of those clustered shading demos/papers, using a hierarchy would make it much simpler, and possibly faster – just throw a thread at each cell in the grid and have it traverse the BVH using the grid cell’s bounding frusta. Maybe is it the cost of keeping the bvh up to date that prevents people from doing this, but for a scene with mostly static lights, ie. most games, the bvh update should be insanely fast.

Ashaman73: I dont know how to post an article on here, but maybe I'll do that at some point though, if theyre interested in me doing that.

Edited by fries

Share this post


Link to post
Share on other sites

So I’m hoping (lots of hand waiving, I don’t really know yet...) that with the coherency of nearby pixels traversing similar paths of the BVH, the bandwidth consumption and cache misses will not be too much worse than with clustered.

I've done some testing with a quick BVH implementation to check the number of nodes read. I will not present my result, because the implementation is quick and dirty and would most likely not represent your work. Nevertheless, the question that arose was the number of how many nodes you need to touch per pixel in average. Eg the spheres in a spheretree get quite big really fast, and all the not light related spheres will increase the overhead of all pixels, even if you have unlit pixels etc. So, if you have pixel lit by 100 lights and you have an overhand of X nodes, it is quite cheap, but if you have a pixel lit by none or only one light, X nodes is really expensive. A clustered/cell based shader would fit more tighly around the variable number of lights influencing pixel areas. The question is, how high is X under different setups ?

 

Just for testing this number I would implement the same search algo from the shader on the CPU to do some rendering simulation to get some quick statistics.

 

 

 


Ashaman73: I dont know how to post an article on here, but maybe I'll do that at some point though, if theyre interested in me doing that.

Check this out to get some infos smile.png

Edited by Ashaman73

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement