Sign in to follow this  

Clustered shading - why deferred?

This topic is 405 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello,

 

I've decided to implement clustered shading, as used by the Avalanch developement-team for getting back to my own 3d renderer.

 

http://www.humus.name/Articles/PracticalClusteredShading.pdf

 

Now they mention that it works with both forward and deferred shading, and that they used deferred shading themselves. But why?

They mentioned two pro-points for deferred rendering:

 

- Screen Space decals

- Performance

 

I know nothing about Screen-Space decals, so ok. But what about performance? I fail to see how deferred rendering would be faster with clustered shading than forward rendering. The per-pixel operations are pretty much the same for both forward and deferred implementations.

 

So am I missing something/Can somebody explain to me why deferred could be any faster than forward rendering, when using clustered shading as mentioned here. As far as I can see, it only adds an additional screen-pass for lighting-calculations, and uses up bandwith via gbuffer-creation/reading.

Share this post


Link to post
Share on other sites

Probably because not all objects necessarily fit in one cluster (they're only 64x64 pixels on-screen, so pretty small).

 

I was going to post an image, but bleh.... forum software says I'm not allowed to use that image format (JPEG)

 

Anyway... with forward, you may have to render some objects twice or more often (if they cover 2 or 3 clusters), with deferred you never render them more than exactly once.

Edited by samoth

Share this post


Link to post
Share on other sites

Anyway... with forward, you may have to render some objects twice or more often (if they cover 2 or 3 clusters), with deferred you never render them more than exactly once.

 

That would be true, if you were rendering the clusters separately. In clustered shading, you don't (from what I've been able to gather, and I've gone over this many many times).

 

The clusters are only used to build the light list. You (naively) check every light with every cluster and build a list of all the lights for each cluster. (in reality, this is optimized & packed, but brute-force might be num_clusters*max_lights_per_cluster).

 

Then when rendering, you use the fragment position to lookup which cluster the fragment is in.

In forward, you would render the object, and do this lookup in the pixel shader, along with the lighting.

In deferred, you render a fullscreen-pass, and do the lookup in the pixel shader with the lighting.

 

So unless I'm mistaken on how the clusters are handled (but I'm pretty sure thats how it works), there should be no overhead, since you'd have to render every object only 1 time anyways.

Share this post


Link to post
Share on other sites

In clustered shading, you don't (from what I've been able to gather, and I've gone over this many many times).
Is that so? (I'm really asking, not being cynical.)

 

I understand that clustered forward is of course an optimization to non-clustered. But I fail to see how you can do without inevitably having to draw some geometry (not all, but some... occasionally) twice or more often. Unless your max_lights is huge (but then, if it's huge, why do you need clustered at all, just render the whole scene in one pass with 1,000 active lights!) or objects and lights always fit snugly into a single cluster, or lights are perfectly, evenly distributed across the screen so there are very few overlaps (never exceeding the max lights number), I believe you have no other choice than to render at least some objects twice, every now and then. How would it otherwise work? You have to render them somehow, and there's a limit on how many lights you can do at once.

 

Maybe I'm just not understanding this...

Share this post


Link to post
Share on other sites

 

Also for forward the shaders are bigger and likely to use more registers which has reduce occupancy which means the GPU is more likely to blocked on memory reads.

 

Finally for the deferred case you can guarantee wavefronts of pixel data are rendered which hit one screen space tile only. This can improve caching and allow for some operations to run wavefront wide rather than pixel wide. 

 

Aye, these are two of the big ones, register occupancy can become a bottleneck very very quickly. As well, things thought of as traditional advantages to forward such as more material types and anti aliasing can now be done in deferred with some work and cleverness.

 

But if you're looking at writing a new rendering structure take a look at deferred texture/visibility buffers

 

There are some things you might want to do that can't be done straightforwardly anymore, such as accessing the complete material property/g-buffer for deferred decals and the like. And right now there's no good "practical" guide to implementation in a tried and shipped title. You'd have to go digging around for things like volume encoded UV maps or the above hacks/etc. to get UV mapping right and etc.

 

But the efficiency gains can be incredible. From managing to squeeze the entire g-buffer into the XNE's ESRAM , to practically unlimited material types, to fantastic poly efficiency, and then incredibly fast tessellation performance as you only tessellate visible triangles you can get a heck of a lot out of it.

Edited by Frenetic Pony

Share this post


Link to post
Share on other sites
I understand that clustered forward is of course an optimization to non-clustered. But I fail to see how you can do without inevitably having to draw some geometry (not all, but some... occasionally) twice or more often. Unless your max_lights is huge (but then, if it's huge, why do you need clustered at all, just render the whole scene in one pass with 1,000 active lights!) or objects and lights always fit snugly into a single cluster, or lights are perfectly, evenly distributed across the screen so there are very few overlaps (never exceeding the max lights number), I believe you have no other choice than to render at least some objects twice, every now and then. How would it otherwise work? You have to render them somehow, and there's a limit on how many lights you can do at once.

 

Yeah, its hard to comprehend at first, I tried multiple times before I finally (think I) got it :)

 

The thing is, you put all visible lights in one big list. This can be cbuffer, texture, etc... from their paper, this appearently is enough to hold all lights they needed. (2xfloat2 for point, 2xfloat3 for spotlight).

 

Then, you compile a list of lights that each cluster is affected. You first have a lookup-list (again, texture or cbuffer) with (Offset, NumPoints, NumSpots). This offset points to another buffer, which contains indices into the light list.

 

Then, for each fragment, you calculate which cluster it is in, and do the lookup. This is a double indirection, but appearently fast enough. This is independant of forward or deferred, you just have a different source of input for the position+z-coordinate used to calculate the cluster.

 

So you can have potentially infinite (= a high amount of) lights, only limited by memory, though in practice they limit it to a certain amount, determined by profiling for a worst-case szenario. Still no need to render anything twice, even with forward rendering.

 

@AliasBinman:

 

Ah, thanks, that was about what I wanted to hear. So I think I'll just stick with deferred rendering by now, and maybe implement forward rendering at some point to profile & compare, especially since they seem kind of easy to switch out in this.

 

@Frenetic Pony:

 

Thanks, sounds really awesome, I'll have a look at it and see how hard it would be to implement alongside. I have kind of a time-concern for now, since I'm doing it alongside my master thesis at university, and one thing I liked about clustered shading is that, once you comprehend it, it seems really easy to implement. So I'll try to keep things clean and expandable, and apply such highly advanced techniques at some point, eventually :)

Edited by Juliean

Share this post


Link to post
Share on other sites
Everyone's pretty much covered it, but my guesses would be:

* pixel quad efficiency / the small triangle problem. GPUs run pixel shaders on 2x2 pixel areas (a pixel-quad) and diacard any pixels that aren't required. This means that triangle edges are almost always over-shaded. GPUs also want a triangle to ideally cover about 16 pixel-quads, which becomes a big problem if using tesselation...
Clustered deferred incurs this same penalty during gbuffer generation, but not at all during shading. Most people use a compute shader, which can run very efficiently on nice square tile areas.

* shader complexity. Forward shaders *can* be more complex than a deferred shader. e.g. when using parallax mapping, it appears in your forward shader and your gbuffer shader, but not your deferred-shading shader.
This is important because shader complexity affects "occupancy" - the number of parallel threads the GPU can run at a time. Simpler shaders allow the GPU to run more threads, which makes memory appear to be lower latency.
Sometimes a shader of 100 instructions may be way slower than two shaders of 60 instructions each!
This is very situation dependant though, and the opposite can also be true.

* branch coherency. In tiled-deferred (2d variant of clustered) then you can get 100% branch coherency within a tile's thread group, which makes your branching/looping free. The cost of loading the cluster's light list from RAM into LDS can also be shared among the thread group instead of repeated per pixel.
In forward, pixels within a warp/wavefront are likely to be part of different tiles/clusters and thus encounter divergent branches.

Share this post


Link to post
Share on other sites

This topic is 405 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this