Clustered shading - why deferred?

Started by
4 comments, last by Hodgman 7 years, 5 months ago

Hello,

I've decided to implement clustered shading, as used by the Avalanch developement-team for getting back to my own 3d renderer.

http://www.humus.name/Articles/PracticalClusteredShading.pdf

Now they mention that it works with both forward and deferred shading, and that they used deferred shading themselves. But why?

They mentioned two pro-points for deferred rendering:

- Screen Space decals

- Performance

I know nothing about Screen-Space decals, so ok. But what about performance? I fail to see how deferred rendering would be faster with clustered shading than forward rendering. The per-pixel operations are pretty much the same for both forward and deferred implementations.

So am I missing something/Can somebody explain to me why deferred could be any faster than forward rendering, when using clustered shading as mentioned here. As far as I can see, it only adds an additional screen-pass for lighting-calculations, and uses up bandwith via gbuffer-creation/reading.

Advertisement

Probably because not all objects necessarily fit in one cluster (they're only 64x64 pixels on-screen, so pretty small).

I was going to post an image, but bleh.... forum software says I'm not allowed to use that image format (JPEG)

Anyway... with forward, you may have to render some objects twice or more often (if they cover 2 or 3 clusters), with deferred you never render them more than exactly once.

Anyway... with forward, you may have to render some objects twice or more often (if they cover 2 or 3 clusters), with deferred you never render them more than exactly once.

That would be true, if you were rendering the clusters separately. In clustered shading, you don't (from what I've been able to gather, and I've gone over this many many times).

The clusters are only used to build the light list. You (naively) check every light with every cluster and build a list of all the lights for each cluster. (in reality, this is optimized & packed, but brute-force might be num_clusters*max_lights_per_cluster).

Then when rendering, you use the fragment position to lookup which cluster the fragment is in.

In forward, you would render the object, and do this lookup in the pixel shader, along with the lighting.

In deferred, you render a fullscreen-pass, and do the lookup in the pixel shader with the lighting.

So unless I'm mistaken on how the clusters are handled (but I'm pretty sure thats how it works), there should be no overhead, since you'd have to render every object only 1 time anyways.

There are a few reasons for performance gains. It can depend on whether the gains outweigh the losses however so it is not always a net win overall. I presume with their data sets it was however.

If you consider all pixel shading has two phases. The first phase is to evaluate the material properties for the closest surface for the particular pixel. The second phase is to use these material parameters along with all the lighting to evaluate the BRDF to derive a final color. For forward shading this material evaluation and the BRDF evaluation is done in the same shader. For deferred the material evaluation is performed first and the results are written to GBuffers. Then later passes will read in the GBuffers and evaluate the BRDF. Typically material evaluation is cheaper than lighting evaluation.

If you are rendering at say 1920x1080 resolution you have about 2M pixels to evaluate.

When you render the scene which is represented by lots of triangles then this is certain amount of inefficiency which can cause you to ultimately process more than 2M pixels. Some of this is due to dead pixels inside pixel quads, overdraw which will still happen with Hi-Z due to quantisation issues and non full wavefronts.

So for deferred you will evaluate materials more than needed but only light each pixel exactly once. For forward then you will material evaluate and light each pixel more than once.

Also for forward the shaders are bigger and likely to use more registers which has reduce occupancy which means the GPU is more likely to blocked on memory reads.

Finally for the deferred case you can guarantee wavefronts of pixel data are rendered which hit one screen space tile only. This can improve caching and allow for some operations to run wavefront wide rather than pixel wide.

In clustered shading, you don't (from what I've been able to gather, and I've gone over this many many times).
Is that so? (I'm really asking, not being cynical.)

I understand that clustered forward is of course an optimization to non-clustered. But I fail to see how you can do without inevitably having to draw some geometry (not all, but some... occasionally) twice or more often. Unless your max_lights is huge (but then, if it's huge, why do you need clustered at all, just render the whole scene in one pass with 1,000 active lights!) or objects and lights always fit snugly into a single cluster, or lights are perfectly, evenly distributed across the screen so there are very few overlaps (never exceeding the max lights number), I believe you have no other choice than to render at least some objects twice, every now and then. How would it otherwise work? You have to render them somehow, and there's a limit on how many lights you can do at once.

Maybe I'm just not understanding this...

Also for forward the shaders are bigger and likely to use more registers which has reduce occupancy which means the GPU is more likely to blocked on memory reads.

Finally for the deferred case you can guarantee wavefronts of pixel data are rendered which hit one screen space tile only. This can improve caching and allow for some operations to run wavefront wide rather than pixel wide.

Aye, these are two of the big ones, register occupancy can become a bottleneck very very quickly. As well, things thought of as traditional advantages to forward such as more material types and anti aliasing can now be done in deferred with some work and cleverness.

But if you're looking at writing a new rendering structure take a look at deferred texture/visibility buffers

There are some things you might want to do that can't be done straightforwardly anymore, such as accessing the complete material property/g-buffer for deferred decals and the like. And right now there's no good "practical" guide to implementation in a tried and shipped title. You'd have to go digging around for things like volume encoded UV maps or the above hacks/etc. to get UV mapping right and etc.

But the efficiency gains can be incredible. From managing to squeeze the entire g-buffer into the XNE's ESRAM , to practically unlimited material types, to fantastic poly efficiency, and then incredibly fast tessellation performance as you only tessellate visible triangles you can get a heck of a lot out of it.

I understand that clustered forward is of course an optimization to non-clustered. But I fail to see how you can do without inevitably having to draw some geometry (not all, but some... occasionally) twice or more often. Unless your max_lights is huge (but then, if it's huge, why do you need clustered at all, just render the whole scene in one pass with 1,000 active lights!) or objects and lights always fit snugly into a single cluster, or lights are perfectly, evenly distributed across the screen so there are very few overlaps (never exceeding the max lights number), I believe you have no other choice than to render at least some objects twice, every now and then. How would it otherwise work? You have to render them somehow, and there's a limit on how many lights you can do at once.

Yeah, its hard to comprehend at first, I tried multiple times before I finally (think I) got it :)

The thing is, you put all visible lights in one big list. This can be cbuffer, texture, etc... from their paper, this appearently is enough to hold all lights they needed. (2xfloat2 for point, 2xfloat3 for spotlight).

Then, you compile a list of lights that each cluster is affected. You first have a lookup-list (again, texture or cbuffer) with (Offset, NumPoints, NumSpots). This offset points to another buffer, which contains indices into the light list.

Then, for each fragment, you calculate which cluster it is in, and do the lookup. This is a double indirection, but appearently fast enough. This is independant of forward or deferred, you just have a different source of input for the position+z-coordinate used to calculate the cluster.

So you can have potentially infinite (= a high amount of) lights, only limited by memory, though in practice they limit it to a certain amount, determined by profiling for a worst-case szenario. Still no need to render anything twice, even with forward rendering.

@AliasBinman:

Ah, thanks, that was about what I wanted to hear. So I think I'll just stick with deferred rendering by now, and maybe implement forward rendering at some point to profile & compare, especially since they seem kind of easy to switch out in this.

@Frenetic Pony:

Thanks, sounds really awesome, I'll have a look at it and see how hard it would be to implement alongside. I have kind of a time-concern for now, since I'm doing it alongside my master thesis at university, and one thing I liked about clustered shading is that, once you comprehend it, it seems really easy to implement. So I'll try to keep things clean and expandable, and apply such highly advanced techniques at some point, eventually :)

Everyone's pretty much covered it, but my guesses would be:

* pixel quad efficiency / the small triangle problem. GPUs run pixel shaders on 2x2 pixel areas (a pixel-quad) and diacard any pixels that aren't required. This means that triangle edges are almost always over-shaded. GPUs also want a triangle to ideally cover about 16 pixel-quads, which becomes a big problem if using tesselation...
Clustered deferred incurs this same penalty during gbuffer generation, but not at all during shading. Most people use a compute shader, which can run very efficiently on nice square tile areas.

* shader complexity. Forward shaders *can* be more complex than a deferred shader. e.g. when using parallax mapping, it appears in your forward shader and your gbuffer shader, but not your deferred-shading shader.
This is important because shader complexity affects "occupancy" - the number of parallel threads the GPU can run at a time. Simpler shaders allow the GPU to run more threads, which makes memory appear to be lower latency.
Sometimes a shader of 100 instructions may be way slower than two shaders of 60 instructions each!
This is very situation dependant though, and the opposite can also be true.

* branch coherency. In tiled-deferred (2d variant of clustered) then you can get 100% branch coherency within a tile's thread group, which makes your branching/looping free. The cost of loading the cluster's light list from RAM into LDS can also be shared among the thread group instead of repeated per pixel.
In forward, pixels within a warp/wavefront are likely to be part of different tiles/clusters and thus encounter divergent branches.

This topic is closed to new replies.

Advertisement