# Revival of Forward Rending?

This topic is 2121 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I saw AMD's "Leo" tech demo the other day and was pretty amazed by it. I was also surprised to see that it hadn't yet been mentioned on gamedev.net (that I can find). Here are the links.

AMD's Website

They say that they are using a forward renderer that uses compute shaders to cull lights on a tile basis, which reminded me of this thread I saw on this site a while back.

I'm really interested to see just how feasible this is and how it stacks head to head with tile based deferred rendering for a strictly D3D11 based engine.

##### Share on other sites
Did it ever die? There are specific use cases for which each type of renderer is more suited than the other. What's most interesting about this is the use of a CS for work that would be more traditionally done CPU-side, but I wonder how well that would balance out in a real in-game scene. Thanks for the heads-up anyway, I'll be checking this out in more detail later on.

##### Share on other sites
You may be interested in this paper: http://www.cse.chalmers.se/~olaolss/papers/tiled_shading_preprint.pdf
It spends some time comparing a tile-based deferred shading implementation with a tile-based forward shading implementation.

##### Share on other sites
The short answer is, it doesn't, not from a development point of view. Tile based light culling was originally developed for deferred shading, which is only getting better. And with the next generation of consoles no doubt having plenty of RAM and memory bandwidth there's no reason (at the moment) to waste rendering the geometry twice when you can just shove all you need through in G-Buffers and render it once.

In fact deferred lighting is probably going the same way for the same reason, both forward and deferred lighting were only ever used to minimize memory and bandwidth use, a precious commodity on the 360 and PS3. But with even mobile devices shoving past them now what's the point? Might as well use up that available RAM and double your polycount, throw a ton of shadow mapped lights, or something else of the kind.

As for AA, there was a recent paper (very recent) on minimal cost MSAA while doing deferred shading, similar to the cost of forward rendering. And of course you can also use temporal and/or morphological AA as well. I'm sure you could forward render transparency stuff using the same tile based scheme while you're going deferred, but deferred shading definitely seems to me to be the way to go.

##### Share on other sites

The short answer is, it doesn't, not from a development point of view. Tile based light culling was originally developed for deferred shading, which is only getting better. And with the next generation of consoles no doubt having plenty of RAM and memory bandwidth there's no reason (at the moment) to waste rendering the geometry twice when you can just shove all you need through in G-Buffers and render it once.
Let's forget for a moment the GBuffer is typically bigger. NV40 could render about 60 pointlights per pass, I have difficulty understanding the need to render twice.

The only major advantage of deferred tech is the modularity of light processing compared to material rendering but this comes at a considerable cost: no one I've talked to understood the need to write shaders putting stuff in different buffers... and don't even get me started on packing.

So, in my opinion, the advantages are still unproven. I suppose UE3 and Samaritan makes this clear. Flexible Forward can emulate Deferred at a reduced cost... not vice versa.

I look forward to read what mhagain will write on this.

##### Share on other sites
So I've downloaded the full demo and run it a few times. I deliberately chose a fairly low-specced machine to see how viable the technique is on the kind of hardware that would be considered commodity today, and the short answer is - it's not.

Reminds me of the time I first got a 3DFX and - naturally - immediately grabbed GLQuake to test it out. Of course I neglected to pop in the 3DFX opengl32.dll file so I ended up drawing through the Windows software implementation at 1fps.

Obviously AMD feel that they've got something special with their new kit, and they want to show off it's capabilities to the best by taking a sub-optimal technique and making it realtime. More power to them, I wish them well with it. Maybe in 2 years time when this level of hardware is a commonplace average this might be an approach to think of using, but for now there seems to be better things to burn your GPU budget on.

##### Share on other sites
"Specifically, this demo uses DirectCompute [...] per-pixel or per-tile list of lights that forward-render based shaders use for lighting each pixel."

Now call me a cynic, but is this not pretty much exactly what Damian Trebilco did 5 years ago on 3 generation older hardware using nothing but the CPU and some stencil buffer trick?

Admittedly, Damian's demo with that horse model inside a room was not quite as artistic. The ATI demo sure is kind of funny, with a nice story, well done animations, and it looks quite good, but honestly I couldn't tell it really looks a class better than a thousand other demos (in fact, all of the characters are quite "plastic like" though of course the ATI guys will claim that this is intentional). Opposed to that, unlike a thousand other demos, it requires the lastest, fastest hardware to run...

##### Share on other sites
I've done some simple tests (sponza with 128 unshadowed point lights) and it's definitely feasible. On a GTX 570 at 1920x1080 I get 1.1ms for filling the G-Buffer and 3.5ms for the lighting with a tiled deferred approach, while with an indexed deferred approach I get 0.75ms for the computing the lights per-tile and 4.5ms for forward rendering (both using a z-only prepass). So at 4.6ms vs. 5.25ms it's not too far off in terms of performance. But of course you really need to know how well it scales with:

1. more lights
2. more sophisticated light types (spot, directional, shadows, gobos, etc.),
3. Different BRDF's/material types
4. Dense, complex geometry
5. How well it handles MSAA (in my test scene it brings the forward rendering pass up to 5.25ms for 4xMSAA)

You'd also want to compare how well a deferred renderer scales with lots of G-Buffer terms, especially with dense geometry. Unfortunately I don't have the time at the moment to thoroughly evaluate all of those things, but it does at least seem like a viable alternative to traditional deferred rendering. But I'm not sure if it would ever beat a good tiled deferred implementation outright. It would definitely make certain things a lot easier, since you wouldn't have to worry about packing things into a G-Buffer or special-case handling of different material types in the lighting shader.

@mhagain: there's a lot more going on in that demo than just indexed deferred rendering. For instance they use PTex, and a VPL-based GI solution.

@samoth: it's a variant of light indexed deferred, and they even say as much in their presentation. The technique just becomes a lot more practical when you can generate the light list per-tile in a compute shader, rather than having to do all of the nasty hacks required by the original implementation.

##### Share on other sites

@mhagain: there's a lot more going on in that demo than just indexed deferred rendering. For instance they use PTex, and a VPL-based GI solution.

The bounced lights are the most interesting thing in it to me; my obsession with lighting is not shadows but brightness, and that one tickles my fancy.

##### Share on other sites
Sorry for the bump, but I put up a blog post with some performance numbers and a sample app. Feel free to use it for your own experiments.

##### Share on other sites
Hmmm, I've given your test app a quick spin on my 7970 and I noticed the MSAA times where a little inconsistant which is a tad strange.

Anyway, I ran the code 'out of the box' so I don't know what light settings you had setup but the results were:

Forward Render Time:
No MSAA : 0.8ms
2x MSAA : 0.9ms to 1.6ms (?)
4x MSAA: 1.0ms to 1.8ms (?) - seems to spend more time at 1.0ms however.

Even Light Time Computation seems to bounce around a bit, varying between 0.2 and 0.4ms on all AA modes - sometimes NoAA and 2xAA sit around 0.36ms while 4xAA comes in at 0.26ms.

If I unlock v-sync then the total render times come in at;
NoAA : 1.8ms
2xAA: 2.0ms
4xAA: 2.3ms

In comparison the tiled deferred stacks in at;
(G-buffer pass, light pass compuation, total render time)
(With z-pass)
NoAA: 0.41, 0.53, 1.9ms
2xAA: 0.49, 1.3, 2.82ms
4xAA: 0.55, 1.9, 3.5ms

(Without z-pass)
NoAA: 0.6, 0.53, 1.9ms
2xAA: 0.7, 1.3, 2.82ms
4xAA: 0.8, 1.9, 3.6ms

(720p rendering, in a window, stock clocked i7 920, but judging by the CPU graph from the task manager it wasn't CPU limited so that should change things.)

##### Share on other sites
It would seem to me that there is promise in a CS powered light indexed deferred rendered. Even on AMD hardware the performance isn't bad, and apparently on nV it's better. Especially if you want MSAA and weren't planning on using thousands of lights.

##### Share on other sites
Thanks for posting those phantom! I've noticed that the queries I use for profiling can get messed up a bit when VSYNC is enabled. For the numbers I gathered, I just used the total frame time with VSYNC disabled.

I also realized that it was pretty dumb and lazy of me to leave the number of lights hard-coded, so I uploaded a new version that lets you switch the number of lights at runtime.

##### Share on other sites
I am noticing no change at all when enabling or disabling Z prepass when in light indexed mode. I would have though it would have a large impact on forward shading time.

##### Share on other sites

Even on AMD hardware the performance isn't bad, and apparently on nV it's better. Especially if you want MSAA and weren't planning on using thousands of lights.

Well, the performance delta is better on the 680 vs tiled deferred but I suspect that is down to a reduction in memory bandwidth (680 runs slightly higher clocked but has 2/3 the bus size of the 7970); over all the 7970 seems to like it better render time wise (which is intresting as most game benchmarks have the 680 winning across the board).

Anyway, I'm going to try out the new version and report back in a bit regarding the various light counts.

##### Share on other sites
OK, 7970 results as promised;

(Forward, Light Tile, Total Frame Time)

256 lights:
Index Deferred:
No AA: 1.4ms, 0.196ms, 2.5ms
2xAA: 1.5ms, 0.22ms, 2.7 -> 2.9ms
4xAA: 1.7ms, 0.27ms, 3.0 -> 3.4ms

Tiled Deferred:
No AA: 0.416ms, 0.914ms, 2.3ms
2x AA: 0.49ms, 1.86ms, 3.3ms (spike 3.5ms)
4x AA: 0.55ms, 2.53ms, 4.2ms (4.4ms spike)

512 light
Index Deferred:
No AA: 2.2ms, 0.2ms, 3.3ms
2xAA: 2.59ms, 0.22ms, 3.8ms
4xAA: 2.85ms, 0.28ms, 4.2ms

Tiled Deferred:
No AA: 0.416ms, 1.4ms, 2.9ms
2x AA: 0.49ms, 2.65ms, 4.2ms
4x AA: 0.55ms, 3.5ms, 5.2ms

1024 light
Index Deferred:
No AA: 4.8ms, 0.2ms, 5.9ms
2xAA: 5.45ms, 0.25ms, 6.7ms
4xAA: 5.99ms, 0.315ms, 7.4ms

Tiled Deferred:
No AA: 0.416ms, 3.08ms, 4.5ms
2x AA: 0.49ms, 4.88ms, 6.4ms
4x AA: 0.55ms, 6.16ms, 7.8ms (8ms spike)

Side note: GPU-Z reports back;
1690MB VRAM used
420MB 'dynamic' ram used

That's.... a lot

So, if we try to make some sense out of this test

It would seem that without AA the Tiled Deferred has the edge, but as soon as you throw AA into the mix things swing towards the Index Deferred method (1024, 2xAA being the notable exception to that rule).

TD shows the normal deferred charactistic of stable G-buffer pass times, but the tile lighting phase begins to get very expensive for it.
By constrast ID has a pretty constant lighting phase but the forward render phase shows the same kind of increase as TD's lighting phase.

##### Share on other sites
Hidden
phantom,

Can you please post the results for 128 lights?

Thanks!
-= Dave

I am noticing no change at all when enabling or disabling Z prepass when in light indexed mode. I would have though it would have a large impact on forward shading time.

The Z prepass is always enabled for light indexed mode, because a depth buffer is necessary to figure out the list of lights that intersect each tile. If you didn't do this you could build a list of lights just using the near/far planes of the camera, but I would suspect that the larger light lists + lack of good early z cull would cause performance to go right down the drain.

##### Share on other sites
Thanks for the updated numbers phantom! It definitely seems as though light indexed removes a lot of bandwidth dependence, which is pretty cool. Ethatron posted some numbers on my blog where he showed that performance scaled pretty directly with shader core clock speed. I would suspect that overall light indexed would scale down pretty well to lower-class hardware with lower bandwidth ratings.

##### Share on other sites

...If you didn't do this you could build a list of lights just using the near/far planes of the camera, but I would suspect that the larger light lists + lack of good early z cull would cause performance to go right down the drain.

I did look at that in my paper 'Tiled Shading' that someone posted a link to above. And the short answer is that no indeed, it does not end too well.

On the other hand, I imagine that it would be a useful technique simply to manage lights in an environment with not too many lights in any given location and limited views (e.g. RTS camera or so), as the limited depth span makes the depth range optimization less effective.

I've got an open gl demo too, which builds grids entirely on the CPU (so it's not very high performance, just there to demo the techniques).

Btw, one thing that could may affect your results that I noticed is that you make use of atomics to reduce the min/max depth. Shared memory atomics on NVIDIA hardware serialize on conflicts, so to use them to perform a reduction this way is less efficient than just using a single thread in the CTA to do the work (at least then you dont have to run the conflict detection steps involved). So this step gets a lot faster with a SIMD parallel reduction, which is fairly straight forward, dont have time to dig out a good link sorry, I'll just post a cuda variant I've got handy, for 32 threads (a warp), but scales up with apropriate barrier syncs, sdata is a pointer to a 32 element shared memory buffer (is that local memory in compute shader lingo? Anyway, the on-chip variety.).

uint32_t warpReduce(uint32_t data, uint32_t index, volatile uint32_t *sdata) { unsigned int tid = index; sdata[tid] = data; if (tid < 16) { sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid + 4]; sdata[tid] += sdata[tid + 2]; sdata[tid] += sdata[tid + 1]; } return sdata[0]; }

Same goes for the list building, where a prefix sum could be used. Here it'd depend on the rate of collisions. Anyway, thinking this might be a difference between NVIDIA and AMD (Where I don't have a clue how atomics are implemented).

As a side note, it's much more efficient to work out the screen space bounds of each light before running the per tile checks, saves constructing identical planes for tens of tiles, etc.

Anyway, fun to see some activity on this topic! And I'm surprised at the good results for tiled forward.

Cheers
.ola

##### Share on other sites
Yeah, given that ALU power continues to increase while bandwidth doesn't the removal of that constraint is a very nice factor for LI.

I tried to spin the app up in AMD's GPU PerfStudio last night but it became a crash fest Wanted to see what kind of utilisation the various stages were getting on the ALUs and bandwidth wise.

##### Share on other sites
How easy would it be to implement transparency in a LI renderer? The z prepass suffers from the same issue as a geometry pass in a deferred shading renderer.

##### Share on other sites
It would be the same as a normal forward lighting system; render transparent objects back to front. You'd just get early rejection for objects which are behind the layed down z-pass.

The other option is an order independant transparency system but those are still pretty heavy on the hardware it would seem.

##### Share on other sites

It would be the same as a normal forward lighting system; render transparent objects back to front. You'd just get early rejection for objects which are behind the layed down z-pass.

Just note that the restriction applies to lights as well, so when you build the grid you can only reject lights entirely behind the scene (only use max depth). Obiously one could elaborate on this with a min depth buffer, but before you know it we'll have implemented depth peeling

Otherwise I think the fact that you can reuse the entire pipeline including shader functions to access the grid, is one of the really strong features of the tiled deferred-forward combo. Easy to to tiled deferred for opaque objects, and then add a tiled forward for transparent, if that is what works. It is very easy to move between tiled deferred and forward shading, and this got to be good for compatibility/scaling/adapting to platforms.

##### Share on other sites

Just note that the restriction applies to lights as well, so when you build the grid you can only reject lights entirely behind the scene (only use max depth). Obiously one could elaborate on this with a min depth buffer, but before you know it we'll have implemented depth peeling

Otherwise I think the fact that you can reuse the entire pipeline including shader functions to access the grid, is one of the really strong features of the tiled deferred-forward combo. Easy to to tiled deferred for opaque objects, and then add a tiled forward for transparent, if that is what works. It is very easy to move between tiled deferred and forward shading, and this got to be good for compatibility/scaling/adapting to platforms.

But then you loose the benefit of easily being able to use different BRDFs, unless you resort to storing material ID in your gbuffer and then use branching in your lighting shader.