Revival of Forward Rending?

Started by
33 comments, last by Hodgman 12 years ago
Hmmm, I've given your test app a quick spin on my 7970 and I noticed the MSAA times where a little inconsistant which is a tad strange.

Anyway, I ran the code 'out of the box' so I don't know what light settings you had setup but the results were:

Forward Render Time:
No MSAA : 0.8ms
2x MSAA : 0.9ms to 1.6ms (?)
4x MSAA: 1.0ms to 1.8ms (?) - seems to spend more time at 1.0ms however.

Even Light Time Computation seems to bounce around a bit, varying between 0.2 and 0.4ms on all AA modes - sometimes NoAA and 2xAA sit around 0.36ms while 4xAA comes in at 0.26ms.

If I unlock v-sync then the total render times come in at;
NoAA : 1.8ms
2xAA: 2.0ms
4xAA: 2.3ms

In comparison the tiled deferred stacks in at;
(G-buffer pass, light pass compuation, total render time)
(With z-pass)
NoAA: 0.41, 0.53, 1.9ms
2xAA: 0.49, 1.3, 2.82ms
4xAA: 0.55, 1.9, 3.5ms

(Without z-pass)
NoAA: 0.6, 0.53, 1.9ms
2xAA: 0.7, 1.3, 2.82ms
4xAA: 0.8, 1.9, 3.6ms

(720p rendering, in a window, stock clocked i7 920, but judging by the CPU graph from the task manager it wasn't CPU limited so that should change things.)
Advertisement
It would seem to me that there is promise in a CS powered light indexed deferred rendered. Even on AMD hardware the performance isn't bad, and apparently on nV it's better. Especially if you want MSAA and weren't planning on using thousands of lights.
Thanks for posting those phantom! I've noticed that the queries I use for profiling can get messed up a bit when VSYNC is enabled. For the numbers I gathered, I just used the total frame time with VSYNC disabled.

I also realized that it was pretty dumb and lazy of me to leave the number of lights hard-coded, so I uploaded a new version that lets you switch the number of lights at runtime.
I am noticing no change at all when enabling or disabling Z prepass when in light indexed mode. I would have though it would have a large impact on forward shading time. huh.png

Even on AMD hardware the performance isn't bad, and apparently on nV it's better. Especially if you want MSAA and weren't planning on using thousands of lights.


Well, the performance delta is better on the 680 vs tiled deferred but I suspect that is down to a reduction in memory bandwidth (680 runs slightly higher clocked but has 2/3 the bus size of the 7970); over all the 7970 seems to like it better render time wise (which is intresting as most game benchmarks have the 680 winning across the board).

Anyway, I'm going to try out the new version and report back in a bit regarding the various light counts.
OK, 7970 results as promised;

(Forward, Light Tile, Total Frame Time)

256 lights:
Index Deferred:
No AA: 1.4ms, 0.196ms, 2.5ms
2xAA: 1.5ms, 0.22ms, 2.7 -> 2.9ms
4xAA: 1.7ms, 0.27ms, 3.0 -> 3.4ms

Tiled Deferred:
No AA: 0.416ms, 0.914ms, 2.3ms
2x AA: 0.49ms, 1.86ms, 3.3ms (spike 3.5ms)
4x AA: 0.55ms, 2.53ms, 4.2ms (4.4ms spike)

512 light
Index Deferred:
No AA: 2.2ms, 0.2ms, 3.3ms
2xAA: 2.59ms, 0.22ms, 3.8ms
4xAA: 2.85ms, 0.28ms, 4.2ms

Tiled Deferred:
No AA: 0.416ms, 1.4ms, 2.9ms
2x AA: 0.49ms, 2.65ms, 4.2ms
4x AA: 0.55ms, 3.5ms, 5.2ms

1024 light
Index Deferred:
No AA: 4.8ms, 0.2ms, 5.9ms
2xAA: 5.45ms, 0.25ms, 6.7ms
4xAA: 5.99ms, 0.315ms, 7.4ms

Tiled Deferred:
No AA: 0.416ms, 3.08ms, 4.5ms
2x AA: 0.49ms, 4.88ms, 6.4ms
4x AA: 0.55ms, 6.16ms, 7.8ms (8ms spike)

Side note: GPU-Z reports back;
1690MB VRAM used
420MB 'dynamic' ram used

That's.... a lot :o

So, if we try to make some sense out of this test :D

It would seem that without AA the Tiled Deferred has the edge, but as soon as you throw AA into the mix things swing towards the Index Deferred method (1024, 2xAA being the notable exception to that rule).

TD shows the normal deferred charactistic of stable G-buffer pass times, but the tile lighting phase begins to get very expensive for it.
By constrast ID has a pretty constant lighting phase but the forward render phase shows the same kind of increase as TD's lighting phase.
phantom,

Can you please post the results for 128 lights?

Thanks!
-= Dave
Graphics Programmer - Ready At Dawn Studios

I am noticing no change at all when enabling or disabling Z prepass when in light indexed mode. I would have though it would have a large impact on forward shading time. huh.png


The Z prepass is always enabled for light indexed mode, because a depth buffer is necessary to figure out the list of lights that intersect each tile. If you didn't do this you could build a list of lights just using the near/far planes of the camera, but I would suspect that the larger light lists + lack of good early z cull would cause performance to go right down the drain.
Thanks for the updated numbers phantom! It definitely seems as though light indexed removes a lot of bandwidth dependence, which is pretty cool. Ethatron posted some numbers on my blog where he showed that performance scaled pretty directly with shader core clock speed. I would suspect that overall light indexed would scale down pretty well to lower-class hardware with lower bandwidth ratings.

...If you didn't do this you could build a list of lights just using the near/far planes of the camera, but I would suspect that the larger light lists + lack of good early z cull would cause performance to go right down the drain.


I did look at that in my paper 'Tiled Shading' that someone posted a link to above. And the short answer is that no indeed, it does not end too well.

On the other hand, I imagine that it would be a useful technique simply to manage lights in an environment with not too many lights in any given location and limited views (e.g. RTS camera or so), as the limited depth span makes the depth range optimization less effective.

I've got an open gl demo too, which builds grids entirely on the CPU (so it's not very high performance, just there to demo the techniques).

Btw, one thing that could may affect your results that I noticed is that you make use of atomics to reduce the min/max depth. Shared memory atomics on NVIDIA hardware serialize on conflicts, so to use them to perform a reduction this way is less efficient than just using a single thread in the CTA to do the work (at least then you dont have to run the conflict detection steps involved). So this step gets a lot faster with a SIMD parallel reduction, which is fairly straight forward, dont have time to dig out a good link sorry, I'll just post a cuda variant I've got handy, for 32 threads (a warp), but scales up with apropriate barrier syncs, sdata is a pointer to a 32 element shared memory buffer (is that local memory in compute shader lingo? Anyway, the on-chip variety.).

uint32_t warpReduce(uint32_t data, uint32_t index, volatile uint32_t *sdata)
{
unsigned int tid = index;
sdata[tid] = data;
if (tid < 16)
{
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
return sdata[0];
}


Same goes for the list building, where a prefix sum could be used. Here it'd depend on the rate of collisions. Anyway, thinking this might be a difference between NVIDIA and AMD (Where I don't have a clue how atomics are implemented).

As a side note, it's much more efficient to work out the screen space bounds of each light before running the per tile checks, saves constructing identical planes for tens of tiles, etc.

Anyway, fun to see some activity on this topic! And I'm surprised at the good results for tiled forward.

Cheers
.ola

This topic is closed to new replies.

Advertisement