Crossbones+ - Reputation: 3911
Posted 23 January 2013 - 02:23 AM
There is a thing I don't understand.
It appears there's this thing still going on which plain forward can only do 8-10 lights per pass. How? In the past I've had quite some success encoding light data in textures and looping them on entry-level SM3 hardware. Perhaps I'm not seeing the whole picture but in SM4 with the much higher resource limits and the unlimited dynamic instruction count... shouldn't we go easily in the thousand range? Of course we'll neeed a z-only pass first.
So I guess there are additional practical reasons to stay in the 8-10 range.
At the top of page 2, I read about extra pressure and lower execution efficiency. I understand.
But, as much as I love lighting modularity coming from deferred, as a DDR3 card owner I still don't understand how the improved processing makes up for the bandwidth increase required. The trend on bandwidth is set. It looks to me we want to trash compute in the future.
Crossbones+ - Reputation: 4977
Posted 28 January 2013 - 12:49 PM
In shader model 4.0 you can have up to 4096 entries in a constant buffer (which would limit lights to ~256 if they have position, direction diffuse, specular). Or you can use texture buffers and have near limitless lights.
Let's say you use the latter, so no worries about the light count. And today with SM 5.0 we really don't need to worry about loop count limits either. So we're good on that front
Indeed you can loop through a 1000 lights in SM4+ hardware. But let's say I'm running at 1920x1080 resolution and the whole screen is covered.
1920x1080 x 1000 lights = 2.073.600.000 light evaluations per frame.
Not to mention some BRDFs are expensive (i.e. Cook Torrance). Framerate would be sloooooooooow. So slow in fact, that it could trigger the Windows watchdog for believing the GPU is stalled and restart the driver.
The secret behind Deferred shaders (or Forward+) is that even though there are thousands of lights, they're not covering the whole screen at the same time.
In other words: many small lights = few big lights.
It's typical that a single region of the screen isn't lit by more than 4-20 lights, may be 5 on average. Let's be pessimistic and say 10.
1920x1080 x 10 lights = 20.736.000 light evaluations per frame
That's a lot more reasonable for a GPU to perform in real time. In such scenario every region of pixels (called tiles) would only have to loop through 10 lights (on average), not a 1000 and waste gpu time on 990 lights that aren't needed.
Members - Reputation: 768
Posted 29 January 2013 - 02:10 AM
Edited by CryZe, 29 January 2013 - 02:20 AM.
Moderators - Reputation: 14022
Posted 29 January 2013 - 02:22 AM
Also with more traditional forward rendering you would typically have stage (performed either offline or online) where you determine which lights will affect a given mesh, so that you only apply those lights when rendering it. Once again the only major difference is your granularity, and when/where you cull your lights at your given level of granularity. Doing everything on the GPU lets you achieve very fine granularity (per-tile or per-pixel) with relatively simple code, which is the primary draw of deferred techniques.
Crossbones+ - Reputation: 4977
Posted 29 January 2013 - 01:18 PM
But what if you would test whether a light actually should be covering the individual pixels inside this loop and skip all the lights, that shouldn't? The only difference to a tile-based deferred renderer would be, that the light culling is performed per pixel instead of per tile. But you wouldn't have all the BRDF, transparency, bandwidth and Anti-Aliasing issues. It would basically be a worse version of a light indexed deferred renderer, because the list of lights is not precomputed.
You could do that, but GPUs suck at branch-heavy applications, specially if there's not good branch coherency within the tile block (pixel shaders are run in blocks)
And even if it did, a tile-based deferred renderer is MUCH more efficient in performing this culling.