Jump to content

  • Log In with Google      Sign In   
  • Create Account


Branching & picking lighting technique in a Deferred Renderer


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
16 replies to this topic

#1 spek   Prime Members   -  Reputation: 993

Like
0Likes
Like

Posted 17 May 2012 - 07:39 AM

We all know one of the problems with Deferred-Rendering is using various lighting models (Lambert, Phong, Oren, Ward, et cetera). Yet I wonder if branching is still a DONT these days. I mean, having an "if-then else" usually doesn't give me a noticeable performance difference. But in this case, the branch could become relative large:
psLightShader
{
   int lightingTechnique = tex2D( gBuffer, texcoords ).x;

   if ( lightingTechnique == 0 )    doLambert_Blinn else
   if ( lightingTechnique == 1 )    doLambert_Phong else
   if ( lightingTechnique == 2 )    doOrenNayar else
   if ( lightingTechnique == 3 )    doWard else
   ...
}
I think there will be 10 to 20 different options in the end. Although most pixels will use a default (Lambert + Blinn) technique.

Would this be a bad idea, and ifso, is there a smart way-around? There was something about combining a Deferred with a Forward pipeline recently, but I'm afraid that's a bit too much of a change. Honestly using different BRDF's isn't giving THAT much of an improvement either, so if it costs too much, I'll fall back on just a few standard lighting techniques.

Rick

Sponsor:

#2 Waterlimon   Crossbones+   -  Reputation: 2436

Like
0Likes
Like

Posted 17 May 2012 - 08:06 AM

I believe it wont be a problem if the threads (are they called threads...?) branch the same way, so if you have a part of the screen do lighting technique x and another part do y, it wouldnt be bad, but if every pixel uses a random technique it would really slow it down a lot.

Though i dont really know how GPUs work. Maybe modern GPUs dont need all of them branching the same way, maybe it can sort them so it does the pixels requiring same branching at the same time.

Anyways, i think it wont be a problem if they all generally branch the same way.

o3o


#3 spek   Prime Members   -  Reputation: 993

Like
0Likes
Like

Posted 17 May 2012 - 08:10 AM

Well, luckily it should be somewhat coherent indeed. A typical example would be a room that uses default shading on all contents, except the walls using Oren Nayar, and a sofa a velvet lighting method. In other words, pixels using technique X are clustered together.

Yet I'm still a bit affraid of having such a big branch, or doesn't it matter much whether a shader has to check for 2 or 20 cases?

#4 Waterlimon   Crossbones+   -  Reputation: 2436

Like
0Likes
Like

Posted 17 May 2012 - 08:25 AM

I googled some and at least for some modern GPUs it seems like if there is multiple branches taken in a collection of threads, it calculates both branches for all those threads. So it doesnt seem that bad. Not sure if there is other cache related stuff that would make it slower, but if theres just like 1-2 different branches to take and theyre not randomly scattered all around the place, it could do pretty well.

o3o


#5 Radikalizm   Crossbones+   -  Reputation: 2802

Like
0Likes
Like

Posted 17 May 2012 - 08:25 AM

Why not try to select only a small amount of BRDFs to use and go from there? I myself use 2 more or less physically based BRDFs (an isotropic and an anisotropic one) which can create a wide range of realistic and good-looking materials in my deferred setup. I use a branch just like you suggested and notice no performance drop whatsoever in my profiling tests.

I can't imagine a normal scene rendered with a deferred renderer requiring 20 BRDFs to be available at once though. If you really want to render with tons of different BRDFs (eg. when you want to do physically correct rendering) you should probably rethink your decision of using a deferred renderer instead of trying to cram tons of techniques into a pipeline that really wasn't designed to handle them.

I gets all your texture budgets!


#6 InvalidPointer   Members   -  Reputation: 1407

Like
1Likes
Like

Posted 17 May 2012 - 08:50 AM

Well, luckily it should be somewhat coherent indeed. A typical example would be a room that uses default shading on all contents, except the walls using Oren Nayar, and a sofa a velvet lighting method. In other words, pixels using technique X are clustered together.

Yet I'm still a bit affraid of having such a big branch, or doesn't it matter much whether a shader has to check for 2 or 20 cases?

It's still going to add about 19+ spurious ALU ops that may or may not be scheduled concurrently with useful work, depending on the target GPU architecture and a handful of other things. In the non-coherent branch case, you're very likely going to be shading all 20+ BRDF models and then doing a predicated move to pick the 'right' result-- *any* sort of boundary is going to be disproportionally expensive to render. I guess what I'm trying to say here is that your question gets asked a lot and the answer hasn't really changed much :(

If you want flexible BRDFs, you have a few options. You can just use standard, expressive BRDFs like Oren-Nayar/Minnaert or Kelemen Szirmay-Kalos for everything and store some additional material parameters in your G-buffers; this is in general a workable base for most scenes. More esoteric surfaces could be handled via forward shading (and you may be doing this anyway for things like hair, being that they're partially-transparent and all) and compositing into the final render.

You could also aim for the more general BRDF solutions like Lafortune or Ashikminh-Shirley and encode their parameters too. This should be sufficient to represent pretty much any material you can think of.

Lastly, you can also give tiled forward rendering a go. If you're starting off from a deferred renderer this may not be that hard to switch over to, though you'll need to do some work on the CPU side (namely light binning and culling) if you're just using a D3D9 feature set. It should still be viable, however.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

#7 Waterlimon   Crossbones+   -  Reputation: 2436

Like
0Likes
Like

Posted 17 May 2012 - 08:55 AM

... In the non-coherent branch case, you're very likely going to be shading all 20+ BRDF models...


Wait, if there is lets say 3 different techniques used in the group of pixels being processed, doesnt it just calculate those 3? Or does it really calculate all the branching? Or did you mean for a worst case situation where all the techniques are used and they all happen to end up in the same thread-group-thing?

o3o


#8 Hodgman   Moderators   -  Reputation: 28632

Like
1Likes
Like

Posted 17 May 2012 - 09:14 AM

N.B. you can express branching in other ways, such as stencil masking or tile classification.

e.g. say you're trying to do a full-screen lighting pass with 2 BRDFs, say iso/aniso --
You could first do a pass that reads your g-buffer material ID value and outputs a mask - Red for iso and Green for aniso. Then you could down-sample this mask, say, 32 times smaller. When down-sampling, if red + green are blended together at all, instead output blue. This then gives you 3 tile masks -- red tiles contain only iso pixels, green tiles only aniso and blue tiles contain both.
Instead of drawing a full-screen quad with a branching shader, you can now draw 3 full-screen quad-grids, who's vertices are associated with the texels in your mask texture -- the first grid uses an iso-lighting pixel shader, and a vertex shader which rejects vertices that don't belong to a red tile. Same for the 2nd grid, but with an aniso shader and green tile vertex checking. The last grid checks for blue vertices and uses your branching shader supporting multiple BRDF's.
On older hardware with bad branching performance, you now only perform branching on the tiles that actually need it, and areas where the tiles are coherent are only shaded by the appropriate branchless shader.

#9 Crowley99   Members   -  Reputation: 178

Like
1Likes
Like

Posted 17 May 2012 - 09:31 AM

My biggest concern is that your shader's register usage will be determined by your most complex lighting function, and will reduce your warp occupancy for simpler shading. The approach Hodgman mentioned will avoid this pitfall (which badly affects many engines out there tha have tried it).

#10 spek   Prime Members   -  Reputation: 993

Like
0Likes
Like

Posted 17 May 2012 - 10:11 AM

Probably you guys are right that only a few BRDF's will be sufficient, but in this early stage is pretty hard to already tell which ones will be usefull and which won't. I guess the final version will just use basic Lambert + Blinn, Oren Nayar for the many matte surfaces I have, and 1 or 2 anisotropic variants.

Doing multiple steps like Hodgman explains sounds interesting. And since Tiled-Deferred Rendering is also on the to-do list, I'm heading that way soft of anyway. Doing steps with specific shaders kills the branching issue, but also introduces some other overhead of course (shader switches, making the mask, downscaling). Hard to say what would be the best choice.


Instead of branching, it still might be a good idea to encode multiple lighting models into 2D (or 3D?) textures and simply use the texcoords to pick. Probably the fastest way, although encoding various complexer models such as Ward or Cook into textures may get a bit difficult, or requires quite a lot texture reads.

#11 InvalidPointer   Members   -  Reputation: 1407

Like
0Likes
Like

Posted 17 May 2012 - 12:36 PM

N.B. you can express branching in other ways, such as stencil masking or tile classification.

e.g. say you're trying to do a full-screen lighting pass with 2 BRDFs, say iso/aniso --
You could first do a pass that reads your g-buffer material ID value and outputs a mask - Red for iso and Green for aniso. Then you could down-sample this mask, say, 32 times smaller. When down-sampling, if red + green are blended together at all, instead output blue. This then gives you 3 tile masks -- red tiles contain only iso pixels, green tiles only aniso and blue tiles contain both.
Instead of drawing a full-screen quad with a branching shader, you can now draw 3 full-screen quad-grids, who's vertices are associated with the texels in your mask texture -- the first grid uses an iso-lighting pixel shader, and a vertex shader which rejects vertices that don't belong to a red tile. Same for the 2nd grid, but with an aniso shader and green tile vertex checking. The last grid checks for blue vertices and uses your branching shader supporting multiple BRDF's.
On older hardware with bad branching performance, you now only perform branching on the tiles that actually need it, and areas where the tiles are coherent are only shaded by the appropriate branchless shader.

Battlefield 3 does something like this, so the approach is definitely workable. D3D11 makes this vastly easier if you use append/consume buffers.


... In the non-coherent branch case, you're very likely going to be shading all 20+ BRDF models...


Wait, if there is lets say 3 different techniques used in the group of pixels being processed, doesnt it just calculate those 3? Or does it really calculate all the branching? Or did you mean for a worst case situation where all the techniques are used and they all happen to end up in the same thread-group-thing?

Nope, you're going to take all paths and pick the result you would get at the end. CPUs can generally jump around in the instruction stream to skip work, but that's not really a functionality GPUs have-- you just have to settle for a conditional move that copies a register value if a certain condition evaluates to true. Obviously, this ends up being rather inexpressive.

You could use dynamic branching, but that pretty much just rearranges the masking so that you run each group of BRDF'd pixels in sequence. It's better, but is still skirting the 'unacceptable' performance range. GPUs work best when computing lots of instruction/data streams in parallel, and as soon as you start introducing sequencing (transparent as it may be in the initial code) you reduce work opportunities.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

#12 Crowley99   Members   -  Reputation: 178

Like
0Likes
Like

Posted 17 May 2012 - 06:08 PM

Nope, you're going to take all paths and pick the result you would get at the end. CPUs can generally jump around in the instruction stream to skip work, but that's not really a functionality GPUs have-- you will have to settle for a conditional move that copies a register value if a certain condition evaluates to true. Obviously, this ends up being rather inexpressive.

You could use dynamic branching, but that pretty much just rearranges the masking so that you run each group of BRDF'd pixels in sequence. It's better, but is still skirting the 'unacceptable' performance range. GPUs work best when computing lots of instruction/data streams in parallel, and as soon as you start introducing sequencing (transparent as it may be in the initial code) you reduce work opportunities.


GPUs can and do do dynamic branching to skip work. If only 3 possibilities are realized within a warp, those will be the only 3 that are executed. Predicated moves will only occur if the branched code is short enough to not justify the jump.

#13 spek   Prime Members   -  Reputation: 993

Like
0Likes
Like

Posted 18 May 2012 - 12:58 PM

Thanks for all the info guys!
One last question. I have little understanding how threading / branching exactly works inside the deep layers of the GPU, but it seems to make sense parallel architecture benefits from doing "the same" as much as possible.

That leaves me with a question "Tiled Deferred Rendering". AFAIK, each screen-tile will get a number of lights to loop through. I guess that means:
- define a light-count (x pointlights, y shadowed spotlights, z cascaded lights, et cetera)
- make an array of lights, or indices that can be used to lookup in one big light-array (Uniform Buffer Object)
- Read G-Buffer stuff (once) and loopieloop through those arrays

But... since the loopnumbers differ for each tile, won't that hurt the parallel processing as well? Or... is it possible to explicitely tell which thread does which tile(s)? To make sure a thread isn't hopping between 2 or 4 tiles with different parameters. Using OpenGL and Cg here btw.

#14 MJP   Moderators   -  Reputation: 10648

Like
2Likes
Like

Posted 18 May 2012 - 03:00 PM

GPU's work in terms of groups of threads, where every thread in the group works on a SIMD hardware unit and shares the same instruction stream. If all of the threads in such a group go the same way in the branch then there's no problem, since they all can still execute the same instruction. But if some of the threads take a branch and some don't, then you have divergence and both sides of the branch must be executed. On Nvidia hardware these groups of threads are called "warps" and have 32 threads, and on AMD hardware they're called "wavefronts" and have 64 threads. GPU's will always execute entire warps/wavefronts at a time, so they're basically the smallest level of granularity for the hardware. Pixel/fragment shaders will (in general) assign the threads of a warp/wavefront to a group of contiguous pixels in screen space based on the macro-level coarse rasterization performed by the hardware. This is why you'll see people say that you want your branching to be coherent in screen space, since threads in a warp/wavefront will be next to each other in screen space. When you see people talk about tile-based deferred rendering they're generally going to be using compute shaders, where thread assignment is more explicit. With compute shaders/OpenCL/Cuda you explicitly split up your kernel into "thread groups", where a thread group is made up of several warps/wavefronts all executing on the same hardware unit (Shader Multiprocessor in Nvidia terminology, Compute Unit in AMD terminology). With compute shaders it's up to you to decide how to assign threads to pixels or vertices or whatever it is you're processing. In the case of deferred rendering, the common way to do it is to have thread groups of around 16x16 threads working on a 16x16 square of pixels. Then what you do is each thread group performs culling to create a per-tile list of lights to process, and then each thread runs through the list and applies each light one by one. There's no divergence in this case since each warp/wavefront uses the same per-tile list (since all thread in warp/wavefront are always belong to the same thread group), so you don't need to worry about that.

If you're looking for a nice intro to how GPU's work in terms of threading and ALU's, then this presentation is a good read.

Edited by MJP, 18 May 2012 - 03:01 PM.


#15 spek   Prime Members   -  Reputation: 993

Like
0Likes
Like

Posted 19 May 2012 - 01:49 AM

Thanks!

>>

the common way to do it is to have thread groups of around 16x16 threads working on a 16x16 square of pixels


I don't quite understand how the division works. You mean the screen gets divided in 16x16 pixel tiles. Then on each tile, 1 group having 256 threads is working on it, thus 1 thread per pixel?? But Warps/Wavefronts only have 32 or 64 threads? I guess I got it wrong Posted Image


What is the OpenGL "Compute Shader" variant (if there is one)? All new to me! I did a quick Google, but mainly found articles related to OpenCL(Apple) or DX11.
>>edit: Seems OpenCL is also available for Windows

And in case there isn't one (yet), or my hardware is too old, is it also possible to render such a tiled grid in a smart way to make advantage of the parallel processing? If I just render a quadgrid (16x16 pixels per quad) and apply the light-parameters for each quad individually, does the whole threading tactics get screwed up?

Edited by spek, 19 May 2012 - 11:47 AM.


#16 MJP   Moderators   -  Reputation: 10648

Like
1Likes
Like

Posted 19 May 2012 - 01:17 PM

A thread group (using compute shader terminology) is made up of multiple warps/wavefronts. So a thread group with 256 threads will have 8 warps, or 4 wavefronts. The warp/wavefront thing is actually transparent to the programmer, it's just an implementation detail of the hardware. So for instance you can have a thread group with 60 threads, but the hardware will end up executing a full wavefront and just masking off the last 4 threads.

OpenGL has no true equivalent of a compute shader, instead it allows you to interop with OpenCL. OpenCL has the same capabilities as compute shaders, they're just not as tightly integrated with the rest of the graphics pipeline.

You can render quads and process them with a fragment shader. The scheduling should be coherent if you use a grid of equal-sized quads. The one thing that you can't do compared to a compute shader is make use of thread group shared memory, which on-chip memory shared among the threads within a thread group. The cool thing you can do with deferred rendering is you can build up a shared list of light indices as you cull the lights per-tile, which allows you to cull the lights in parallel and then process the list per-pixel. You can't do that with a fragment shader, so instead it would probably make sense to build the per-tile lists ahead of time on the CPU.

EDIT: see this for more info

Edited by MJP, 19 May 2012 - 01:55 PM.


#17 spek   Prime Members   -  Reputation: 993

Like
0Likes
Like

Posted 19 May 2012 - 01:54 PM

This doesn't happen too often, but it seems there are even Delphi7 headers + example + handy AMD link for OpenCL:
http://www.heatlab.cz/OpenCLforDelphi.html
To get a real understanding of what you are saying, I just have to dive in myself I'm afraid. It never stops, those nasty graphics :P But probably it will be pretty refreshing to get in the "deeper layers", which also may give more insight on how a GPU actually works, like the presentation you posted shows.

In the meanwhile, I'll just start on a equal-sized quadgrid, using UBO's and indices (calculated by the CPU) that hold all light parameters. And to come back on the original question about BRDF's.... either I'll do it in a few steps as Hodgman suggested to prevent branching which BRDF to chose, Or I'll just use a single BRDF and use hacks to fix the (rare) occasions of materials that require anisotropic or other advanced techniques (velvet, human skin, SubScattering, et cetera). Shoot me if I'm wrong, but after some playing, I don't think Oren Nayar or Cook Torrence will be adding that much. Using the right textures and a good Blinn specular intensity/shininess already helps a lot.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS