Why use Uber Shaders? Why not one giant shader with some ifs.

Started by
20 comments, last by Hodgman 11 years, 11 months ago
I'm not sure how it works in the world of shaders but in CUDA, when you have branches, but all threads take the same branch, as I understand there is no slow down. The slowdown occurs from divergence as some of the processors have to sit idle if they need to take one branch and let the others take the other branch.

So say I was doing a renderer in CUDA.

I could say, enable light 0 and make it a point light.
Disable light 1.
Disable light 2.

Then I'd have the entire program loaded and just run something like this:

for(int light = 0; light < maxLights; light++) {
if(gl_light[light].enabled) {
switch(gl_light[light].type) {
case LIGHT_SPOT:
//do spot light stuff
break;
case LIGHT_POINT:
//do point light stuff
break;
case LIGHT_DIRECTIONAL:
//do diractional light stuff
break;
}
}
}


There are branches, but all processors take the same path in the branches and don't cause divergence.

In the world of Shaders I've seen people recommending something like uber shaders to handle different permutations of rendering states to avoid branches since they are supposed to be slower. But is it really slower in shaders when all of them take the same path in the code? You end up having to compile many different shaders and changing which one is loaded based on what you are rendering, which causes some slowdown.

Is there a reason to not just have these much bigger shaders with some if statements?
Advertisement
all threads take the same branch, as I understand there is no slow down
No, when all threads take the same branch, there is no extra slowdown on top of the regular number of cycles it takes to process the condition and branch instructions -- every instruction still has a cost (unless it's optimised out -- N.B. some graphics drivers can optimise out these kind of non-divergent branches when you issue your draw-call, if it's able to determine that all threads will take the same path).

But is it really slower in shaders when all of them take the same path in the code?

Atleast the calculation and testing of the conditions are unnecessary in this case.


PS: Ahh...too slow.. Hodgman is lurking in the shadows all day ph34r.png
You also have to watch out for increased register pressure from having too many branches, since the compiler will need to allocate enough registers to handle both sides of the branch.

Anyway you have to realize that a lot of advice regarding graphics is going to come from the era before DX10/Cuda-capable GPU's. This is because old information hangs around the internet instead of dying out, and also because that level of hardware is still prevalent in consoles, mobile devices, and PC's. Before DX10 hardware, branching was generally a much less appealing proposition.
On older hardware such as the xbox 360 you can go so far as to explain to the compiler exactly what you are trying to do with the branching.
Its far safer to go the permutation route with older hardware
Also why would you want to do that anyway? Seems like an antipattern to me.

You also have to watch out for increased register pressure from having too many branches, since the compiler will need to allocate enough registers to handle both sides of the branch.


Just highlighting this. If you mix shaders that do a lot of complex lighting math with shaders that are relatively simple, the register requirements of the complex case will kill your warp occupancy (and hence performance) in the simple case.
There is the increased instruction count, but I was thinking maybe it would be balanced out by not having to constantly switch the loaded shader as you are drawing different materials.

I can see the register count being a problem... But if you have a complex shader using many registers, and a branch that is simpler and uses less, isn't that not really made worse since there are times when you would have the more complex shader loaded and be using a lot of registers anyway?

When you take a branch, I'm pretty sure it uses the same set of registers. I'm not sure how it is on the GPU but on the CPU branch A won't have registers 1-5 reserved and branch B won't have registers 6-10 reserverd.

With good optimizations it should work something more like, branch A would say, I need 3 registers, branch B would say I need 10 registers. So if you take branch A you use registers 1-3, if you take branch B you use registers 1-10.
So the provided example may not issue branch instructions at all. This is a candidate for uniform branching, in which case the runtime or driver may choose to produce multiple compilations of the shader where all of the branches have been resolved and the loops unrolled. This was really common before we had hardware branching, in the 2.x days. I'm not sure to what extent it's still used now, but you can hint the compiler to unroll loops and avoid branches.

Uber shaders give you much more precise control over compilation though.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

There is the increased instruction count, but I was thinking maybe it would be balanced out by not having to constantly switch the loaded shader as you are drawing different materials.

I can see the register count being a problem... But if you have a complex shader using many registers, and a branch that is simpler and uses less, isn't that not really made worse since there are times when you would have the more complex shader loaded and be using a lot of registers anyway?

When you take a branch, I'm pretty sure it uses the same set of registers. I'm not sure how it is on the GPU but on the CPU branch A won't have registers 1-5 reserved and branch B won't have registers 6-10 reserverd.

With good optimizations it should work something more like, branch A would say, I need 3 registers, branch B would say I need 10 registers. So if you take branch A you use registers 1-3, if you take branch B you use registers 1-10.


The way registers work on the GPU is that on each hardware unit there is a fixed size register file, and the number of registers used by a shader determines how many threads can be in flight simultaneously. So far example if you had 10k registers and you were running a shader that used 10 registers, then you could have 1k threads in flight. Those threads don't all run concurrently of course, but having lots of threads allows the hardware to swap out threads stalled on memory accesses for other threads that can perform ALU work.

The reason this is a problem with branching is that the shader has to allocate for the worst case. So if you have a complex path that's rarely taken and requires 20 registers and a simple path that only requires 4, each thread will allocate 20 registers for their entire lifetime even if none of threads ever take that branch. This means your occupancy is always determined by your worst case. In a permutation scenario you could draw your simple case objects with a simple shader and they would have good occupancy, and only the objects requiring the complex shader path would suffer the performance effects of having high register pressure.

This topic is closed to new replies.

Advertisement