Why is GPU branching slow?

Started by
11 comments, last by Kylotan 17 years, 4 months ago
Simple title, simple question... I'm not so sure the answer is as simple.. Why is branching in shaders considered a bottleneck? Why is it considerably slowing down performance?
Advertisement
Basically you have a pipeline, each stage of which is performing some operation on a pixel or a vertex. If the pipeline is like 5 stages long, each stage of it will perform a part of the process for executing a shader instruction.

The key to performance is throughput, ideally you want each stage of the pipeline full and that is easy when you know what operation needs to be executed next. If however you don't know the next instruction because it depends on the result of a branch, you have to wait for the branch instruction to finish. This means that parts of the pipeline become empty (whilst no more instructions are pushed in). When the result is known, the instructions are pushed into the pipeline again and it fills up.

Now GPUs can try and predict the likely outcome as to avoid having an empty pipeline, but if they get it wrong, the throughput drops. GPUs also have lots of pipelines and can put the true result in one and the false result in another.

It's something like that.

Dave
I'm not 100% certain of this answer. But I have the understanding that each shader pipe executes each shader instruction in lock-step. Meaning if you have 16 pipes, each shader unit will execute instruction 1, then 2, then 3, etc...

So if at any point you need to go to a new instruction, all the shader units would have to go to the new instruction... which kind of defeats the purpose. So to get around that, both branches of instructions are executed on modern GPU's.

Note that this is only for dynamic branching, static branching doesn't have the same limitation.
-----------------------------------Indium Studios, Inc.
Dave, that's not quite it.

See the following threads,

is dynamic branching really dynamic branching under SM3.0 shader cards?

Need explanation of efficiency issues with dynamic branching

Hope that helps

GPUs utilize heavily SIMD multiprocessing: for every running instance of a given shader program, there are several fragments or vertices being processed in parallel by that program. Clearly, because the code is only run once, this can only work if the execution path is the same for each element. If there are branches that differ from element to element, these have to be processed separately, stalling the pipeline.
Thanks for the fast answers.
So basically, each shader must run a single set of code, and branching is two sets of codes when it comes to executing them in the GPU.
What is dynamic branching and static branching? This is the first time I've heard of that..
The principle of branching being slow, that i described is when one instruction depends on the result of the branch and therefore has to wait for it to complete before it knows what to do.

My destcription (although crude) is applicable to CPUs as well.

Dave
Hi,

I'm not an expert about it, but I can assure you GPUs weren't designed to be general processing chips. At first not even stream processors. But that's where it's heading (take a look at nVidia's latest offering allowing you to code the individual stream processors and AMD's initiative for a stream processor using ATi's (who is now part of AMD) X1900 cards). Anyway, the reason is, the first GPUs where fixed function. Then came static shaders. And recently (well, not really in terms of GPU generations) there was the addition of dynamic branching. IIRC, they implemented internally a simple execute/don't execute mechanism for specific blocks of pixels of a specific shader. Only recently every "pipe" or "stream processor" has it's specific dynamic branching unit (I'm not sure but I think some still have to share them).

In a more architectural view, the GPU is a large super-parallel processing chip. Only recently their design shifted from a more SIMD approach (where we have one instruction operating on a lot of data simultaneosly) to a more MIMD approach (giving shader processors more "independance", like for example's ATi/AMD's thread control logic on their newest cards), but their essence is still SIMD. So to allow more independance between shaders, you need more control logic and consequently a larger transistor budget. And for the last years the independance wouldn't have benefited the game industry as much as more pipes would. Also, I assume the FPUs/ALUs/FMACs inside the shader processors are largely pipelined to allow a larger throughput, hence a conditional branch gives a very large bubble inside the pipeline.

Again, I'm just a fan of computers, so don't put your life on this info. For further reading I suggest Wikipedia's articles about SIMD, MIMD, pipelines, out-of-order execution, GPUs, FPUs, and whatever links they point to. Also, a great information center is Jon "Hannibal" Stokes' article's about CPU architecture on ArsTechnica. Forgive terrible english, it's been a while. Hope this helps,

JVFF
ThanQ, JVFF (Janito Vaqueiro Ferreira Filho)
Man I'm slow to type... 0_0

Anyway, static branching is unconditional, and dynamic is conditional. IOW, static is a jump instruction, dynamic is an "if" instruction, but we only know if after we have the result, which means it has a dependancy, while the other one doesn't. Once you get to an if you can a) stop execution until you get the result or b) guess the result (true or false), if your right woohoo, if not you have to discard everything you've done after the guess. Hope this helps,

JVFF
ThanQ, JVFF (Janito Vaqueiro Ferreira Filho)
The amount of information you supplied me with is amazing, thanks for the elaborated replies, this is extremely useful to me.
Can you say that, for instance, a vertex arrives at a vertex processor, the "if" condition doesn't hold for that vertex, so it "returns" to the non-processed vertex pool for another run? Is this explanation applicable?

This topic is closed to new replies.

Advertisement