Back to General and Gameplay Programming

Why is GPU branching slow?

General and Gameplay Programming Programming

Started by GunBlade December 11, 2006 03:34 PM

11 comments, last by Kylotan 17 years, 4 months ago

GunBlade

145

Author

December 11, 2006 03:34 PM

Simple title, simple question... I'm not so sure the answer is as simple.. Why is branching in shaders considered a bottleneck? Why is it considerably slowing down performance?

dave

2,188

December 11, 2006 03:49 PM

Basically you have a pipeline, each stage of which is performing some operation on a pixel or a vertex. If the pipeline is like 5 stages long, each stage of it will perform a part of the process for executing a shader instruction.

The key to performance is throughput, ideally you want each stage of the pipeline full and that is easy when you know what operation needs to be executed next. If however you don't know the next instruction because it depends on the result of a branch, you have to wait for the branch instruction to finish. This means that parts of the pipeline become empty (whilst no more instructions are pushed in). When the result is known, the instructions are pushed into the pipeline again and it fills up.

Now GPUs can try and predict the likely outcome as to avoid having an empty pipeline, but if they get it wrong, the throughput drops. GPUs also have lots of pipelines and can put the true result in one and the false result in another.

It's something like that.

Dave

jkleinecke

251

December 11, 2006 03:51 PM

I'm not 100% certain of this answer. But I have the understanding that each shader pipe executes each shader instruction in lock-step. Meaning if you have 16 pipes, each shader unit will execute instruction 1, then 2, then 3, etc...

So if at any point you need to go to a new instruction, all the shader units would have to go to the new instruction... which kind of defeats the purpose. So to get around that, both branches of instructions are executed on modern GPU's.

Note that this is only for dynamic branching, static branching doesn't have the same limitation.

-----------------------------------Indium Studios, Inc.

nts

968

December 11, 2006 03:54 PM

Dave, that's not quite it.

See the following threads,

is dynamic branching really dynamic branching under SM3.0 shader cards?

Need explanation of efficiency issues with dynamic branching

Hope that helps

google

Sharlin

864

December 11, 2006 03:57 PM

GPUs utilize heavily SIMD multiprocessing: for every running instance of a given shader program, there are several fragments or vertices being processed in parallel by that program. Clearly, because the code is only run once, this can only work if the execution path is the same for each element. If there are branches that differ from element to element, these have to be processed separately, stalling the pipeline.

GunBlade

145

Author

December 11, 2006 03:58 PM

Thanks for the fast answers.
So basically, each shader must run a single set of code, and branching is two sets of codes when it comes to executing them in the GPU.
What is dynamic branching and static branching? This is the first time I've heard of that..

dave

2,188

December 11, 2006 03:59 PM

The principle of branching being slow, that i described is when one instruction depends on the result of the branch and therefore has to wait for it to complete before it knows what to do.

My destcription (although crude) is applicable to CPUs as well.

Dave

jvff

223

December 11, 2006 04:00 PM

Hi,

I'm not an expert about it, but I can assure you GPUs weren't designed to be general processing chips. At first not even stream processors. But that's where it's heading (take a look at nVidia's latest offering allowing you to code the individual stream processors and AMD's initiative for a stream processor using ATi's (who is now part of AMD) X1900 cards). Anyway, the reason is, the first GPUs where fixed function. Then came static shaders. And recently (well, not really in terms of GPU generations) there was the addition of dynamic branching. IIRC, they implemented internally a simple execute/don't execute mechanism for specific blocks of pixels of a specific shader. Only recently every "pipe" or "stream processor" has it's specific dynamic branching unit (I'm not sure but I think some still have to share them).

In a more architectural view, the GPU is a large super-parallel processing chip. Only recently their design shifted from a more SIMD approach (where we have one instruction operating on a lot of data simultaneosly) to a more MIMD approach (giving shader processors more "independance", like for example's ATi/AMD's thread control logic on their newest cards), but their essence is still SIMD. So to allow more independance between shaders, you need more control logic and consequently a larger transistor budget. And for the last years the independance wouldn't have benefited the game industry as much as more pipes would. Also, I assume the FPUs/ALUs/FMACs inside the shader processors are largely pipelined to allow a larger throughput, hence a conditional branch gives a very large bubble inside the pipeline.

Again, I'm just a fan of computers, so don't put your life on this info. For further reading I suggest Wikipedia's articles about SIMD, MIMD, pipelines, out-of-order execution, GPUs, FPUs, and whatever links they point to. Also, a great information center is Jon "Hannibal" Stokes' article's about CPU architecture on ArsTechnica. Forgive terrible english, it's been a while. Hope this helps,

JVFF

ThanQ, JVFF (Janito Vaqueiro Ferreira Filho)

jvff

223

December 11, 2006 04:10 PM

Man I'm slow to type... 0_0

Anyway, static branching is unconditional, and dynamic is conditional. IOW, static is a jump instruction, dynamic is an "if" instruction, but we only know if after we have the result, which means it has a dependancy, while the other one doesn't. Once you get to an if you can a) stop execution until you get the result or b) guess the result (true or false), if your right woohoo, if not you have to discard everything you've done after the guess. Hope this helps,

JVFF

ThanQ, JVFF (Janito Vaqueiro Ferreira Filho)

GunBlade

145

Author

December 11, 2006 04:21 PM

The amount of information you supplied me with is amazing, thanks for the elaborated replies, this is extremely useful to me.
Can you say that, for instance, a vertex arrives at a vertex processor, the "if" condition doesn't hold for that vertex, so it "returns" to the non-processed vertex pool for another run? Is this explanation applicable?

Why is GPU branching slow?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Why is GPU branching slow?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines