Sign in to follow this  

Why is GPU branching slow?

This topic is 4022 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Simple title, simple question... I'm not so sure the answer is as simple.. Why is branching in shaders considered a bottleneck? Why is it considerably slowing down performance?

Share this post


Link to post
Share on other sites
Basically you have a pipeline, each stage of which is performing some operation on a pixel or a vertex. If the pipeline is like 5 stages long, each stage of it will perform a part of the process for executing a shader instruction.

The key to performance is throughput, ideally you want each stage of the pipeline full and that is easy when you know what operation needs to be executed next. If however you don't know the next instruction because it depends on the result of a branch, you have to wait for the branch instruction to finish. This means that parts of the pipeline become empty (whilst no more instructions are pushed in). When the result is known, the instructions are pushed into the pipeline again and it fills up.

Now GPUs can try and predict the likely outcome as to avoid having an empty pipeline, but if they get it wrong, the throughput drops. GPUs also have lots of pipelines and can put the true result in one and the false result in another.

It's something like that.

Dave

Share this post


Link to post
Share on other sites
I'm not 100% certain of this answer. But I have the understanding that each shader pipe executes each shader instruction in lock-step. Meaning if you have 16 pipes, each shader unit will execute instruction 1, then 2, then 3, etc...

So if at any point you need to go to a new instruction, all the shader units would have to go to the new instruction... which kind of defeats the purpose. So to get around that, both branches of instructions are executed on modern GPU's.

Note that this is only for dynamic branching, static branching doesn't have the same limitation.

Share this post


Link to post
Share on other sites
GPUs utilize heavily SIMD multiprocessing: for every running instance of a given shader program, there are several fragments or vertices being processed in parallel by that program. Clearly, because the code is only run once, this can only work if the execution path is the same for each element. If there are branches that differ from element to element, these have to be processed separately, stalling the pipeline.

Share this post


Link to post
Share on other sites
Thanks for the fast answers.
So basically, each shader must run a single set of code, and branching is two sets of codes when it comes to executing them in the GPU.
What is dynamic branching and static branching? This is the first time I've heard of that..

Share this post


Link to post
Share on other sites
The principle of branching being slow, that i described is when one instruction depends on the result of the branch and therefore has to wait for it to complete before it knows what to do.

My destcription (although crude) is applicable to CPUs as well.

Dave

Share this post


Link to post
Share on other sites
Hi,

I'm not an expert about it, but I can assure you GPUs weren't designed to be general processing chips. At first not even stream processors. But that's where it's heading (take a look at nVidia's latest offering allowing you to code the individual stream processors and AMD's initiative for a stream processor using ATi's (who is now part of AMD) X1900 cards). Anyway, the reason is, the first GPUs where fixed function. Then came static shaders. And recently (well, not really in terms of GPU generations) there was the addition of dynamic branching. IIRC, they implemented internally a simple execute/don't execute mechanism for specific blocks of pixels of a specific shader. Only recently every "pipe" or "stream processor" has it's specific dynamic branching unit (I'm not sure but I think some still have to share them).

In a more architectural view, the GPU is a large super-parallel processing chip. Only recently their design shifted from a more SIMD approach (where we have one instruction operating on a lot of data simultaneosly) to a more MIMD approach (giving shader processors more "independance", like for example's ATi/AMD's thread control logic on their newest cards), but their essence is still SIMD. So to allow more independance between shaders, you need more control logic and consequently a larger transistor budget. And for the last years the independance wouldn't have benefited the game industry as much as more pipes would. Also, I assume the FPUs/ALUs/FMACs inside the shader processors are largely pipelined to allow a larger throughput, hence a conditional branch gives a very large bubble inside the pipeline.

Again, I'm just a fan of computers, so don't put your life on this info. For further reading I suggest Wikipedia's articles about SIMD, MIMD, pipelines, out-of-order execution, GPUs, FPUs, and whatever links they point to. Also, a great information center is Jon "Hannibal" Stokes' article's about CPU architecture on ArsTechnica. Forgive terrible english, it's been a while. Hope this helps,

JVFF

Share this post


Link to post
Share on other sites
Man I'm slow to type... 0_0

Anyway, static branching is unconditional, and dynamic is conditional. IOW, static is a jump instruction, dynamic is an "if" instruction, but we only know if after we have the result, which means it has a dependancy, while the other one doesn't. Once you get to an if you can a) stop execution until you get the result or b) guess the result (true or false), if your right woohoo, if not you have to discard everything you've done after the guess. Hope this helps,

JVFF

Share this post


Link to post
Share on other sites
The amount of information you supplied me with is amazing, thanks for the elaborated replies, this is extremely useful to me.
Can you say that, for instance, a vertex arrives at a vertex processor, the "if" condition doesn't hold for that vertex, so it "returns" to the non-processed vertex pool for another run? Is this explanation applicable?

Share this post


Link to post
Share on other sites
I'm not really sure about the old vertex processors (the new ones are more similiar to pixel processors). But yes, I think you can. IIRC, given a batch of pixels, the ones that don't follow the same path (the guessed one I think) don't save the results (but they execute the same instructions) and are saved for a later re-execution given the new path (I'm not sure how smart the control is, but I wouldn't be suprised if on older hardware the same batch would be sent for execution, even if some pixels are done). Hope this helps,

JVFF

Share this post


Link to post
Share on other sites
It's worth bearing in mind that CPU branching is slow, too! It's just that CPUs, being general purpose devices, have had a lot more investment in alleviating this problem, in terms of predicting which branch is going to be taken and so on. With a GPU, it's much more of a specialised and linear device, so these optimisations are harder to perform while keeping costs down and performance up.

Share this post


Link to post
Share on other sites

This topic is 4022 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this