Need explanation of efficiency issues with dynamic branching

Started by
7 comments, last by Spoonbender 17 years, 7 months ago
Hi! I've heard that you should be careful with the dynamic branching that can be done in SM3 pixel shaders. But to what extent? What happens in the pipeline when you put in if statements in the shader code? Naturally there will be a disturbance in the coherence if different pixels processed at the same time follow different paths, but exactly HOW does this affect efficiency? And if I where to write something like this: if(condition) result = float4(1,1,1,1); else result = float4(0,1,0,1); Would this disturb the coherence? My intuition tells me that because the two assignments here should take equal time to complete, the shaders would be running the rest of the code (after the if thingy) in parallell again. Is this right? Thanx
Advertisement
Not sure what you mean by coherence. You're guaranteed that the output is corect, even if different pixels end up taking different code paths. And they don't have to run 100% synchronized or anything. They're assigned separate shader units executing the shader.

The problem is a classic pipelining problem, and CPU's suffer from the same (except they dedicate more resources to reducing the impact, and they have shorter pipelines lessening the impact as well)

Quite simply, a pipelined CPU doesn't wait for one instruction to complete before starting on the next. Instead, once one instruction has been loaded, it is passed to the next pipeline stage, which may consist of actually executing the instruction, and when that is done, the result is passed to the stage responsible for storing the result back into the registers. (That's a unrealistically simple model, but it should serve to explain my point)

Now, once instruction i has been read, it is passed to the execute stage. This leaves the read stage idle, so it starts reading instruction i+1. In the next clock cycle, i will be at the write stage, while i+1 is being executed, and again, the read stage is free to begin on another instruction, i+2.

That's all fine, as long as there are no dependencies between instructions. Let's say i+1 uses the result computed in i. When i+1 is at the execute stage, i is at the write stage, but it hasn't written its result back yet. So i+1 has to wait one cycle for the data it needs to be available.

That can usually be solved by forwarding the result directly, without waiting for it to store to the register file.

A bigger problem is branches. How does the read stage know which instruction to read next, if we don't yet know whether the previous instruction was a branch?
It doesn't. So it has to make a guess, trying to predict whether the previous instruction was a branch, if the branch is actually taken, and if so, where we should jump to. In the worst case, you might have to sit idle for several cycles, just waiting to be told which instruction you should read next. That's a nasty performance hit.

Now replace my primitive 3-stage pipeline (where you might lose one or two cycles here or there) with the ~15 stages of an Athlon 64. Or the 30+ of a Pentium 4. Or the several hundred (Exact number isn't known outside NVidia/ATI, but a figure I heard some years ago was 250'ish) stages in a modern GPU. Imagine if you encounter a branch, and you have to wait 50 cycles just to find out which instruction to read next.

Branch prediction can be done very accurately on CPU's (to the extent that something like 98% of all branches may be guessed correctly)
On a GPU, that's trickier. First, you don't really have an accurate history to refer to. You could look at what this shader unit did on the last, say, 10 pixels it computed, but they took very different input, and might just mislead you. Or you could try to look at the branch history of all shader units, which would be just as misleading.
Or you could look at the last 10 times you encountered this instruction in the current pixel, but since shader programs are very short, so most of the time, you won't encounter any specific branch more than a few times.

Furthermore, branch prediction isn't free. It requires a good chunk of the die, transistors which could have been used on other things.

Basically, GPU's have looong pipelines, which make branch prediction expensive. They also don't have very advanced branch prediction units, which means they can't really do much to minimize this branch penalty.

Hope this makes sense
Thank you for your answer. I pretty much got it, but just to make sure...

What you are saying is that the penalty doesn't come from the fact that the different parallelly processed pixels may take different paths, but from the branch instruction itself?

Is there a way to help the GPU with the branch prediction, for example when pixels close to each other use the same branching. (I'm just processing data from one texture to a rendertarget texture (General Purpose Computation on the GPU), so I could organize it).
The coherency issue comes in cases like this:

if(testResult)
outColor=DoCheapThing();
else
outColor=DoExpensiveThing();

The way this implemented in GPU is that all the pixels in a region of the screen (around 30x30 pixels for most current GPUs) will take the same "path" through the program. So unless 'testResult' is either TRUE or FALSE for almost all the pixels in a block the you not see any speed up (as all the pixels end up executing the code for DoCheapThing AND DoExpensiveThing).

Edit - I mean all the pixels, not almost all, in a block, as long MOST of the block are either all true or all false (i.e. only a few blocks of pixels end up executing both paths) you should see a speed up. The classic example of where this kind of algorithm is used is PCF shadow filtering.
Quote:Original post by fortia
Hi!

I've heard that you should be careful with the dynamic branching that can be done in SM3 pixel shaders. But to what extent? What happens in the pipeline when you put in if statements in the shader code?

Naturally there will be a disturbance in the coherence if different pixels processed at the same time follow different paths, but exactly HOW does this affect efficiency?

And if I where to write something like this:

if(condition)
result = float4(1,1,1,1);
else
result = float4(0,1,0,1);

Would this disturb the coherence? My intuition tells me that because the two assignments here should take equal time to complete, the shaders would be running the rest of the code (after the if thingy) in parallell again. Is this right?

Thanx


This is a complex question to answer. First, your exampl isn't a good one because in such a simple situation its faster to evaluate both sides of the branch then to use flow control (the compiler will choose for you), but I'll talk about the case where you actually have heafty work. In the new HLSL compiler, you can actually micro-manage what the compiler does becuase of the coherence issue (Because the choice depends on coherence expectations), you can do this by saying [branch] or [flatten] in front of the if statement.

Part1)
Because Pixel shaders run in 2x2 groups minimum (often much bigger), and in lockstep (instruction pointer must be at the same place), you always pay the max cost of any pixel shader. This means if one pixel goes down path a, and the other goes down path b, you pay the cost a+b on all shaders. You gain something only when the shaders all go down the same path. The level of coherence depends on the hardware, although I would expect finer granulartiy in the future then we have today. The same is true for loops - you pay the cost of the the shader with the biggest loop.

Part2)
GPU's have a extremely different model for hiding latency then CPUs. Where CPU's use out-of-order instruction execution and rely on huge caches, GPU's take an opposite approach: The cache's are tiny (around 32k range), and everytime a memory access occurs, the pixel goes to sleep if the data isn't there, and the GPU goes on to another pixel. Though you'll never pry the exact number from nvidia or ATI, the number of pixels suspended and executing at any point of time is _vast_, in the hundreds, and likely thousands on newer cards.

How does this effect you? When a pixel shader goes to sleep it has to save off all the current state, which amounts of the registers it is currently using. This means that a pixel shaders performance can be heavily dependent on the number of values in flight, or the number of registers used. The GPU has a giant register profile (instead of a cache), already in the half mb range for size (it's there cache). The more registers you use in a pixel shader, the less pixels can be in flight, and the more likley the GPU is to stall when a memory access (texture read) happens.

Again, how does this affect branching? if you have two branches a and b then your potential register load is the MAX of the two branches, thus you can easily blow your register load out on a shader by branching and then start hitting memory stalls because the GPU can no longer swap enough pixels out.


Conclusion:
You have to simply expirment with the correct level of dynamic branching. It is highly dependent on the content, and the hardware running on. A small number of instructions hidden in a branch statement won't buy you much, so its more useful when you have giant chunks of code.

I hope this helps.






EvilDecl81
Quote:Original post by fortia
Thank you for your answer. I pretty much got it, but just to make sure...

What you are saying is that the penalty doesn't come from the fact that the different parallelly processed pixels may take different paths, but from the branch instruction itself?

Is there a way to help the GPU with the branch prediction, for example when pixels close to each other use the same branching. (I'm just processing data from one texture to a rendertarget texture (General Purpose Computation on the GPU), so I could organize it).


GPUs have always worked on batches of pixels and when a branch occurs it can occur anywhere in that batch of pixels. For example if half the pixels in a batch take one path (pixels A) and the other half take another path (pixels B) then the second half (pixels B) still gets the first halfs (pixels A) instructions executed but then become another batch themselves to execute their own instructions. In the end they get masked out and produce the correct result.

So you end up carrying out about 1.5 to 2 times the work for the batch.

Now the smaller the batches the more efficient the code will run. On current ATi cards R520/R580 the batch size is 4x4 pixels (I think 4x12 on the R580) so dynamic branching should be very efficient. They also have a dedicated branch prediction/execution unit so that ALU cycles aren't wasted.

NVIDIA on the other hand has a fairly large batch size of 64x64 (1024 pixels, IIRC) in the G70/71 (higher in NV40...) so a branch can cost quite a bit to execute.

Hope that clears some stuff up, someone please correct me if I am wrong (been a while since I looked at any of this stuff in detail).

Do a search for reviews for the R520, IIRC they did cover branching quite a bit since it was a pretty big selling feature for ATi (Ultra Threaded Shader Dispatch processor, etc) and some benchmarks did show ATi running about twice as fast with branching, compared to no branching (NVIDIA took a hit with branching).

btw do you have a source/link for the 250 stage pipeline... that doesn't sound correct but I could be wrong.

Guys, where do you learn this sort of things? Thank you so much! You've been very helpful.
Quote:Original post by fortia
Guys, where do you learn this sort of things? Thank you so much! You've been very helpful.


I know I'm not one of the "guys", but some of the the best places to learn is
NVidia documentation
ATI's technical publications
ATI's presentations

Then there is a lot of papers/presentations floating around the net (SIGGRAPH, GDC). Also the ShaderX and GPU Gems series of books contains lots of valuable information. And if you look a little more on NVidia's and ATI's site I'm sure you can find some more information. You might even want to look into general techniques used for "processing units" (CPU/GPU/PPU/etc.), techniques such as branch prediction, out of order execution, etc.
Quote:Original post by fortia
Guys, where do you learn this sort of things? Thank you so much! You've been very helpful.


My part was just basic CPU architecture. Learned that in one of my computer science classes. (And unfortunately, I completely forgot about some of the GPU quirks mentioned by the people after me, like the fact that it processes bigger batches of pixels at a time.)
So yeah, the penalty comes both from problems with branches on any pipelined architecture, and from the GPUs running large batches of pixels through the same code path.

Quote:btw do you have a source/link for the 250 stage pipeline... that doesn't sound correct but I could be wrong.

Nope, just found an article about it a way back... Think it might have been on Ars Technica, but honestly can't remember. But keep in mind that they have to be long enough to be able to hide the latency from texture accesses and such. Of course, it's possible that this figure referred to old fixed-function pipelines, and todays shader-based ones are shorter. I don't know, wish I could find the article again, so I could at least check how old it was. In any case, GPU pipelines are long. But hard to say *how* long. [grin]

This topic is closed to new replies.

Advertisement