[HLSL] Dynamic branching performance

Started by
7 comments, last by _the_phantom_ 12 years, 11 months ago
How can I know when dynamic branching ruins performance?

In my shaders I , often, use if() and for() is there any "rules" to make dynamic branching faster?
Advertisement
If you can put it in your vertex shader rather than your pixel shader it won't have as much overhead. That's a good general principle with any shader operation: where possible move stuff back to the vertex shader.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


How can I know when dynamic branching ruins performance?


Profile. Do it often. AMD's GPU PerfStudio can give you statistics about how many threads take a branch, or don't, and how much wasted work is done because of the branches. I'm not sure if Parallel Nsight can do the same, but I would assume it has some useful info.


In my shaders I , often, use if() and for() is there any "rules" to make dynamic branching faster?


Well first of all, using an "if" or a "for" does not automatically mean you're getting a branch in your shader. The compiler can flatten branches and unroll loops if it's possible to do so. You should check the shader assembly to verify. You can also use attributes to force the compiler to flatten or unroll.

Anyway the number 1 rule for dynamic branching is coherence. You need lots of adjacent pixels (usually within a 32x32 or 64x64 block) to take the same branch, otherwise you end up with all of those pixels taking both branches and doing wasted work. So if you're trying to use dynamic branching as an optimization, only do it for things where the branch will be the same for large portions of the screen. Also it helps to only use a branch to try to skip large sections of code as opposed to smaller ones, since having a branch adds some fixed overhead itself.
What does it mean to flatten a branch?

The assembly of one of my shader that use if() looks like this:

if_nz r2.y
mov r3.x, r1.z
mov r3.z, r1.w
sample_l r4.xyzw, r3.xzxx, t0.xyzw, s0, l(0.000000)
mov r4.x, r4.x
add r3.y, r2.x, r3.z
sample_l r5.xyzw, r3.xyxx, t0.xyzw, s0, l(0.000000)
add r2.y, r4.x, r5.x
mov r2.z, -r2.x
add r3.w, r2.z, r3.z
sample_l r4.xyzw, r3.xwxx, t0.xyzw, s0, l(0.000000)
add r2.y, r2.y, r4.x
add r3.y, r2.z, r3.x
sample_l r4.xyzw, r3.yzyy, t0.xyzw, s0, l(0.000000)
add r2.y, r2.y, r4.x
add r3.x, r2.x, r3.x
sample_l r3.xyzw, r3.xzxx, t0.xyzw, s0, l(0.000000)
add r2.x, r2.y, r3.x
else
mov r2.x, l(-100000.000000)
endif


I guess this code is branched right? What would a flattened code look like?
It would not have if/else instructions. It would probably have a lerp call.
-----Quat
"flattened" means the there is no branch instructions, and some other means is used to calculate the correct value. Typically this is done with a cmp (compare) instruction, but it can be done in other ways. Your assembly has "if" and "else" instructions which are the branching instructions.
In the case of the posted code I would just reverse the 'if' condition as that would likely be a 'win'... of course what I'd be more concerned about given that code is what looks like a lot of stalling from using a texture sample result right away...

In the case of the posted code I would just reverse the 'if' condition as that would likely be a 'win'... of course what I'd be more concerned about given that code is what looks like a lot of stalling from using a texture sample result right away...


I have to sample texture multiple times to smooth the terrain...
I wasn't questioning your need to sample the texture, I was pointing out that as it currently stands (unless my asm reading is very very rusty) you are saying;

- sample texture
- use value right away
- sample texture
- use value right away

etc etc


If the data from the sample instruction isn't ready when you want to use it (as texture fetches have a high latency on them if not in cache) then the thread will be stalled out, if the gpu can't find more threads to fill the work while yours are stalled then gpu cycles go to waste while it waits for data to come back.

This topic is closed to new replies.

Advertisement