[HLSL] Dynamic branching performance

This topic is 2468 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

How can I know when dynamic branching ruins performance?

In my shaders I , often, use if() and for() is there any "rules" to make dynamic branching faster?

Share on other sites

How can I know when dynamic branching ruins performance?

Profile. Do it often. AMD's GPU PerfStudio can give you statistics about how many threads take a branch, or don't, and how much wasted work is done because of the branches. I'm not sure if Parallel Nsight can do the same, but I would assume it has some useful info.

In my shaders I , often, use if() and for() is there any "rules" to make dynamic branching faster?

Well first of all, using an "if" or a "for" does not automatically mean you're getting a branch in your shader. The compiler can flatten branches and unroll loops if it's possible to do so. You should check the shader assembly to verify. You can also use attributes to force the compiler to flatten or unroll.

Anyway the number 1 rule for dynamic branching is coherence. You need lots of adjacent pixels (usually within a 32x32 or 64x64 block) to take the same branch, otherwise you end up with all of those pixels taking both branches and doing wasted work. So if you're trying to use dynamic branching as an optimization, only do it for things where the branch will be the same for large portions of the screen. Also it helps to only use a branch to try to skip large sections of code as opposed to smaller ones, since having a branch adds some fixed overhead itself.

Share on other sites
What does it mean to flatten a branch?

The assembly of one of my shader that use if() looks like this:
 if_nz r2.y mov r3.x, r1.z mov r3.z, r1.w sample_l r4.xyzw, r3.xzxx, t0.xyzw, s0, l(0.000000) mov r4.x, r4.x add r3.y, r2.x, r3.z sample_l r5.xyzw, r3.xyxx, t0.xyzw, s0, l(0.000000) add r2.y, r4.x, r5.x mov r2.z, -r2.x add r3.w, r2.z, r3.z sample_l r4.xyzw, r3.xwxx, t0.xyzw, s0, l(0.000000) add r2.y, r2.y, r4.x add r3.y, r2.z, r3.x sample_l r4.xyzw, r3.yzyy, t0.xyzw, s0, l(0.000000) add r2.y, r2.y, r4.x add r3.x, r2.x, r3.x sample_l r3.xyzw, r3.xzxx, t0.xyzw, s0, l(0.000000) add r2.x, r2.y, r3.x else mov r2.x, l(-100000.000000) endif 

I guess this code is branched right? What would a flattened code look like?

Share on other sites
It would not have if/else instructions. It would probably have a lerp call.

Share on other sites
"flattened" means the there is no branch instructions, and some other means is used to calculate the correct value. Typically this is done with a cmp (compare) instruction, but it can be done in other ways. Your assembly has "if" and "else" instructions which are the branching instructions.

Share on other sites
In the case of the posted code I would just reverse the 'if' condition as that would likely be a 'win'... of course what I'd be more concerned about given that code is what looks like a lot of stalling from using a texture sample result right away...

Share on other sites

In the case of the posted code I would just reverse the 'if' condition as that would likely be a 'win'... of course what I'd be more concerned about given that code is what looks like a lot of stalling from using a texture sample result right away...

I have to sample texture multiple times to smooth the terrain...

Share on other sites
I wasn't questioning your need to sample the texture, I was pointing out that as it currently stands (unless my asm reading is very very rusty) you are saying;

- sample texture
- use value right away
- sample texture
- use value right away

etc etc

If the data from the sample instruction isn't ready when you want to use it (as texture fetches have a high latency on them if not in cache) then the thread will be stalled out, if the gpu can't find more threads to fill the work while yours are stalled then gpu cycles go to waste while it waits for data to come back.