Sign in to follow this  
Aqua Costa

[HLSL] Dynamic branching performance

Recommended Posts

mhagain    13430
If you can put it in your vertex shader rather than your pixel shader it won't have as much overhead. That's a good general principle with any shader operation: where possible move stuff back to the vertex shader.

Share this post


Link to post
Share on other sites
MJP    19786
[quote name='TiagoCosta' timestamp='1305806660' post='4812962']
How can I know when dynamic branching ruins performance?
[/quote]

Profile. Do it often. AMD's GPU PerfStudio can give you statistics about how many threads take a branch, or don't, and how much wasted work is done because of the branches. I'm not sure if Parallel Nsight can do the same, but I would assume it has some useful info.

[quote name='TiagoCosta' timestamp='1305806660' post='4812962']
In my shaders I , often, use if() and for() is there any "rules" to make dynamic branching faster?
[/quote]

Well first of all, using an "if" or a "for" does not automatically mean you're getting a branch in your shader. The compiler can flatten branches and unroll loops if it's possible to do so. You should check the shader assembly to verify. You can also use attributes to force the compiler to flatten or unroll.

Anyway the number 1 rule for dynamic branching is coherence. You need lots of adjacent pixels (usually within a 32x32 or 64x64 block) to take the same branch, otherwise you end up with all of those pixels taking both branches and doing wasted work. So if you're trying to use dynamic branching as an optimization, only do it for things where the branch will be the same for large portions of the screen. Also it helps to only use a branch to try to skip large sections of code as opposed to smaller ones, since having a branch adds some fixed overhead itself.

Share this post


Link to post
Share on other sites
Aqua Costa    3692
What does it mean to flatten a branch?

The assembly of one of my shader that use if() looks like this:
[code]
if_nz r2.y
mov r3.x, r1.z
mov r3.z, r1.w
sample_l r4.xyzw, r3.xzxx, t0.xyzw, s0, l(0.000000)
mov r4.x, r4.x
add r3.y, r2.x, r3.z
sample_l r5.xyzw, r3.xyxx, t0.xyzw, s0, l(0.000000)
add r2.y, r4.x, r5.x
mov r2.z, -r2.x
add r3.w, r2.z, r3.z
sample_l r4.xyzw, r3.xwxx, t0.xyzw, s0, l(0.000000)
add r2.y, r2.y, r4.x
add r3.y, r2.z, r3.x
sample_l r4.xyzw, r3.yzyy, t0.xyzw, s0, l(0.000000)
add r2.y, r2.y, r4.x
add r3.x, r2.x, r3.x
sample_l r3.xyzw, r3.xzxx, t0.xyzw, s0, l(0.000000)
add r2.x, r2.y, r3.x
else
mov r2.x, l(-100000.000000)
endif
[/code]

I guess this code is branched right? What would a flattened code look like?

Share this post


Link to post
Share on other sites
MJP    19786
"flattened" means the there is no branch instructions, and some other means is used to calculate the correct value. Typically this is done with a cmp (compare) instruction, but it can be done in other ways. Your assembly has "if" and "else" instructions which are the branching instructions.

Share this post


Link to post
Share on other sites
_the_phantom_    11250
In the case of the posted code I would just reverse the 'if' condition as that would likely be a 'win'... of course what I'd be more concerned about given that code is what looks like a lot of stalling from using a texture sample result right away...

Share this post


Link to post
Share on other sites
Aqua Costa    3692
[quote name='phantom' timestamp='1305844801' post='4813221']
In the case of the posted code I would just reverse the 'if' condition as that would likely be a 'win'... of course what I'd be more concerned about given that code is what looks like a lot of stalling from using a texture sample result right away...
[/quote]

I have to sample texture multiple times to smooth the terrain...

Share this post


Link to post
Share on other sites
_the_phantom_    11250
I wasn't questioning your need to sample the texture, I was pointing out that as it currently stands (unless my asm reading is very very rusty) you are saying;

- sample texture
- use value right away
- sample texture
- use value right away

etc etc


If the data from the sample instruction isn't ready when you want to use it (as texture fetches have a high latency on them if not in cache) then the thread will be stalled out, if the gpu can't find more threads to fill the work while yours are stalled then gpu cycles go to waste while it waits for data to come back.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this