Sign in to follow this  
cavatina

How to speed up branch instrctions?

Recommended Posts

Hi, all! I have a view dependent constraint in my pixel shader program written in GLSL. See below => ... if(theta > PI*7/4 && theta < PI/4) ... do something1 ... else if( abs(theta - PI_2) < PI_4) ... do something2 ... else if( abs(theta - PI) < PI_4) ... do something3 ... else ... do something4 ... The shader program which involves these branch instructions downs the FPS very much, about only 1/3 FPS of the original non-branch shader program. Could someone know how to encode such branch description or have any suggestions of graphics hardware? I use nVidia 6 series card, but I have heared that ATI's shader model 3.0 implementation is much better than nVidia's. Is it that? Hope someone helps me... thanks a lot!!! :) cavatina

Share this post


Link to post
Share on other sites
it really depends...provided that gpu pipes are really long branching becomes really tricky. However, what can be stated is the fact that ATi is better in branching that nVidia.

Regarding your code, it really depends what kind of code is executed in the particular branch(for optimal performance it should be equally expensive code) and how coherent are your fragment's execution paths(from your code it seems that this should be ok)

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
well what you are checking in the ifs is constant
[quote]if(theta > PI*7/4 && theta PI*1.75 && theta <

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
missing from post:
so why not make them constansts?

If you are going to calculate more than once (why would you?)then change to


if(theta > PI*1.75 && theta < PI*0.25)
...

Share this post


Link to post
Share on other sites
I can only suggest reducing the number if FP operations, e.g.:

// convert angle into quadrant number
int i = (theta + PI_4) * RECIP_PI_2 // RECIP_PI_2 = 2 / PI, or 1 / (PI/2)
if (i == 1)
..
else if (i == 2)
..
else if (i == 3)
..
else // i == 0 || i == 4
..


Skizz

Share this post


Link to post
Share on other sites
The most important thing for performance of dynamic branching is coherency (as stated). If N fragments in a spatial region takes a different path, you may end up running the program N times for *every* one of the pixels in the region, having 99% of the results simply discarded. Thus it is *very* important to make sure that spatially coherent regions all take the same path through the control flow.

As mentioned, ATI's region size is on the order of 16 pixels (4x4, or 4x12 on 1900 series) and NVIDIA's is somewhere in the neighborhood of 64x64. It's easy to see where differences in performance will occur. Note that next generation cards from both vendors will probably perform similarly if not better than ATI's current cards with respect to dynamic branching.

Share this post


Link to post
Share on other sites
Are your branches very big?
If not, better to use static branching and also not even use the "else" conditional.

Its simple. You are trying to perform conditionals in a sequential order. Have you thought about doing it in the reverse order? This method will remove the need to use "else" conditionals.

Secondly, sometimes you will find it better to abstract the instructions out of the "if" conditionals. Meaning try not to nest computations inside. This saves quite alot of instructions.

Thirdly, removing the "&&" operator by using small functions, saves 1 instruction each. Naturally, when your code increases, you will realise you can save quite a number of instructions too.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:

As mentioned, ATI's region size is on the order of 16 pixels (4x4, or 4x12 on 1900 series) and NVIDIA's is somewhere in the neighborhood of 64x64.


Can you elaborate on what that means please? Whats a region size? And nVidia's region size is much bigger?

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Quote:

As mentioned, ATI's region size is on the order of 16 pixels (4x4, or 4x12 on 1900 series) and NVIDIA's is somewhere in the neighborhood of 64x64.


Can you elaborate on what that means please? Whats a region size? And nVidia's region size is much bigger?


it's the amount of fragments that is processed by the GPU simultaneously

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Quote:

As mentioned, ATI's region size is on the order of 16 pixels (4x4, or 4x12 on 1900 series) and NVIDIA's is somewhere in the neighborhood of 64x64.


Can you elaborate on what that means please? Whats a region size? And nVidia's region size is much bigger?


See this thread and this thread. Search might turn up more.



Share this post


Link to post
Share on other sites
You also might consider moving branches up the pipeline. You mentioned that your shader is computing view dependent effects. You could implement a deferred shading scheme where instead of writing colors to the framebuffer, you write several parameters for each pixel to one of several textures. Then, for the final pass, you draw a screen aligned quad and shade it using your final shader that takes as input the parameters that were stored in textures from the earlier rendering pass. So what does this have to do with branching?

Well, you can write several shaders, one for each control flow condition. Then, you tile several quads across the screen, with different quads corresponding to the different regions where you wanted a certain control flow path to occur. This way, you can change control flow by changing shaders, which moves the branch up the pipeline, so to speak.

This isn't always going to be a good or graceful solution, but if you're having lots of performance problems caused by branching, this is a way to prevent pixels from taking a bunch of different control flow paths that they don't need to take.

Share this post


Link to post
Share on other sites
Quote:
Original post by cwhite
You also might consider moving branches up the pipeline.

Good point - theoretically any control flow can be emulated using multiple passes and predication (in particular, z-cull does a really efficient job of this). This may even be more efficient on some current cards, although those days are numbered.

Share this post


Link to post
Share on other sites
Thanks for all the kind words...
I will try to tune my shader program from these advices and post the result as soon as possible. thx :)

cavatina

Share this post


Link to post
Share on other sites
Quote:
Original post by cwhite
You also might consider moving branches up the pipeline. You mentioned that your shader is computing view dependent effects. You could implement a deferred shading scheme where instead of writing colors to the framebuffer, you write several parameters for each pixel to one of several textures. Then, for the final pass, you draw a screen aligned quad and shade it using your final shader that takes as input the parameters that were stored in textures from the earlier rendering pass. So what does this have to do with branching?

Well, you can write several shaders, one for each control flow condition. Then, you tile several quads across the screen, with different quads corresponding to the different regions where you wanted a certain control flow path to occur. This way, you can change control flow by changing shaders, which moves the branch up the pipeline, so to speak.

This isn't always going to be a good or graceful solution, but if you're having lots of performance problems caused by branching, this is a way to prevent pixels from taking a bunch of different control flow paths that they don't need to take.


Thank you.
This seems a better solution that using different passes doing different shaders and keeps the branch instruction away. Hope I don't miss your meaning... :)

cavatina

Share this post


Link to post
Share on other sites
Quote:
Original post by edwinnie
Are your branches very big?
If not, better to use static branching and also not even use the "else" conditional.

Its simple. You are trying to perform conditionals in a sequential order. Have you thought about doing it in the reverse order? This method will remove the need to use "else" conditionals.

Secondly, sometimes you will find it better to abstract the instructions out of the "if" conditionals. Meaning try not to nest computations inside. This saves quite alot of instructions.

Thirdly, removing the "&&" operator by using small functions, saves 1 instruction each. Naturally, when your code increases, you will realise you can save quite a number of instructions too.


Hi, I try to abstract the real code that "if" condition needs and the FPS arises from 30 to 50. Additionally, I remove "else" conditions and rewrite the code as =>

... do something4 ...
if(...) do something1 ...
if(...) do something2 ...
if(...) do something3 ...

the FPS arises from 50 to 55.
These guildings help me a lot. thanks :)

cavatina

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this