• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
jcabeleira

Shader branching ruins performance

29 posts in this topic

We all know that dynamic branching hits the performance, but how much it impacts is quite ridiculous. Some time ago I discovered that a single "if" statement in one of my shaders was dropping the frame rate to half! I replaced it with a multiply and everything was fine. Today, I was experimenting with a simple raytracing calculations on the GPU, actually some kind of Ambient Occlusion based on the intersection of rays with spheres, and the two nested loops I used (for each ray check collision against each sphere) completely ruins the shader performance. Currently I'm doing this in a GLSL pixel shader: for each pixel cast 8 rays that are checked for intersection against 10 spheres. No big deal, the code is short and clean but it is also slow as hell! I also tried moving the calculation to the vertex shader (my scene contained about 16.000 vertices) and you know what? I got exactly the same performance as if it was running per pixel (which acount for 1.700.000 pixels)! When I saw this, I realized that the calculation complexity was not the reason for the poor performance. Then I moved back to the pixel shader and unrolled the two nested loops by hand which resulted in an awfull lengthy code that surprisingly runs really fast! So my question is this: is there any way for the GLSL compiler to unroll the goddamn loops instead of making the GPU use dynamic branching or making me unroll them by hand? Thanks
0

Share this post


Link to post
Share on other sites
- Could you show some code ?
- Are you using constants to define iterations or are you using uniforms to define number of iterations ?
- What is your performance impact ?
- Are you sure, that your shader is not falling back to software mode ?
- What hardware are you using ?

--
Ashaman73
0

Share this post


Link to post
Share on other sites
If you haven't already, try prefixing the if statement with [branch]

Check the disassembly in both cases, you may see an unrolled loop without it and an explicit if with it

[edit: just noticed you're using GLSL, this might not be applicable outside HLSL]
0

Share this post


Link to post
Share on other sites
Quote:

What video card do you use ?


Nvidia GTX 260. It's good hardware.

Quote:

- Could you show some code ?
- Are you using constants to define iterations or are you using uniforms to define number of iterations ?
- What is your performance impact ?
- Are you sure, that your shader is not falling back to software mode ?
- What hardware are you using


Here is the code. The rays and spheres are two constant arrays that I didn't include here for sake of the post length.


float occlusion= 0.0;

for(int ray= 0; ray< 8; ++ray){
vec3 rayDirection= rays[ray];

for(int sphere= 0; sphere< 8; ++sphere){

vec3 sphereVector= spheres[sphere].position- position;
float d= dot(rayDirection, sphereVector);
vec3 nearestPoint= position+ rayDirection*d;

if(length(nearestPoint- spheres[sphere].position)<= spheres[sphere].radius)
occlusion+= 1.0*dot(rayDirection, normal);

}
}

occlusion/= 8.0;
gl_FragColor= vec4(1.0- occlusion);





With dynamic branching I get 2 fps but with unrolled loops I get 30 fps.
As I said before, I get the same performance whether I run in per pixel or per vertex.
0

Share this post


Link to post
Share on other sites
It definitely sounds like its running in software mode for whatever reason...

As far as unrolling goes, I believe that most graphics drivers will automatically unroll a loop or branch if they can (it does not depend on a variable). Not sure where I read that though.
0

Share this post


Link to post
Share on other sites
Quote:
Original post by jcabeleira
Nvidia GTX 260. It's good hardware.


Indeed, your shader should run well on this kind of card..

Quote:
Original post by jcabeleira
With dynamic branching I get 2 fps but with unrolled loops I get 30 fps.
As I said before, I get the same performance whether I run in per pixel or per vertex.


Is the problem the dynamic branching or the loops ?

If you keep the loop but remove the if and always execute the occlusion operation, how does it affect the framerate ?

Do you have recent drivers ?
0

Share this post


Link to post
Share on other sites
Quote:

Is the problem the dynamic branching or the loops ?

If you keep the loop but remove the if and always execute the occlusion operation, how does it affect the framerate ?

Do you have recent drivers ?


The loops are the problem, I've tried to replace the "if" by a multiply but no performance changes occurred.

Yes I have the most recent drivers.

0

Share this post


Link to post
Share on other sites
This may be a dumb idea but what happens if you allocate all of your local variables (like sphereVector, nearestPoint, and maybe even the loop counters) outside the loop?

I'm thinking maybe it is unrolling your loop internally but doing so requires it to create more temporary variables than it has room for?

Would you mind showing a screen shot of your result? How does AO look with only 8 samples?

0

Share this post


Link to post
Share on other sites
-dont use doubles, use floats
-dont index on interpolators or local variables
-avoid branching


if you break those rulez, you can get easily 30times slower. why? simply because the hardware runs 32threads for the same code path (nvidia calls them Warps). doint something that is not the way the hardware is optimized to do will make it probably like 32times slower (1/32th of the performance).

the later two points can be fixed by unrolling. if you assume your compile doesn't unroll and you dont want to unroll it by hand, make the inner loop a function and pass the iteration parameter.
once you've done that, just call the function 8 times instead of doing that in a loop, passing the iteration param. this way it stays maintainable and will be "unrolled".
0

Share this post


Link to post
Share on other sites
Quote:
Original post by ZenoGD
This may be a dumb idea but what happens if you allocate all of your local variables (like sphereVector, nearestPoint, and maybe even the loop counters) outside the loop?

I'm thinking maybe it is unrolling your loop internally but doing so requires it to create more temporary variables than it has room for?


I tried that, but got no improvement.

Quote:

Would you mind showing a screen shot of your result? How does AO look with only 8 samples?


Looks preety good actually. In fact, looks just like SSAO. The only difference is that it calculates occlusion from objects outside the view frustum too.

Quote:

-dont use doubles, use floats
-dont index on interpolators or local variables
-avoid branching


if you break those rulez, you can get easily 30times slower. why? simply because the hardware runs 32threads for the same code path (nvidia calls them Warps). doint something that is not the way the hardware is optimized to do will make it probably like 32times slower (1/32th of the performance).

the later two points can be fixed by unrolling. if you assume your compile doesn't unroll and you dont want to unroll it by hand, make the inner loop a function and pass the iteration parameter.
once you've done that, just call the function 8 times instead of doing that in a loop, passing the iteration param. this way it stays maintainable and will be "unrolled".


Yes, I'm aware of all that. First, I'm using floats only.
Second, I could unroll the shader, but I wanted to make it flexible enough to handle an arbitrary number of spheres.
Third, even if I used a fixed number of spheres like 8 spheres, that still gives me 64 loop iterations to unroll (8 rays X 8 spheres) which is not maintainable.
0

Share this post


Link to post
Share on other sites
[quote]Original post by jcabeleira
First, I'm using floats only.

occlusion/= 8.0;
gl_FragColor= vec4(1.0- occlusion);

those are doubles. GTX card from nvidia are capable of processing them. they run at 1/8th of the float performance. I cannot guarantee that nvidia's glsl compiler really uses them like doubles, but it's possible it does.
0

Share this post


Link to post
Share on other sites
[quote]Original post by Krypt0n
Quote:
Original post by jcabeleira
First, I'm using floats only.

occlusion/= 8.0;
gl_FragColor= vec4(1.0- occlusion);

those are doubles. GTX card from nvidia are capable of processing them. they run at 1/8th of the float performance. I cannot guarantee that nvidia's glsl compiler really uses them like doubles, but it's possible it does.


I think you're mixing up with C++ code that takes "1.0" as a double and "1.0f" as a float. I think all shader languages take any real number without the "f" postfix as a float not a double.

But even if what you're saying is true, those calculations you mentioned are being performed once per pixel which in neglible comparing to the 64 iterations of ray-sphere intersection code.

[EDIT]: I replaced those values with the "f" postfix like you sugested but got no performance improvement.
0

Share this post


Link to post
Share on other sites
[quote]Original post by jcabeleira
Quote:
Original post by Krypt0n
Quote:
Original post by jcabeleira
First, I'm using floats only.

occlusion/= 8.0;
gl_FragColor= vec4(1.0- occlusion);

those are doubles. GTX card from nvidia are capable of processing them. they run at 1/8th of the float performance. I cannot guarantee that nvidia's glsl compiler really uses them like doubles, but it's possible it does.


I think you're mixing up with C++ code that takes "1.0" as a double and "1.0f" as a float. I think all shader languages take any real number without the "f" postfix as a float not a double.

But even if what you're saying is true, those calculations you mentioned are being performed once per pixel which in neglible comparing to the 64 iterations of ray-sphere intersection code.

[EDIT]: I replaced those values with the "f" postfix like you sugested but got no performance improvement.

i wasn't sure bout the way glsl is handling it. but there are sometimes awkware cases why optimizes f#*k something up. usually, with using constants for loop start and end, everything should be just one simple shader with no branching at all.
you could maybe use ati's render monkey to check your glsl code (i'm not sure if it runs on nvidia cards, but nvidia's fxcomposer wasn't supporting glsl last time I checked, but maybe it's not too hard to convert your glsl shader to hlsl for the fx composer, there you could check the shader assembly output including hardware specific static performance analysis.)

sorry for not beeing much of a help with that unrolling thinggy. but i'm pretty sure it's not the dynamic branching that hurts ya, it's rather the usage of variables. either cause of too many temporaries or cause of indexing.
one old "trick" was to move those indexable things into textures. sampling floats isn't that fast either, but wont make the framerate drop from 30 to 2.
0

Share this post


Link to post
Share on other sites
RenderMonkey does support NVIDIA cards.
FX Composer supports ATI cards but it does not support GLSL so it cannot be used.
To disassemble your shader, use a special utility called "nvemulate", search it on google. Once you run it, select "Write shader assembly", then check your program's working directory and the disassembly should all be there. It's in the NVvp/fp language.
0

Share this post


Link to post
Share on other sites
Quote:
Original post by Momoko_Fan
RenderMonkey does support NVIDIA cards.
FX Composer supports ATI cards but it does not support GLSL so it cannot be used.
To disassemble your shader, use a special utility called "nvemulate", search it on google. Once you run it, select "Write shader assembly", then check your program's working directory and the disassembly should all be there. It's in the NVvp/fp language.


I've took your advice and I've seen the disassembled shader.
It has nothing particularly suprprising in it. As expected, the shader does perform the loops instead of unrolling them, it contains a couple of nested REP/ENDREP instructions that does it.
The only thing that looked a little weird was that the shader's variables are all initialized with MOV's. Since I have 32 rays declared and 10 spheres, that resulted in a lot of moves.

Now here is the funny part:
From the 32 declared rays I was only using 8, so I removed the unnecessary rays from the array declaration, and the frame rate raised from 2 fps to 4 fps. When I checked the disassembled code, I realized that the compiler had decided to unroll the outher loop and eliminate the array of rays which was no longer necessary. The unrolled outher loop explains the strange increase in frame rate.
0

Share this post


Link to post
Share on other sites
Wow! I can't believe what my eyes are seeing:
I was playing with NVemulate and I set the GLSL compilation profile to force NV40 compatibility. Since NV40 has no support for dynamic branching (I supose), the compiler had to unroll all the loops.
This way, my shader runs at 30 fps instead of 2 fps!!!

So, now that it is proved that the slowness comes from the loops and that unrolling them solves the issue, I only need a way to force the loop unrolling by code. But let me guess, there's no way to do it, right?
0

Share this post


Link to post
Share on other sites
If you use the latest drivers, then I would suggest to rollback to previous drivers and see if it makes a difference. Your problem really starts to look like a driver problem to me.

Other than that, maybe you could give a try to storing your data in 1D textures instead of an array of constants. I've seen strange behaviors when accessing an array of constant uniforms in the past, although that'd be surprising on a GTX 260.

Then your last hope would be to make a minimal program that reproduces the problem and send it to NVidia, and hope they have a look at the program.

Y.
0

Share this post


Link to post
Share on other sites
Quote:
Original post by jcabeleira
Since NV40 has no support for dynamic branching (I supose), the compiler had to unroll all the loops.
As a side note, I am pretty sure PS3.0 has full branching support...
EDIT: Anyway, this is just ugly.
0

Share this post


Link to post
Share on other sites
Quote:
Original post by Ysaneya
If you use the latest drivers, then I would suggest to rollback to previous drivers and see if it makes a difference. Your problem really starts to look like a driver problem to me.


It shouldn't be a driver problem. I've tested the shader on two different computers. One of them is a laptop with a Nvidia GTX 260, the other is a PC with a
Nvidia 9800. Those two computers use different but recent drivers, and the shader performance problem happens on both of them.

Quote:

Other than that, maybe you could give a try to storing your data in 1D textures instead of an array of constants. I've seen strange behaviors when accessing an array of constant uniforms in the past, although that'd be surprising on a GTX 260.


Yes, that could be a good ideia. Thanks.
0

Share this post


Link to post
Share on other sites
Quote:
Original post by Krohm
As a side note, I am pretty sure PS3.0 has full branching support...
EDIT: Anyway, this is just ugly.


It most definitely does support dynamic branching.

With HLSL you can control things like unrolling and dynamic branching using attributes, or compiler flags. I have no idea of GLSL supports such things.
0

Share this post


Link to post
Share on other sites
I assume that rays and spheres are constant registers? Then that's the problem! I had the same issue with either a 7800 or a 9800, don't remember, but this hardware does not support constant register indexing in a pixel shader! At least not in an efficient way. It does in a vertex shader, though. In a pixel shader, the indexing code rays[ray] is basically unrolled into
if (ray == 0) return rays[0];
else if (ray == 1) return rays[1];

and so on. As you can imagine, this is sub-optimal to say the least. So if the for-loops are unrolled, also the indexing is done explicitely and you don't have that issue. So it's not dynamic branching, it's array indexing.

To test this hypothesis, try this: Keep the for-loop, but replace rays[ray] with rays[0] and the same with spheres. If the speed increases, indexing is indeed the problem.

To fix the performance issues, use textures. Crytek & co are all using textures as well.

@Krypt0n: Doubles are an SM5 feature, G260s are SM4, though.
0

Share this post


Link to post
Share on other sites
Try replacing this:
  vec3 rayDirection= rays[ray];

with this:
vec3 getRay(int idx)
{
if (idx == 0) return rays[0]
else if (idx == 1) return rays[1]
else if (idx == 2) return rays[2]
...
}

vec3 rayDirection = getRay(ray);

Do the same with spheres.
0

Share this post


Link to post
Share on other sites
btw Im sure youre aware but with nvemulate theres an option for it to dump the ASM from a glsl file
which can give u an idea of whats causing the difference in speed between 2 differnt methods
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0