Shader branching ruins performance

Started by
28 comments, last by jcabeleira 14 years, 2 months ago
Quote:Original post by Ysaneya
If you use the latest drivers, then I would suggest to rollback to previous drivers and see if it makes a difference. Your problem really starts to look like a driver problem to me.


It shouldn't be a driver problem. I've tested the shader on two different computers. One of them is a laptop with a Nvidia GTX 260, the other is a PC with a
Nvidia 9800. Those two computers use different but recent drivers, and the shader performance problem happens on both of them.

Quote:
Other than that, maybe you could give a try to storing your data in 1D textures instead of an array of constants. I've seen strange behaviors when accessing an array of constant uniforms in the past, although that'd be surprising on a GTX 260.


Yes, that could be a good ideia. Thanks.
Advertisement
Quote:Original post by Krohm
As a side note, I am pretty sure PS3.0 has full branching support...
EDIT: Anyway, this is just ugly.


It most definitely does support dynamic branching.

With HLSL you can control things like unrolling and dynamic branching using attributes, or compiler flags. I have no idea of GLSL supports such things.
I assume that rays and spheres are constant registers? Then that's the problem! I had the same issue with either a 7800 or a 9800, don't remember, but this hardware does not support constant register indexing in a pixel shader! At least not in an efficient way. It does in a vertex shader, though. In a pixel shader, the indexing code rays[ray] is basically unrolled into
if (ray == 0) return rays[0];else if (ray == 1) return rays[1];

and so on. As you can imagine, this is sub-optimal to say the least. So if the for-loops are unrolled, also the indexing is done explicitely and you don't have that issue. So it's not dynamic branching, it's array indexing.

To test this hypothesis, try this: Keep the for-loop, but replace rays[ray] with rays[0] and the same with spheres. If the speed increases, indexing is indeed the problem.

To fix the performance issues, use textures. Crytek & co are all using textures as well.

@Krypt0n: Doubles are an SM5 feature, G260s are SM4, though.
Try replacing this:
  vec3 rayDirection= rays[ray];

with this:
vec3 getRay(int idx){  if (idx == 0) return rays[0]  else if (idx == 1) return rays[1]  else if (idx == 2) return rays[2]  ...}vec3 rayDirection = getRay(ray);

Do the same with spheres.
btw Im sure youre aware but with nvemulate theres an option for it to dump the ASM from a glsl file
which can give u an idea of whats causing the difference in speed between 2 differnt methods
I think Lutz is on to something - my SSAO loop runs a very similar type of calculation (two for loops nested with some arithmetic in the center), and even on a 8600M it runs significantly faster than 30 fps for a similar number of iterations. I've also used dynamic branching with for loops on a parallax occlusion calculation and there wasn't any big problems with performance... Have you tried his suggested test out yet?
Quote:Original post by Jason Z
I think Lutz is on to something - my SSAO loop runs a very similar type of calculation (two for loops nested with some arithmetic in the center), and even on a 8600M it runs significantly faster than 30 fps for a similar number of iterations. I've also used dynamic branching with for loops on a parallax occlusion calculation and there wasn't any big problems with performance... Have you tried his suggested test out yet?


Not yet, I'm currently busy with other stuff right now, but I'll try it very soon.
Thanks for all your help, guys.
As already mentioned, Nvidia cards can't do register indexing, IIRC they use scratch memory to do it, which explains all those movs.
Quote:Original post by Lutz
I assume that rays and spheres are constant registers? Then that's the problem! I had the same issue with either a 7800 or a 9800, don't remember, but this hardware does not support constant register indexing in a pixel shader! At least not in an efficient way.


You were absolutely right, Lutz. It was the array indexing that was causing the major slow down. I replaced the arrays by textures with the same information and the array indexing by texture reads, and now I get 30 FPS.

But what scares me the most is that I had already several shaders doing heavily use of array indexing for the SSAO and soft shadowing effects. Although their performance wasn't bad, I wonder how much will I increase the frame rate of my application just by removing the array indexing.
Hmmm, I overcomplicated the solution. Instead of using textures I'm now using an uniform array of vec3's and also works fine, and it also saves me the trouble of encoding/decoding the needed information in textures.
What is really surprising is that the GPU accepts the uniform array as a fast input method, but doesn't handle a simple constant array declared on the shader the same way.

This topic is closed to new replies.

Advertisement