Jump to content
  • Advertisement
Sign in to follow this  
kRogue

unroll loops in fragment shader: good or bad?

This topic is 4101 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

On GeForce6 and higher, the fragment shader can handle genuine dynamic branching... however, the GLSL and Cg compilers like to unroll a for-loop whenever it can, now the question: is that always a good idea? for example I have the following shader:
void main()
{
  //simple lighting, no specular.
  float i,numberLights;
  struct pixel_data pData;
  struct light_data lData;
  vec3 color, lightVector;
  float att,lightVector_mag,dot_value,final_value,att_value,directional_att;

  unpack_pixel_data(pData);
  numberLights=number_lights(pData.light_group);

  for(color=vec3(0,0,0), i=0.0;i<numberLights; i+=1.0)
    {
      unpack_light_data(pData.light_group, i, lData);
      
      //vector from pixel to light:
      lightVector=light_vector(pData,lData);

      lightVector_mag=length(lightVector);
      lightVector/=lightVector_mag;

      dot_value=max(0.0, dot(lightVector, pData.normal_in_room_coordinates));

#ifdef LIGHT_DIRECTIONAL
      directional_att=clamp(0.0, 1.0, dot(lightVector, lData.light_direction_in_room_coordinates) );
      
      directional_att=smoothstep(lData.cone_angle_end_cosine, 
				 lData.cone_angle_begin_lessening_cosine, 
				 directional_att);
#endif     
 

#ifdef LIGHT_ATTENUATION
      att_value=lData.attenuation_factors.x 
	+ lightVector_mag*(lData.attenuation_factors.y 
			   + lightVector_mag*lData.attenuation_factors.z);

#ifdef LIGHT_DIRECTIONAL
      final_value=dot_value*directional_att/att_value;
#else
      final_value=dot_value/att_value;
#endif
#else  //of ifdef LIGHT_ATTENUATION
#ifdef LIGHT_DIRECTIONAL
      final_value=dot_value*directional_att;
#else
      final_value=dot_value;
#endif
#endif //of else

      color+=lData.color*final_value;
    }

  gl_FragColor.rgb= color.rgb * pData.raw_color;
  gl_FragColor.a=1.0;
  
}


just so you know: the system is a differed shading system, the various unpack functions unpack the needed data from the textures (pixel data has stuff like position, normal, color, etc) and light data has things like color, direction etc for the lights... at any rate, if I make the number of lights the same for each pixel, i.e. number_lights() just returns a constant known at compile time the loop gets completely unrolled, which makes the shader very bug (almost 300 instruction), but if I make number_lights() return a uniform or a value from a texture, then the shader is only 55 instructions.. so now the question: is it better for performance for the loop to be unrolled or not? my first reflex is that *yes, of coarse!* but in the back of my mind are doubts... anyone have comments?

Share this post


Link to post
Share on other sites
Advertisement
Equal what instruction count the shader has: it's the number of instructions executed that determines the performance. And using dynamic branching this number is always higher than with unrolled loops. The looping shader just looks shorter.

Bye, Thomas

Share this post


Link to post
Share on other sites
that is exactly what I figured, but this is the doubt in my mind: if the shader is very long, is there a performance loss due to, for lack of a better word, cache missing of shader instructions? on a CPU we usually don't care about execution size, but under some hardware situations one wants the whole instruction stream to fit into the processor... I figured for older GPU's all of the shader is "in the GPU" when it is in use (whereas for CPU's one does not have that the whole program running is "in the CPU", and so the OS and CPU work together to stream the execution into the CPU) but what about GPU's? the entire system is a black box of the driver and hardware, I don't know what it is doing with really long shaders and what not... will future GPU's also stream the instructions into them as the shaders get really huge?

Share this post


Link to post
Share on other sites
I highly doubt there is a max length for a shader if that's what you're asking. The code gets converted to machine language eventually, which doesn't have any for loops or any other high level programming characteristics so the only thing you'll be saving is a kb or two on the file size of the shader.

Share this post


Link to post
Share on other sites
Quote:
Original post by xerodsm
I highly doubt there is a max length for a shader if that's what you're asking. The code gets converted to machine language eventually, which doesn't have any for loops or any other high level programming characteristics so the only thing you'll be saving is a kb or two on the file size of the shader.


Shaders do have maximum lengths, but I'm pretty sure that was only ever a problem with SM2.0 or earlier - with modern cards you will never have to worry about length. I'm also 90% certain that length of a shader will never be a problem except in the simplest shaders, the performance being in texture lookups, exp() calls, etc.

Also, machine language most certainly does have loops. How else would you implement a loop based on a non-constant condition? take something like:


for(int i=0; i < 5; i++)
do_something(i);




mov %eax, $0
foo: push %eax
call _do_something ; let's pretend this won't trash %eax
add %esp, $4
inc %eax
cmp %eax, $5
jl foo




My assembler syntax is amazingly rusty so I'd be amazed if that was right, but you get the picture - it's still a loop. It would probably also be changed to be a jnz by the compiler, but whatever.

Share this post


Link to post
Share on other sites
With GLSL, the drivers knows best so if it unrolls them, then it is better. The GPU avoids the jump instruction and the possible penalty that goes with it.
It's the same with CPUs. If it can unroll a loop, it is better. If it can inline, it is also better.... but up to a certain limit.
The only way to be sure is to profile the code.


"he code gets converted to machine language eventually, which doesn't have any for loops"

Incorrect. Geforce 6 is capable of loops in the fragment shader. It's a SM 3 feature.
SM 2 cards could do dynamic loops in the vertex shader only.

Share this post


Link to post
Share on other sites
When I said machine language doesn't have any loops I'm referring to 1s and 0s, which is true machine language, and there are no words at all, so a loop with 1s and 0s are just more 1s and 0s. Of course I agree that assembly language has for loops, or you'd be right about not being able to implement dynamic loops. Basically anywhere you have a loop in GLSL you could potentially just rewrite the code inside the loop over and over as many times as the loop will be executed and that's what will be interpreted. Of course, if you have a dynamic number of something then it's a much better idea to use a loop in your GLSL code than have a bunch of if statements to check how many times to execute certain code.

Basically I'm saying they are fine and there's no need to worry about them since they'll just be interpreted like normal code, over and over.

Share this post


Link to post
Share on other sites
Even machine language has loops. A loop is basically a jump to an earlier instruction. So if the processor, whether CPU or GPU sees a certain sequence of bytes, it knows it is a jump instruction and it knows to which address it needs to go to.
For, while and do---while are basically the same thing.

Share this post


Link to post
Share on other sites
Alright, that's true. I'm just saying when it gets down to the hardware level, it's a 1 or a 0, code is code, having a for loop or having the same code repeated the same number of times will yield the same result.

Share this post


Link to post
Share on other sites
Ok, now I'm going to repeat myself. There is a difference in result. Loops are slower. It will take at minimum 1 clock cycle to execute a jump.

It's a problem for shaders. You can lose a few FPS. By default, the nVidia compiler will unroll the loop if the loop is static. I'm sure ATI does the same.

'0' and '1' is irrelevant.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!