is dynamic branching really dynamic branching under SM3.0 shader cards?

Started by
8 comments, last by rapunzel 17 years, 7 months ago
Hey, I am wondering if dynamic branching is really what it promises to be under SM3.0 cards. I am working on a GeForce 6600GT which promises shader 3.0 functionality and therefore dnymic branching. Well I am understanding dynamic branching as skipping the execution of an if-block if the boolean expresion in the if-clause returns false. But based on some tests I made, it seems that this is not happening - instead the calculation is done but is just ignored. See the following example in (pseudo) fragment shader code:

vec4 calculateColor()
{
  vec4 color = get the color for the fragment somehow (texture - whatever)
  
  if (color.a > 0.5)
  {
    color = result of a difficult lighting calculation
  }
  else
  {
    color = vec4(0.5);
  }

  return color
}

void main()
{
  gl_FragColor = calculateColor();
}



So my geometry is rendered and all fragments that have alpha > 0.5 get the lighting calculation. If I make sure that no fragment has an alpha > 0.5 I get only uniform grey pixels everywhere. That works fine but it also should work faster, for the complicated lighting thing is completely skipped. But well, nothing gets faster, not at all :-/ To assure that the speed problem really lies in the fragment shader I deleted the if-else block and just left "color = vec(0.5)", so no lighting calculation in the code anymore and everythings always grey. And voila: performace rises clearly noticeable. So what is happening? If the execution always runs into the else part and always skips the code in the if block i should get almost the same higher performance, shouldn't I? At least if dynamic branching really works. a confused rapunzel
Advertisement
So here's the thing. A GPU is heavily pipelined. At any given time, there are up to hundreds of pixels in flight at once. (Not just the number of pipes you have.) These pixels are processed in large groups, the sizes of which I forget. These groups all have to pretty much proceed in lock step through the same instruction sequences. So dynamic branching is only helpful performance-wise if you're branching based on a uniform; if you're branching on a quantity that varies per pixel, every pixel is going to end up following every path and discarding results. This is basically what's happening in your case. It only takes one pixel going the other way to force the dozens of others in the same group into executing every instruction on both sides of the branch.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Promit
These pixels are processed in large groups, the sizes of which I forget. These groups all have to pretty much proceed in lock step through the same instruction sequences.


The numbers 16 for ATI's latest generation and 64 for NV's spring to mind... numbers might not be right but I do know ATI's hardware does do significantly smaller blocks than NV thus (part of) the faster benchmark results for SM3.0 shaders.

Wow thats kinda bad news :-/ So basically, even if no pixel will ever cause the if-clause to be true so that the result will always be the one of the else-part, all pixels will pass all calculations because the driver/graphics card cannot predict if one of the many pixels in one process-block might need the if-result?

So branching works for uniforms, because the value of the uniform is known ahead while branching based on fragments doesn't work because it's known only when the fragment is processed and then the pipe it has to run through is already kinda fixed functionality for that pixel-processing-block?

I mean I thought modern graphics cards really run these little fragment programms on little fragment processors that can do real dynamic branching, means skipping code that should not be executed but obviousliy it's not like that O.o, is it?

Does anyone have some good reading tips about how modern graphics cards work? I'd really like to understand it better now O.o

And thanks for the answers.

greetz
rapunzel
Quote:Original post by phantom
The numbers 16 for ATI's latest generation and 64 for NV's spring to mind...

These numbers are wrong :-P ;-)
Correct numbers (hopefully I am not wrong now :-D ):
ATI X1800XT: 16
ATI X1900XT: 48
Nvidia: 256

This article explains some things.

Quote:Original post by rapunzel
I mean I thought modern graphics cards really run these little fragment programms on little fragment processors that can do real dynamic branching, means skipping code that should not be executed but obviousliy it's not like that O.o, is it?

Oh, it is. It just doesn't work good on Nvidia cards. ATI's cards (X1k series) performs much better with Dynamic Branching in shaders. Try "Dynamic branching 3" from Humus. With dynamic branching, the FPS number is increased by (at least) 60% on my ATI X1800XT and nothing happens on my Geforce 6600GT.

Quote:Does anyone have some good reading tips about how modern graphics cards work? I'd really like to understand it better now O.o

The Holy Grail (for Germans at least) is http://www.3dcenter.de. I don't know any better site for those things (except Gamedev.net sometimes :-D ).w
--
Quote:Original post by rapunzel
Does anyone have some good reading tips about how modern graphics cards work? I'd really like to understand it better now O.o


Gpu gems 2, "the geforce 6 series architecture".

Note that because there is a fixed cost of dynamic branching, sometimes it's faster to just use a predicate. Predicate acts as a write mask based on some condition, so it cannot skip work like dynamic branching, but does not incur the additional cost of branching.

Also there is sometimes a faster alternative to dynamic branching in the pixel shader that is to use the faster stencil culling but it needs to be "massaged" on Geforce cards (else it won't skip any work)..

LeGreg
In my experience I used if/else statements to avoid texture lookups and has increased my performance. But I am also fillrate limited so would make sense. IMO it depends on what area you are suffering from. If CPU bound not seeing anything in shaders that would make it speed up if I am correct?
Quote:Original post by Enrico
These numbers are wrong :-P ;-)
Correct numbers (hopefully I am not wrong now :-D ):
ATI X1800XT: 16
ATI X1900XT: 48
Nvidia: 256


ATi numbers are correct.

AFAIK there have been no official numbers from NVIDIA about the batch sizes of Geforce cards (please correct me if I am wrong). When the G70 (7800) was released a few sites did tests for the batch sizes and branching performance.

linky

The above graph puts the batch size of the G70 around 1024 pixels (faster then without branching) and the NV40 at 4096 pixels. The above test only checks at what point (batch size) you start gaining performance with dynamic branching so the actual batch sizes may be smaller...

If you want to compare the dynamic branching efficiency of the X1000 cards and the 7000 cards then see this page (second image).

Quote:Original post by nts
The above graph puts the batch size of the G70 around 1024 pixels (faster then without branching) and the NV40 at 4096 pixels.

Now that you say, these numbers are more familiar to me ;-)


Quote:If you want to compare the dynamic branching efficiency of the X1000 cards and the 7000 cards then see this page (second image).

Mhm, they only test Shadow Mapping, which is quite slow and CPU-heavy on ATI-cards :-(
But it shows the direction quite well :-)

[Edited by - Enrico on September 15, 2006 9:29:17 AM]
--
Thanks for all your answeres and the informative links!

Things got much clearer to me now. I tested the Humus Demo and dynamic branching or not, it really didn't change anything on my NVidia. I'd really like to test my Prog on an ATI now. It's a volume renderer and there are a lot of voxels that I am assigning alpha = 0 (for the stuff I don't wanna see) while the others get a nice lighting calculation. So skipping pixels/voxels on a per fragment basis should speed that up a lot with real dynamic branching ...

This topic is closed to new replies.

Advertisement