early return from GPU does not speed up process ?

Started by
1 comment, last by menohack 10 years, 11 months ago

i am doing shader work in the geometry shader a particle system and thought that doing a check at the beginning and doing an early return then it would speed up the whole project .

i found that no speed improvement was made.

i even tryed doing a modular test so it would return every time apart from the 1000 particle , it still made no difference

what can i do for early return so that it will speed up the process ?

Advertisement

This is fairly standard: all executions of a shader run for the same time as the longest invocation of that shader in a processing group. The exact size and properties of the groups vary based on hardware and shader type. Essentially early-out doesn't help at all unless many -- ideally most -- of the elements using that shader hit the early out. Even then the difference may be trivial or undetectable due to other factors. I'm sure someone will be happy to chime in on the hardware concepts that underlie this behavior, and I'll see if I can find a useful paper or two.

All of this means that the way to optimize shaders is to shorten the longest shader execution, because everyone else is bottlenecked on the slow guy.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

Threads run in SIMD lock-step, meaning that every thread needs to execute the exact same instruction at the same time. Obviously we have if statements where half the threads will go one way and the other half will go another. What the GPU does is insert a nop (no operation) for failed cases and execute BOTH branches. If half the threads are false and go to the else then they will execute nops while the other half does the code in the if block, then they will switch roles and the first half will perform the else operations while the other half performs nops.

Now imagine you have 999 threads that exit early and 1 that does not. All 1000 threads will have to execute the longest path!

NVIDIA GPUs break the problem into threadgroups which execute together. If you can get the early terminators to be in the same threadgroup then the computation will actually speed up. If you can't then there is no performance increase.

This topic is closed to new replies.

Advertisement