Threads run in SIMD lock-step, meaning that every thread needs to execute the exact same instruction at the same time. Obviously we have if statements where half the threads will go one way and the other half will go another. What the GPU does is insert a nop (no operation) for failed cases and execute BOTH branches. If half the threads are false and go to the else then they will execute nops while the other half does the code in the if block, then they will switch roles and the first half will perform the else operations while the other half performs nops.
Now imagine you have 999 threads that exit early and 1 that does not. All 1000 threads will have to execute the longest path!
NVIDIA GPUs break the problem into threadgroups which execute together. If you can get the early terminators to be in the same threadgroup then the computation will actually speed up. If you can't then there is no performance increase.
Edited by menohack, 20 May 2013 - 05:42 PM.