Thanks for taking time to wrestle through my code pieces Joe!
>> you forget to do a memory barrier on shared memory as well.
All right. Adding "memoryBarrierShared()" in addition to "barrier()" would do the job (to ensure the index-array is done filling before starting the second half)?
Btw, besides crashes, is it possible that bad/lacking usage of the barrier as suggested can cause such a huge slowdown? Like I said, on my computer all seems fine, another one works as expected as well, but just very slow.
>> because OpenCL was two times faster on Nvidia ans slightly faster on AMD 1-2 years ago
Now that concerns me. Especially because I used OpenCL before, removed it completely from the engine, and swapped it for OpenCL (easier integration, more consistency)... Doh!
Is it safe to assume that modern/future cards will overcome these performance issues? Otherwise I can turn my Deferred Rendering approach back to an "old" additive style. Anyone experience if Tiled Difference Rendering is that much of a win? And then I'm talking about indoor scenes which have relative much lights, but certainly not hundreds or thousands.
The crappy part is that I'm adapting code to support older cards now, even though I'm far away from a release, so maybe I shouldn't put too much energy on that and bet on future hardware.
I suppose that can't happen if the size isn't hardcoded (counts.x comes from an outside (CPU) variable)?
Well, let's try the shared-barrier, different workgroup size, and avoiding unrolling. And see if these video-cards start smiling... But I'm afraid not hehe.