I've started optimizing graphics algorithms by porting them to compute shader and improving them by sharing memory and synchronizing the threads. This way I could improve the runtime of my Bloom from to per pixel.
But that was only the theoretical runtime. In reality, the algorithm performed so much worse than the original linear algorithm. I'm pretty sure I know the reason. Instead of let's say 32 read operations and 1 write operation, the algorithm now needs 1 read operation from VRAM, 5 read operations from groupshared memory, 5 write operations to groupshared memory and 1 write operation to VRAM.
Overall groupshared memory being L1 Cache should be way faster than 32 read operations from VRAM and it's even way less operations because of the algorithm having logarithmic runtime, but it's way slower (8ms instead of 0.5ms). The slowdown could be because of memory bank conflicts. But could they really cause such an enormous slowdown?
To me it looks like my graphics card might not even have an actual L1 cache residing on the Wavefront as groupshared memory at all. It performs just as bad as a UAV residing in VRAM would. So maybe they simply wrote a driver that uses 32kb of reserved memory in the VRAM as groupshared memory. Could that be the case or is it the bank conflicts?
I wish there were tools that could shine more light on such problems. Graphics cards and the tools should be more transparent in what's actually going on, so that the developers could improve the algorithms even further.
Update: After reading through NVidias CUDA documentation my shaders don't even cause any bank conflicts at all. Each half warp (16 threads) always accesses 16 different memory banks. Just a whole block (1024 threads) accesses them multiple times, which is normal and has nothing to do with bank conflicts.
Edited by CryZe, 10 September 2012 - 08:19 AM.