I have a few questions regarding ThreadGroupSize and performance.
1. No matter how many threads are in one thread group, it will always be executed by one SIMD / SMX (split in wavefronts / warps)? So lets say if I only need 1024 threads to process something and I start this in one group only, I have wasted performance since I could split it in smaller parts and have multiple SIMDs / SMXs working on it?
2. In case the above assumption is correct: If I dispatch only one thread group, will the other SIMD / SMX are blocked? Or do they work on other stuff like pixel processing, vector operations etc.? In other words... do all of them have to work on the same stuff or will the mix different things to keep them occupied?
3. Someone was writing this:
Maintaining performance and correctness across devices becomes harder:- Code hardwired to 32 threads per warp when run on AMD hardware 64 threads will waste execution resources- Code hardwired to 64 threads per warp when run on Nvidia hardware can lead to races and affects the local memory budget
The first Statement makes perfectly sense. But the second... well I don't get it. Local memory is the main memory on the graphics card I guess not the thread shared memory? And could anyone explain what exactly happens that these races occur?