So I've read about the modern gpu architecture and also about the compute shader threading model but what I don't fully understand yet is their relationship to each other.
From what I understand the gpu has several SIMD lanes (or pipes) each featuring several SIMD processing units (shader cores / threads /...), which also have some fetch/decode, ALUs, and execution contexts in them.
I've seen this simple example for a compute kernel where they multiply the values of an array of 2000 elements by 2 and store it back again.
The way they used the threading model was launching the dispatch with (20, 1, 1 ) and using numthreads[100, 1, 1]. Which would be 20 threadgroups, each having 100 threads that execute the compute kernel.
Now what I don't quite understand is what the relationship of the number of threadgroups (wavefronts/warps?) and theads (SIMD units?) are.
If these are a 1:1 mapping, does that mean that I can manually tell the gpu how many threads a wavefront should have ? Sounds unlikely...
If that's not the case, then under what criteria do I calculate these numbers for what I'm trying to accomplish with the compute shader ?
Using the example above, why not use 200 threadgroups with each 10 threads ? Or how about something more random like 5 threadgroups and 400 threads each ? How do I know what the most efficient number of threadgroups / threads are for a certain task ? And how does this map to the actual hardware ?