While I had a quick look on GPU releases in the last few years, since I've focused on Development (as opposed to Research), I've haven't had the time to deeply inspect GPU performance patterns.
On this forum I see a lot of people dealing with graphics is still optimizing a lot in terms of data packing and such. It seems very little has changed so far, yet on this forum we care about a widespread installed base.
In an attempt to bring my knowledge up-to-date, I've joined a friend of mine in learning CL, and we're studying various publicly available kernels.
In particular, there's one having the following "shape":
[attachment=19307:MemoryUsage.png]
The kernel is extremely linear up to a certain point, where it starts using a lot of temporary registers. After a while, those massive amount of values are only read and then become irrelevant.
What I expect to happen is that the various kernel instances will
- be instanced in number to fit execution clusters, according to the amount of memory consumed.
What happens to other ALUs in the same cluster? - Happily churn along until memory starts to be used. At this point, they will starve one after the other due to the low arithmetic intensity.
- The scheduler will therefore swap the "threads" massively every time they starve by bandwidth.
- When a "thread" is nearby the ending, compact phase, the swapping will possibly end.
It is unclear to me if the compiler/scheduler is currently smart enough to figure out the kernel is in fact made of three phases with different performance behavior.
However, back when CL was not even in the works and GPGPU was the way to do this, the goal was to make "threads" somehow regular. The whole point was that scheduling was very weak and the ALUs should have been working conceptually "in locksteps". This spurred discussions about the "pixel batch size" back in the day.
Now, I am wondering if simplifying the scheduling could improve the performance. On modern architectures such as GCN (or Kepler).
The real bet I'm doing with him is that the slowdown introduced by increased communication (which is highly coherent) will be smaller than the benefit given by improved execution flow.
Unfortunately, we don't have easy access to neither GCN nor Kepler systems, so all this is pure speculation. Do you think it still makes sense to think in those terms?
Edit: punctuation.