How efficient are current GPU schedulers?

Started by
4 comments, last by Krypt0n 10 years, 3 months ago

While I had a quick look on GPU releases in the last few years, since I've focused on Development (as opposed to Research), I've haven't had the time to deeply inspect GPU performance patterns.

On this forum I see a lot of people dealing with graphics is still optimizing a lot in terms of data packing and such. It seems very little has changed so far, yet on this forum we care about a widespread installed base.

In an attempt to bring my knowledge up-to-date, I've joined a friend of mine in learning CL, and we're studying various publicly available kernels.

In particular, there's one having the following "shape":

[attachment=19307:MemoryUsage.png]

The kernel is extremely linear up to a certain point, where it starts using a lot of temporary registers. After a while, those massive amount of values are only read and then become irrelevant.

What I expect to happen is that the various kernel instances will

  1. be instanced in number to fit execution clusters, according to the amount of memory consumed.
    What happens to other ALUs in the same cluster?
  2. Happily churn along until memory starts to be used. At this point, they will starve one after the other due to the low arithmetic intensity.
  3. The scheduler will therefore swap the "threads" massively every time they starve by bandwidth.
  4. When a "thread" is nearby the ending, compact phase, the swapping will possibly end.

It is unclear to me if the compiler/scheduler is currently smart enough to figure out the kernel is in fact made of three phases with different performance behavior.

However, back when CL was not even in the works and GPGPU was the way to do this, the goal was to make "threads" somehow regular. The whole point was that scheduling was very weak and the ALUs should have been working conceptually "in locksteps". This spurred discussions about the "pixel batch size" back in the day.

Now, I am wondering if simplifying the scheduling could improve the performance. On modern architectures such as GCN (or Kepler).

The real bet I'm doing with him is that the slowdown introduced by increased communication (which is highly coherent) will be smaller than the benefit given by improved execution flow.

Unfortunately, we don't have easy access to neither GCN nor Kepler systems, so all this is pure speculation. Do you think it still makes sense to think in those terms?

Edit: punctuation.

Previously "Krohm"

Advertisement

1. the instancing is specified by you, in opencl as in cuda, the programmer specifies how many units are clustered and scheduled. Scheduling is not only depending on the code, but largely on the access pattern of the kernels. that cannot be determined without proper benchmarking and that's why YOU, the programmer, has to find the best solution.

2. if you are referring to registers, then no. the distribution of your kernels is kept constant, based on the way you've launched the kernels. if you talk about main memory, then it depends on how much data you access, how coherent, in what way... (optimizing access pattern is actually a big part of optimizing compute kernels). you can access a lot of data and still be compute bound and you can access few data with a very random access pattern on a big data set and become memory bound.

3. the scheduler swaps threads on every memory access. the feedback loop from the memory controller would be too high latency to keep the same thread running and wait for the reply, it's more likely to run another thread, as your memory access needs anyway some address translation etc.

4. the 'swapping' will continue as long as you're accessing data. swapping is essentially 'free', so it's simpler to just swap the threads instead of evaluating whether you can continue running an existing thread.

you'd need to elaborate about your bet, I'm not sure what you mean ;)


if you are referring to registers, then no. the distribution of your kernels is kept constant, based on the way you've launched the kernels. if you talk about main memory, then it depends on how much data you access, how coherent, in what way... (optimizing access pattern is actually a big part of optimizing compute kernels). you can access a lot of data and still be compute bound and you can access few data with a very random access pattern on a big data set and become memory bound.

I cannot quite understand what kind of memory will be used as no architecture has enough memory per-cluster to currently hold all the values, not even for a single work item.

I'm afraid I cannot elaborate more on what I'm betting with him because I still have to look at the host code instancing the kernels, but my idea was essentially to move the swap granularity at a coarser level, maybe I should just play with the multidimensional work size settings for a start. We have indeed noticed a lot of people seems to just play with those settings and I've seen that happening in some presentations as well, dealing for example with tiled lighting.

I still think I'll want to play a bit more with that however, as the pattern is incredibly stable in the "growing" phase.

edit: more thinking aloud.

Previously "Krohm"

you don't have to hold all the values, as you cannot work on all the values at the same time. you just need to hold enough data to be able to load the needed data while other threads occupy the ALUs.

sometimes, on initialization, all threads load values from main memory, store that value into shared/local memory (e.g. 48kb) and then work on that data, which is usually very low latency and once the work is done, all threads simultaneously write out the output. so while you're doing extra work (the load-copy on initialization and the store-copy on finalization), the kernel actually runs faster while doing the heavy math at peak rates.

this 48kb are limiting, of course. but that's why I wrote that a lot of optimization work is how you handle data. sometimes some 'extra' work is needed to perform the other part of the work in a faster way. and that 'extra' work is 'caching', 're-layouting' 'decompression' 'transformation' of data to a more friendly format.

and, if you wonder about some optimization you come up with, it's indeed best if you just try and profile it. even experts cannot really predict the results and if they are really compute experts, they'll profile it and tell you whats best, as they wouldn't want you to rely on crystal balls, but engineering tools.

that might be different in some other engineering areas (e.g. writing assembler code for in-order cpus) where you could statically predict what's better, but with so many dependencies, so many units in parallel, and sometimes really minor changes that can vastly affect performance (e.g. by sampling just one more data in a loop you could exceed cache sizes and trash the cache constantly), it's best to profile :)

I am very well aware of the latency-hiding strategies involving block read/write.

There's no heavy math in the memory-heavy section. I don't understand what you're saying


while you're doing extra work ... the kernel actually runs faster while doing the heavy math at peak rates

I don't understand what kind of extra work are you referring to: changing layout or decompressing/transforming data appears something I'd have to do, not the HW.


if you wonder about some optimization you come up with, it's indeed best if you just try and profile it.

Are you reading my posts? We don't have easy access to neither GCN nor Kepler devices. I'm sorry to write this but so far I haven't read anything I didn't know already and I start being aware I am not able to express my thoughts.

Previously "Krohm"

I am very well aware of the latency-hiding strategies involving block read/write.

There's no heavy math in the memory-heavy section. I don't understand what you're saying


while you're doing extra work ... the kernel actually runs faster while doing the heavy math at peak rates

I don't understand what kind of extra work are you referring to: changing layout or decompressing/transforming data appears something I'd have to do, not the HW.

and that 'extra' work is 'caching', 're-layouting' 'decompression' 'transformation' of data to a more friendly format.


if you wonder about some optimization you come up with, it's indeed best if you just try and profile it.

Are you reading my posts?

that's why I try my best to give helpful replies to your vague ideas.

We don't have easy access to neither GCN nor Kepler devices. I'm sorry to write this but so far I haven't read anything I didn't know already

I've been trying my best to reply to your questions and elaborate the reasons for my replies. Sorry for not being much of a help, I didn't know you knew the replies already.

and I start being aware I am not able to express my thoughts.

I'm not the only one here, maybe someone else has more helpful replies ;)

This topic is closed to new replies.

Advertisement