• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Krohm

How efficient are current GPU schedulers?

5 posts in this topic

While I had a quick look on GPU releases in the last few years, since I've focused on Development (as opposed to Research), I've haven't had the time to deeply inspect GPU performance patterns.

On this forum I see a lot of people dealing with graphics is still optimizing a lot in terms of data packing and such. It seems very little has changed so far, yet on this forum we care about a widespread installed base.

 

In an attempt to bring my knowledge up-to-date, I've joined a friend of mine in learning CL, and we're studying various publicly available kernels.

In particular, there's one having the following "shape":

[attachment=19307:MemoryUsage.png]

The kernel is extremely linear up to a certain point, where it starts using a lot of temporary registers. After a while, those massive amount of values are only read and then become irrelevant.

 

What I expect to happen is that the various kernel instances will

  1. be instanced in number to fit execution clusters, according to the amount of memory consumed.
    What happens to other ALUs in the same cluster?
  2. Happily churn along until memory starts to be used. At this point, they will starve one after the other due to the low arithmetic intensity.
  3. The scheduler will therefore swap the "threads" massively every time they starve by bandwidth.
  4. When a "thread" is nearby the ending, compact phase, the swapping will possibly end.

It is unclear to me if the compiler/scheduler is currently smart enough to figure out the kernel is in fact made of three phases with different performance behavior.

 

However, back when CL was not even in the works and GPGPU was the way to do this, the goal was to make "threads" somehow regular. The whole point was that scheduling was very weak and the ALUs should have been working conceptually "in locksteps". This spurred discussions about the "pixel batch size" back in the day.

 

Now, I am wondering if simplifying the scheduling could improve the performance. On modern architectures such as GCN (or Kepler).

The real bet I'm doing with him is that the slowdown introduced by increased communication (which is highly coherent) will be smaller than the benefit given by improved execution flow.

 

Unfortunately, we don't have easy access to neither GCN nor Kepler systems, so all this is pure speculation. Do you think it still makes sense to think in those terms?

 

Edit: punctuation.

Edited by Krohm
1

Share this post


Link to post
Share on other sites

if you are referring to registers, then no. the distribution of your kernels is kept constant, based on the way you've launched the kernels. if you talk about main memory, then it depends on how much data you access, how coherent, in what way... (optimizing access pattern is actually a big part of optimizing compute kernels). you can access a lot of data and still be compute bound and you can access few data with a very random access pattern on a big data set and become memory bound.

I cannot quite understand what kind of memory will be used as no architecture has enough memory per-cluster to currently hold all the values, not even for a single work item.

I'm afraid I cannot elaborate more on what I'm betting with him because I still have to look at the host code instancing the kernels, but my idea was essentially to move the swap granularity at a coarser level, maybe I should just play with the multidimensional work size settings for a start. We have indeed noticed a lot of people seems to just play with those settings and I've seen that happening in some presentations as well, dealing for example with tiled lighting.

 

I still think I'll want to play a bit more with that however, as the pattern is incredibly stable in the "growing" phase.

 

edit: more thinking aloud.

Edited by Krohm
0

Share this post


Link to post
Share on other sites

you don't have to hold all the values, as you cannot work on all the values at the same time. you just need to hold enough data to be able to load the needed data while other threads occupy the ALUs.

sometimes, on initialization, all threads load values from main memory, store that value into shared/local memory (e.g. 48kb) and then work on that data, which is usually very low latency and once the work is done, all threads simultaneously write out the output. so while you're doing extra work (the load-copy on initialization and the store-copy on finalization), the kernel actually runs faster while doing the heavy math at peak rates.

this 48kb are limiting, of course. but that's why I wrote that a lot of optimization work is how you handle data. sometimes some 'extra' work is needed to perform the other part of the work in a faster way. and that 'extra' work is 'caching', 're-layouting' 'decompression' 'transformation' of data to a more friendly format.

 

and, if you wonder about some optimization you come up with, it's indeed best if you just try and profile it. even experts cannot really predict the results and if they are really compute experts, they'll profile it and tell you whats best, as they wouldn't want you to rely on crystal balls, but engineering tools.

that might be different in some other engineering areas (e.g. writing assembler code for in-order cpus) where you could statically predict what's better, but with so many dependencies, so many units in parallel, and sometimes really minor changes that can vastly affect performance (e.g. by sampling just one more data in a loop you could exceed cache sizes and trash the cache constantly), it's best to profile :)

2

Share this post


Link to post
Share on other sites

I am very well aware of the latency-hiding strategies involving block read/write.

There's no heavy math in the memory-heavy section. I don't understand what you're saying


while you're doing extra work ... the kernel actually runs faster while doing the heavy math at peak rates

I don't understand what kind of extra work are you referring to: changing layout or decompressing/transforming data appears something I'd have to do, not the HW.


if you wonder about some optimization you come up with, it's indeed best if you just try and profile it.

Are you reading my posts? We don't have easy access to neither GCN nor Kepler devices. I'm sorry to write this but so far I haven't read anything I didn't know already and I start being aware I am not able to express my thoughts.
1

Share this post


Link to post
Share on other sites

 

I am very well aware of the latency-hiding strategies involving block read/write.

There's no heavy math in the memory-heavy section. I don't understand what you're saying

 

 


while you're doing extra work ... the kernel actually runs faster while doing the heavy math at peak rates

I don't understand what kind of extra work are you referring to: changing layout or decompressing/transforming data appears something I'd have to do, not the HW.

 

and that 'extra' work is 'caching', 're-layouting' 'decompression' 'transformation' of data to a more friendly format.

 

 

 


if you wonder about some optimization you come up with, it's indeed best if you just try and profile it.

 

Are you reading my posts?

that's why I try my best to give helpful replies to your vague ideas.

 

 

We don't have easy access to neither GCN nor Kepler devices. I'm sorry to write this but so far I haven't read anything I didn't know already 

I've been trying my best to reply to your questions and elaborate the reasons for my replies. Sorry for not being much of a help, I didn't know you knew the replies already.

 

 

 

and I start being aware I am not able to express my thoughts.

I'm not the only one here, maybe someone else has more helpful replies ;)

1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0