Jump to content
  • Advertisement
Sign in to follow this  

DirectCompute. How many threads can run at the same time?

This topic is 1359 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts


This is information I found on msdn:

  • The maximum number of threads is limited to 1024.
  • The maximum dimension of dispatch is limited to 65535 per dimension.

So, in theory I can call compute shader 65535 * 65535 * 65535 * 1024 in one dispatch. This is huge number and I bet no gpu can run such number in parallel. But what happens if I make a call like this? Will the groups execute by order? I mean if there no "free" thread groups the processing will stall until some group will finish? But what is the maximum number of groups that I can run simultaneously?

Share this post

Link to post
Share on other sites

But what is the maximum number of groups that I can run simultaneously?
The number actually in-flight at once depends highly on the GPU, but also on the complexity of your shader...

On a high-end GPU, using a very complex shader, probably around 1024... or when using a very simple shader, probably 10x more -- around 10240.


All that really matters is that the thread group size is a multiple of 64 (AMD's SIMD size), and then you dispatch the appropriate number of threadgroups, i.e. (AmountOfWork+63)/64 to cover all your work items (AmountOfWork).

Share this post

Link to post
Share on other sites
No, 1 group of 1024 threads, or 16 groups of 64 threads.

If you dispatch more, then the GPU will get to them when it gets to them.

It's a good idea to give the GPU more than enough work to do, because it's "hyperthreaded". It doesn't run a workgroup from beginning to end without interruption before moving onto the next. It will continually do a bit of work on one group, then a little bit on another, constantly jumping around between half-finished groups.
It does this because it's a great way to hide memory latency.
e.g. with a pixel shader - say you have:
return tex2d(texture, uv * 2) * 4;
For each pixel, it has to do some math (uv*2) then fetch some memory (tex2d), then do some more math (*4).
A regular processor will reach the memory-fetch, and stall, waiting potentially hundreds of cycles until that value arrives from memory before continuing... So it takes 2 cycles to do the math, plus 400 cycles wasted on waiting for memory! Resulting in 402 cycles per pixel.

To avoid this, when a GPU runs into that kind of waiting situation, it just switches to another thread. So it will do the "uv*2" math for 400 different pixels, by which time the memory fetches will start arriving, so it can do the final "result*4" math for 400 pixels, with the end result that it spends zero time waiting for memory! Resulting in 3 cycles per pixel (assuming the GPU can handle 400 simultaneous work groups....)

For the GPU to be able to hide memory latency like this, you want your thread group size to be a multiple of 64, your dispatch size to be greater than 10, and your shaders to use the least number of temporary variables as possible (as these are the 'state' of a thread, which must be stored somewhere when switching threads).

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!