Compute Threads

Started by
3 comments, last by MJP 7 years, 4 months ago

Just started reading up on compute shaders but the author of this book as well as an nvidia slide seemed to brush off how these Thread Group and Num Threads works.

They stated that 32 threads will usually run concurrently. Not groups, but threads.

If I have a 1920x1080 image, and I simply want to add 1 to each pixel.

I could have (1,1,1) thread groups, and (1920,1080,1) number of threads? Why is this good/bad.

I could have (32,32,1) thread groups, what would my # of threads be? Wouldnt it have to be (1920/32, 1080/32, 1) to make sure it hits all pixels?

What about the case of a texture that is 33x33? How would you have to lay out threads and thread groups in that case? Or does it have more to do with the algorithm itself on grouping things based on the algorithm?

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Advertisement

So there's a difference between the hardware and the software abstraction.

In the shader you specify a thread-group size. e.g. numthreads(8, 8, 1) declares that for each thread-group that is dispatched, you want 64 thread to be launched.

On the CPU side, the Dispatch call says how many thread groups to launch. e.g. Dispatch(240, 135, 1) will launch 32400 thread groups. With the above shader, it will end up launching 240*8 * 135*8 (=1920*1080 = 2073600) threads.

On the hardware side:
AMD compute units always run 64 threads at once.
NVidia compute units run 32 threads at once.
Intel compute units run 8 threads at once.
So - for good performance across all hardware, you should try to make your shader's thread-group sizes a multiple of 64.

You can't make your thread-group size too big, because there's not enough memory inside a compute-unit to maintain state for thousands of threads at a time (a few hundred is probably the limit).
The point of a thread-group is:
A) they can communicate with each other by using group-shared memory.
B) they should somewhat match the way that the hardware works (8-64 "threads" being executed in lockstep on a SIMD architecture).

You also don't want just a single mega-thread-group of size 1920*1080 because GPU's have more than one compute-unit.
A GPU might have 32 CU's, which can each be working on 10 thread-groups at once, each made up of 64 threads, meaning you can have 20480 threads in flight at once across all CU's -- but a single thread group might be limited to something like 640 threads max.

So you want your thread-group size to be small enough so that your workload is spread out across all the CU's on the GPU, and you want it to be large enough so that you're not wasting any of the SIMD abilities of each CU. As a rule of thumb, 64 or 128 are good sizes to use.

Note that on AMD, a 128-sized thread-group will run on a SIMD-64 machine, which lets AMD's internal compiler basically unroll the code to work on two sets of SIMD-64 registers at once.

On NVidia, a 128-sized thread-group will run on a SIMD-32 machine, which lets NVidia's internal compiler unroll the code to work on four sets of SIMD-32 registers at once.

Sometimes this unrolling is helpful as it helps reduce memory stalls, and sometimes it's harmful as you're increasing register pressure... It takes a lot of profiling to find the optimal thread-group size for a particular GPU :(

What about the case of a texture that is 33x33? How would you have to lay out threads and thread groups in that case? Or does it have more to do with the algorithm itself on grouping things based on the algorithm?

If you defined your thread-group size to be 8x8, you'd dispatch 33/8 x 33/8 thread-groups (rounding up to 5). That's enough threads to process a 40x40 sized texture, so a bunch of your threads will end up doing nothing.
HLSL is helpful here where out of bounds texture reads return 0, and out of bounds texture writes are a NOP -- so often your algorithm can be unaware that it's writing outside the bounds of your target texture. In other cases, you need to use an if statement to 'early exit' on any out of bounds threads.

On the hardware side:
AMD compute units always run 64 threads at once.
NVidia compute units run 32 threads at once.
Intel compute units run 8 threads at once.
So - for good performance across all hardware, you should try to make your shader's thread-group sizes a multiple of 64.

That's only true for one (or a few) specific hardware generation. You have to check the execution units count per compute unit per gpu type. There is a high amount of variance across different generations.

For example, when NVidia rolled out Kepler, they suddenly increased the simd count of the execution unit to 192, increasing the effectiveness of typical game shaders, but severely crippling the generic, compute workloads. That's why you had lots of people sticking to 580s when doing computing, even if the 680 series were out.

shaken, not stirred

That's only true for one (or a few) specific hardware generation. You have to check the execution units count per compute unit per gpu type. There is a high amount of variance across different generations.

Actually IIRC it's been like that since unified shaders were introduced.

-potential energy is easily made kinetic-

Nvidia has had 32-thread warps since G80, and AMD has had 64-thread wavefronts for as long as I can remember. These numbers are not 1:1 with the configuration of physical execution units: the warp/wavefront count just means that threads in execute in lock-step from a software point of view (which allows for sharing of context and also some forms of cross-thread communication), and also dictates the minimum granularity for dispatching threads.

The thing about DX11 thread groups is that they are also not directly 1:1 with warp/wavefront size, and they're also not 1:1 with physical execution units on the hardware. In concept they're similar to a warp or wavefront: they determine your dispatch granularity, and they determine the set of threads that support cross-thread communication (synchronization/barriers, and thread group local storage). Note that this is a bit different from what's dicated by a warp/wavefront, and so it's up to the driver to determine how to map the semantics of a thread group for a particular shader into the GPU's hardware-specific functionality. So for example, let's take a simple compute shader that doesn't use any cross-thread sync or group shared memory. In this case every thread is completely independent, which means the driver is relatively unhindered in how it chooses to run that on the GPU. All that matters in the end is that all of your threads get executed with the correct SV_GroupID/SV_GroupThreadID/etc. parameters passed in, and so it may choose to split up your threads however it likes among its execution units. So for instance if you have a thread group size of 1024 it may split that up into 16 different wavefronts, and those wavefronts might end up on totally different GPU cores. However if you decided to use shared memory the driver now has an additional restriction. All of the threads in a group need to be able to see the same shared memory, which probably means that that all need to end up executing on the same GPU core so that they end up sharing the same physical on-chip memory. So in this case using 1024 threads in a group might be bad, because it may lead to having your threads "clump up" on the GPU cores instead of getting spread out on all available execution units. It may also lead to cores being underutilized in the case that sirpalee mentioned, since there may be more ALU available but it can't be used since you're bottlenecked by limited per-core shared memory.

This topic is closed to new replies.

Advertisement