GPU Compute threading model and its relationship to hardware

Started by
7 comments, last by Ryan_001 10 years, 8 months ago

So I've read about the modern gpu architecture and also about the compute shader threading model but what I don't fully understand yet is their relationship to each other.

From what I understand the gpu has several SIMD lanes (or pipes) each featuring several SIMD processing units (shader cores / threads /...), which also have some fetch/decode, ALUs, and execution contexts in them.

I've seen this simple example for a compute kernel where they multiply the values of an array of 2000 elements by 2 and store it back again.

The way they used the threading model was launching the dispatch with (20, 1, 1 ) and using numthreads[100, 1, 1]. Which would be 20 threadgroups, each having 100 threads that execute the compute kernel.

Now what I don't quite understand is what the relationship of the number of threadgroups (wavefronts/warps?) and theads (SIMD units?) are.

If these are a 1:1 mapping, does that mean that I can manually tell the gpu how many threads a wavefront should have ? Sounds unlikely...

If that's not the case, then under what criteria do I calculate these numbers for what I'm trying to accomplish with the compute shader ?

Using the example above, why not use 200 threadgroups with each 10 threads ? Or how about something more random like 5 threadgroups and 400 threads each ? How do I know what the most efficient number of threadgroups / threads are for a certain task ? And how does this map to the actual hardware ?

Advertisement

does that mean that I can manually tell the gpu how many threads a wavefront should have ? Sounds unlikely...

You can -- threadblock * threads per block = # threads per wave. Up to some maximum #of blocks * max #of threads per block.


what criteria do I calculate these numbers for what I'm trying to accomplish with the compute shader

You will probably have to try different settings and profile for the range of inputs, though you can make some guesses/decisions ahead of time based on memory needs, register usage, etc..


why not use 200 threadgroups with each 10 threads ? Or how about something more random like 5 threadgroups and 400 threads each ? How do I know what the most efficient number of threadgroups / threads are for a certain task ?

You want to fully utilize the hardware (fast memory, registers and instruction pipeline) to minimize time needed to do the work. Depends on how many cores you have on the GPU, how much cache, available registers, ...and other stuff. You can only run so many threads in parallel, so you want those threads to be utilizing the resources allocated to them otherwise you are leaving hardware idle (ie, use fewer threads that do more work) . But if the threads require too many resources, then fewer threads can actually run in parallel and speed takes a big hit (so use more threads that do less). You have to find the right balance, probably through testing. Also, whether or not threads need to pass data to each other will factor strongly in the decision, since inter-block communication is problematic.

And how does this map to the actual hardware ?

A few years ago, I found some great diagrams that showed this...keep searching, especially sites of companies that produce GPUs.

The Four Horsemen of Happiness have left.

OK, some terminology needs clearing up here I think.

warp/wavefront size is a term used by NVidia & AMD respectively to describe the number of threads which are working in lock step. This number is fixed for the hardware; you can't change it. AMD runs a wavefront size of 64 and NV ran 32, but I think their newer hardware runs 64 as well. This means that at any given time, on AMD hardware, 64 threads are executing the SAME instruction at the same time in lock step.

threadgroups on the other hand does what the name implies; it is a group of threads working together on a task. This can indeed be any number (up to given limits) of threads however the hardware can only execute thread in warp/wavefront sized chunks. This means if you ask for 100 threads you'll effectively be executing 128 threads (2 wavefronts, 4 warps) with the 'unused' 28 threads being masked out.

When I say 'executing 128 threads' while they are in theory executing at the same time in practise the hardware executes them in warp/wavefront sized blocks. So on AMD hardware it would execute instructions for the first 64 threads, then swap and do the next 64 and so on; on NV it would work the same way but, assuming a warp size of 32, it would pick from 4 groups of threads.

So while you can ask for 200 groups of 10 threads this means that your hardware will be under utilised as, say on AMD hardware, 54 threads would be 'idle' and doing no useful work.

When it comes to working out what you need to set to do work; well this is a combination of knowing the hardware, the work you need to do, experience and profiling.

So they are actually more like virtual threads and the hardware is using its own partitioning of wavefronts/warps,... to do the work...?

And in the case of AMD hardware a good number of threads to use then would be a multiple of 64 ? What about the group size ? Is that just a case of trying to make it fit ?

They are no more 'virtual' than the threads you create on the CPU; if you have a 4 core CPU but create 12 threads then the OS has to select among them; GPUs work the same way but threads come in groups and it's the hardware selecting which threads to run.

'groupsize' is the number of threads you control; ideally, yes, this will be a multiple of the number of threads which are executing at once (so 64 on AMD) - it doesn't have to be, but it's a good place to start in most cases.


This means that at any given time, on AMD hardware, 64 threads are executing the SAME instruction at the same time in lock step.

Correct, but I believe that's per compute unit, not per-gpu.

A given GPU typically has several compute units. A 7970 has 32 of them. Inside the compute unit, currently, those 64 thread are organized as 4x 16-wide SIMD execution units. NV's SMX units are similar in concept, but arranged differently. This is also the level that many resources are shared (memory controller access, texture units, ROPs), so while leveraging this information to run code better may risk over-tuning things, its important to understand for some performance characteristics.

throw table_exception("(? ???)? ? ???");

Each CU in a 7970 actually executes 4 wavefronts concurrently. Each wavefront is mapped to a 16-wide SIMD unit, and instructions are executed 16 threads at time. Not that this changes much, but who doesn't like being pedantic. tongue.png

Anyway these would be my TL:DR guidlines for thread groups:

1. Always prefer thread group sizes that are a multiple of the warp/wavefront count of your hardware. If you're targetting both AMD and Nvidia, a multiple of 64 is safe for both.

2. You usually want more than one wavefront/warp per thread group. Having more lets the hardware swap out warps/wavefronts to hide latency. 128-256 is usually a good place to aim.

3. Don't use too much shared memory, it kills occupancy

Correct, but I believe that's per compute unit, not per-gpu.


Yes, I should have been clearer about that - good catch smile.png

One of the key reasons why a computational unit can be assigned more threads than a warp/wavefront is latency hiding. Texture fetches, memory access, even certain data dependencies, ect... can take more than 1 cycle. The computational unit can swap out a warp/wavefront that has stalled for one that is ready to go.

This topic is closed to new replies.

Advertisement