Compute shader: varying the number of threads

Started by
13 comments, last by newMe 9 years, 3 months ago

Hi. I have a structured buffer, which is indexed by the flattened ID of a thread. The size of a thread group should be stated before a shader is compiled. The actual size is not known at compile time, but is acquired at runtime, before compiling the shader. I can vary the size dynamicaly by setting the number of thread groups in the dispatch call meaning there would be just one thread in each group, or i can make some small groups say 10-50 threads in each, which is probably more beneficial since i can use the shared memory within these groups, but then it would be impossible to dispatch the correct total number of threads. Is there a way to cull some threads if their ID exceeds a certain range? I mean there are some numbers (odd for example) that just cant be got by the multiplication of a number of groups times a number of threads in these groups.

Advertisement

You can branch the shader logic based on the thread id. In fact, this is very common in algorithms that require some kind of setup within the shader that is required to run serially on a single thread (compare thread id to 0 and execute if true, then call global memory barrier to sync with other threads).

Be aware that the "culled" threads still effectively have to wait for the threads that go do the actual work. That is, the "idling" shader units need to waste their time.

Niko Suni

The thing is i try not to branch my code (replace them using mix, etc) Is the impact of branching is still noticeble in 5.0 shaders or they mitigated it in some way?

When you branch, the threads in the same group need to effectively execute all the instructions, but the memory reads and writes are no-ops on the threads that do not follow the active branch.

If you can model the logic using other constructs, then by all means do so (and therefore avoid the "idling"). But in this particular case, I believe it can be very difficult to replace actual branching. If you develop an elaborate path of mixes and such in an effort to avoid branching, the cost of such avoidance may be greater than the cost of branching itself.

It is also worth noting that in most real-world applications, maintainability is a very high factor in determining actual return of investment. You really don't want to sacrifice maintainability for trivial reasons; low maintainability == high maintaining cost.

Niko Suni

AFAIK, modern AMD GPUs run 64 threads at once and nVidia GPUs run 32 threads at once (per wave, per processor, per shader unit, per GPU).

If there's 64 threads in a wave, and only 10 of them take the true branch of an if statement, then you still pay for the full wave. You're wasting (64-10=)54 thread's time.

If every thread takes the same branch, then there's only the cost of the if instruction itself to worry about.

On modern GPU's, the cost of the if instruction itself is basically free. On older GPU's, it would cost about the same as a dozen basic arithmetic instructions.

If you set your shader's thread group size to 10, then on nVidia hardware you're always wasting (32-10=)22 thread's time, per thread group. If you dispatch 100 thread groups, you're wasting 2200 thread's time.

The same example on AMD hardware is (64-10=)54 threads / 5400 threads.

Say you need to process 1000 items (Continuing with modern AMD hardware in this example)--

You can set the thread group size to 10, then dispatch 100 thread groups as above, but this is immensely wasteful (5400 HW threads wasted).

You can set the thread group size to 64, then dispatch 16 thread groups, which equals 1024 total threads. You can then add an if statement to your code, as suggested by the posters above. This means that for 15 of the thread groups, this branch does nothing (and is pretty much free, performance wise), and for the last thread group you waste (1024-1000=)24 thread's time.

24 wasted threads is much nicer than 5400 biggrin.png

But apart from memory sharing is there any reasons (not logical, helping you to organize the code, but rather from hardware point of view) to group threads or it is ok to just dispatch groups with a singl thread within them?

The hardware is pretty important -
If you dispatch groups of 1 thread, then on AMD you'll be running at 1/64th speed, and 1/32nd speed on NVidia, which is a huge penalty.
These chips are SIMD processors, where one register holds 32/64 floats, so one instruction operates on 32/64 floats. These specific architectures are designed to run lots of threads within each group - specifically, multiples of 32(NVidia) or 64(AMD) threads.

If you know you'll always be working on 100 items, there's no harm in declaring that in your shader... But while declaring one thread (and then dispatching the true number) is easy to develop/maintain, it will kill performance. You have to balance maintainability with hardware specific choices... ;(

Sounds funny. So, the best scenario would be (at least in 1D case) to make groups with the number of threads multiple by 32 and then dispatch enough of these groups. Do they state it explicitely in their documentation that perfomance can be hit that much just if you dont guess the right size? Because you may think that as long as your shader is ok you may expect more or less the same performance in different scaled cases.

Tools like AMD CodeXL helped me to understand why choosing another thread size can affect performance.

For Nvidia there is something similar - Nsight, which i have not tried yet.

The bad thing is that this depends on chip type, so it might make sense to compile the shader with the size dependent on that.

There are however cases, where you don't have a choice anyways.

F. ex. You have nodes with varying number of neighbours - after the neighbour number has been found, you would bin them into groups of <=64, <=128, <=256.

Then you launch the same shader compiled three times with thread sizes of 64, 128, 256, which does something like:

if (threadID < numNeighbours) LDS[threadID] = neighbourDataFromGlobalMemory[binGroupIndicesBuffer[globalID]]; // move all data to LDS memory so all threads have fast access to it

// do some work like sorting by distance...

Downsides of this are:

You have only 3/4 of threads busy on average.

You need to invoke 3 shaders instead one (and your API / Hardware may not allow to execute them simultaneously).

But because you can solve the problem in fast LDS it will be the fastest way to do it, if you have enough nodes to process.

Do they state it explicitely in their documentation that perfomance can be hit that much just if you dont guess the right size?

The wavefront size of each IHV hardware has been mentioned in a few talks and presentations eg: (slide 22).

Choosing a thread group size multiple of 64 should eliminate any wasted threads on both NV and AMD hardware (unless you use branches).

This topic is closed to new replies.

Advertisement