Jump to content
  • Advertisement
Sign in to follow this  
Bacterius

DX11 DirectX Compute Shader Help

This topic is 2513 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi,
first: I have a HD6950 (DX11 capable) which apparently has 1408 shader units.

Now I have a compute shader which is as follows:

[numthreads(1024, 1, 1)]
void main(uint thread: SV_GroupIndex)
{
// lots of number crunching
}


My dispatch call on the CPU looks like this (C#, SlimDX), where count is the number of groups to execute. The query is there to make the call block until the GPU has finished working (for profiling):

public void Dispatch(int count)
{
device.ImmediateContext.Dispatch(count, 1, 1);
device.ImmediateContext.End(eventQuery);
while (!device.ImmediateContext.IsDataAvailable(eventQuery)) { }
}


Executing 200 000 groups (of 1024 threads each), takes 0.511ms on my computer on average.

However, executing the same number of groups, but of 512 threads (by changing it to numthreads(512, 1, 1)), takes 0.259ms (i.e. half as long).
Same for 256 threads, it takes 0.132ms.
etc...

It was my understanding the GPU was performing the compute shader computations in parallel? Therefore shouldn't each group take the same amount of time, regardless of the number of threads in the group (up to the number of shader cores available on the GPU)? Or does the GPU already use all available shader cores, and distribute the total load over all of them equally (which would explain the timings)? And if the latter, then what is the use of the "numthreads" field?

I am really confused. This whole "thread group" concept is convoluted and difficult to grasp and there aren't many helpful resources available online about compute shaders.

Share this post


Link to post
Share on other sites
Advertisement
The "shader units" terminology is some marketing thing. It doesn't directly (but rather indirectly) relate to the number of threads you can run in your card. Same goes for "shader cores" in NVDIA's.
Sometimes these "core/units" are grouped together and the actual number of thread reduces instantly. For example (hypothetically) if your Radeon has organized 4 "cores" per thread, that means you can run up to 352 threads in parallel.

Share this post


Link to post
Share on other sites
Nvidia and AMD both use their own terminology which makes it a little confusing at the first glance.
Kayvon Fatahalian explained in a Siggraph talk quite well how GPUs work (with all the interleaved processing of warps and so on.) The slides are here.

Share this post


Link to post
Share on other sites
Threads in a thread group will not be executed simultaneously. GPU's executed threads in small groups called "warps" (Nvidia) or "wavefronts" (AMD). A wavefront has 64 threads, and threads within a wavefront will be executed concurrently. Each core on the GPU will process one or more warps/wavefronts at a time, and will periodically switch out a warp/wavefront for another whenever a stall occurs (usually due to memory access). It does this to hide the latency of the memory access, so that it can try to keep the ALU's busy.

When you specify a thread group in a compute shader, it will be split up into warps or wavefronts (16 wavefronts in your case of the 1024 thread group) and all of those wavefronts will be restricted a single core of the GPU so that they can share the same shared memory unit. Having lots of warps/wavefronts can be good or bad, depending on the hardware and the workload. If you don't have enough, then the GPU may not be able to effectively hide latency from memory access. If you have too many, then you may underutilize the GPU cores (this is more likely if you use a lot of shared memory, since the amount is limited). You'll have to profile to know what gives you the best performance for a given workload and GPU. For most things, I tend to find that 128-256 is a pretty sweet spot so I usually start around there.

One thing that you should always do is make sure that your thread group size is a multiple of the warp/wavefront size. Since the hardware can only execute entire warps/wavefronts, you'll just waste threads if your threadgroup size isn't evenly divisible by 32 or 64.

Share this post


Link to post
Share on other sites
Ah I see. The slides are helpful. So am I actually meant to query the number of threads I can theoretically run in parallel, or does the GPU already deal with that and distribute the work? I suppose the latter to make it scalable.

I'm asking because I am really worried at my compute shader implementation's performance: a trivial compute shader with say 256 threads, with each thread incrementing its own member of a structured buffer by 1, reaches a performance of 350M additions per second, which seems awfully low - my processor could beat that with a single thread. This leads me to believe I must be doing something horribly wrong, but directx debug tells me there's no obvious issues.

I think I'm either failing really hard at understanding the relationship between Dispatch() and numthreads(), or there's some crippling memory synchronization issue in my compute shader. Such as this one (the one I use to test):


struct stuff
{
uint4 data;
};

RWStructuredBuffer<stuff> buffer;

[numthreads(256, 1, 1)]
void main(uint thread: SV_GroupIndex)
{
buffer[thread].data += uint4(1, 1, 1, 1);
}


And if I make too many Dispatch calls on the CPU side my driver freezes aswell.

When you specify a thread group in a compute shader, it will be split up into warps or wavefronts (16 wavefronts in your case of the 1024 thread group) and all of those wavefronts will be restricted a single core of the GPU so that they can share the same shared memory unit. Having lots of warps/wavefronts can be good or bad, depending on the hardware and the workload. If you don't have enough, then the GPU may not be able to effectively hide latency from memory access. If you have too many, then you may underutilize the GPU cores (this is more likely if you use a lot of shared memory, since the amount is limited). You'll have to profile to know what gives you the best performance for a given workload and GPU. For most things, I tend to find that 128-256 is a pretty sweet spot so I usually start around there.[/quote]
I think this might explain the low performance I see, if only a single core is being used because of shared memory. But then how do I make all my cores work? Is there a way for my compute shader to run on every single GPU core, considering it doesn't need memory synchronization or anything - each thread is completely self-contained within its structured buffer index. I just want all my cores busy doing number crunching, and have the results being put back into the structured buffer when finished.

Is there a way to be certain of when the GPU is done executing all the thread groups I asked it to dispatch? I use an event query at the moment but maybe it doesn't apply to compute shaders.

Share this post


Link to post
Share on other sites
Your compute shader doesn't take the thread group ID into account. This means that all of the thread groups you dispatch will all access the same 256 elements of your structured buffer, which is a surefire way to cause contention since different thread groups will run concurrently on different cores of the GPU. It's undoubtably being made worse by the fact that you do a write AND a read. If you use the .x component of SV_DispatchThreadID instead of SV_GroupIndex, then it will take the group index into account and all of your thread groups will write to different areas of memory. Or alternatively you can use SV_GroupID to come up with the buffer index yourself.

And yes, if a single Dispatch (or Draw call, for that matter) takes too long then windows will assume the driver is hung and will restart it. Breaking up the Dispatch into multiple calls will alleviate the problem. If this is not possible, you can set registry keys that control how long it takes for a timeout to occur.

Share this post


Link to post
Share on other sites
Your compute shader doesn't take the thread group ID into account. This means that all of the thread groups you dispatch will all access the same 256 elements of your structured buffer, which is a surefire way to cause contention since different thread groups will run concurrently on different cores of the GPU. It's undoubtably being made worse by the fact that you do a write AND a read. If you use the .x component of SV_DispatchThreadID instead of SV_GroupIndex, then it will take the group index into account and all of your thread groups will write to different areas of memory. Or alternatively you can use SV_GroupID to come up with the buffer index yourself.[/quote]
But my structured buffer only has 256 elements. Basically I want the threads to repeatedly do the same number crunching over and over again, reading the input from their structured buffer slot and writing it back at the same place. In an iterative manner. I want the shaders to repeat the same calculations over and over with feedback, at high speed. But if I do a for loop inside the compute shader (so I can only dispatch a single group and read/write from memory only once per thread), the dispatch call takes too long and the driver dies on me (then windows restarts it).

So what I should do is get a bigger structured buffer, and do more parallel computations with all the groups to hide memory latencies, right? (using SV_DispatchThreadID to index my buffer). So essentially... I will get more work done, but less overall shader iterations per group because of memory access latency. This makes more sense.

Share this post


Link to post
Share on other sites
Well the first thing you should do is eliminate race conditions. The way you're doing it right now won't even produce correct results, since the memory access between different thread groups won't be synchronized. Instead you could use atomic increments, using a ByteAddressBuffer or typed R32_UINT buffer.

After that, you can tackle performance. Atomics aren't the quickest, especially if you need to do them globally across multiple thread groups. To get performance you'll need to structure things such that you can spread the work across a lot of threads but without requiring any global communication or atomic instructions. Nvidia has a bunch of whitepapers on data-parallel algorithms, including one a parallel reduction. So you could have each thread group calculate the single value of the output buffer, with N threads in the thread group performing the parallel reduction to sum the values. This lets you saturate the GPU, and avoids and atomics or global synchronization. If you want a compute shader example, you can look at this sample that I made. It computes the average luminance value of 1024x1024 texture by using two 32x32 parallel reductions.

Share this post


Link to post
Share on other sites
I will look into it. I doubt I will be able to parallelize the algorithm however, because I'm doing cryptographic hash computations, which are inherently nonlinear. So I think the approach of doing more parallel, independent work instead of sharing the same task over many threads might be better suited for me. I will need to watch out for synchronization however. For now I am using the "simple" approach which keeps the GPU at 100%, and I'm getting fairly good performance out of it, but it can probably get more if I organize my threads and manage my buffers better. It's quite a change porting CPU work to the GPU, when you realize how many subtle things we take for granted when working with the CPU.

On a different note - sorry for the change of topic - is there a way to use hardware-specific instructions in hlsl? For instance newer AMD cards have the bitalign instruction which essentially performs 32-bit bitwise rotation directly in hardware. Does HLSL let me inline this instruction (as it doesn't have the corresponding instrinsic) or how would i go as to do that?

Share this post


Link to post
Share on other sites
There's no way to inline HLSL assembly, and definitely no way to inline vendor-specific microcode instructions. It's possible that the JIT compiler in the driver will pick up on certain patterns and replace them with the bitalign instruction...you could use GPU ShaderAnalyzer to find out for sure.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!