# [DirectCompute] computing and indexing large buffers

This topic is 2082 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi,

I'm currently trying to solve a parallel computation problem with the help of compute shaders. The problem that I'm facing is that the number of computed elements can be quite high (1-2 Million) and I am not entirely sure how to operate a compute shader on such big buffers to achieve a reasonable performance.

My shader setup basically looks like this...

struct InputData
{
...
};

struct OutputData
{
...
};

StructuredBuffer<InputData> InputBuffer;
RWStructuredBuffer<OutputData> OutputBuffer;

OutputData CS_Calculation(InputData i)
{
// the computational workload happens here
...
}

{
// the number of elements in the input and output buffers is the same,
// each element in the output buffer is computed by its matching element in the input buffer
InputData input = InputBuffer[DTid.x];
OutputBuffer[DTid.x] = CS_Calculation(input);
}


For each element of the input buffer I want to run a calculation and store the result in the output buffer (both are structured buffers).

As you see I address the elements in the buffers via the X-Component of the SV_DispatchThreadID semantic, but the problem with this is that I am then limited to a number of 65535 elements that the shader can process (see http://msdn.microsoft.com/en-us/library/windows/desktop/ff476405(v=vs.85).aspx), since I am calling it with parameters like follows...

dc->Dispatch(element_count, 1, 1);


What I figured is that I should somehow use a different combination of the [numthreads(x, y, z)] definition and the Dispatch(x, y, x) parameters to support higher element counts, but I'm unable figure out how to correctly get the index of the current input/output element in the shader if I do so.

I have already tried several ways to use the other available CS semantics (http://msdn.microsoft.com/en-us/library/windows/desktop/ff471566(v=vs.85).aspx), but I never got the wanted result. Most of the time, certain elements in the output buffer were not written at all.

Could someone please help me out how to correctly use a combination of the numthreads shader declaration, the Dispatch call and a corresponding CS shader semantic to achieve the linear indexing that I need.

Thanks

Edited by stoneMcClane

##### Share on other sites
I wouldn't say 1-2 million elements is that high for GPU workloads, if you consider a 1920x1080 image is just over 2 million pixels. Most modern GPUs have no problem rendering that many pixels multiple times per frame, at many frames per second, using pixel shaders.

The key to getting the same performance with compute shaders is providing the GPU with workloads that utilise it just as well as normal rendering does.

Perhaps someone more experienced could comment here, but I don't believe there's any performance difference between using numthreads(64,1,1) or numthreads(8,8,1), they both run 64 threads in a group. The dimensions just help to make it easier when you're accessing multi-dimension data, for example with a 2D array or an image, it's easier to index into that data per-thread when the threadID is also in 2 dimensions. Another reason to use multiple dimensions would be to get around the dimension limits when you want to run a lot of threads, i think the limit is 1024 per dimension?

Which brings us to dispatching threadgroups, when you hit that exact dimension limit problem, where you could only dispatch a maximum of 65535 threadgroups in the X dimension, and with each group running a single thread. If we say you've now changed your numthreads to 64,1,1, that dispatch now becomes (element_count / 64), rather than the element_count alone, since each group you're dispatching now runs 64 threads at once rather than a single thread.

So 2 million / 64 = ~32k, which is within the dispatch dimension limit and you could just use:

UINT dispatchX = element_count / 64;  //element_count over numthreads per threadgroup
dc->Dispatch(dispatchX, 1, 1);
You should be careful of integer rounding issues though, where element_count isn't a multiple of 64, you could skip the processing of those last <64 elements. e.g. 137 / 64 = 2 when dealing with integers. A solution is to add (numthreads - 1) to your element count, so 137 + 63 / 64 = 3. You might be noticing that the thread count is being used in multiple places at this point (inside the compute shader, and twice when figuring out the threadgroup dimensions), so it might be easier to make it a define that is shared between the .cpp and .hlsl files. So now the code would be:

//some file like shaderdefs.h
...

//in the compute shader hlsl file
...

//in your cpp file with the dispatch
...

dc->Dispatch(dispatchX, 1, 1);
So that deals with using the hardware more efficiently to improve performance, and now you need to index into your array with each thread, within each threadgroup. I'm assuming your input and output are 1D arrays btw, hence using 1 dimension numthreads and dispatch call as you want to index into a 1 dimensional resource.

You should be able to just use the SV_DispatchThreadID semantic as you have been. It's the exact equivalent of getting (SV_groupID * numthreads) + SV_GroupThreadID. Where SV_GroupID is the current threadgroup from the dispatch call, and SV_GroupThreadID is the current thread within that threadgroup. Using SV_DispatchThreadID should just save you computing it manually for every thread.

Bear in mind that dispatching with numthreads(64,1,1) isn't necessarily ideal, it's just a good safe minimum thread count to utilise most GPU hardware well. You may find changing THREADS_DIM to 128 or 256 is faster, depending on what you're actually doing in the main compute shader, or the data access patterns, or how much threadgroup-shared-memory you're using etc. That's another good reason to make the numthreads a define like that, so you can experiment with it quickly rather than editing it in seperate files each time.

Here's some more advanced tips on compute performance: http://developer.amd.com/wordpress/media/2012/10/DirectCompute%20Performance.ppsx

good luck. Edited by backstep

##### Share on other sites

Thanks a lot for such a detailed answer.

Yes, I forgot to mention that they are one dimensional resources which is why I'm using y=1, z=1 for the numthreads definition.

I already figured that I might need to do something like that (dividing into thread groups), but I didn't really understand the concept in association to the hardware. Your explanation helps me a lot there

Indeed 1-2 Million is not that much when it comes to GPU computing ... just in relation to 216 (65535) it is quite a step which I didn't know how to take.

Thanks for the great help (+rep)

Edited by stoneMcClane

##### Share on other sites
No problem at all.

It'd be a lot quicker to only say "you just need to do this and it'll work, here's the code", but I doubt I'm alone in preferring to gain some understanding of the how and why, as well.

• 18
• 19
• 11
• 21
• 16