DirectCompute indexing

Graphics and GPU Programming Programming

Started by gfxCahd February 26, 2014 01:05 PM

1 comment, last by gfxCahd 10 years, 1 month ago

234

Author

February 26, 2014 01:05 PM

So I wrote this really simple compute shader. I'm processing a buffer of ints and outputing the result in a RWStructuredBuffer.

I had trouble with converting the 3D coordinates of a GPU thread to a 1D coordinate of my buffer.

The way I got around this was by passing in (by means of a cbuffer) to the compute shader, the dispatch dimensions.

cbuffer dispatchParams : register(b0)
{
int2 DispatchSize;
};

Then, the 1D coordinate was calculated in the shader as

[numthreads(size_x, size_y, 1)]

.

.

int index = DispatchThreadID.x + DispatchThreadID.y *( DispatchSize.x * size_x) + DispatchThreadID.z * (DispatchSize.x * size_x ) * (DispatchSize.y * size_y);

Is this how it's done? It just seems counterintuitive, since the compute shader gets all this data automaticaly (SV_DispatchThreadID, SV_GroupID etc...) I kinda expected that there would be a better way of doing this.... Any opinions?

MJP

20,295

February 26, 2014 10:55 PM

First of all, if you're working in 1D then you can set up your threads and dispatches in 1D as well and make your life easier. Just pick the number of threads that you want per thread group (64-256 is a good choice) and use that to set up the numthreads attribute:


static const uint NumThreads = 256;
 
[numthreads(NumThreads, 1 , 1)]
void CS()
{
}

You should never need to pass anything via a constant buffer to figure out an index, whether your index is 1D, 2D, or 3D. All of the information you need can be computed using the number of threads per thread group, and values that can be obtained using system-value semantics (SV_GroupID, SV_GroupIndex, SV_GroupThreadID, and SV_DispatchThreadID).

Let's walk through a simple example so that you can understand how it all it works: let's say you have a shader that's using 256 threads per group like the one I described above, and you need it to operate on a buffer with 5000 elements. Each thread is going to read a single value from this buffer, do some operation on it, then write out a single value to an output buffer that also has 5000 elements. On the CPU side of things, you need to decide how many thread groups to dispatch. Basically you need to dispatch the minimum number of thread groups required to cover the entire buffer. You can compute this number easily using a function like this:


UINT DispatchSize(UINT ThreadGroupSize, UINT NumElements)
{
    return (NumElements + (ThreadGroupSize - 1)) / ThreadGroupSize;
}

Now in your shader, you need to compute a "flattened" index that you'll use as an address for reading from your input buffer and writing to your output buffer. For a 1D case, you basically want (GroupID * ThreadGroupSize) + ThreadID where GroupID is index of the thread group (so it will have range [0, 19]) and ThreadID is the index of the thread within a thread group (so it will have the range [0, 255]). This GroupID is available via SV_GroupID and ThreadID is availabe via SV_GroupThreadID, which means you can calculate this manually if you wish. However you can also have it provided automatically by using SV_DispatchThreadID:

StructuredBuffer<int> InputBuffer;
RWStructuredBuffer<int> OutputBuffer;
 
static const uint NumThreads = 256;
 
[numthreads(NumThreads, 1, 1)]
void CS(in uint3 DispatchIdx : SV_DispatchThreadID)
{
    int num = InputBuffer[DispatchIdx.x];
    num = num * 8 + 4;
    OutputBuffer[DispatchIdx.x] = num;
}

The Blog | The Book

gfxCahd

234

Author

February 27, 2014 10:05 AM

Wow, thanks for the very detailed post.

So, I think your point is that I should create thread groups and their threads in the dimensions that suit my problem,

and thus the conversion from 2D "thread space" to 1D "data space" becomes moot.

(as, if I'm not wrong, for a 2d or 3d thread to determine which 1d data to work on,

it still needs the dispatch size to be passed in the form of a cbuffer)

And after all, the arrangement of x,y,z dimensions doesn't actually correspond to anything physical in the GPU,

its just an option for the programer to use the convention most suitable to the problem, correct?

Thanks again.

DirectCompute indexing

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

DirectCompute indexing

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines