The next couple of Direct3D 11 tips that I'll be writing about will be discussing the compute shader, which I personally consider to be the single most important addition to the D3D API since programmable shaders - which says a lot about what you can do with it! The general concept of a compute shader is to give the programmer greater control over the threading model of the modern GPUs, as well as breaking down some of the barriers about the isolation between threads as well. This allows for a significant increase in the amount of general purpose programming that can be performed - if there is any reason to start working with D3D11, this is it! So, this first tip is going to be more of an introduction to the compute shader instead of a tip per se, but I'm going to keep rolling with the same format.
Compute Shaders Overview
To get things started, let's discuss what the compute shader is. It is a programmable shader stage that is not part of the rendering pipeline. It is conceptually a standalone processing element that has access to the majority of the functionality available in the common shader core, but with some important additional functionality. The two most important concepts are that you can now control precisely how threads are used in processing data, and that some of those threads are allowed to synchronize with each other and share memory - which is a big departure from the other shader stages.
Before diving into more detail, it may help to talk a bit about the high-level overview of what we have access to. The first difference you will notice about using the compute shader from that application's view is that you execute it using a 'Dispatch' call instead of one of the 'Draw' calls. Dispatch takes three UINT arguments, each one specifying the number of 'thread groups' organized along the three principle axes. A 'thread group' represents one invocation of the compute shader, meaning that if you call:
m_pContext->Dispatch( 4, 4, 2 );
then your compute shader will be instantiated a total of 32 times. You can visualize these thread groups as blocks within a cube. In this example, there would be four blocks across, four blocks up, and two blocks deep.
The individual invocations of the compute shader (each block from our visualization) have access to the identification numbers provided in the Dispatch call, and can be used to have the compute shader to select different portions of input or output data structures to operate on. More detail on this identification information a little later, but this represents the extent of the work that the application needs to do to execute the compute shader, aside from binding the required resources and constant buffers...
Compute Shader Resources
Resources provide another difference in how the application has to deal with in supplying an input/output path to the compute shader stage. To allow even more general processing functionality, D3D11 provides a new resource view called an 'Unordered Access View', or UAV. This type of view allows the compute shader (as well as the pixel shader) to have read and write access to a resource (a resource being either a texture or a buffer). This is also a big departure from the normal shader paradigm - typically you can only read from or write to a resource at any given time, but not both. This is changed with UAVs, and provides the compute shader with lots of freedom.
Compute Shaders in HLSL
Now we move on to the actual compute shader, and how it is implemented in HLSL. The first portion of the compute shader declaration that we'll discuss is the 'numthreads' declaration. Here is a sample declaration that appears prior to the shader function:
[numthreads( 32, 32, 1 )]
This specifies that 32x32x1 threads will be created and executed for each thread group instantiated. This indirectly allows the application to specify how many theads will be executed overall, since the number of thread groups is specified in the Dispatch call as described above. Notice that there are three parameters here as well, which also identify the threads for use within the shader itself.
Alright, so now we can execute several thread groups with a specified number of threads - the actual shader execution is the next topic to consider. A shader is declared with the following syntax:
void CSMAIN( uint3 GroupID : SV_GroupID, uint3 DispatchThreadID : SV_DispatchThreadID, uint3 GroupThreadID : SV_GroupThreadID, uint GroupIndex : SV_GroupIndex )
Each of the arguments passed into the shader function represents a different system value semantic parameter. These parameters serve as the identifies that I have mentioned above - each one serves to identify which thread group this shader function call represents. The system values are generated as so:
SV_GroupID: is a uint3 representing the x, y, z parameters from the Dispatch call. This identifies the thread group in a 3-tuple, with each index starting from 0. Thus it gives you a 3D thread group index.
SV_GroupThreadID: is a uint3 similar to the SV_GroupID, except this system value represents the thread identifier within the current thread group. It is a 3-tuple representing the thread being executed in the current group.
SV_DispatchThreadID: is a uint3 representing the current thread identifier over a complete dispatch call. This is the thread ID over the global number of threads over all thread groups in a call.
SV_GroupIndex: is a uint representing a flat index of the current thread within the group. This is basically the index value if you represent a 3D array in a 1D array, which is useful for different addressing schemes in the shader.
These different system values, along with the new resource types give you almost everything you need to perform just about whatever algorithm you want to. Next time around, we'll look at how to declare different resource types in the application as well as HLSL, and how to use these nice new system values to index and operate on them for your benefit!