Hi,
I'm currently in the process of developing a tile based forward renderer (Like Forward+) for a university project and this week have begun work on the light culling stage utilising the compute shader. I am a little inexperianced with regards to the compute shader and parallel programming but I do know that you should avoid dynamic branching as much as possible.
The following code is what I have so far. In the application code (not shown), I launch enough thread groups to cover the entire screen with each thread group containing n,n,1 threads (n is 8 in my example all though you can change this). Thus, a thread per pixel.
The idea is for each thread in a thread group to sample the depth buffer (MSAA depth buffer supported) and store this in shared memory. Then loop through all these depth values and work out which is the largest.(I am supporting transparancy with the trivial solution of having the minimum depth as 0. This was suggested in GPU Pro 4 as a potential solution for the time being. I have an idea which uses 2 depth buffers to better support transparancy, but for the time being, we will stick with what I've got).
However, in order to do this, I have had to add an if statement. This if statement checks the group thread ID to ensure that only the first thread in every thread group executes the code - or at least, that was the idea (EDIT: Bold and Enlarge didnt work - you are hunting for this line "if (groupThreadID.x == 0 && groupThreadID.y == 0 && groupThreadID.z == 0)"):
//Num threads per thread group. One thread per pixel. This is a 2D thread group. Shared
//memory will be used (shared between threads in the same thread group) to cache the
//depth value from the depth buffer. For this pass, we have one thread group per tile
//and a thread per pixel in the tile.
[numthreads (TILE_PIXEL_RESOLUTION, TILE_PIXEL_RESOLUTION, 1)]
void CSMain(
in int3 groupID : SV_GroupID, //Uniquely identifies each thread group
in int3 groupThreadID : SV_GroupThreadID, //Uniquely identifies a thread inside a thread group.
in int3 dispatchThreadID : SV_DispatchThreadID, //Uniquely identifies a thread relative to ALL threads generated in a Dispatch() call
uniform bool useMSAA) //MSAA Enabled? Sample MSAA DEPTH Buffer
{
//Stage 1 - We sample the depth buffer and work out what the maximum Z value is for every tile.
//This is done by looping through all the depth values of the pixels that share the same
//tile and comparing them.
//
//We then write this data to the MaxZTileBuffer RWBuffer (Optional). This data is handy
//for stage 2 where we can cull more lights based on this maximum depth value.
//Load value to sample the depth buffer.
int3 sampleVal = int3( (dispatchThreadID.x), (dispatchThreadID.y), 0);
//This is the sampled depth value from the depth buffer for this given thread.
//If msaa is used (Ie, MSAA enabled depth buffer), this will represent the average
//of the 4 samples.
float sampledDepth = 0.0f;
//Sample MSAA buffer if MSAA is enabled
[flatten]
if (useMSAA)
{
//Sample the buffer (4 times)
float s0 = ZPrePassDepthBufferMSAA.Load(sampleVal.xy, 0).r;
float s1 = ZPrePassDepthBufferMSAA.Load(sampleVal.xy, 1).r;
float s2 = ZPrePassDepthBufferMSAA.Load(sampleVal.xy, 2).r;
float s3 = ZPrePassDepthBufferMSAA.Load(sampleVal.xy, 3).r;
//Average out.
sampledDepth = (s0 + s1 + s2 + s3) / 4.0f;
}
//Sample standard buffer
else
sampledDepth = ZPrePassDepthBuffer.Load(sampleVal).r;
//Write to the (thread group) shared memory and wait for threads to complete their work.
depthCache[groupThreadID.x][groupThreadID.y] = sampledDepth;
GroupMemoryBarrierWithGroupSync();
//Only one thread in the thread group should preform this check and then the
//write to our MaxTileZBuffer.
if (groupThreadID.x == 0 && groupThreadID.y == 0 && groupThreadID.z == 0)
{
//Loop through the shared pool (essentially a 2D array) and workout what the maximum
//value is for this thread group (Tile).
//Store the maximum value in the following floating point variable - Init to 0.0f.
float maxDepthVal = 0.0f;
//Unroll - i and j are knowen at compile time - The compiler will happily
//do this for us, but just incase.
[unroll]
for (int i = 0; i < TILE_PIXEL_RESOLUTION; i++)
{
for (int j = 0; j < TILE_PIXEL_RESOLUTION; j++)
{
//Extract value from the depth cache.
float depthToTest = depthCache[i][j];
//Test and update if larger than the already stored value.
if (depthToTest > maxDepthVal)
maxDepthVal = depthToTest;
}//End for j
}//End for i
//Write to Maz Z Tile Buffer for use in the second pass - Only one thread in a thread
//group should do this.
//
//Note, we can turn this feature off (buffer writes
//are expensive. Since this is actually not required - though needed if we want
//to visualise the tiles max depth values, a #define has been used to enable/disable
//the buffer write. )
#ifdef SHOULD_WRITE_TO_MAX_Z_TILE_BUFFER
int tilesX = ceil( (rtWidth / (float)TILE_PIXEL_RESOLUTION) );
int maxZTileIndex = groupID.x + (groupID.y * tilesX);
MaxZTileBuffer[maxZTileIndex] = maxDepthVal;
#endif
//Stage 2 - In this stage, we will build our LLIB (Light List Index Buffer - essentially
//a list which indexes in to the List List Buffer and tells us which lights affect
//a given tile) and our LLSEB (Light List Start End Buffer - a list which indexes
//in to the LLIB).
}//End if(...)
}//End CSMain()
Now, my limited understanding of dynamic branching in shaders suggests this may not be a good move - Each thread will execute the code within the code block and then decide if it should be kept or discarded later (In order to ensure parallelism??). Not ideal, particually when I am going to do >3000 sphere/frustum instersections in stage 2.
Or, since all but one thread in a thread group will not actually execute the code, does the hardware actually do a pretty good job in handling this sytem? (63 threads not doing it in our example.
(My test GPU is: 650M (Laptop that I work on in uni) or 570 (at home - Will be upgrading to a 770/680 in the near future. I am led to belive that on modern GPUs, dynamic branching is less of a concern, all though I dont really understand why :P)
Many thanks,
Dan.