Light Culling on the GPU for a Tile Based Forward Renderer.

Started by
5 comments, last by Daniel Bowler 10 years, 6 months ago

Hi,

I'm currently in the process of developing a tile based forward renderer (Like Forward+) for a university project and this week have begun work on the light culling stage utilising the compute shader. I am a little inexperianced with regards to the compute shader and parallel programming but I do know that you should avoid dynamic branching as much as possible.

The following code is what I have so far. In the application code (not shown), I launch enough thread groups to cover the entire screen with each thread group containing n,n,1 threads (n is 8 in my example all though you can change this). Thus, a thread per pixel.

The idea is for each thread in a thread group to sample the depth buffer (MSAA depth buffer supported) and store this in shared memory. Then loop through all these depth values and work out which is the largest.(I am supporting transparancy with the trivial solution of having the minimum depth as 0. This was suggested in GPU Pro 4 as a potential solution for the time being. I have an idea which uses 2 depth buffers to better support transparancy, but for the time being, we will stick with what I've got).

However, in order to do this, I have had to add an if statement. This if statement checks the group thread ID to ensure that only the first thread in every thread group executes the code - or at least, that was the idea (EDIT: Bold and Enlarge didnt work - you are hunting for this line "if (groupThreadID.x == 0 && groupThreadID.y == 0 && groupThreadID.z == 0)"):


//Num threads per thread group. One thread per pixel. This is a 2D thread group. Shared
//memory will be used (shared between threads in the same thread group) to cache the
//depth value from the depth buffer. For this pass, we have one thread group per tile
//and a thread per pixel in the tile.
[numthreads (TILE_PIXEL_RESOLUTION, TILE_PIXEL_RESOLUTION, 1)]
void CSMain(
    in int3 groupID            : SV_GroupID,           //Uniquely identifies each thread group
    in int3 groupThreadID      : SV_GroupThreadID,     //Uniquely identifies a thread inside a thread group.
    in int3 dispatchThreadID   : SV_DispatchThreadID,  //Uniquely identifies a thread relative to ALL threads generated in a Dispatch() call
    uniform bool useMSAA)                              //MSAA Enabled? Sample MSAA DEPTH Buffer
{
    //Stage 1 - We sample the depth buffer and work out what the maximum Z value is for every tile.
    //This is done by looping through all the depth values of the pixels that share the same
    //tile and comparing them.
    //
    //We then write this data to the MaxZTileBuffer RWBuffer (Optional). This data is handy
    //for stage 2 where we can cull more lights based on this maximum depth value.
    
    //Load value to sample the depth buffer.
    int3 sampleVal = int3( (dispatchThreadID.x), (dispatchThreadID.y), 0);
    
    //This is the sampled depth value from the depth buffer for this given thread.
    //If msaa is used (Ie, MSAA enabled depth buffer), this will represent the average
    //of the 4 samples.
    float sampledDepth = 0.0f;

    //Sample MSAA buffer if MSAA is enabled
    [flatten]
    if (useMSAA)
    {
        //Sample the buffer (4 times)
        float s0 = ZPrePassDepthBufferMSAA.Load(sampleVal.xy, 0).r;
        float s1 = ZPrePassDepthBufferMSAA.Load(sampleVal.xy, 1).r;
        float s2 = ZPrePassDepthBufferMSAA.Load(sampleVal.xy, 2).r;
        float s3 = ZPrePassDepthBufferMSAA.Load(sampleVal.xy, 3).r;
        
        //Average out.
        sampledDepth = (s0 + s1 + s2 + s3) / 4.0f;
    }
    //Sample standard buffer  
    else
        sampledDepth = ZPrePassDepthBuffer.Load(sampleVal).r;

    //Write to the (thread group) shared memory and wait for threads to complete their work.
    depthCache[groupThreadID.x][groupThreadID.y] = sampledDepth;
    GroupMemoryBarrierWithGroupSync();

    //Only one thread in the thread group should preform this check and then the
    //write to our MaxTileZBuffer.
    if (groupThreadID.x == 0 && groupThreadID.y == 0 && groupThreadID.z == 0)
    {
        //Loop through the shared pool (essentially a 2D array) and workout what the maximum
        //value is for this thread group (Tile).
        //Store the maximum value in the following floating point variable - Init to 0.0f.
        float maxDepthVal = 0.0f;
        //Unroll - i and j are knowen at compile time - The compiler will happily
        //do this for us, but just incase.
        [unroll]
        for (int i = 0; i < TILE_PIXEL_RESOLUTION; i++)
        {  
            for (int j = 0; j < TILE_PIXEL_RESOLUTION; j++)
            {
                //Extract value from the depth cache.
                float depthToTest = depthCache[i][j];
                //Test and update if larger than the already stored value.
                if (depthToTest > maxDepthVal)
                    maxDepthVal = depthToTest;
            }//End for j
        }//End for i

        //Write to Maz Z Tile Buffer for use in the second pass - Only one thread in a thread
        //group should do this.
        //
        //Note, we can turn this feature off (buffer writes
        //are expensive. Since this is actually not required - though needed if we want
        //to visualise the tiles max depth values, a #define has been used to enable/disable
        //the buffer write. )
#ifdef SHOULD_WRITE_TO_MAX_Z_TILE_BUFFER
        int tilesX = ceil( (rtWidth  / (float)TILE_PIXEL_RESOLUTION) );
        int maxZTileIndex = groupID.x + (groupID.y * tilesX);
        MaxZTileBuffer[maxZTileIndex] = maxDepthVal;  
#endif

        //Stage 2 - In this stage, we will build our LLIB (Light List Index Buffer - essentially
        //a list which indexes in to the List List Buffer and tells us which lights affect
        //a given tile) and our LLSEB (Light List Start End Buffer - a list which indexes
        //in to the LLIB).

    }//End if(...)
}//End CSMain()

Now, my limited understanding of dynamic branching in shaders suggests this may not be a good move - Each thread will execute the code within the code block and then decide if it should be kept or discarded later (In order to ensure parallelism??). Not ideal, particually when I am going to do >3000 sphere/frustum instersections in stage 2.

Or, since all but one thread in a thread group will not actually execute the code, does the hardware actually do a pretty good job in handling this sytem? (63 threads not doing it in our example.

(My test GPU is: 650M (Laptop that I work on in uni) or 570 (at home - Will be upgrading to a 770/680 in the near future. I am led to belive that on modern GPUs, dynamic branching is less of a concern, all though I dont really understand why :P)

Many thanks,

Dan.

Advertisement

There are a lot of cases where it's useful or required to use dynamic branching in this way, where one thread needs to perform some operations. But I'm not sure if this is what you really want to be doing here, since you are using a single thread to loop over the entire group's shared data. Have you looked into InterlockedMax? You should be able to compute your max depth without doing any sort of looping over the data, or need for the cache other than a single groupshared uint. Just remember you will need to convert your sampled depth to uint first, and then convert the final max depth back to a float before storing in your max z buffer.

Edit:

As far as the branching goes, it's perfectly ok to have the first thread do some work if it's going to be shared across the group. For example, computing the frustum for each tile would just need to be done for the one thread. Obviously only do this in cases where it only makes sense for one thread to do it.

Based on the comments it looks like you have your light list building code inside that if statement. This can be distributed as well. I believe one of the tile based shading articles in GPU Pro 4 shows how to do this, and I'm pretty sure it's done in this sample. Something like this:


// Get a list of the lights that affect the tile
for(uint i=0; i<g_NumLights; i+=GROUP_SIZE)
{
    uint lightIndex = groupIndex + i;
    if( lightIndex < g_NumLights )
    {
        // Check if the light affects this tile
        if( light is visible to the tile )
        {
            // Update the counter
            uint currentTileLightIndex;
            InterlockedAdd(sharedTileLightCounter, 1, currentTileLightIndex);

            // Add the light to the list
            if(currentTileLightIndex < MAX_LIGHTS)
                g_LightList[currentTileLightIndex] = lightIndex;
        }
    }
}
GroupMemoryBarrierWithGroupSync();

There are a lot of cases where it's useful or required to use dynamic branching in this way, where one thread needs to perform some operations. But I'm not sure if this is what you really want to be doing here, since you are using a single thread to loop over the entire group's shared data. Have you looked into InterlockedMax? You should be able to compute your max depth without doing any sort of looping over the data, or need for the cache other than a single groupshared uint. Just remember you will need to convert your sampled depth to uint first, and then convert the final max depth back to a float before storing in your max z buffer.

Edit:

As far as the branching goes, it's perfectly ok to have the first thread do some work if it's going to be shared across the group. For example, computing the frustum for each tile would just need to be done for the one thread. Obviously only do this in cases where it only makes sense for one thread to do it.

Based on the comments it looks like you have your light list building code inside that if statement. This can be distributed as well. I believe one of the tile based shading articles in GPU Pro 4 shows how to do this, and I'm pretty sure it's done in this sample. Something like this:


// Get a list of the lights that affect the tile
for(uint i=0; i<g_NumLights; i+=GROUP_SIZE)
{
    uint lightIndex = groupIndex + i;
    if( lightIndex < g_NumLights )
    {
        // Check if the light affects this tile
        if( light is visible to the tile )
        {
            // Update the counter
            uint currentTileLightIndex;
            InterlockedAdd(sharedTileLightCounter, 1, currentTileLightIndex);

            // Add the light to the list
            if(currentTileLightIndex < MAX_LIGHTS)
                g_LightList[currentTileLightIndex] = lightIndex;
        }
    }
}
GroupMemoryBarrierWithGroupSync();

Thank you very much for your reply.

I'll be looking right in to Interlocking functions - I have seen source code for a tile based forward renderer that made use of it and had a solution which seemed extreamly clean (as one would expect from developers over at AMD) and didnt have my issues above.

Also thanks for the link to a paper/source code - I havnt had a look yet, but I would be surprised if it didnt feature somewhere in the report at the very least!

Once again, thanks for the help! :)

Okay a few things...

1. You need the max and min depth per-tile to perform optimal culling for opaques. Without the min depth you could have lights that are "floating in air" that never touch a surface, but you'll still include them in your per-tile lists. Using the min depth won't work for transparents, but for transparents you can generate a second list per-tile so that the opaques are more efficient.

2. Like Pachanoi mentioned, having one thread determine the min/max depth for a tile is a really inefficient way to do it. It will be slow since one thread will have to read a lot of values out of shared memory and perform some math, and you'll also use a lot of shared memory which can hurt overall shader performance. A simpler way to do it is to use InterlockedMax and InterlockedMin on a shared memory variable, as Pachanoi suggests. That sample he linked to do does this, and you'll notice it has to do a few tricks to do it since InterlockedMin/Max only works on integers. Atomics still aren't that great though, since they effectively serialize the memory access among all threads. To avoid using them, you can use a parallel reduction instead. However a parallel reduction requires more shared memory, so it may not always be a win.

3. Don't take the average of your MSAA values within a pixel, that doesn't make any sense. Compute the min and max depth values, and use them as the initial min and max depth that you use to compare against the other thread values.

To clear up your misconception about branching, the hardware doesn't always execute both sides of the branch. It only does this when threads within a nearby group of threads don't take the same path in the branch. This "nearby group of threads" is called a warp on Nvidia hardware and is made up of 32 threads, and is called a wavefront on AMD hardware is made up of 64 threads. So for example, lets take your branch where you only do some work if the thread index is 0. I'll assume that your tile size is 16x16, which means you have 256 threads in your thread group. On Nvidia hardware, the first warp will have 1 thread take the branch and 31 not take the branch. This means the other 31 still have to do the work inside the branch. However for all subsequent warps in the thread group, no threads will take the branch. This means that those warps can skip the branch.

Okay a few things...

1. You need the max and min depth per-tile to perform optimal culling for opaques. Without the min depth you could have lights that are "floating in air" that never touch a surface, but you'll still include them in your per-tile lists. Using the min depth won't work for transparents, but for transparents you can generate a second list per-tile so that the opaques are more efficient.

2. Like Pachanoi mentioned, having one thread determine the min/max depth for a tile is a really inefficient way to do it. It will be slow since one thread will have to read a lot of values out of shared memory and perform some math, and you'll also use a lot of shared memory which can hurt overall shader performance. A simpler way to do it is to use InterlockedMax and InterlockedMin on a shared memory variable, as Pachanoi suggests. That sample he linked to do does this, and you'll notice it has to do a few tricks to do it since InterlockedMin/Max only works on integers. Atomics still aren't that great though, since they effectively serialize the memory access among all threads. To avoid using them, you can use a parallel reduction instead. However a parallel reduction requires more shared memory, so it may not always be a win.

3. Don't take the average of your MSAA values within a pixel, that doesn't make any sense. Compute the min and max depth values, and use them as the initial min and max depth that you use to compare against the other thread values.

To clear up your misconception about branching, the hardware doesn't always execute both sides of the branch. It only does this when threads within a nearby group of threads don't take the same path in the branch. This "nearby group of threads" is called a warp on Nvidia hardware and is made up of 32 threads, and is called a wavefront on AMD hardware is made up of 64 threads. So for example, lets take your branch where you only do some work if the thread index is 0. I'll assume that your tile size is 16x16, which means you have 256 threads in your thread group. On Nvidia hardware, the first warp will have 1 thread take the branch and 31 not take the branch. This means the other 31 still have to do the work inside the branch. However for all subsequent warps in the thread group, no threads will take the branch. This means that those warps can skip the branch.

Hi,

Thanks for your reply.

In response to 1). The project is only meant to take 400 hours total, which includes a 25k word report, all the preliminary education (D3D11) and applicaion framework. I'm lucky enough to have put in some time over the summer to learn D3D11 and build a decent enough framework (For this purpose, anyway) - so I dont have any time issues yet, but, I still have had to make some decisions to simplify the program - this being one.

I will be using the suggestion from GPU Pro 4 - Just assuming the minimum depth is 0.0f, thus our bounding volume per tile will extend from the origin and up untill the maximum value of the opaque objcts (thus supporting transparancy - allthough inefficiently as you say).

However, I'm still yet to decide on which marking shceme to go with - the other places more emphersis on the product (IE, a more complete product. More design documentation, etc. But a much smaller report) and that we have to follow an accepted development path (Eg, an itterative method). So there isnt anything stopping me changing mark schemes for another few weeks - Therefore, If I develop the simple solution now (Which is more than enough for the original mark scheme. So long as it can demonstrate support for transparancy and MSAA, were golden. Plus, the method you suggested (Individual tile list for opaque and transparant objects) can go in to my report as part of the imporovments section), I can then extending functionality at a later date (I'll claim this as an Itterative method of some sort).

I'm going to give a proper look in to InterlockedMax tomorrow. I have some source code from an AMD demo on Forward+ (I think) which makes use of InterlockedMax and Min. And demonstrates the ways to convert from float to uint and back again. If this solution is good enough for AMD, it should be fine for my needs. :)

Thanks for clearing up the branching - that has helped alot! :)

Also, I'm a rather big fan of (I presume) your book - Practical Rendering and Computation. Its actually one of the major influences in my decision to do a rendering system as my final year project after implementing the deferred renderer in the book!

Lot of good stuff in that book I must say! :)

Yes that is indeed my book, although Jason wrote most of it! I'm glad that you find it useful. smile.png

Thanks to the suggestions above and some source code from AMD, I have redesigned my code a little this morning:


//
//Shared Memory
//
//Groupshared memory which holds our maximum depth value of all pixels within a tile -
//which conveniantly, is also one of our thread groups. We use the atomic InterlockedMax
//function to write to this in a thread safe manner. Note this is a uint - the atomic
//functions only work with uints or ints and NOT floats. Happily, asuint() and asfloat()
//will convert between data types with no extra work (They work on a bit level)
//
//GroupShared memory is memory that is read/write between threads within a single
//thread group. Relaivly fast - especially compared to resource read/writes.
//
//Soted in NDC space - [0,1] (Actually, NDC space is [-1,1] but the hardware
//automatically bungs depths in to the [0,1] range as do we. Plus, my
//ConvertSampledDepthFromDepthBufferToViewSpace() function expects it in the
//[0,1] range.
groupshared uint MaxGroupSharedDepthNDC = 0;

//
//Compute Shader helper functions
//
//Converts a sampled depth value back to view space depth. Please note: The depth
//value passed in to this function (z), should be taken directly from the depth buffer.
//NDC space has the range [-1,1] whilst the depth buffer stores things to [0,1]. Our
//function handles this.
//
//With view space depth aquired, it is possible (given screen space xy coords) to
//reconstruct the full 3D position (view space) of any pixel in our scene.
//
//Credit to the Forward+ Source Code for this solution. I did actually test it out
//in my little maths demo application and it looks right.
float ConvertSampledDepthFromDepthBufferToViewSpace(float z)
{
    //From [0,1] to [-1,1]
    float depth = z * 2.0f - 1.0f;
    //Calculate
    float viewDepth = 1.0f / (depth*invProjection._34 + invProjection._44);
    //And return
    return viewDepth;
}

//Function that samples the depth buffer (Or MSAA Depth buffer) and uses
//the InterlockedMax atomic function to store it (Note: As a UINT - HLSL
//functions asuint and asfloat are useful for storing data here and then
//converting the return from uint to float. These functions work on the
//bit patterns).
void ThreadCalculateMaxDepth(int3 dispatchID, uniform bool msaa, uint msaaSampleIndex)
{    
    //Load value to sample the depth buffer.
    int3 sampleVal = int3( (dispatchID.x), (dispatchID.y), 0);

    //uint - we will sample our depth buffer and convert it to a uint. This can
    //then be stored (via the atomic funcion interlockedmax()) in our group shared
    //memory.
    uint samp = 0;

    [flatten]
    if (msaa)
        //MSAA Enabled. Sample the buffer using msaaSampleIndex as the 3rd param
        //in the Load function.
        samp = asuint(ZPrePassDepthBufferMSAA.Load(sampleVal.xy, msaaSampleIndex).r);
    else
        //MSAA disabled - Sample the standard buffer.
        samp = asuint(ZPrePassDepthBuffer.Load(sampleVal).r);

    //write to shared memory.
    InterlockedMax(MaxGroupSharedDepthNDC, samp);
}

//
//Compute Shader Entry functions.
//

//Num threads per thread group. One thread per pixel. This is a 2D thread group. Shared
//memory will be used (shared between threads in the same thread group) to cache the
//depth value from the depth buffer. For this pass, we have one thread group per tile
//and a thread per pixel in the tile.
[numthreads (TILE_PIXEL_RESOLUTION, TILE_PIXEL_RESOLUTION, 1)]
void CSMain(
    in int3 groupID            : SV_GroupID,           //Uniquely identifies each thread group
    in int3 groupThreadID      : SV_GroupThreadID,     //Uniquely identifies a thread inside a thread group.
    in int3 dispatchThreadID   : SV_DispatchThreadID,  //Uniquely identifies a thread relative to ALL threads generated in a Dispatch() call
    uniform bool useMSAA)                              //MSAA Enabled? Sample MSAA Depth Buffer
{
    //Stage 1 - We sample the depth buffer and work out what the maximum Z value is for every tile.
    //This is done by looping through all the depth values of the pixels that share the same
    //tile and comparing them.
    //
    //We then write this data to the MaxZTileBuffer RWBuffer (Optional). The shared data is handy
    //for stage 2 where we can cull more lights based on this maximum depth value. (Assume
    //min depth is 0.0f so that we can accomodate transparant objects - there is a much
    //more efficient solution available which I will explore, but if this comment is still
    //here... I left it out for whatever reason - check my report, I will have made some
    //comment regarding said feature)
    
    //GroupThreadIndex - 1D index uniquly identifying threads within a thread
    //group - can be used to determin if a thread in the thread group
    //is the first (0,0)
    int groupThreadIndex = groupThreadID.x + (groupThreadID.y * TILE_PIXEL_RESOLUTION);
    
    //Sample depth buffer and write to shared memory.
    [flatten]
    if (useMSAA)  
        [unroll]
        for(uint i = 0; i < 4; i++)
            //Sample MSAA buffer - Note, we could actually ask the texture
            //what MSAA count said resource has been created with, but since I only
            //support 4xMSAA, I'm not too fussed...
            ThreadCalculateMaxDepth(dispatchThreadID, true, i);
    else
        //Sample standard buffer  
        ThreadCalculateMaxDepth(dispatchThreadID, false, 0);
    
    //Wait for the threads to complete there work (Ie, work out there
    //max depth value and preform the write. This is important to do
    //before we write to the max depth buffer (optional) or use
    //it in our calculations)
    //
    //If we dont have this, we end up with some very odd flashing
    //as we write to the buffer before all the threads have completed
    //there InterlockMax call.
    GroupMemoryBarrierWithGroupSync();

    //Write to Maz Z Tile Buffer for visualisation if option enabled.
    //
    //Note, we can turn this feature off (buffer writes are very, very, very
    //expensive. Since this is actually not required for our algorithm - though needed if we want
    //to visualise the tiles max depth values - a #define has been used to enable/disable
    //the buffer write)
    //
    //NOTE: The first thread only to reduce buffer writes. (well, first thread warp in reality.
    //32 threads will actually run this code block (32 threads in a warp on nvidia cards.
    //64 on ATI. Every other thread in a thread group will not do this.(well, any thread in
    //a differnt thread warp for this thread group anyway...)
    //
    //This advice came from the authorof Practical Rendering and Computation
    //with Direct3D11 on GameDev.net so I will not be questioning his wisdom. Much
    //thanks for your time MJP!)
#ifdef SHOULD_WRITE_TO_MAX_Z_TILE_BUFFER
    if (groupThreadIndex == 0)
    {
        //Work out the index in to our buffer. One entry in the
        //buffer per tile:
        //
        //------------------->
        //------------------->
        //------------------->
        int tilesX = ceil( (rtWidth  / (float)TILE_PIXEL_RESOLUTION) );
        int maxZTileIndex = groupID.x + (groupID.y * tilesX);
        //Reinterpret as a floating point value.
        MaxZTileBuffer[maxZTileIndex] = asfloat(MaxGroupSharedDepthNDC);  
    }
#endif  
 
    //Stage 2 - In this stage, we will build our LLIB (Light List Index Buffer - essentially
    //a list which indexes in to the List List Buffer and tells us which lights affect
    //a given tile) and our LLSEB (Light List Start End Buffer - a list which indexes
    //in to the LLIB)

}//End CSMain()

(Sorry for the wall of red text - Being an assignment, you are expected to make decent comments)

Just wondering if anyone has any extra feedback on the solution (Apart from the God awful spelling in some of my comments tongue.png - I've changes a few of the inaccurate comments just now)? And if it is an acceptable solution, hopefully the above code can become somewhat useful for others.

Cheers.

This topic is closed to new replies.

Advertisement