Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 08 Sep 2012
Offline Last Active May 25 2014 04:17 PM

Posts I've Made

In Topic: How to transform BoundingShapes to screen space ?

21 May 2014 - 08:02 PM



Firstly, could I ask the OP what paper he is reading? I didn't actually find anything on the tile based deferred system whilst hunting around the internet about a year back and it's something I wish I knew more about (past the BF3 powerpoint presentation)


Secondly, I recently did a "Tile Based Forward Renderer" for my final year project at uni and found the book: GPU Pro 4 to be extreamly handy (2 chapters on the system, one from the guys at AMD). Theres also some very useful source code on the AMD website (they call it Forward+).


My system (and the AMD system) did these light (bounding spheres) to tile intersection tests in view space.

In Topic: Creating a SRV to a ID3D11Buffer resource.

11 October 2013 - 06:43 AM

If you look at the actual structure definition for D3D11_BUFFER_SRV, you'l lsee that ElementWidth and NumElements are in a union. So you can't actually set these values separately, since the occupy the same place in memory. Honestly I have no idea why ElementWidth is in there, since I've never used it for anything. You'll want to just set NumElements to the appropriate value, and leave ElementWidth alone.


Thank you MJP (Again :P), cleared things up perfectly and plugged a knowledge gap of mine!



In Topic: Light Culling on the GPU for a Tile Based Forward Renderer.

07 October 2013 - 07:41 AM

Thanks to the suggestions above and some source code from AMD, I have redesigned my code a little this morning:

//Shared Memory
//Groupshared memory which holds our maximum depth value of all pixels within a tile -
//which conveniantly, is also one of our thread groups. We use the atomic InterlockedMax
//function to write to this in a thread safe manner. Note this is a uint - the atomic
//functions only work with uints or ints and NOT floats. Happily, asuint() and asfloat()
//will convert between data types with no extra work (They work on a bit level)
//GroupShared memory is memory that is read/write between threads within a single
//thread group. Relaivly fast - especially compared to resource read/writes.
//Soted in NDC space - [0,1] (Actually, NDC space is [-1,1] but the hardware
//automatically bungs depths in to the [0,1] range as do we. Plus, my
//ConvertSampledDepthFromDepthBufferToViewSpace() function expects it in the
//[0,1] range.
groupshared uint MaxGroupSharedDepthNDC = 0;

//Compute Shader helper functions
//Converts a sampled depth value back to view space depth. Please note: The depth
//value passed in to this function (z), should be taken directly from the depth buffer.
//NDC space has the range [-1,1] whilst the depth buffer stores things to [0,1]. Our
//function handles this.
//With view space depth aquired, it is possible (given screen space xy coords) to
//reconstruct the full 3D position (view space) of any pixel in our scene.
//Credit to the Forward+ Source Code for this solution. I did actually test it out
//in my little maths demo application and it looks right.
float ConvertSampledDepthFromDepthBufferToViewSpace(float z)
    //From [0,1] to [-1,1]
    float depth = z * 2.0f - 1.0f;
    float viewDepth = 1.0f / (depth*invProjection._34 + invProjection._44);
    //And return
    return viewDepth;

//Function that samples the depth buffer (Or MSAA Depth buffer) and uses
//the InterlockedMax atomic function to store it (Note: As a UINT - HLSL
//functions asuint and asfloat are useful for storing data here and then
//converting the return from uint to float. These functions work on the
//bit patterns).
void ThreadCalculateMaxDepth(int3 dispatchID, uniform bool msaa, uint msaaSampleIndex)
    //Load value to sample the depth buffer.
    int3 sampleVal = int3( (dispatchID.x), (dispatchID.y), 0);

    //uint - we will sample our depth buffer and convert it to a uint. This can
    //then be stored (via the atomic funcion interlockedmax()) in our group shared
    uint samp = 0;

    if (msaa)
        //MSAA Enabled. Sample the buffer using msaaSampleIndex as the 3rd param
        //in the Load function.
        samp = asuint(ZPrePassDepthBufferMSAA.Load(sampleVal.xy, msaaSampleIndex).r);
        //MSAA disabled - Sample the standard buffer.
        samp = asuint(ZPrePassDepthBuffer.Load(sampleVal).r);

    //write to shared memory.
    InterlockedMax(MaxGroupSharedDepthNDC, samp);

//Compute Shader Entry functions.

//Num threads per thread group. One thread per pixel. This is a 2D thread group. Shared
//memory will be used (shared between threads in the same thread group) to cache the
//depth value from the depth buffer. For this pass, we have one thread group per tile
//and a thread per pixel in the tile.
void CSMain(
    in int3 groupID            : SV_GroupID,           //Uniquely identifies each thread group
    in int3 groupThreadID      : SV_GroupThreadID,     //Uniquely identifies a thread inside a thread group.
    in int3 dispatchThreadID   : SV_DispatchThreadID,  //Uniquely identifies a thread relative to ALL threads generated in a Dispatch() call
    uniform bool useMSAA)                              //MSAA Enabled? Sample MSAA Depth Buffer
    //Stage 1 - We sample the depth buffer and work out what the maximum Z value is for every tile.
    //This is done by looping through all the depth values of the pixels that share the same
    //tile and comparing them.
    //We then write this data to the MaxZTileBuffer RWBuffer (Optional). The shared data is handy
    //for stage 2 where we can cull more lights based on this maximum depth value. (Assume
    //min depth is 0.0f so that we can accomodate transparant objects - there is a much
    //more efficient solution available which I will explore, but if this comment is still
    //here... I left it out for whatever reason - check my report, I will have made some
    //comment regarding said feature)
    //GroupThreadIndex - 1D index uniquly identifying threads within a thread
    //group - can be used to determin if a thread in the thread group
    //is the first (0,0)
    int groupThreadIndex = groupThreadID.x + (groupThreadID.y * TILE_PIXEL_RESOLUTION);
    //Sample depth buffer and write to shared memory.
    if (useMSAA)  
        for(uint i = 0; i < 4; i++)
            //Sample MSAA buffer - Note, we could actually ask the texture
            //what MSAA count said resource has been created with, but since I only
            //support 4xMSAA, I'm not too fussed...
            ThreadCalculateMaxDepth(dispatchThreadID, true, i);
        //Sample standard buffer  
        ThreadCalculateMaxDepth(dispatchThreadID, false, 0);
    //Wait for the threads to complete there work (Ie, work out there
    //max depth value and preform the write. This is important to do
    //before we write to the max depth buffer (optional) or use
    //it in our calculations)
    //If we dont have this, we end up with some very odd flashing
    //as we write to the buffer before all the threads have completed
    //there InterlockMax call.

    //Write to Maz Z Tile Buffer for visualisation if option enabled.
    //Note, we can turn this feature off (buffer writes are very, very, very
    //expensive. Since this is actually not required for our algorithm - though needed if we want
    //to visualise the tiles max depth values - a #define has been used to enable/disable
    //the buffer write)
    //NOTE: The first thread only to reduce buffer writes. (well, first thread warp in reality.
    //32 threads will actually run this code block (32 threads in a warp on nvidia cards.
    //64 on ATI. Every other thread in a thread group will not do this.(well, any thread in
    //a differnt thread warp for this thread group anyway...)
    //This advice came from the authorof Practical Rendering and Computation
    //with Direct3D11 on GameDev.net so I will not be questioning his wisdom. Much
    //thanks for your time MJP!)
    if (groupThreadIndex == 0)
        //Work out the index in to our buffer. One entry in the
        //buffer per tile:
        int tilesX = ceil( (rtWidth  / (float)TILE_PIXEL_RESOLUTION) );
        int maxZTileIndex = groupID.x + (groupID.y * tilesX);
        //Reinterpret as a floating point value.
        MaxZTileBuffer[maxZTileIndex] = asfloat(MaxGroupSharedDepthNDC);  
    //Stage 2 - In this stage, we will build our LLIB (Light List Index Buffer - essentially
    //a list which indexes in to the List List Buffer and tells us which lights affect
    //a given tile) and our LLSEB (Light List Start End Buffer - a list which indexes
    //in to the LLIB)

}//End CSMain()

(Sorry for the wall of red text - Being an assignment, you are expected to make decent comments)


Just wondering if anyone has any extra feedback on the solution (Apart from the God awful spelling in some of my comments tongue.png - I've changes a few of the inaccurate comments just now)? And if it is an acceptable solution, hopefully the above code can become somewhat useful for others.



In Topic: Light Culling on the GPU for a Tile Based Forward Renderer.

06 October 2013 - 11:35 AM

Okay a few things...

1. You need the max and min depth per-tile to perform optimal culling for opaques. Without the min depth you could have lights that are "floating in air" that never touch a surface, but you'll still include them in your per-tile lists. Using the min depth won't work for transparents, but for transparents you can generate a second list per-tile so that the opaques are more efficient.

2. Like Pachanoi mentioned, having one thread determine the min/max depth for a tile is a really inefficient way to do it. It will be slow since one thread will have to read a lot of values out of shared memory and perform some math, and you'll also use a lot of shared memory which can hurt overall shader performance. A simpler way to do it is to use InterlockedMax and InterlockedMin on a shared memory variable, as Pachanoi suggests. That sample he linked to do does this, and you'll notice it has to do a few tricks to do it since InterlockedMin/Max only works on integers. Atomics still aren't that great though, since they effectively serialize the memory access among all threads. To avoid using them, you can use a parallel reduction instead. However a parallel reduction requires more shared memory, so it may not always be a win.

3. Don't take the average of your MSAA values within a pixel, that doesn't make any sense. Compute the min and max depth values, and use them as the initial min and max depth that you use to compare against the other thread values.

To clear up your misconception about branching, the hardware doesn't always execute both sides of the branch. It only does this when threads within a nearby group of threads don't take the same path in the branch. This "nearby group of threads" is called a warp on Nvidia hardware and is made up of 32 threads, and is called a wavefront on AMD hardware is made up of 64 threads. So for example, lets take your branch where you only do some work if the thread index is 0. I'll assume that your tile size is 16x16, which means you have 256 threads in your thread group. On Nvidia hardware, the first warp will have 1 thread take the branch and 31 not take the branch. This means the other 31 still have to do the work inside the branch. However for all subsequent warps in the thread group, no threads will take the branch. This means that those warps can skip the branch.




Thanks for your reply.


In response to 1). The project is only meant to take 400 hours total, which includes a 25k word report, all the preliminary education (D3D11) and applicaion framework. I'm lucky enough to have put in some time over the summer to learn D3D11 and build a decent enough framework (For this purpose, anyway) - so I dont have any time issues yet, but, I still have had to make some decisions to simplify the program - this being one.


I will be using the suggestion from GPU Pro 4 - Just assuming the minimum depth is 0.0f, thus our bounding volume per tile will extend from the origin and up untill the maximum value of the opaque objcts (thus supporting transparancy - allthough inefficiently as you say).


However, I'm still yet to decide on which marking shceme to go with  - the other places more emphersis on the product (IE, a more complete product. More design documentation, etc. But a much smaller report) and that we have to follow an accepted development path (Eg, an itterative method). So there isnt anything stopping me changing mark schemes for another few weeks - Therefore, If I develop the simple solution now (Which is more than enough for the original mark scheme. So long as it can demonstrate support for transparancy and MSAA, were golden. Plus, the method you suggested (Individual tile list for opaque and transparant objects) can go in to my report as part of the imporovments section), I can then extending functionality at a later date (I'll claim this as an Itterative method of some sort).


I'm going to give a proper look in to InterlockedMax tomorrow. I have some source code from an AMD demo on Forward+ (I think) which makes use of InterlockedMax and Min. And demonstrates the ways to convert from float to uint and back again. If this solution is good enough for AMD, it should be fine for my needs. :)


Thanks for clearing up the branching - that has helped alot! :)


Also, I'm a rather big fan of (I presume) your book - Practical Rendering and Computation. Its actually one of the major influences in my decision to do a rendering system as my final year project after implementing the deferred renderer in the book!


Lot of good stuff in that book I must say! :)

In Topic: Light Culling on the GPU for a Tile Based Forward Renderer.

05 October 2013 - 10:35 AM


There are a lot of cases where it's useful or required to use dynamic branching in this way, where one thread needs to perform some operations. But I'm not sure if this is what you really want to be doing here, since you are using a single thread to loop over the entire group's shared data.  Have you looked into InterlockedMax?  You should be able to compute your max depth without doing any sort of looping over the data, or need for the cache other than a single groupshared uint. Just remember you will need to convert your sampled depth to uint first, and then convert the final max depth back to a float before storing in your max z buffer.





As far as the branching goes, it's perfectly ok to have the first thread do some work if it's going to be shared across the group.  For example, computing the frustum for each tile would just need to be done for the one thread. Obviously only do this in cases where it only makes sense for one thread to do it.


Based on the comments it looks like you have your light list building code inside that if statement. This can be distributed as well.  I believe one of the tile based shading articles in GPU Pro 4 shows how to do this, and I'm pretty sure it's done in this sample.  Something like this:


// Get a list of the lights that affect the tile
for(uint i=0; i<g_NumLights; i+=GROUP_SIZE)
    uint lightIndex = groupIndex + i;
    if( lightIndex < g_NumLights )
        // Check if the light affects this tile
        if( light is visible to the tile )
            // Update the counter
            uint currentTileLightIndex;
            InterlockedAdd(sharedTileLightCounter, 1, currentTileLightIndex);

            // Add the light to the list
            if(currentTileLightIndex < MAX_LIGHTS)
                g_LightList[currentTileLightIndex] = lightIndex;



Thank you very much for your reply.


I'll be looking right in to Interlocking functions - I have seen source code for a tile based forward renderer that made use of it and had a solution which seemed extreamly clean (as one would expect from developers over at AMD) and didnt have my issues above.


Also thanks for the link to a paper/source code - I havnt had a look yet, but I would be surprised if it didnt feature somewhere in the report at the very least!


Once again, thanks for the help! :)