Sign in to follow this  
vprat

DX10 - Dynamic Selection of the Render Target

Recommended Posts

vprat    122
Hi, I am currently implementing point light shadow mapping. I have so far successfully implemented 2 ways to do it: - Single pass cube map rendering using geometry shader (my 1st implementation) - Multi pass rendering (my 2nd implementation) I have implemented the second one because it allows me to do culling per cube face. I am seeing great improvements with this method and I want to try to optimize it. I am trying to minimise render target changes. With the first method, only one change per light (worst case). In the second method, 6 changes (worst case, could be less if culling returned no object for a face). So my idea would be to set the 6 render targets at once (just like the first method) and using an effect parameter, select the appropriate render target to write to. I could not do it with SV_TARGETn (because this is static, you need to know at compilation time to which target you'll be writing to). I also tried a if ( face==N ) output.colorN = ...; but I don't like the 6 if cases. I would need something in the pixel shader like output.color[ face ] = ...; where face is an effect variable set just before pass.apply() Is this possible?

Share this post


Link to post
Share on other sites
DieterVW    724
It's necessary to know which render target is being drawn to in advance of actually executing a pixel shader. As far as I know you are going to be restricted to knowing which render target up front, or doing your culling in the GS. This means that you'd either have to change pixel shader, which can be slow for the driver, or you could change render targets, which can also be slow. It's hard to know just how bad this is given that your render targets are probably the same format/size and thus may not be quite as bad as switching a shader program. Profiling will be your best friend here -- and results may very for different hardware vendors.

The vs/gs is setup to help get rid of unnecessary primitives. You probably already know that you can set the viewport for each render target array slice. You can also use the sv_clip and sv_cull distances to eliminate polys that would otherwise be rasterized and submitted to the pixel shader. I think that the ATI and NVIDIA white papers both recommend doing this anyway.

[Edited by - DieterVW on November 29, 2009 12:47:17 PM]

Share this post


Link to post
Share on other sites
vprat    122
Hi,

thanks for the reply. I keep my current multiple pass rendering as it is for now. and I will try to further optimize the single pass rendering then.

So if I understand well, what could be done for one point light is:


set cube as 6 render targets at once
get objects within light range
for each object within range
do frustum test against the 6 frusta of the cube map
set flags in the effect to indicate which planes to render to
render the object to the cube map in single pass


and the geometry shader could be something like:


// For each face of the cube
//--
for( int f = 0; f < 6; ++f )
{
// Check the write to face flag
//--
if ( writeToFace[f]==true )
{
GS_CUBEPL_OUTPUT output = (GS_CUBEPL_OUTPUT) 0;

// Assign triangle to the RT corresponding to this cube face
//--
output.RTIndex = f;
for( int v = 0; v < 3; v++ )
{
output.position = mul( input[v].position, kgfx_matViewProj[f] );
output.light = input[v].position.xyz - kgfx_vLightPosition;
CubeMapStream.Append( output );
}
CubeMapStream.RestartStrip();
}
}
}


However I don't really see where I would use SV_ClipDistance or SV_Cull. Is there a better way than the above with those techniques? Would you happen to have the link to the white papers you are mentioning (or the title or any additional info to help me find it).

I found some paper mentioning how to do frustum culling inside the GS. However it looks rather inefficient to do it per primitive instead of doing it against a bounding sphere before.

[Edited by - vprat on November 29, 2009 7:40:16 AM]

Share this post


Link to post
Share on other sites
vprat    122
Update: I gave it a try and finally implemented a solution that allows to switch between single/multi pass rendering of the cube map.

I must say that the single pass rendering algorithm still wins (75 vs 45 frames per seconds) even though I do culling and output only the required primitives in the geometry shader.

More about the implementations on the Kourjet Engine blog

Could this be because my graphics card is not so great? (NVidia GeForce 9M series, acer laptop)

Share this post


Link to post
Share on other sites
DieterVW    724
Nvidia Papers:
GPU Programming Guide.
DX10 Performance and usage considerations

AMD:
Introduction to DirectX's Direct3D 10
Harnessing The Power Of DirectX 10

The shader you posted will not perform optimally because each call to the gs is emitting more than 4 vertices. Both IHV's hardware is optimized for the case of 1 gs invocation emitting 1-4 vertices. Beyond that there is a recursive power of 2 decrease in performance for emitting more. Instancing in the VS or GS should be better suited since it'll help you stay in the 1-4 range per gs invocation. The output size also needs to be pretty small. I don't recall how small though. I don't know what is in your vertex shader right now, probably the model matrix mul, so if you don't have GS instancing available (DX11), you'll have to experiment to see if the vs instancing can outperform your current gs magnification scheme while doing more repetitive math. If your models have a lot of vertices then you may get a boost by just uploading a complete model * view * projection matrix per model so that there is just one matrix mul being done on the GPU per vertex instance instead of 2.

You could upload an inverse transformed image plane and then do a back face test on each triangle before deciding to do the full multiply and emit. This could be beneficial if groups of triangles are well organized. Milage will vary per IHV though since at least one of them requires batches of 64 invocations to take the same code branching path in order to see execution benefit. But that doesn't indicate what you would save by reducing the emitted memory bandwidth through this technique. The same thing could be applied to all sides of each frustum in order to cull some triangles. There's just a lot of experimentation that you'll need to do in order to find the best method. These techniques will probably be fairly sensitive to the polygon ordering algorithm you've used with your models.

[Edited by - DieterVW on November 29, 2009 1:52:30 PM]

Share this post


Link to post
Share on other sites
vprat    122
Hi,

thanks for the papers, I read them and it was very interesting.

So following your advice and as you gave me a new idea, I have today given a try to a 3rd possible implementation using geometry instancing (basically sending the whole mesh + an instance buffer containing the cube face index then a GS chooses the render target). I have as well optimised the other two implementations as well.

My conclusion on my small test scene is that the two single pass algorithms are a bit faster than the multi-pass (as expected and unlike in the previous unoptimized implementations).

I had never thought about using geometry instancing for rendering a cube map in a single pass. That's a neat application. (however not applicable to DX9 because you need the GS to direct the primitive to the right render target).

For the ones who would like to have more info, on the dev blog of Kourjet.

I will leave the other experimentations you mention (clipping planes and so on) for later in order to concentrate now on directional lights and spot lights :)

Thanks a lot, regards,

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this