[DirectX 10] Improving terrain performance

Started by
8 comments, last by _the_phantom_ 12 years, 10 months ago
Hi,
I'm trying to improve my terrain shaders, because its taking ~8ms per frame to render the terrain and my not even rendering it to the shadow map cascades (which will probably take some more ms per frame).

Shader to render the terrain to GBuffer

VS_OUT GBufferVS(VS_IN vIn)
{
VS_OUT vOut;

float2 worldPos = vIn.posL.xz * float2(vIn.mSpacing, vIn.mSpacing) + vIn.mTransform;

float elevation = -100000.0f;

float2 texC = float2(worldPos.x/HSIZEX, worldPos.y/HSIZEY) * 0.5f + 0.5f;

float dx = vIn.mSpacing * DX;
float dy = vIn.mSpacing * DY;

if( texC.x > 0.0f && texC.x < 1.0f && texC.y > 0.0f && texC.y < 1.0f)
{
elevation = gElevationMap.SampleLevel(gTriLinearSam, texC, 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x, texC.y+dy), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x, texC.y-dy), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x-dx, texC.y), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x+dx, texC.y), 0);
}

elevation /= 5;

float4 posVS = mul(float4(worldPos.x, elevation * HSCALE + HOFFSET, worldPos.y, 1.0f), gView);

vOut.posH = mul(posVS, gProj);

vOut.posWSTexC.xy = worldPos;
vOut.posVS = posVS;
vOut.posWSTexC.zw = texC;

return vOut;
}

PS_OUT GBufferPS(VS_OUT pIn)
{
PS_OUT pOut;

float et = gElevationMap.Sample(gTriLinearSam, float2(pIn.posWSTexC.z, pIn.posWSTexC.w+DY)) * HSCALE + HOFFSET;
float eb = gElevationMap.Sample(gTriLinearSam, float2(pIn.posWSTexC.z, pIn.posWSTexC.w-DY)) * HSCALE + HOFFSET;
float el = gElevationMap.Sample(gTriLinearSam, float2(pIn.posWSTexC.z-DX, pIn.posWSTexC.w)) * HSCALE + HOFFSET;
float er = gElevationMap.Sample(gTriLinearSam, float2(pIn.posWSTexC.z+DX, pIn.posWSTexC.w)) * HSCALE + HOFFSET;

float3 tanZ = float3(0.0f, (et-eb)*0.4, 1.0f);
float3 tanX = float3(1.0f, (er-el)*0.4, 0.0f);

float3 normal = cross(tanZ, tanX);

pOut.normal.rgb = 0.5f * normalize(normal) + 0.5f;
pOut.normal.a = 0.1f;
pOut.depthVS = pIn.posVS.z / gFarClipDistance;

return pOut;
}


Second geometry pass shader:

SGeometry_VS_OUT SGeometryVS(VS_IN vIn)
{
SGeometry_VS_OUT vOut;

float2 worldPos = vIn.posL.xz * float2(vIn.mSpacing, vIn.mSpacing) + vIn.mTransform;

float elevation = -100000.0f;

float2 texC = float2(worldPos.x/HSIZEX, worldPos.y/HSIZEY) * 0.5f + 0.5f;

float dx = vIn.mSpacing * DX;
float dy = vIn.mSpacing * DY;

if( texC.x > 0.0f && texC.x < 1.0f && texC.y > 0.0f && texC.y < 1.0f)
{
elevation = gElevationMap.SampleLevel(gTriLinearSam, texC, 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x, texC.y+dy), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x, texC.y-dy), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x-dx, texC.y), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x+dx, texC.y), 0);
}

elevation /= 5;

float4 posVS = mul(float4(worldPos.x, elevation * HSCALE + HOFFSET, worldPos.y, 1.0f), gView);

vOut.posH = mul(posVS, gProj);
vOut.texC = texC;

return vOut;
}

float4 SGeometryPS(SGeometry_VS_OUT pIn) : SV_Target
{
float4 diffuse = pow(gDiffuseMap.Sample(gTriLinearSam, pIn.texC), 2.2);
clip(diffuse.a - 0.25f);

float3 color = (0.0f,0.0f,0.0f);

float2 sPos = pIn.posH;
float4 light = gLightBuffer.Load(int3(sPos.x, sPos.y, 0));

color += light.rgb * diffuse + light.aaa + (gAmbientLight * diffuse);

return float4(color , 1.0f);
}


The terrain is made of about 700k triangles, rendered twice (one to GBuffer and another one in the second geometry pass).

Can you see any improvement that can be made? Should I reduce the number of triangles?

P.S. I'm already doing frustum culling to reduce the number of triangles.
Also, these shaders use ~74% of VS and 16% of PS
Advertisement
This code here

if( texC.x > 0.0f && texC.x < 1.0f && texC.y > 0.0f && texC.y < 1.0f)
{
elevation = gElevationMap.SampleLevel(gTriLinearSam, texC, 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x, texC.y+dy), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x, texC.y-dy), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x-dx, texC.y), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x+dx, texC.y), 0);
}

Can be changed to

elevation = gElevationMap.SampleLevel(gTriLinearSam, texC, 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x, texC.y+dy), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x, texC.y-dy), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x-dx, texC.y), 0);
elevation += gElevationMap.SampleLevel(gTriLinearSam, float2(texC.x+dx, texC.y), 0);

Just create the sampler with wrap or clamp, you do not need to use an if statement to enforce that.

Whats the purpose of geometry buffer pass?

I have gone down the same road you are, and I can see what you are trying to do. You are trying to offload work to the GPU, which can be a good thing, but most of the work is redundant and can be done once on the cpu and then rendered as static geometry which is much faster than what you are currently doing.

Create your terrain as grids to be built on the CPU, then sent to the gpu to draw as static geometry, this will reduce your draw times down by alot. Also, the geometry shader sucks for doing alot of work, its purpose is for small stuff.
Wisdom is knowing when to shut up, so try it.
--Game Development http://nolimitsdesigns.com: Reliable UDP library, Threading library, Math Library, UI Library. Take a look, its all free.
What batch sizes are you doing?

Are you performing any LODing on the terrain 'patches'?

How are you getting those GPU usage numbers? Does the tool give you any idea of any shader bottlenecks?
I forgot to mention that this is a Geometry Clipmaps implementation. So I cant use static geometry.

Im not using the geometry shader... I'm rendering the geometry to the Gbuffer and then a second time to the Light Pre-Pass Second geometry pass buffer.

Also, Im instancing most of the geometry so I just make 5 draw calls (10 draw calls because I draw the terrain twice)

I'm using NVIDIA PerfHUD to get usage numbers...

@phantom whats do you mean by shader bottlenecks? I just know that my game is GPU bound, because the driver is sleeping about 6ms per frame, is that what your asking?
Numbers as to where your shaders are spending their time; ALU or sample bound for example? or maybe even raster operation bound (writing the pixels) which is common in a deferred renderer setup.

Also, what hardware are you rendering this on?

Numbers as to where your shaders are spending their time; ALU or sample bound for example? or maybe even raster operation bound (writing the pixels) which is common in a deferred renderer setup.

Also, what hardware are you rendering this on?


I dont know how to check if a shader is ALU or sample bound... Take a look at this image anyway:
semttulovt.png

Im running on Intel Pentium Dual-Core E5300 2.60GHz (bad CPU I know) and NVIDIA GeForce 9800 GT
geometry clip maps can be done on the cpu. You can implement this as static geometry. Why couldnt you? Pre compute each grid as a separate mesh on the cpu, then simply draw them -- problem solved. There isnt a good reason to preform redundant calculations on the gpu when you can do them once on the cpu -- this is the reason you are getting such poor performance.

Geometry clipmaps is a technique for how to piece together geometry that allows for long distance viewing -- and large terrains. There is no rule dictating that it must be done solely on the gpu. I realize many implement it this way, but it is not the best way.

If you implement it on the cpu, you will quickly see how much faster the technique is.
Wisdom is knowing when to shut up, so try it.
--Game Development http://nolimitsdesigns.com: Reliable UDP library, Threading library, Math Library, UI Library. Take a look, its all free.
I'll admit up front I don't have a lot of experiance with PerfHUD, however even so I'll do a quick walk through on bottlenecks here :)

So, firstly for this draw call, yes you are certainly spending most of your time in VS instructions as the 'instruction count ratios' shows. This is mildly worrying as while your vertex shader is pretty heavy there are generally more pixels than vertices being processed in any given scene.

When it comes to finding bottlenecks the graph next to the ICR graph is your friend; the graph is showing the amount of time each functional unit on the GPU is taking for the frame, state bucket and (importantly) draw call.

Based on the size of the peach coloured bar it would seem most of your time is being spent in input assembly and geometry setup; shader, texture and raster ops aren't remotely a problem, even frame buffer traffic is a bit low.

The net result of this is that the problem ISNT with your shaders; they are executing quickly on the draw call taking hardly any time, the problem seems to be input assembly and geometry setup as they are taking much much longer; in fact shader, texture, ROP and frame buffer combined are taking less time than input assembly.

This would lead me to think that you are submitting too many vertices, which would support the high VS:PS instruction counts for the draw call as well, so you might want to look at your LOD scheme and check how many verts you are submtting.

I'm submitting an average of 1m vertices every time i draw the terrain...

Do you know what is done in the geometry setup?

Also, whats the difference between shader and texture in the graph on the picture I posted?
I think, and I need to refresh my memory but I'm 99% certain, that geometry setup includes dividing up work into pixel quads for the hardware to process at the pixel level.

Shader is the amount of time executing ALU ops, texture is going to be the amount of time performing texture operations (aka sampling); based on that most of your time is spent doing ALU ops rather than TEX ops, which is a good thing as most GPUs these days are biased in favour of ALU ops.

As for the vertex count, 1 million might be a little high, but it comes down to density as well as lots of small triangles are not very hardware friendly due to how the hardware dispatches under the hood.

This topic is closed to new replies.

Advertisement