SSAO performance problems

With SSAO, even when drawing nothing, I get 50 fps. I think this is because of depth buffer overdraw.

The author in the book sets depth state to equal to prevent overdraw:

void Direct3D::DrawScene()
pDeviceContext->ClearDepthStencilView(m_DepthStencilView, D3D11_CLEAR_DEPTH|D3D11_CLEAR_STENCIL, 1.0f, 0);
pDeviceContext->RSSetViewports(1, &m_ViewPort);
// Now compute the ambient occlusion.
pDeviceContext->ClearRenderTargetView(m_RenderTargetView, reinterpret_cast<const float*>(&Colors::Silver));
//pDeviceContext->ClearDepthStencilView(m_DepthStencilView, D3D11_CLEAR_DEPTH|D3D11_CLEAR_STENCIL, 1.0f, 0);
m_Land.Draw(pDeviceContext, m_Cam, mDirLights);
// We already laid down scene depth to the depth buffer in the Normal/Depth map pass,
// so we can set the depth comparison test to “EQUALS.”  This prevents any overdraw
// in this rendering pass, as only the nearest visible pixels will pass this depth
// comparison test.
pDeviceContext->OMSetDepthStencilState(RenderStates::EqualsDSS, 0);
if (GetAsyncKeyState('1'))

But this does not work well for me, I get this result when i set EqualsDDS
But with this

void Direct3D::DrawScene()
pDeviceContext->ClearDepthStencilView(m_DepthStencilView, D3D11_CLEAR_DEPTH|D3D11_CLEAR_STENCIL, 1.0f, 0);
pDeviceContext->RSSetViewports(1, &m_ViewPort);
// Now compute the ambient occlusion.
pDeviceContext->ClearRenderTargetView(m_RenderTargetView, reinterpret_cast<const float*>(&Colors::Silver));
//pDeviceContext->ClearDepthStencilView(m_DepthStencilView, D3D11_CLEAR_DEPTH|D3D11_CLEAR_STENCIL, 1.0f, 0);
m_Land.Draw(pDeviceContext, m_Cam, mDirLights);
// We already laid down scene depth to the depth buffer in the Normal/Depth map pass,
// so we can set the depth comparison test to “EQUALS.”  This prevents any overdraw
// in this rendering pass, as only the nearest visible pixels will pass this depth
// comparison test.
//pDeviceContext->OMSetDepthStencilState(RenderStates::EqualsDSS, 0);
if (GetAsyncKeyState('1'))
DrawModels(); //..........
It works well.
Only the problem is performance.

Also what output debug messages say:

D3D11: ERROR: ID3D11DeviceContext::OMSetRenderTargets: The RenderTargetView at slot 0 is not compatable with the DepthStencilView. DepthStencilViews may only be used with RenderTargetViews if the effective dimensions of the Views are equal, as well as the Resource types, multisample count, and multisample quality. The RenderTargetView at slot 0 has (w:492,h:473,as:1), while the Resource is a Texture2D with (mc:1,mq:0). The DepthStencilView has (w:492,h:473,as:1), while the Resource is a Texture2D with (mc:4,mq:0). D3D11_RESOURCE_MISC_TEXTURECUBE factors into the Resource type, unless GetFeatureLevel() returns D3D_FEATURE_LEVEL_10_1 or greater. [ STATE_SETTING ERROR #388: OMSETRENDERTARGETS_INVALIDVIEW ]
EDIT : When I viewed texture for debugging, it was empty i.e models are not getting drawn to SSAO map.

Just as the debug output messages say, I can barely rephrase it: The render target view is not compatible with the depth stencil view. In detail, your render target texture does not match the depth stencil buffer. They have to equal in size, multisamplecount, etc... otherwise nothing gets rendered.

EDIT: Also, concerning performance, we need to see your SSAO-shader.

With SSAO, even when drawing nothing, I get 50 fps.
This means that either your total CPU time per frame, or your total GPU time per frame is ~20ms. It could well be that running your SSAO algorithm for every pixel is taking ~20ms of GPU time...

Thanks, I changed my multi-sample quality to match depth buffer of render target view 0 and it worked.

There is now another problem.
Why is this happening?



EDIT: Also here is the SSAO shader

// Ssao.fx by Frank Luna (C) 2011 All Rights Reserved.
// Computes SSAO map.
cbuffer cbPerFrame
float4x4 gViewToTexSpace; // Proj*Texture
float4   gOffsetVectors[14];
float4   gFrustumCorners[4];
// Coordinates given in view space.
float    gOcclusionRadius    = 0.5f;
float    gOcclusionFadeStart = 0.2f;
float    gOcclusionFadeEnd   = 2.0f;
float    gSurfaceEpsilon     = 0.05f;
// Nonnumeric values cannot be added to a cbuffer.
Texture2D gNormalDepthMap;
Texture2D gRandomVecMap;
SamplerState samNormalDepth
// Set a very far depth value if sampling outside of the NormalDepth map
// so we do not get false occlusions.
AddressU = BORDER;
AddressV = BORDER;
BorderColor = float4(0.0f, 0.0f, 0.0f, 1e5f);
SamplerState samRandomVec
AddressU  = WRAP;
    AddressV  = WRAP;
struct VertexIn
float3 PosL            : POSITION;
float3 ToFarPlaneIndex : NORMAL;
float2 Tex             : TEXCOORD;
struct VertexOut
    float4 PosH       : SV_POSITION;
    float3 ToFarPlane : TEXCOORD0;
float2 Tex        : TEXCOORD1;
VertexOut VS(VertexIn vin)
VertexOut vout;
// Already in NDC space.
vout.PosH = float4(vin.PosL, 1.0f);
// We store the index to the frustum corner in the normal x-coord slot.
vout.ToFarPlane = gFrustumCorners[vin.ToFarPlaneIndex.x].xyz;
// Pass onto pixel shader.
vout.Tex = vin.Tex;
    return vout;
// Determines how much the sample point q occludes the point p as a function
// of distZ.
float OcclusionFunction(float distZ)
// If depth(q) is "behind" depth(p), then q cannot occlude p.  Moreover, if 
// depth(q) and depth(p) are sufficiently close, then we also assume q cannot
// occlude p because q needs to be in front of p by Epsilon to occlude p.
// We use the following function to determine the occlusion.  
//       1.0     -------------\
//               |           |  \
//               |           |    \
//               |           |      \ 
//               |           |        \
//               |           |          \
//               |           |            \
//  ------|------|-----------|-------------|---------|--> zv
//        0     Eps          z0            z1        
float occlusion = 0.0f;
if(distZ > gSurfaceEpsilon)
float fadeLength = gOcclusionFadeEnd - gOcclusionFadeStart;
// Linearly decrease occlusion from 1 to 0 as distZ goes 
// from gOcclusionFadeStart to gOcclusionFadeEnd. 
occlusion = saturate( (gOcclusionFadeEnd-distZ)/fadeLength );
return occlusion; 
float4 PS(VertexOut pin, uniform int gSampleCount) : SV_Target
// p -- the point we are computing the ambient occlusion for.
// n -- normal vector at p.
// q -- a random offset from p.
// r -- a potential occluder that might occlude p.
// Get viewspace normal and z-coord of this pixel.  The tex-coords for
// the fullscreen quad we drew are already in uv-space.
float4 normalDepth = gNormalDepthMap.SampleLevel(samNormalDepth, pin.Tex, 0.0f);
float3 n =;
float pz = normalDepth.w;
// Reconstruct full view space position (x,y,z).
// Find t such that p = t*pin.ToFarPlane.
// p.z = t*pin.ToFarPlane.z
// t = p.z / pin.ToFarPlane.z
float3 p = (pz/pin.ToFarPlane.z)*pin.ToFarPlane;
// Extract random vector and map from [0,1] --> [-1, +1].
float3 randVec = 2.0f*gRandomVecMap.SampleLevel(samRandomVec, 4.0f*pin.Tex, 0.0f).rgb - 1.0f;
float occlusionSum = 0.0f;
// Sample neighboring points about p in the hemisphere oriented by n.
for(int i = 0; i < gSampleCount; ++i)
// Are offset vectors are fixed and uniformly distributed (so that our offset vectors
// do not clump in the same direction).  If we reflect them about a random vector
// then we get a random uniform distribution of offset vectors.
float3 offset = reflect(gOffsetVectors[i].xyz, randVec);
// Flip offset vector if it is behind the plane defined by (p, n).
float flip = sign( dot(offset, n) );
// Sample a point near p within the occlusion radius.
float3 q = p + flip * gOcclusionRadius * offset;
// Project q and generate projective tex-coords.  
float4 projQ = mul(float4(q, 1.0f), gViewToTexSpace);
projQ /= projQ.w;
// Find the nearest depth value along the ray from the eye to q (this is not
// the depth of q, as q is just an arbitrary point near p and might
// occupy empty space).  To find the nearest depth we look it up in the depthmap.
float rz = gNormalDepthMap.SampleLevel(samNormalDepth, projQ.xy, 0.0f).a;
// Reconstruct full view space position r = (rx,ry,rz).  We know r
// lies on the ray of q, so there exists a t such that r = t*q.
// r.z = t*q.z ==> t = r.z / q.z
float3 r = (rz / q.z) * q;
// Test whether r occludes p.
//   * The product dot(n, normalize(r - p)) measures how much in front
//     of the plane(p,n) the occluder point r is.  The more in front it is, the
//     more occlusion weight we give it.  This also prevents self shadowing where 
//     a point r on an angled plane (p,n) could give a false occlusion since they
//     have different depth values with respect to the eye.
//   * The weight of the occlusion is scaled based on how far the occluder is from
//     the point we are computing the occlusion of.  If the occluder r is far away
//     from p, then it does not occlude it.
float distZ = p.z - r.z;
float dp = max(dot(n, normalize(r - p)), 0.0f);
float occlusion = dp * OcclusionFunction(distZ);
occlusionSum += occlusion;
occlusionSum /= gSampleCount;
float access = 1.0f - occlusionSum;
// Sharpen the contrast of the SSAO map to make the SSAO affect more dramatic.
return saturate(pow(access, 4.0f));
technique11 Ssao
    pass P0
SetVertexShader( CompileShader( vs_4_0, VS() ) );
SetGeometryShader( NULL );
        SetPixelShader( CompileShader( ps_4_0, PS(14) ) );

Here is how i draw:

void Model::RenderNormalDepthMap(CXMMATRIX World, CXMMATRIX ViewProj)
ID3DX11EffectTechnique* activeTech = Effects::SsaoNormalDepthFX->NormalDepthTech;;
XMMATRIX view = d3d->m_Cam.View();
XMMATRIX world = World;
XMMATRIX worldInvTranspose = MathHelper::InverseTranspose(world);
XMMATRIX worldView   = world* view;
XMMATRIX worldInvTransposeView = worldInvTranspose*view;
XMMATRIX worldViewProj = world * ViewProj;
float blendFactor[4] = {0.0f, 0.0f, 0.0f, 0.0f};
    for(UINT p = 0; p < techDesc.Passes; ++p)
for (UINT i = 0; i < mModel.mSubsetCount; i++)
   activeTech->GetPassByIndex(p)->Apply(0, pDeviceContext);
void Model::Render(CXMMATRIX World, CXMMATRIX ViewProj)
ID3DX11EffectTechnique* activeTech;
XMMATRIX worldInvTranspose = MathHelper::InverseTranspose(W);
XMMATRIX WorldViewProj = W * ViewProj;
XMMATRIX TexTransform = XMMatrixIdentity();
XMMATRIX ShadowTransform = W * XMLoadFloat4x4(&d3d->m_ShadowTransform);
// Transform NDC space [-1,+1]^2 to texture space [0,1]^2
XMMATRIX toTexSpace(
0.5f, 0.0f, 0.0f, 0.0f,
0.0f, -0.5f, 0.0f, 0.0f,
0.0f, 0.0f, 1.0f, 0.0f,
0.5f, 0.5f, 0.0f, 1.0f);
if (mInfo.AlphaClip)
   if (mInfo.NumLights == 1)
   activeTech = Effects::BasicFX->Light1TexAlphaClipTech;
   else if (mInfo.NumLights == 2)
       activeTech = Effects::BasicFX->Light2TexAlphaClipTech;
   else if (mInfo.NumLights == 3)
   activeTech = Effects::BasicFX->Light3TexAlphaClipTech;
   activeTech = Effects::BasicFX->Light0TexAlphaClipTech;
if (mInfo.NumLights == 1)
   activeTech = Effects::BasicFX->Light1TexTech;
   else if (mInfo.NumLights == 2)
       activeTech = Effects::BasicFX->Light2TexTech;
   else if (mInfo.NumLights == 3)
   activeTech = Effects::BasicFX->Light3TexTech;
   activeTech = Effects::BasicFX->Light0TexTech;
if (!mInfo.BackfaceCulling)
Effects::BasicFX->SetWorldViewProjTex(WorldViewProj * toTexSpace);
float blendFactor[4] = {0.0f, 0.0f, 0.0f, 0.0f};
    for(UINT p = 0; p < techDesc.Passes; ++p)
for (UINT i = 0; i < mModel.mSubsetCount; i++)
if (mInfo.AlphaToCoverage)
pDeviceContext->OMSetBlendState(RenderStates::AlphaToCoverageBS, blendFactor, 0xffffffff);
   activeTech->GetPassByIndex(p)->Apply(0, pDeviceContext);
if (mInfo.AlphaToCoverage)
   pDeviceContext->OMSetBlendState(0, blendFactor, 0xffffffff);
if (!mInfo.BackfaceCulling)

I fixed the problem but SSAO is still bottleneck. (Performance problems.)


I reduced samples to 5 from 14 and blurring of image 2 times instead of 4 times + I'm skipping pixels that are too away from camera. ( if (p.z > 100) return 1.0f;) yet I only get 40 fps (at 1024x768 resoulution of screen and half of it for ssao map) when I'm very near to the model which uses SSAO.

40fps with what hardware precisely? (I'm always surprised that no one seems to ask about the hardware when somebody thinks he has performance issues).


If you think that your SSAO is the bottleneck, then try rendering your frames with the occlusion buffer not being rendered into, and don't do the calculations for determining the occlusion value - then see what the performance is. This would be your baseline that you can start to reason about what is taking up the most time with your rendering scheme. Then add back in the rendering to the occlusion buffer, and finally you can add back in your occlusion calculations. See what each step is costing you before you try to optimize the code.

Realtime rendering performance is a tricky thing to understand. You might be CPU bound due to lots of API calls. If that is the case, changing your SSAO calculations won't help you one bit. You could be texture bandwidth bound, in which case changing your calculations won't help either. Or you could be GPU computation bound, which means that changing your SSAO calculations will probably help some.

The point is, there are lots of reasons that you might end up with performance problems - you can't fix them until you have identified what the problem is.

We need to know the system specs as well. :)

40fps with what hardware precisely? (I'm always surprised that no one seems to ask about the hardware when somebody thinks he has performance issues).

We need to know the system specs as well. smile.png

The system specs don't really matter. If he is trying to get his system running faster, the bottleneck is probably not going to be apparent from the type of GPU, CPU or how much memory he has. He needs to indicate directionally what the issue is, and then we can help him drill down and find solutions to the problems.

