The system specs don't really matter.

What if the OP has a GeForce 8400GT and wants 300 fps out of it? "Performance problems" are relative. For some hardware 40 fps might be poor, for some other hardware it might be very fast, for some people it might be not enough, for some people they'd be fine with just 30 fps.

Before saying this is a performance problem, you have to define it first (what it would be the desired performance) and give it context (what hardware is being used), then you can say "Yep, this is running slow" or "Nope, numbers like those are expected with that configuration".


My GPU is average and it can render terrain with tessellation and patch culling at 500-800 fps.

When i add shadows it drops to 200-400 but When I look at sky, its 700 fps (because of culling.)

anyways my cpu is:

Intel(R) Core(TM) i3-3220 CPU @ 3.30 GHz (4 CPUs), ~3.3GHz

GPU: (from dxdiag -> display)

Name: Intel(R) HD Graphics

Approx. Total memory: 1555 MB


The desired performance I want is 100 fps at 1024x768 resolution with SSAO (as soon as I disable computing ssao, fps increases to about 200) because when I'll add other things such as physics and collision then the fps going to drop more (maybe to 60 - 70 fps)

Also, I'll try what Jason Z said.

With this ssao shader

float4 PS(VertexOut pin, uniform int gSampleCount) : SV_Target
/* // p -- the point we are computing the ambient occlusion for.
// n -- normal vector at p.
// q -- a random offset from p.
// r -- a potential occluder that might occlude p.
// Get viewspace normal and z-coord of this pixel.  The tex-coords for
// the fullscreen quad we drew are already in uv-space.
float4 normalDepth = gNormalDepthMap.SampleLevel(samNormalDepth, pin.Tex, 0.0f);
float3 n =;
float pz = normalDepth.w;
// Reconstruct full view space position (x,y,z).
// Find t such that p = t*pin.ToFarPlane.
// p.z = t*pin.ToFarPlane.z
// t = p.z / pin.ToFarPlane.z
float3 p = (pz/pin.ToFarPlane.z)*pin.ToFarPlane;
if (p.z > 100)
return 1.0f;
// Extract random vector and map from [0,1] --> [-1, +1].
float3 randVec = 2.0f*gRandomVecMap.SampleLevel(samRandomVec, 4.0f*pin.Tex, 0.0f).rgb - 1.0f;
float occlusionSum = 0.0f;
// Sample neighboring points about p in the hemisphere oriented by n.
for(int i = 0; i < gSampleCount; ++i)
// Are offset vectors are fixed and uniformly distributed (so that our offset vectors
// do not clump in the same direction).  If we reflect them about a random vector
// then we get a random uniform distribution of offset vectors.
float3 offset = reflect(gOffsetVectors[i].xyz, randVec);
// Flip offset vector if it is behind the plane defined by (p, n).
float flip = sign( dot(offset, n) );
// Sample a point near p within the occlusion radius.
float3 q = p + flip * gOcclusionRadius * offset;
// Project q and generate projective tex-coords.  
float4 projQ = mul(float4(q, 1.0f), gViewToTexSpace);
projQ /= projQ.w;
// Find the nearest depth value along the ray from the eye to q (this is not
// the depth of q, as q is just an arbitrary point near p and might
// occupy empty space).  To find the nearest depth we look it up in the depthmap.
float rz = gNormalDepthMap.SampleLevel(samNormalDepth, projQ.xy, 0.0f).a;
// Reconstruct full view space position r = (rx,ry,rz).  We know r
// lies on the ray of q, so there exists a t such that r = t*q.
// r.z = t*q.z ==> t = r.z / q.z
float3 r = (rz / q.z) * q;
// Test whether r occludes p.
//   * The product dot(n, normalize(r - p)) measures how much in front
//     of the plane(p,n) the occluder point r is.  The more in front it is, the
//     more occlusion weight we give it.  This also prevents self shadowing where 
//     a point r on an angled plane (p,n) could give a false occlusion since they
//     have different depth values with respect to the eye.
//   * The weight of the occlusion is scaled based on how far the occluder is from
//     the point we are computing the occlusion of.  If the occluder r is far away
//     from p, then it does not occlude it.
float distZ = p.z - r.z;
float dp = max(dot(n, normalize(r - p)), 0.0f);
float occlusion = dp * OcclusionFunction(distZ);
occlusionSum += occlusion;
occlusionSum /= gSampleCount;
float access = 1.0f - occlusionSum;
// Sharpen the contrast of the SSAO map to make the SSAO affect more dramatic.
return saturate(pow(access, 4.0f)); */
return 1.0f;
performance is still same so there is no problem here.
When I'm looking at other side or If i'm too far, I disable computing SSAO Map with this code.

bool ComputeSSAOThisFrame = false;
for (USHORT i = 0; i < ModelInstances.size(); ++i) 
if (ModelInstances[i].ComputeSSAO) 
if (ModelInstances[i].Visible) 
XMVECTOR campos = m_Cam.GetPositionXM();
XMVECTOR modelpos = XMLoadFloat3(&XMFLOAT3(ModelInstances[i].World._41, 
ModelInstances[i].World._42, ModelInstances[i].World._43));
XMVECTOR dist = modelpos - campos;
float distf;
XMStoreFloat(&distf, XMVector3LengthSq(dist));
if (!(distf > 10000)) //sqrt(10000) = 100
ComputeSSAOThisFrame = true;
Now the performance is when not visible:
when too far:
Now when I'm looking at sky, everything is culling using intersection tests (shadows are too culled i.e rendering to shadow map) + terrain is culled in constant hull shader.
Now the performance is:
These code is causing performance issues because when I don't call it, FPS increases to 136 from 40.

void Ssao::ComputeSsao(const Camera& camera)
// Bind the ambient map as the render target.  Observe that this pass does not bind 
// a depth/stencil buffer--it does not need it, and without one, no depth test is
// performed, which is what we want.
ID3D11RenderTargetView* renderTargets[1] = {mAmbientRTV0};
mDC->OMSetRenderTargets(1, renderTargets, 0);
mDC->ClearRenderTargetView(mAmbientRTV0, reinterpret_cast<const float*>(&Colors::Black));
mDC->RSSetViewports(1, &mAmbientMapViewport);
// Transform NDC space [-1,+1]^2 to texture space [0,1]^2
static const XMMATRIX T(
0.5f, 0.0f, 0.0f, 0.0f,
0.0f, -0.5f, 0.0f, 0.0f,
0.0f, 0.0f, 1.0f, 0.0f,
0.5f, 0.5f, 0.0f, 1.0f);
XMMATRIX P  = camera.Proj();
XMMATRIX PT = XMMatrixMultiply(P, T);
UINT stride = sizeof(Vertex::Basic32);
    UINT offset = 0;
mDC->IASetVertexBuffers(0, 1, &mScreenQuadVB, &stride, &offset);
mDC->IASetIndexBuffer(mScreenQuadIB, DXGI_FORMAT_R16_UINT, 0);
ID3DX11EffectTechnique* tech = Effects::SsaoFX->SsaoTech;
tech->GetDesc( &techDesc );
for(UINT p = 0; p < techDesc.Passes; ++p)
tech->GetPassByIndex(p)->Apply(0, mDC);
mDC->DrawIndexed(6, 0, 0);
void Ssao::BlurAmbientMap(int blurCount)
for(int i = 0; i < blurCount; ++i)
// Ping-pong the two ambient map textures as we apply
// horizontal and vertical blur passes.
BlurAmbientMap(mAmbientSRV0, mAmbientRTV1, true);
BlurAmbientMap(mAmbientSRV1, mAmbientRTV0, false);
void Ssao::BlurAmbientMap(ID3D11ShaderResourceView* inputSRV, ID3D11RenderTargetView* outputRTV, bool horzBlur)
ID3D11RenderTargetView* renderTargets[1] = {outputRTV};
mDC->OMSetRenderTargets(1, renderTargets, 0);
mDC->ClearRenderTargetView(outputRTV, reinterpret_cast<const float*>(&Colors::Black));
mDC->RSSetViewports(1, &mAmbientMapViewport);
Effects::SsaoBlurFX->SetTexelWidth(1.0f / mAmbientMapViewport.Width );
Effects::SsaoBlurFX->SetTexelHeight(1.0f / mAmbientMapViewport.Height );
ID3DX11EffectTechnique* tech;
tech = Effects::SsaoBlurFX->HorzBlurTech;
tech = Effects::SsaoBlurFX->VertBlurTech;
UINT stride = sizeof(Vertex::Basic32);
    UINT offset = 0;
mDC->IASetVertexBuffers(0, 1, &mScreenQuadVB, &stride, &offset);
mDC->IASetIndexBuffer(mScreenQuadIB, DXGI_FORMAT_R16_UINT, 0);
tech->GetDesc( &techDesc );
for(UINT p = 0; p < techDesc.Passes; ++p)
tech->GetPassByIndex(p)->Apply(0, mDC);
mDC->DrawIndexed(6, 0, 0);
// Unbind the input SRV as it is going to be an output in the next blur.
tech->GetPassByIndex(p)->Apply(0, mDC);
When i'm too far or if model is not visible, then i do this and performance suddenly increases.

if (ComputeSSAOThisFrame) //don't compute if all models are not visible or all models are far.
  // Now compute the ambient occlusion.

NOTE: The resolution is 1024x768, if its 800x600 or 500x500 then with SSAO it reaches about 70-100 fps and 216 fps when not computing ssao. My goal is to reach 100 fps with SSAO at 1024x768 resolution.

EDIT: here are SSAO blur and ssaoNormaldepth shaders.

SSAO normal depth:

SSAO Blur:

Maybe you can guess what the appropriate performance for a given algorithm on particular hardware is, but I doubt that you can be anywhere other than within an order of magnitude of the true 'max' performance. At least in my opinion, the specs may be interesting to hear and compare with your own experiences, but there are far too many variables in play for them to have a useful input into a performance discussion. He indicated that he gets 40 FPS - now that he has given his hardware specs, what is your estimate of what his performance should be? What if he said he gets 80 FPS, or 160 FPS - what should he be getting before you consider it a performance problem with his system? How do you suggest for him to improve his performance based on specs?

In reality, you need so much more information that only running tests will tell you.

@newtechnology: It is fairly common for SSAO performance to scale with the number of pixels in a scene, since the computation and texture lookups are linearly related to the number of pixels in the render target. When you run your tests, do so from a single camera view point - don't move around or change how many objects you are rendering. This will give a more stable estimate of your performance and let you optimize effectively.

So once you have picked a representative view of your scene, take a measurement of your average frame time. Then do the same measurement with your occlusion buffer calculations from above disabled. Now you can selectively re-enable parts of your algorithm and see what effect they are having on the performance.

Your blur algorithm takes 24 texture samples (12 samples * 2 textures) per pixel. That's 48 texture samples for horizontal + vertical blur passes (i.e. a huge amount!). If you remove the blur, what happens to your frame time?

This is my fps with no blur. (92 - 165)



So now you know that your blur is a big part of your performance problem.

A couple of quick thoughts:

- Do you really need so many taps? Note that if you're using bilinear filtering, you can sample from between two pixels to get contributions from both without making two texture samples. It doesn't look like you're doing that, from a quick look at your code

- Are you using all the channels of your occlusion texture (samInputImage)? You're really just interested in one channel, right? Then maybe you could squeeze your depth and normal into the remaining 3 channels (you'd need to squeeze your normal into 2, I guess), so you don't have to sample from 2 textures.

- Starting from the center texel, once you hit a discontinuity in the blur, you don't need to continue in that direction, right? So you might be able to use dynamic branching to bail out of some of the calculations. This could actually end up making performance worse, but there's a chance it could improve things and might be worth experimenting with.

