SSAO horrendous performance and graphical issue(s)

Started by
29 comments, last by Jason Z 10 years, 11 months ago

Hello,

I've somewhat successfully implemented the SSAO-approach featured in this article: http://www.gamedev.net/page/resources/_/technical/graphics-programming-and-theory/a-simple-and-practical-approach-to-ssao-r2753. Here is the shader:


float4 cSize : register(c0); // xy = screen, zw = 1 / random texture
float4 cParams : register(c1); // scale, bias, intensity, sample rad
// currently passed in:              { 0.1f, 0.045f, 1.0f, 0.5f };
float4x4 cView : register(c2);
sampler cNormalSampler : register(s0);
sampler cPositionSampler : register(s1);
sampler cNoiseSampler : register(s2);

struct PS_INPUT
{
	float4 vPos             : POSITION0;
	float2 vTex0            : TEXCOORD0;
};

float3 getPosition(in float2 uv)
{
	return mul( tex2D(cPositionSampler, uv).xyz, cView) ;
}

float3 getNormal(in float2 uv)
{
	return normalize( mul(tex2D(cNormalSampler, uv).xyz * 2.0f - 1.0f, cView) );
}

float2 getRandom(in float2 uv)
{
	const float2 random_size = 1/64.0f;
	return normalize(tex2D(cNoiseSampler, cSize.xy * uv * cSize.zw).xy * 2.0f - 1.0f);
}

float doAmbientOcclusion(in float2 tcoord,in float2 uv, in float3 p, in float3 cnorm)
{
	float3 diff = getPosition(tcoord + uv) - p;
	const float3 v = normalize(diff);
	const float d = length(diff)*cParams.x;
	return max(0.0, dot(cnorm, v)- cParams.y)* (1.0 / (1.0 + d) ) * cParams.z;
}

float4 mainPS(PS_INPUT i) : COLOR0
{
	const float2 vec[4] = {float2(1,0),float2(-1,0),
				float2(0,1),float2(0,-1)};

	float3 p = getPosition(i.vTex0);
	
	float3 n = getNormal(i.vTex0);
	
	float2 rand = getRandom(i.vTex0);

	float ao = 0.0f;
	float rad = cParams.w/  p.z;

	//**SSAO Calculation**//
	int iterations = 4;
	for (int j = 0; j < iterations; ++j)
	{
	  float2 coord1 = reflect(vec[j], rand)*rad;
	  float2 coord2 = float2(coord1.x*0.707 - coord1.y*0.707,
				  coord1.x*0.707 + coord1.y*0.707);
	  
	  ao += doAmbientOcclusion(i.vTex0, coord1*0.25, p, n);
	  ao += doAmbientOcclusion(i.vTex0, coord2*0.5, p, n);
	  ao += doAmbientOcclusion(i.vTex0, coord1*0.75, p, n);
	  ao += doAmbientOcclusion(i.vTex0, coord2, p, n);
	}
	ao/=(float)iterations*4.0;
	//**END**//
	
	return 1.0-ao;
}

However, I'm having multiple issues:

- First of all, the performance hit is straightout horrible. In an emtpy scene, the game drops from 600 to 200 FPS. Thats about 4 ms, simply for SSAO'ing an emtpy scene. But ok, 4 ms for that effect seems somewhat reasonable. However, as soon as I load some bigger scene, the FPS drops down to not more then 20-30 FPS. Without SSAO, that scene would render at about 200 FPS. Thats almost 45 ms just for the SSAO. I can add 20-30 deferred lights, before I even come close to that value. See that attachment for what the scene looks like. If I scroll out, so that still the whole screen is covered, I get about 100 FPS. Why is that? What makes such a huge difference in performance between a near object, and a far one, in a deferred renderer, where the SSAO shader has the same size of a texture anyway? oO And, more importantly, how can I solve this? I'm running on a Geforce 560 Gtx Ti, so it should at least run at 60 fps oO

- Note how in the screenshot, there is a suptle black line in the middle of the scene. If I zoom out, this line becomes white, and it rotates as I rotate the scene. Where does this come from? Is there any way to resolve that?

- In the article, it says that in order to get position and normal to view space from world space, I should multiply by the view matrix, however, if I pass in the view matrix as cView, the ambient occlusion doesn't rotate correctly as I rotate the scene, it seems to be fixed to a certain side of the model. Hard to describe, imagine having a model, if you look from the one side it is somehow SSAO'd ( way too dark, though), and if you look from the other side, there is no occlusion at all. Passing in the ViewProjection-matrix fixes this, but I suspect there is some issue in my code. Is the code right, or do you see something suspicious.

Hopefully someone has an idea, especially about the performance part...

Advertisement
Disclaimer first: I haven't done much deferred rendering, let alone AO, so take this with a grain of salt.

Seems you don't do any GBuffer packing, the most obvious of which is the position. For the latter search the term e.g. "position reconstruction from depth". Also: If your GBuffer were already in view space, there would be no need to do that transformation over and over again (cView).

Then: Your Resolution is huge. Can one do AO in low-res, similar to a bloom ?

As for that line: First I thought it's the classic D3D9 pixel-to-texel offset trouble, but that should not account for the full screen quad diagonal IMO (though you got that brown effect top and left, too, so investigate anyway). How do you apply the AO finally ? Alpha blending ? Edit MSAA enabled per chance ?

Seems you don't do any GBuffer packing, the most obvious of which is the position. For the latter search the term e.g. "position reconstruction from depth". Also:

Isn't gbuffer packing more a technique to save memory than to improve performance - talking about speed vs. memory usage?

Also: If your GBuffer were already in view space, there would be no need to do that transformation over and over again (cView).

Hm, if my gbuffer were in view space, I'd need to calculate it back into world space for lighting - or is there any way to do lighting calculations in view space, too?

Then: Your Resolution is huge. Can one do AO in low-res, similar to a bloom ?

I'm not sure, especially talking about image quality. Bloom works well with low-resolution buffers because it covers "large" areas, and doesn't display details - SSAO on the other hand, from what I can see, puts a lot of weight on relatively smaller details - if I were to render this to a half or quarter size buffer and then stretch it, I don't think it would look that good really.

As for that line: First I thought it's the classic D3D9 pixel-to-texel offset trouble, but that should not account for the full screen quad diagonal IMO (though you got that brown effect top and left, too, so investigate anyway).

Oh, no, I didn't mean the brown line, thats just how Nvidia PerfHUD displays what has been renderer in that command. Let me show you again what I meant. This time the buffer is saved without the 1.0 - ao, plus I added a blur. However this line is always there - the closer I get to the scene, the more subtle it gets, but it is always there, always orienting itself towards the camera if I rotate the scene, but always laying on the geometry. Any idea where that can come from?

How do you apply the AO finally ? Alpha blending ?

Currently I just have a 1.0-brightness ambient light where I multiply the amient term by 1.0 - ao. Not sure if this is the best approach though, since you havn't done AO yourself, I quess you don't know whats the best way to apply the ambient occlusion term, right? ;)

Edit MSAA enabled per chance ?

Nah, isn't on, if only it was :/

I don't know what might cause you slowdown.

I have same gfx card as you do gtx 560 ti and recently in my renderer (viewspace, deferred) i used SSAO pixel shader from Crysis 1 package to test it out, you can find it somewhere in game folders and unpack it with winrar for example.

pic1.jpg

pic2.jpg

I have ~3000 static objects in the scene, 1 directional light (sun) and 1 spotlight (flashlight), 7 cascade splits (4096x4096 per split), SSAO with blur (screensize), no optimizations were done since i am experimenting yet, and get average ~80 fps.

Hm, if my gbuffer were in view space, I'd need to calculate it back into world space for lighting - or is there any way to do lighting calculations in view space, too?

Why would you convert your view space gbuffer in world space? Just transform your light data (dir/pos that you are sending to shader) into view space and use same math you used before with world space.


...
ao/=(float)iterations*4.0;
    //**END**//
    
    return 1.0-ao;
}

I am not sure but you might have issue here if 'ao' is not in 0 - 1 range. See if it makes any difference if you saturate it:


ao/=(float)iterations*4.0;
    //**END**//
    
    return 1.0 - saturate(ao);

- You should get rid of all the matrix-vector multiplies. It is possible. Work in view space and use an alternative method for constructing the position.

- You may do some level of detail with the SSAO, further away the pixel is, fewer loops you can execute.

- use depth or stencil to filter out sky in order not to process pixels which don't need SSAO

Cheers!

Isn't gbuffer packing more a technique to save memory than to improve performance - talking about speed vs. memory usage?

One doesn't exclude the other, on the contrary: The more data you move around the more time it takes. Bandwidth. You sample the position five 17 times per pixel. What is the format by the way ? The texture cache will hopefully alleviate the problem a bit but nonetheless.

Personal experience of late: Playing with a noise shader I first had some vectors in a lookup texture, then I constructed them purely with some math (ala Perlin reference implementation), so no texture access at all. Gave me a 10% boost in spite the ALU (math) instruction count grew quite a bit. (Note: GPU's are sometimes very counter-intuitive, so there could be another culprit for that difference, I don't know. Neither do I have "real" profiling tools).

About that line: Cannot make much out of it, it's probably more obvious interactively. But your initial description makes me suspect some mixup with spaces.

The most likely reason that performance changes in SSAO when objects are close in view is due to texture cache thrashing. That would come from the sample offsets being effectively larger in texture space since it is closer to the camera location. However, it looks like you are using a texture space sampling offset, so that probably isn't your problem.

Since you are using a texture space offsets, there should be no difference in the time needed for SSAO before and after the scene is filled. You are most likely running into a bottleneck with your other code running sequentially with the SSAO pass. Can you try to render your input data to the SSAO algorithm, then keep those buffers unmodified and see what the performance is like?

My guess would be that you are texture bandwidth bound, due to the gbuffer generation, sampling, and the SSAO sampling process. That's lots of reads and writes... especially at high resolutions, so if you can cut down on the GBuffer size or otherwise save some bandwidth, that will probably help out.

Also, there was a few presentations (from NVidia I think) about rendering the occlusion buffer at a 1/4 scale, then doing a bilateral up-sample pass to scale it up to the full resolution size. So that is an option, but I'm pretty sure you can improve things without having to resort to something like that.

I have ~3000 static objects in the scene, 1 directional light (sun) and 1 spotlight (flashlight), 7 cascade splits (4096x4096 per split), SSAO with blur (screensize), no optimizations were done since i am experimenting yet, and get average ~80 fps.

What resolution are you rendering at? Still, thats quite much better, I only have one ambient light, one static object (single vb + draw call), just that I'm rendering at 2560x1600...

Why would you convert your view space gbuffer in world space? Just transform your light data (dir/pos that you are sending to shader) into view space and use same math you used before with world space.

Ah, neat, going to try this out ASAP.

I am not sure but you might have issue here if 'ao' is not in 0 - 1 range. See if it makes any difference if you saturate it:

This solved itself since I'm now storing "ao" in the buffer and do the 1.0 - ao in the light shader, but it looks pretty much the same, so don't think this was an issue.

You should get rid of all the matrix-vector multiplies. It is possible. Work in view space and use an alternative method for constructing the position.

- You may do some level of detail with the SSAO, further away the pixel is, fewer loops you can execute.

- use depth or stencil to filter out sky in order not to process pixels which don't need SSAO

Unfortunately, aside from your first idea, going to change my gbuffer layout and position calculation accordingly indeed, none of this would (can?) do any difference. As you can see in the scene, there is no sky (behind the buildings is the side of a mountain), plus pixels are very close, so LOD isn't much of an option here eigther. But thanks for the tips, anyway!

Since you are using a texture space offsets, there should be no difference in the time needed for SSAO before and after the scene is filled. You are most likely running into a bottleneck with your other code running sequentially with the SSAO pass. Can you try to render your input data to the SSAO algorithm, then keep those buffers unmodified and see what the performance is like?

Hm, that would be kind of complicated, but I don't really think this would aid much in performance, since basically if I move the scene further away, I start getting ~100 Fps. See the other screen in one of my posts, this scene yields about 5 times the performance, while there is the same amount of pixels on the screen covered by objects... Thats why is is really weird, I can't think of a reason why it would go slow if I zoom in on objects, but as soon as objects are further away, everything works fine? oO

Also, there was a few presentations (from NVidia I think) about rendering the occlusion buffer at a 1/4 scale, then doing a bilateral up-sample pass to scale it up to the full resolution size. So that is an option, but I'm pretty sure you can improve things without having to resort to something like that.

Yeah, this would probably help, at least if I'm rendering at lowest possible resolution (640x480) I get ~600 FPS. Is the bilateral up-pass streigthforward in the light shader, or is it more complicated?

As other people have said you really need to reconstruct position from Z, which can get you the result in view space more directly and save a bunch of memory bandwidth on the texture fetch, and probably give you a significant speedup.

However that shader is also possible to optimize as-is. Here's a few things to try:

There's one simple trick that will save you a few shader instructions in this case (it only saves one in the d3d assembly, but GPU ShaderAnalazyer says it's a much bigger win of about 10% on actual hardware):

row_major float4x4 cView : register(c2);

You'll obviously need to transpose the matrix in code for that to work.

You should also change the max(0.0, ...) to saturate(...) because saturate is free on the hardware, and max isn't, and I think the value won't be above 1.0 in this case. The gain from that isn't so big though. I've not checked that it looks the same.

There's also a small speedup for GetRandom() although pre-combining the constant on the CPU would be even better:

float2 getRandom(in float2 uv)
{
const float2 random_size = 1/64.0f;
return normalize(tex2D(cNoiseSampler, uv * (cSize.xy * cSize.zw)).xy * 2.0f - 1.0f);
}

There's one simple trick that will save you a few shader instructions in this case (it only saves one in the d3d assembly, but GPU ShaderAnalazyer says it's a much bigger win of about 10% on actual hardware):



row_major float4x4 cView : register(c2);



You'll obviously need to transpose the matrix in code for that to work.

No problem with the transpose thing, but am I even going to need this matrix at all after I've changed it to working in view space? I'd store normals in view space and position from depth, so no need for the view matrix here.. right?

You should also change the max(0.0, ...) to saturate(...) because saturate is free on the hardware, and max isn't, and I think the value won't be above 1.0 in this case. The gain from that isn't so big though. I've not checked that it looks the same.

Looks pretty much the same, but didn't do much about the performance - the bottleneck is somewhere else, obviously, but still good to know, might help once I've gotten rid of the current bottleneck.

There's also a small speedup for GetRandom() although pre-combining the constant on the CPU would be even better:

Well, I tried that, but for some reason it won't work (produces completely different, weird looking results). See anything the might be wrong?


                //precalculated:
                const float screen[4] = { 2560.0f * 1.0f / 64.0f, 1600.0f * 1.0f / 64.0f, 0.0f , 0.0f };
                pSSAO->SetPixelConstant(0, screen );

                //before:
                const float screen[4] = { 2560.0f, 1600.0f, 1.0f / 64.0f, 1.0f / 64.0f };
                pSSAO->SetPixelConstant(0, screen );

// new shader :

float2 getRandom(in float2 uv)
{
	const float2 random_size = 1/64.0f;
	return normalize(tex2D(cNoiseSampler, cSize.xy * uv).xy * 2.0f - 1.0f);
}

This should do the same thing, shouldn't it? Well it certainly doesn't...

This topic is closed to new replies.

Advertisement