*solved* Too slow for full resolution, want to improve this SSAO code

Started by
16 comments, last by agleed 9 years, 1 month ago

Hello.

TL;DR: I have an SSAO renderer that is too slow for (EDIT: Wow, I messed up the TL;DR... xD)

EDIT: I have written a summary of my findings further down.

I have a home-made SSAO shader that I currently run at a reduced resolution and then upscale using a bilateral upsampling shader to maintain the sharpness of edges. The shader relies on normals calculated from the depth buffer to reduce the noise from normal mapping, which can cause flickering due to undersampling at lower SSAO resolutions.

We recently added a significant amount of vegetation to the game, and sadly the SSAO looks horrible on it. They depth buffer essentially becomes extremely noisy, giving worthless normals. During the bilateral upsampling part, the normal doesn't match and there's not enough resolution to give the vegetation a good SSAO effect since there are so many different depth values that there simply doesn't exist a good match in the low resolution SSAO texture. The result is a noisy aliased mess that flickers extremely during motion. Even the temporal supersampling I have can't do much to reduce the impact of the effect.

My SSAO renderer has 4 main passes:

1. In the first pass I pack in the normal and the linear depth into a low-resolution single GL_RGBA16F texture, usually at half resolution but this can be modified by a setting. The normal is calculated by checking the depth of the 5 pixels in a cross shape and using a cross product and stored in RGB and the linear depth of each pixel is packed into the alpha channel. The cost of this pass is offset by the savings in the next two passes.

2. In the second pass, the SSAO value is calculated and stored in a GL_R8 texture. See the shader code below for details; it's pretty straight-forward.

3. In the third pass, the SSAO value is blurred using a depth and normal aware separable 9x9 blur.This is applied twice. This pass benefits a lot from the normal+depth packed texture from pass 1, completely offsetting the cost of generating it.

4. As part of a big shader that handles a lot of things (run at full resolution), the 4 closest SSAO values are read and a weighted sum of them is chosen based on the depth and normal of the full resolution pixel being processed.

At 1920x1080, my GTX 770 gets the following performance numbers:

- At half resolution:

engine.post.SSAORenderer : 0.824ms
Downsample/pack buffers : 0.174ms
Clear buffers : 0.03ms
Render SSAO : 0.319ms
Blur : 0.297ms
+ ~0.178ms bilateral upsample
Total: 1.002ms
- At full resolution:
engine.post.SSAORenderer : 3.424ms
Downsample/pack buffers : 0.323ms
Clear buffers : 0.112ms
Render SSAO : 1.783ms
Blur : 1.201ms
+ ~0.267ms bilateral upsample
Total: 3.691ms

The additional cost of doing this at full resolution is simply way too high. My goal is to get this running at 1-2ms at full resolution. Reducing the blur passes to 1 instead of two would save around 0.6ms, while getting rid of the bilateral upsample would save ~0.15ms. At full res, I wouldn't have to calculate the normal from the depth buffer as well, which would save another fraction of a ms too. That leaves me at around 2.9ms, still 1-2ms too high.

Here is my shader: http://pastebin.com/xYFmbEP3

The blur and pack shaders are essentially as fast as the can be.

I am looking for other SSAO algorithms that are more efficient/cache friendly, better sampling patterns so I can reduce the noise, optimizations to my current code and the likes.

Advertisement

I am looking for other SSAO algorithms that are more efficient/cache friendly, better sampling patterns so I can reduce the noise, optimizations to my current code and the likes.

Unfortunately I haven't worked on SSAO algorithms yet, so I just want to quickly drop a reference: This thesis compares various SSAO algorithms for quality & performance, so it might be worth it to take a look at it.

Try to reduce samples of your full resolution ssao. This will increase high freq noise but this is lot easier to deal than upscaling artifacts with high freq content. You can also try to use half resolution depth buffer as input(with 16bit depth) when still doing full resolution ssao.(current pixel use full res)

Or you can even build mipmap pyramid like this. http://graphics.cs.williams.edu/papers/SAOHPG12/

In our current game Hardland I have ditched whole spatial blur pass and replaced it with temporal smoothing. It's give better quality and its cheaper but may cause minor amount of ghosting. It's quite similar than this. http://bartwronski.com/2014/04/27/temporal-supersampling-pt-2-ssao-demonstration/

Edit: Ao only screenshot.(ignore terrain large scale ao, focus on foliage) https://www.dropbox.com/s/uizy0vwvasumvrv/AoOnly.png?dl=0

Try to switch to compute shader for the blur pass (and use local memory). My gauss pass were running 4time faster with proper shared memory and grid size.

Here is my shader: http://pastebin.com/xYFmbEP3

Now I have no idea about SSAO but I'm pretty sure that random rotation thing can be solved without recurring to trig functions, so I looked around a bit:

http://john-chapman-graphics.blogspot.com.ar/2013/01/ssao-tutorial.html

There it gives you a way to get random rotations with a very tiny pre computed noise texture (2x2) instead of trig functions, which is what original Crysis used apparently.

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

Have you considered using a smaller format for the packed texture?
This may be a little less relevant, in my D3D9 case I had a depth texture of A32B32G32R32F while only sampling one channel compared to sampling a R32F, the later is obviously faster. Because bottleneck is the sampling amount(14 in my case) there on the depth value and the bandwidth would be higher for the larger format. This is just my case though. I believe your blur pass(3) is using a similar calculation pattern right? (calculating from a series of samples with something from a previous texture lookup)

Also, my noise texture is just 2x2, so I put it as an array lookup. IIRC this is insignificant in terms of performance for such a small texture. Not sure about rand() though.

I also used a 16bit fp buffer for normal input, it is a tiny bit faster. I believe this is because the dependent calculation later on is using a normal vector that easier to get.

Some code optimizations probably don't matter for performance thought.


vec2 offset = (offsets[i] * rotation) * offsetScale; // you can bake offsetScale to rotation outside of loop

               
vec4 sample = inverseProj * vec4(vec3(sampleCoords, texture(depthBuffer, sampleCoords).r) * 2.0 - 1.0, 1.0); // you can bake  offset and scale (* 2.0 - 1.0) to that matrix. You could also avoid matrix mul all together. Check SAO paper.

I've rewritten many parts of the algorithm and tweaked many settings to be faster.

1. All unprojection to view space is now done using a frustum corner vector and a GL_R32F linear depth value. The depth-aware blur pass still uses the 16-bit float depth. This gave a very notable increase when the shader was APU limited, which happened quite often when doing the SSAO at full resolution. With perfect texture cache coherency, the SSAO computation pass (the second pass) now only takes ~0.64ms.

2. I reduced the blur kernel size from 9x9 to 7x7. Doing one blur pass now takes 0.41ms, two takes 0.79ms. When doing two blur passes, the first one uses a twice as wide kernel, e.g it jumps over every other value (see the last page of this little pdf: http://www.cise.ufl.edu/~cchi/SSAO.pdf).The second pass then does a normal blur which hides these artifacts.

3. I added a per-frame jitter value to the computation of the random rotation to better take advantage of my temporal anti-aliasing. In essence, this doubles the sample count for no additional cost. Dropping the sample count to 8 instead of 16 improves performance from 0.64ms to 0.35ms, but has so much noise that even 2 blur passes can't hide it.

4. Some other minor optimizations that did have a small impact. For example, I baked the powerScale value into the normal so that the dot product in the loop takes care of that multiplication.

Further considerations:

1.

My blur pass is currently depth AND normal aware. Basically, each sample is weighted by something a tiny bit more complicated than this:


float weight = clamp(dot(centerNormal, sampleNormal) - abs(centerDepth - sampleDepth), 0.0, 1.0) * gaussianWeight;

Removing the normal awareness does not seem to have any impact at all on performance (it seems to be limited by the NUMBER of texture samples, not bandwidth), but there would be no need for the packed normal+depth texture, saving a solid 8MBs of VRAM and getting rid of 0.23ms. A two pass blur would go from 1.02ms to 0.82ms. The reason why I added the normal awareness was because I was getting bleeding over 90 degree edges sometimes that simply didn't look very good. This was fast enough to be worth it at half resolution but at full resolution it simply might be worth to disregard those artifacts for the sake of those 0.2ms.

2.

The main problem right now is the cache coherency of the samples. All the above values for SSAO performance is when the cache coherency is optimal, e.g. the sample radius is small due to the scene being far away from the camera. With 2 blur passes with the normal awareness removed and the SSAO optimizations, the SSAO takes a total of ~1.46ms, significantly better than the ~3.4ms I started with. However, when the scene is extremely close to the camera, the texture sample coherency gets worse and the SSAO pass skyrockets from the best-case 0.64ms to a massive 4.6ms when every pixel is clamped to the max SSAO width of 50 pixels.

Texture cache worst case performance of the SSAO pass

- 16 samples, 50 pixel sample radius, GL_R32F depth buffer: 4.6ms

- 8 samples, 50 pixel sample radius, GL_R32F depth buffer: 0.94 ms

- 16 samples, 25 pixel sample radius, GL_R32F depth buffer: 1.25 ms
- 16 samples, 50 pixel sample radius, GL_R16F depth buffer: 1.52 ms
Disabling the random rotation per pixel both improves APU performance slightly and improves cache performance a lot:
- 16 samples, 50 pixel sample radius, GL_R32F depth buffer, no random sample rotation: 0.58ms
- 16 samples, 150 pixel sample radius, GL_R16F depth buffer, no random sample rotation: 0.58ms
Sadly, that looks like complete shit, especially up close, but at least it proves that the cache coherency is indeed the problem.
- Reducing the sample count produces too noisy results, so that's pretty much out of the question.
- Reducing the maximum sample radius is annoying since it essentially disables the SSAO when the camera gets too close to the scene, but it's a much more pleasant trade-off compared to reducing the sample count at least. I would actually prefer to increase it to ~100 for best quality...
- Going to a GL_R16F depth buffer is possible, but the quality of distant SSAO would drop, but that's probably not noticeable. I still need the 32-bit linear depth buffer for other parts, so I'd need an extra 0.07 - 0.10 ms to generate the 16-bit one.
Something I remember reading about is that it was possible to essentially split up the depth buffer into many smaller textures. Essentially you'd take each 2x2 block of depth values and scatter it to 4 different textures. I could then run the SSAO shader 4 times (with only 4 samples each) sampling a different texture each time. That'd improve both the cache coherency and essentially reduce the sample radius since the texture is half as wide and high. Hopefully that can improve my worst case performance a bit.

Good results.

For kernel size issues read SAO paper. http://graphics.cs.williams.edu/papers/SAOHPG12/

Basically the main idea is to use depthbuffer with mipmaps that are generated using rotated grid for subsampling. This make the algorithm performance almost totally independant of kernel radius.

You should try more aggressive temporal smoothing for ssao. This way you can get rid of your most expensive component(blurring) and use all computation time for additional samples.

If you remove blurring how many samples you need to get stable results? 32? 64?

Good results.

For kernel size issues read SAO paper. http://graphics.cs.williams.edu/papers/SAOHPG12/

Basically the main idea is to use depthbuffer with mipmaps that are generated using rotated grid for subsampling. This make the algorithm performance almost totally independant of kernel radius.

You should try more aggressive temporal smoothing for ssao. This way you can get rid of your most expensive component(blurring) and use all computation time for additional samples.

If you remove blurring how many samples you need to get stable results? 32? 64?

This paper is excellent and as far as I know a lot of people have their SSAO based off it for the exact problems same problems you're having with sample radius.

Just a quick mention of something that was on here before, but trying to just brute force a huge number of samples with random sample rotation off might be worth a tick. Depending on what you're getting bottlenecked by, straight doubling the sample count without random rotation might give you a lot better results and still be faster.

Last thing, for sampling and blurring patterns the Call of Duty guys did something neat: http://www.iryoku.com/next-generation-post-processing-in-call-of-duty-advanced-warfare They created predictable and stable random noise texture, making sampling and blurring stable and predictable as a result. Apparently they tried it and liked with SSAO but didn't end up shipping it.

This topic is closed to new replies.

Advertisement