solved Too slow for full resolution, want to improve this SSAO code

Graphics and GPU Programming Programming

Started by theagentd March 20, 2015 08:32 PM

16 comments, last by agleed 9 years, 1 month ago

990

Author

March 22, 2015 03:45 AM

Mipmapping the Z-buffer? Interesting. I've managed to find the GLSL source code for the SAO paper and identified the relevant parts. What I'd need to do is add mipmaps to the depth buffer, compute a few mipmap levels and finally modify my shader to pick a good mipmap level for each sample.

Looking at the source code of their example, it seems to be pretty inefficient. With both set to 16 samples, my SSAO is only 107 instructions while theirs is 351. Most of this seems to stem from the fact that they are calculating sample positions from scratch using sin() and cos(). In addition, they're sampling the resolution of each mipmap using textureSize() which seems to have significant overhead. I will simply adapt their mipmap calculation code to my shader and keep the rest of my code intact. The 2x2 bilateral blur using dFdx/dFdy is an interesting concept too so I'll take a look at that too. Thanks for pointing me in this direction!

Concerning more temporal supersampling, I'm not prepared to go any further with that. Preventing ghosting and other artifacts is hard as it is. I am currently using my own anti-aliasing technique which I call Temporal SRAA, which works similarly to SRAA but also uses the previous frame. Since each color sample has a primitive ID, I can eliminate ghosting by only sampling the previous frame if the IDs match. Doing 3x or even 4x temporal supersampling with this would both have a high performance overhead and most likely introduce more apparent ghosting artifacts on shadows and other special effects that the ID matching cannot eliminate, so it's not something I have any plans for.

I haven't taken a look at how many samples I'd need to not have to rotate the samples and/or not have to blur the result, mostly because it has been so hard to find code for generating good sample locations. I'm thinking of porting the sample location code they use in the SAO shader to be calculated offline and passing it into the shader during compilation for maximum performance. This will probably be my first step.

kalle_h

2,470

March 22, 2015 11:43 AM

GLSL port of SAO is pretty inefficient but its memory bound so ALU ops does not mean much in most case. Sin/Cos are not problem(can be replaced with 2x2 rotation matrix if needed) but the integer math/ integer uv sampling.

There is cleaned version of Hardland ssao. http://pastebin.com/WZCsjkrj

I set sample count to 16 to match with your testing but I have noticed that you get better quality with something pairless number. Prime numbers are good. I use 17 for ultra and 13 for high.

Another SAO version can be found here. https://github.com/bartwronski/CSharpRenderer/blob/master/shaders/ssao.fx

ps. Its lot faster to port from hlsl to glsl than optimize the glsl version.

theagentd

990

Author

March 22, 2015 06:07 PM

Great success!

I've written a small sample location generator that I can use to generate sample locations for any sample count. I have good sample distributions for 8, 16, 24 and 32 samples.

I added the simple one-line mipmap generation code. I don't generate that many mipmaps, but the generation clocks in at around 0.09 - 0.10ms. Pretty awesomely, the SSAO pass now runs in constant time regardless of sample radius. With the maximum sample radius set to 1000 pixels I get the following results when I stuff the camera into some grass:

32 samples, 1000 pixel sample radius, GL_R32F depth buffer, with random sample locations:

Old:

engine.post.SSAORenderer : 25.532ms

Render SSAO : 24.705ms

Blur : 0.825ms

New:

engine.post.SSAORenderer2 : 2.325ms

Generate depth buffer mipmaps : 0.094ms

Render SSAO : 1.401ms

Blur : 0.826ms

Best of all, the 1.4ms performance at 32 samples is constant regardless of sample radius. Even more amazing, the image quality is identical. I can't I even see any difference at all, and even if I can spot some difference by flipping between the old and the new algorithm, it doesn't even look worse, just noisy in a different way. At 24 samples and 2 blur passes, I get pretty good quality and 2ms performance at 1920x1080. I will most likely limit the sample radius slightly simply to avoid artifacts when the samples end up outside the screen since I don't have a "guard band" that provides information outside the screen. I've opted to not go with the 2x2 bilateral blur they do in the shader using dFdx/dFdy as it caused block artifacts on my vegetation.

Simply brute-forcing it all does not seem to be very feasible. One blur pass costs around the same as 8 additional samples, and having at least 1 or preferably 2 blur passes improves quality a lot.

@kalle_h

The only code I used from the SAO paper was the depth buffer mipmap generation shader, and for the SSAO code I simply plugged in the LOD level calculation code and switched from texture() to textureLod() when sampling. Since I already use normalized texture coordinates, I didn't need to change anything else or mess with integer texture coordinates, so the APU performance of the new version is barely affected.

I'm pretty satisfied with the current results, so I think I'll just go with this. I might tweak the radius and fall-off function or so, but it'll mostly be aesthetic tweaks from now on.

kalle_h

2,470

March 22, 2015 07:40 PM

Now we just need some screenshots.

theagentd

990

Author

March 22, 2015 08:14 PM

I'd like to thanks everyone here for their great advice and for pointing me in the right direction!

@vlj

I'm interested in optimizing the blur pass a bit. I did some experiments with compute shader blurs before (http://www.gamedev.net/topic/664950-smart-compute-shader-box-blur-is-slower/) but I couldn't make it any faster than simply letting the texture cache take care of it. In the case of SSAO, it's a single channel 8-bit texture. I'd love to improve the blur performance though, but I need to figure out what I did wrong the last time.

theagentd

990

Author

March 22, 2015 08:23 PM

Now we just need some screenshots.

Ah, of course. Here you go. =3

http://screenshotcomparison.com/comparison/117813/picture:0 (Don't mind the FPS counter on these two.)

http://screenshotcomparison.com/comparison/117813/picture:1 (More representative FPS.)

theagentd

990

Author

March 23, 2015 01:28 PM

I've decided to write a small summary of the most important optimizations I added to get to this point.

1. I was reconstructing the eye space position of each SSAO sample using the hardware depth buffer and the inverse projection matrix. Switching to reconstructing the position using a linear depth buffer and a frustum corner vector saves an almost ridiculous number of instructions.

The matrix version does

- 3xMAD (convert coords from [0.0 1.0] to [-1.0 1.0])

- 12xMAD (matrix multiply)

- 1xRCP + 3xMUL (W divide)

= 15xMAD + 3xMUL + 1xRCP = 18 instructions + RCP which is even slower

The frustum corner version only takes 2xMAD + 5xMUL = 7 instructions and no RCP, saving a huge amount of APU time. This is the simple change that brought my shader from 1.78ms to 0.84ms.

2. The second biggest bottleneck is cache coherency. This can be solved by mipmapping the linear depth buffer and picking a LOD level for each sample based on the sample offset distance. Basically, as the samples get more and more spread out, we counter the reduced cache coherency by moving to smaller mip levels, bringing cache coherency up again. Visually, the result is identical. I cannot see any difference whatsoever. Mipmapping the depth buffer is fast, generally taking under 0.1ms. When using extremely large sample radii, this technique brought my SSAO shader from over 20ms down to 0.84ms constant, regardless of sample radius, well worth the cost of generating the mipmaps.

3. Keep the blur simple. My blur was both depth and normal-aware. I've removed the normal awareness as the performance cost was not worth the minor improvement it achieved. Secondly, make sure you're only doing one texture sample per blur sample. I had one GL_R8 texture for the SSAO value from the SSAO pass and a GL_R32F texture for depth, and the shader was completely bottlenecked by the number of texture samples. I changed the texture format of the SSAO result texture to GL_RG16F and packed the SSAO value in the red channel and the depth in the green channel. The blur shader then only had to do one texture sample to get both the SSAO value and the depth. At the end, it outputs both the blurred value and the unmodified center depth for the next blur pass. This almost doubled the blur performance, although writing the extra depth value has a small amount of overhead.

Here are my benchmark results for 16 samples with 2 9x9 blur passes applied.

BEFORE (best case scenario):

engine.post.SSAORenderer : 3.424ms

Downsample/pack buffers : 0.323ms

Clear buffers : 0.112ms

Render SSAO : 1.783ms

Blur : 1.201ms

The "Render SSAO" pass would skyrocket to over 20ms when the sample radius got over ~75 pixels.

AFTER:

engine.post.SSAORenderer : 1.498ms

Generate depth buffer mipmaps : 0.09ms

Render SSAO : 0.826ms

Blur : 0.578ms

Improvement results:

- Precomputation: 0.435 --> 0.090 = 4.83x improvement

- SSAO pass: 1.783 --> 0.826 = 2.16x improvement (best case scenario for the old algorithm, in practice closer to 5x to 30x improvement)

- Blur: 1.201 --> 0.578 = 2.08x improvement

- Total: 3.424 --> 1.498 = 2.29x improvement

Quality-wise, the new algorithm looks identical, except the improved cache locality allows for much larger sample radii, which allows for higher quality without having to resort to clamping the sampling radius or other hacks.

The only thing left to investigate now is compute shader, which isn't something I can prioritize since our engine must run on OGL3 hardware.

Again, thanks everyone! I hope that someone finds this useful.

agleed

1,014

March 23, 2015 04:08 PM

The matrix version does

- 3xMAD (convert coords from [0.0 1.0] to [-1.0 1.0])

- 12xMAD (matrix multiply)

- 1xRCP + 3xMUL (W divide)

= 15xMAD + 3xMUL + 1xRCP = 18 instructions + RCP which is even slower

Wouldn't make much of a difference, but as a small note: You can bake the [0,1] <-> [-1,1] conversion into the matrix (you can do this with all linear transformations that need to be applied before or after).


transform = transform * [matrix that translates each coordinate by -1] * [matrix that scales everything by a factor of 2]

solved Too slow for full resolution, want to improve this SSAO code

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

*solved* Too slow for full resolution, want to improve this SSAO code

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines

solved Too slow for full resolution, want to improve this SSAO code