• Advertisement
Sign in to follow this  

SSAO and skybox artifact

This topic is 1872 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm running into an ugly artifact with my SSAO where when geometry against the skybox is being occluded by the skybox and vice versa, I was able to fix the skybox being occluded by simply skipping the ssao calculation as soon as I know that the pixel is a skybox. I still have the problem with the skybox occluding my geometry creating an ugly dotted line, the skybox is black in the gbuffer and at depth 1.0 but after trying uselessly to skip sampling the skybox using a step function, I decided to ask here. Anyway here is my glsl shader code based heavily on code from here:



uniform sampler2D depth_texture;
uniform sampler2D color_texture;
uniform sampler2D normal_texture;
uniform float scr_w;
uniform float scr_h;

uniform vec3 pSphere[10] = vec3[]( vec3(-0.010735935, 0.01647018, 0.0062425877),
vec3(-0.06533369, 0.3647007, -0.13746321),
vec3(-0.6539235, -0.016726388, -0.53000957),
vec3(0.40958285, 0.0052428036, -0.5591124),
vec3(-0.1465366, 0.09899267, 0.15571679),
vec3(-0.44122112, -0.5458797, 0.04912532),
vec3(0.03755566, -0.10961345, -0.33040273),
vec3(0.019100213, 0.29652783, 0.066237666),
vec3(0.8765323, 0.011236004, 0.28265962),
vec3(0.29264435, -0.40794238, 0.15964167));
varying vec2 vTexCoord;
#define STRENGTH 0.09
#define FALLOFF 0.0 //0.00002
#define RAD 0.006
#define SAMPLES 10
#define INVSAMPLES 1.0/SAMPLES
vec4 height_normal(in vec2 texcoord)
{
vec4 normaltexel;
normaltexel.rgb = (texture2D(normal_texture, texcoord).xyz * 2.0) - vec3(1.0);
normaltexel.a = texture2D(depth_texture, texcoord).x;

return normaltexel;
}
void main(void)
{
// get a random normal
vec3 fres = normalize((texture2D(color_texture, vTexCoord * (scr_w / 64)).xyz * 2.0) - vec3(1.0));

//grab depth and a normal vector
vec4 currentPixelSample = height_normal(vTexCoord);

vec3 samplepos = vec3(vTexCoord.xy, currentPixelSample.a);

float blacklevel = 0.0;

float depthDiff;
vec4 occluderFragment;
vec3 ray;

if(length(currentPixelSample.xyz) <= 1.0) //dont calculate ssao because the pixel is in skybox
{
for(int i = 0; i < SAMPLES; ++i)
{
// trace a ray from a random normal to a random position
ray = (RAD / samplepos.z) * reflect(pSphere, fres);

occluderFragment = height_normal(samplepos.xy + (sign(dot(ray, currentPixelSample.xyz)) * ray.xy)); //get the position of the occluder

depthDiff = samplepos.z - occluderFragment.a;

blacklevel += step(FALLOFF, depthDiff) * (1.0 - dot(currentPixelSample.xyz, occluderFragment.xyz)) * (1.0 - smoothstep(FALLOFF, STRENGTH, depthDiff));
}
}

// output the result
gl_FragColor = vec4(vec3(1.0 - (blacklevel * INVSAMPLES)), 1.0);

}


attached is a picture of my problem the the offending pixels circled. Anyone know how to fix this?

Share this post


Link to post
Share on other sites
Advertisement
if the skybox is exactly at 1.0 (zfar) by writing gl_FragDepth = 1.0 in the atmosphere shader,
you can avoid it by using a branch: if (depth < 0.99) { do stuff }
It will reduce performance of your SSAO shader, but i think it will work :)

in your example this could be:
if (normaltexel.a < 0.999)
{
do ssao
} Edited by Kaptein

Share this post


Link to post
Share on other sites
When dealing with shaders, ALL code is executed, including ALL branches, all function calls, etc. The ONLY exception for this is if something is known at compile time that will allow the compiler to remove a particular piece of code.

This is how all graphics cards work, AMD, NVIDIA, etc. So, your additional cost is of the if statement, and in your example, you are adding an extra if instruction. This is a zero cost on gpus. If you want to read on it, check out vectors processors and data hazards.

If you somehow split our shader up and added an if statement to the middle thinking that it would speed up your code, you would get NO speedup. because ALL paths will be executed.

Share this post


Link to post
Share on other sites
Do a depth bound test (can be setup on engine side, no shader if branches) with max range 0.99999f, this will ensure you're not computing SSAO on the sky (which should be at 1.0). I'm not familiar with OpenGL so I don't know the setup for a depth-bound test (but I'm sure it can be done), but you can do this on CPU side in D3D.

Share this post


Link to post
Share on other sites

ALL code is executed, including ALL branches, all function calls, etc. (...) This is how all graphics cards work, AMD, NVIDIA, etc.


I'm not quite sure where you base your information on, but almost all graphics cards from the last 3 or 4 years work this way. Here's a quote from NVidia:
Any flow control instruction (if, switch, do, for, while) can significantly affect
the instruction throughput by causing threads of the same warp to diverge; that is, to
follow different execution paths. If this happens, the different execution paths must be
serialized, since all of the threads of a warp share a program counter; this increases the
total number of instructions executed for this warp. When all the different execution
paths have completed, the threads converge back to the same execution path.
To obtain best performance in cases where the control flow depends on the thread ID,
the controlling condition should be written so as to minimize the number of divergent
warps.
This is possible because the distribution of the warps across the block is deterministic as
mentioned in SIMT Architecture of the CUDA C Programming Guide. A trivial example is
when the controlling condition depends only on (threadIdx / WSIZE) where WSIZE is
the warp size.
In this case, no warp diverges because the controlling condition is perfectly aligned with
the warps.[/quote]

Only when serialization is needed, which is when threads inside a warp diverge into different branches, the different execution paths get serialized. Edited by CryZe

Share this post


Link to post
Share on other sites
Yes, CryZe is mostly right, although you forgot one other potential source of slowdown from adding branches - register count. The compiler needs to statically determine the worst-case number of temporary registers needed for intermediate computation, accounting for any code path. If the alternate path introduced by a branch causes the number of registers needed to increase, then the total register count of the shader can be higher (even when that branch is never taken). When running the shader, each instance (thread or similar construct, depending on which HW vendor or API terminology you're using), needs that many registers. Basically, shaders that use more threads get fewer instances running in parallel.

Share this post


Link to post
Share on other sites

When dealing with shaders, ALL code is executed, including ALL branches, all function calls, etc. The ONLY exception for this is if something is known at compile time that will allow the compiler to remove a particular piece of code.

This is how all graphics cards work, AMD, NVIDIA, etc. So, your additional cost is of the if statement, and in your example, you are adding an extra if instruction. This is a zero cost on gpus. If you want to read on it, check out vectors processors and data hazards.

If you somehow split our shader up and added an if statement to the middle thinking that it would speed up your code, you would get NO speedup. because ALL paths will be executed.


This is completely wrong, even for relatively old GPU's (even the first-gen DX9 GPU's supported branching on shader constants, although in certain cases it was implemented through driver-level shenanigans). I'm not sure how you could even come to such a conclusion, considering it's really easy to set up a test case that shows otherwise. Edited by MJP

Share this post


Link to post
Share on other sites

[quote name='CryZe' timestamp='1354868138' post='5008035']
A warp consists of either 16 or 32 threads grouped together.

I think you mean "32 or 64" tongue.png
[/quote]

I thought a Wavefront on AMDs architecture consists of 16 execution units. Or am I wrong? (I just used warp as a general term, because I like it more :D) Edited by CryZe

Share this post


Link to post
Share on other sites

[quote name='MJP' timestamp='1354934346' post='5008334']
[quote name='CryZe' timestamp='1354868138' post='5008035']
A warp consists of either 16 or 32 threads grouped together.

I think you mean "32 or 64" tongue.png
[/quote]

I thought a Wavefront on AMDs architecture consists of 16 execution units. Or am I wrong? (I just used warp as a general term, because I like it more biggrin.png)
[/quote]

Nah there's 64 threads in a wavefront. In their latest architecture (GCN) the SIMDs are 16-wide, but they execute each instruction 4 times to complete it for the entire wavefront (so a single-cycle instruction actually takes 4 cycles to execute). Edited by MJP

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement