Sign in to follow this  

OpenGL Voxel Cone Tracing Experiment - Part 2 Progress

This topic is 1487 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts


Looks pretty good, where's your main bottleneck for performance, is it really your new mip mapping?

 

No, mip-mapping is still pretty cheap in the scale of things (but the cost may accumulate later when I try to implement cascades).

Right now the main bottlenecks for performances are soft-shadowing, ssR and ssao.

I haven't bothered to set up any sort of way of querying actual cost of each feature so I can't really tell you accurately where the costs would come from. All I can tell you is the framerate of what I remember from before I implemented soft-shadowing, ssR and ssao - where it was running at 50fps with the same cone tracing features (except now I have the modified mip-mapping). I think soft-shadowing (for main pointlight and 3 emissive objects) pushed it down to ~35fps, ssR pushed it down to about ~25fps and ssao pushed it to ~20fps.

 

I believe that there is a lot of cost to binding all these textures, so I think the next step for me in improving performance would be to dwell into bindless graphics. I also want to try and get partially resident textures working on my nvidia card with OpenGL 4.4, but I haven't found any resources to help me with this - has anyone been able to implement this?

Share this post


Link to post
Share on other sites

It's well worth spending a day to implement a basic GPU profiling system (I guess using ARB_timer_query in GL? I've only done it in D3D so far.) so you can see how many milliseconds are being burnt by your different shaders/passes. You can then print them to the screen, or the console, or a file, etc, and get decent statistics for all of your different features in one go.
 
Measuring performance with FPS is quite annoying on the other hand, requiring you to take before/after FPS measurements when turning a feature on and off.
50fps = 20ms/frame for drawing the scene, voxelizing it, cone tracing and direct lighting?
35fps = 28.5ms/frame == 8.5ms increase for adding soft-shadowing
25fps = 40ms/frame == 11.5ms increase for adding SSR
20fps = 50ms/frame == 10ms increase for adding SSAO
 

I believe that there is a lot of cost to binding all these textures, so I think the next step for me in improving performance would be to dwell into bindless graphics.

That will probably only be a CPU-side optimization, and by the sounds of it, your application is probably bottlenecked by the GPU workload.
 
Awesome results BTW biggrin.png

Edited by Hodgman

Share this post


Link to post
Share on other sites

It's well worth spending a day to implement a basic GPU profiling system (I guess using ARB_timer_query in GL? I've only done it in D3D so far.) so you can see how many milliseconds are being burnt by your different shaders/passes. You can then print them to the screen, or the console, or a file, etc, and get decent statistics for all of your different features in one go.

 

 

This is an excellent idea, especially since there are plenty of optimizations to pare down SSAO/SSR, those are pretty well established and researched. You'd really be looking at cone tracing for trying novel optimizations, though I can think of several already done.

 

One is downsampling before or otherwise binning together pixel blocks for the diffuse trace, which would work well with pixels relatively close to each other but miss thin and edge objects in the right cases. Epic also does this with the specular trace though they never explained how other than a hand wavy "then you upsample and scatter".

 

To reduce the number of samples for specular trace you can check a lower mip level of the volume for the alpha to see if its empty and if you should skip it. I'm also interested in, and may eventually get back to trying to figure out realtime signed distance fields. This should give you a minimum step size you can skip to for tracing, reducing the amount of samples you need.

 

As for a huge area (if you want to go that far), volume LOD and a Directed Acylic Graph (as I mentioned earlier) should reduce memory consumption a lot since you're using volume textures and not a sparse octree, though the paper is based on doing as such for an octree so I'm not sure how a uniform volume texture would play out.

Edited by Frenetic Pony

Share this post


Link to post
Share on other sites

It's well worth spending a day to implement a basic GPU profiling system

 

First value is 32x32x32 texture and second is 64x64x64 texture

Direct light Shadow 6.3; 6.3

Emissive light Shadows (all three) 13.1; 13.1

First Voxelization 0.01; 0.01

Second Bounce Voxelization (5 diffuse cones) 1; 1.6

Mip-mapping and filtering (3x3x3 filtering) 0.6; 1.3

Final Rendering (5 diffuse cones + 1 specular cone) 14.7; 16.8

Post (SSR + SSAO) 17.2; 17.2

Total 52.91 56.31

Edited by gboxentertainment

Share this post


Link to post
Share on other sites

Your post processing is fairly expensive. With just those you could't run a game 60fps.

Share this post


Link to post
Share on other sites

I'm curious as to why your SSAO + SSR are so expensive.

 

Did some debugging and found out that (because I'm using forward rendering) I had accidentally used the hi-res version of the Buddha model for my ssao and ssr (over a million tris).

So instead of 1.0ms from the vertex shader with the low-poly model, I was getting 10ms.

My ssao is about 8.5ms now.

However, when I previously reported my results, I didn't actually have any ssr turned on, so my ssr, when turned on for the entire scene (all surfaces) is an additional 8.8ms.

I guess there's still a lot of room to optimize my ssr - when I implemented it, I was looking more for getting the best quality I could get than performance.

I've managed to reduce my ssao to 4.7ms without too much quality loss.

 

I'm trying to calculate whether deferred shading has an advantage over my current forward shading. With deferred shading, I have to render the Buddha at full res for position, normal and albedo textures so this will be a fixed vertex shader cost of 30ms. At the moment with forward shading, I render the model at full res once and at low-res 7 times, so that makes 17ms altogether for vertex shader costs.

Edited by gboxentertainment

Share this post


Link to post
Share on other sites

Hi! Try to find SSR with iteractive step - not fixed step. You will find reflected pixel in 3-8 steps. SSR must be faster than any postprocess effect.

SSAO - better implement it in multiple resolutions with upsampling - faster/better/no noise/no need to post-blur.

Share this post


Link to post
Share on other sites

I would ditch all screenspace hacks If I would have voxel data structure some where already...

Share this post


Link to post
Share on other sites

Yeah, your SSAO and SSR implementations seem very bloated. There's definitely a lot of room for optimizations here.

 

As for deferred shading: why would your cost for geometry go up 3x? You just need to use MRT to output some g-buffer data.

Share this post


Link to post
Share on other sites

I would ditch all screenspace hacks If I would have voxel data structure some where already...

 

The problem is computation and memory cost go up quite quickly with increased voxel resolution, and he's only getting that with a small room. Ideally you'd have say, a 16 square kilometer grid centered around the player. Which is going to cost more than enough by itself, without getting into the same voxel resolution as screen resolution.

Share this post


Link to post
Share on other sites

 

I would ditch all screenspace hacks If I would have voxel data structure some where already...

 

The problem is computation and memory cost go up quite quickly with increased voxel resolution, and he's only getting that with a small room. Ideally you'd have say, a 16 square kilometer grid centered around the player. Which is going to cost more than enough by itself, without getting into the same voxel resolution as screen resolution.

 

 

Screen space effects are still useful for voxel cone tracing because the voxels often don't have enough resolution to provide finner details. For instance, ambient occlusion generated naturally by the cone tracing tends to look a bit washed out due to the lack of geometric detail and thus can benefit from SSAO to give finner details.

Same thing goes for reflections, if you want sharp reflections you'd need very small voxels which is impractical and expensive, in this case screen space reflections can help a lot. However, for blurred reflections voxel cone tracing is very good.

Share this post


Link to post
Share on other sites

Here's my ssR code for anyone that can help me optimize whilst still keeping some plausible quality:

	vec4 bColor = vec4(0.0);

	vec4 N = normalize(fNorm);
	mat3 tbn = mat3(tanMat*N.xyz, bitanMat*N.xyz, N.xyz);
	vec4 bumpMap = texture(bumpTex, texRes*fTexCoord);
	vec3 texN = (bumpMap.xyz*2.0 - 1.0);
	vec3 bumpN = bumpOn == true ? normalize(tbn*texN) : N.xyz;

	vec3 camSpaceNorm = vec3(view*(vec4(bumpN,N.w)));
	vec3 camSpacePos = vec3(view*worldPos);

	vec3 camSpaceViewDir = normalize(camSpacePos);
	vec3 camSpaceVec = normalize(reflect(camSpaceViewDir,camSpaceNorm));

	vec4 clipSpace = proj*vec4(camSpacePos,1);
	vec3 NDCSpace = clipSpace.xyz/clipSpace.w;
	vec3 screenSpacePos = 0.5*NDCSpace+0.5;

	vec3 camSpaceVecPos = camSpacePos+camSpaceVec;
	clipSpace = proj*vec4(camSpaceVecPos,1);
	NDCSpace = clipSpace.xyz/clipSpace.w;
	vec3 screenSpaceVecPos = 0.5*NDCSpace+0.5;
	vec3 screenSpaceVec = 0.01*normalize(screenSpaceVecPos - screenSpacePos);

	vec3 oldPos = screenSpacePos + screenSpaceVec;
	vec3 currPos = oldPos + screenSpaceVec;
	int count = 0;
	int nRefine = 0;
	float fade = 1.0;
	float fadeScreen = 0.0;
	float farPlane = 2.0;
	float nearPlane = 0.1;

	float cosAngInc = -dot(camSpaceViewDir,camSpaceNorm);
	cosAngInc = clamp(1-cosAngInc,0.3,1.0);
	
	if(specConeRatio <= 0.1 && ssrOn == true)
	{
	while(count < 50)
	{
		if(currPos.x < 0 || currPos.x > 1 || currPos.y < 0 || currPos.y > 1 || currPos.z < 0 || currPos.z > 1)
			break;

		vec2 ssPos = currPos.xy;

		float currDepth = 2.0*nearPlane/(farPlane+nearPlane-currPos.z*(farPlane-nearPlane));
		float sampleDepth = 2.0*nearPlane/(farPlane+nearPlane-texture(depthTex, ssPos).x*(farPlane-nearPlane));
		float diff = currDepth - sampleDepth;
		float error = length(screenSpaceVec);
		if(diff >= 0 && diff < error)
		{
			screenSpaceVec *= 0.7;
			currPos = oldPos;
			nRefine++;
			if(nRefine >= 3)
			{
					fade = float(count);
					fade = clamp(fade*fade/100,1.0,40.0);
					fadeScreen = distance(ssPos,vec2(0.5,0.5))*2;
					bColor.xyz += texture(reflTex, ssPos).xyz/2/fade*cosAngInc*(1-clamp(fadeScreen,0.0,1.0));
				break;
			}
		} else if(diff > error){
			bColor.xyz = vec3(0);
			sampleDepth = 2.0*nearPlane/(farPlane+nearPlane-texture(depthBTex, ssPos).x*(farPlane-nearPlane));
			diff = currDepth - sampleDepth;
			if(diff >= 0 && diff < error)
			{
				screenSpaceVec *= 0.7;
				currPos = oldPos;
				nRefine++;
				if(nRefine >= 3)
				{
					fade = float(count);
					fade = clamp(fade*fade/100,2.0,20.0);
					bColor.xyz += texture(reflTex, ssPos).xyz/2/fade*cosAngInc;
					break;
				}	
			}
		}

		oldPos = currPos;
		currPos = oldPos + screenSpaceVec;
		count++;

	}
	}

Note that the second half of the code (after the else if(diff > error)) is where I cover the back face of models (depthBTex is a depth texture with frontface culling) so that the back of models are reflected.

Share this post


Link to post
Share on other sites
float L=0.1;
float4 T=0;
float3 NewPos;
for(int i=0;i<10;i++){
NewPos=RealPos+R*L; // RealPos - current position, R- reflection
T=mul(float4(NewPos,1),mat_ViewProj); // Projecting new position to screen.
T.xy=0.5+0.5*float2(1,-1)*T.xy/T.w;
NewPos=GetWorldPos( GBufferPositions.Load(uint2(gbufferDim.xy* T),0),T.xy,mat_ViewProjI); // Find world position

L=length(RealPos-NewPos); // new distance
}

T.xy - texturecoord of reflected pixel

Share this post


Link to post
Share on other sites

I've managed to increase the speed of my ssR to 5.3ms at the cost of reduced quality by using variable step distance - so now i'm using 20 steps instead of 50.

 

[attachment=18456:giboxssr10.png]

 

Even if I get it down to 10 steps and remove the additional backface cover, it will still be 3.1ms - is this fast enough? or can it be optimized further?

Share this post


Link to post
Share on other sites

So I've managed to remove some of the artifacts from my soft shadows:

Previously, when I had used front-face culling I got the following issue:

[attachment=18552:givoxshadows8-0.jpg]

 

This was due to backfaces not being captured by the shadow-caster camera when at overlapping surfaces, thus leading to a gap of missing information in the depth test. There's also the issue of back-face self shadowing artifacts.

 

Using back-face culling (only rendering the front-face) resolves this problem, however, leads to the following problem:

[attachment=18553:givoxshadows8-1.jpg]

Which is front-face self shadowing artifacts - any sort of bias does not resolve this problem because it is caused by the jittering process during depth testing.

 

I came up with a solution that resolves all these issues for direct lighting shadows, which is to also store an individual object id for each object in the scene from the shadow-caster's point of view. During depth testing, I then compare the object id from the player camera's point of view with that from the shadow-caster's point of view and make it so that each object does not cast its own shadow onto itself:

[attachment=18554:givoxshadows8-2.jpg]

 

Now this is all good for direct lighting, because everything that is not directly lit I set to zero, including shadows, and then I add the indirect light to that zero - so there's a smooth transition between the shadow and the non-lit part of each object.

[attachment=18557:givoxshadleak2.jpg]

 

For indirectly lit scenes with no direct lighting at all (i.e. emissively lit by objects), things are a bit different. I don't separate a secondary bounce with the subsequent bounces, all bounces are tied together - thus I cannot just set a secondary bounce as the "direct lighting" and everything else including shadows to zero, then add the subsequent bounces. This would require an additional voxel texture and I would need to double the number of cone traces.

I cheat by making the shadowed parts of the scene darker than the non-shadowed parts (when a more accurate algorithm would be to make shadowed areas zero and add subsequent bounces to those areas). This, together with the removal of any self-shadowing leads to shadow leaking:

[attachment=18555:givoxshadleak1.jpg][attachment=18556:givoxshadleak0.jpg]

 

So I think I have two options:

  1. Add another voxel texture for the second bounce and double the number of cone traces (most expensive).
  2. Switch back to back-face rendering with front-face culling for the shadow mapping only for emissive lighting shadows (lots of ugly artifacts).

I wonder if anyone can come up with any other ideas.

Share this post


Link to post
Share on other sites

I just tested this with my brand new EVGA GTX780 and it runs at average 95fps at 1080p with all screen space effects turned on (ssao, ssr, all soft shadows). In fact, screen space effects seem to make little dent in the framerate.

 

I discovered something very unusual when testing the voxel depth. Here's my results:

32x32x32 -> 95fps (37MB memory)

64x64x64 -> 64fps (37MB memory)

128x128x128 -> 52fps (37MB memory)

256x256x256 -> 31fps (38MB memory)

512x512x512 -> 7fps (3.2GB memory)

 

How on earth did I jump from 38MB memory to 3.2GB of memory used when going from 256 to 512 3d texture depths?!

Share this post


Link to post
Share on other sites

I just tested this with my brand new EVGA GTX780 and it runs at average 95fps at 1080p with all screen space effects turned on (ssao, ssr, all soft shadows). In fact, screen space effects seem to make little dent in the framerate.

 

I discovered something very unusual when testing the voxel depth. Here's my results:

32x32x32 -> 95fps (37MB memory)

64x64x64 -> 64fps (37MB memory)

128x128x128 -> 52fps (37MB memory)

256x256x256 -> 31fps (38MB memory)

512x512x512 -> 7fps (3.2GB memory)

 

How on earth did I jump from 38MB memory to 3.2GB of memory used when going from 256 to 512 3d texture depths?!

 

Obviously your profiler is broken somehow, as I doubt your experiment manages to hold ever increasing data in the same exact amount of ram.

Edited by Frenetic Pony

Share this post


Link to post
Share on other sites


Obviously your profiler is broken somehow, as I doubt your experiment manages to hold ever increasing data in the same exact amount of ram.

 

Actually I'm using the task manager to get the amount of ram that my application is using.

Share this post


Link to post
Share on other sites

 


Obviously your profiler is broken somehow, as I doubt your experiment manages to hold ever increasing data in the same exact amount of ram.

 

Actually I'm using the task manager to get the amount of ram that my application is using.

 

Sounds like you hit your video cards memory limit and the drivers are now using system memory - which is also why your frame rate tanks. Task Manager only shows system memory usage, not the memory internal to the video card.

Share this post


Link to post
Share on other sites

Just a general idea regarding the light-info accumulation concept which was floating around my head for some time now and I finally want to get rid of :

Instead of cone-tracing per screen-pixel (which is how the technique works default wise IIRC), couldn't you seperate your view frustrum into cells (similar to what you do for clustered shading, but perhaps with cube-shaped cells), accumulate the light information in these represented by spherical harmonics using cone-tracing and finally use this SH - 'volume' to light your scene?

You would of course end up with low frequent information only suitable for diffuse lighting (like when using light propagation volumes, but still with less quantization since you would not (necessarily) propagate the information iteratively (or at least with fewer steps if you choose to do so to keep the trace range shorter)) but on the other hand you could probably reduce the amount of required cone-traces considerably (you also would only need to fill cells with intersecting geometry (if you choose not to propagate iteratively)) and, to some extend, resolve the correlation between the amount of traces and the output pixel count.

Just an idea.

Edited by Bummel

Share this post


Link to post
Share on other sites

That's a similar idea to what others already did, which is just downsample before tracing and then upsample the results (with some trickery for fine edges). The main problem with just doing cells is that an always present (and temporally stable) specular term is part of the thing that really sells GI to begin with. Still, it's an idea if you're really performance bound.

 

I think I mentioned a similar idea but just for particles, which are going to be diffuse only anyway for the most part and would be really helpful with layers of transparency. And now that I think about it, it would also work well for highly distant objects. While specular doesn't actually fall off of course, anything but primary specular (say from the sun) shouldn't be too noticeable really far away.

 

As for transparency, "inferred" or stippled transparency rendering would be really useful for cone tracing. I'm not sure you could also downsample the tracing simultaneously, but it would still prevent tracing from multiple layers of transparency.

 

As for using a directed acylic graph. I've been thinking that you'd need to separately store albedo/position information, mipmap that, and then figure out a way to apply lighting to different portions dynamically and uniquely using the indirection table. If you're missing what I'm talking about, a Directed Acylic Graph would converge identical copies of voxel areas into just one copy, and then use a table or "indirection table" to direct the tracing to where each copied block was in worldspace.

Share this post


Link to post
Share on other sites

 The main problem with just doing cells is that an always present (and temporally stable) specular term is part of the thing that really sells GI to begin with.

As I understand it, the diffuse part is actually the costly one because of the large amount of cones you need to trace per pixel in the default solution. So for rather sharp glossy highlights you could keep tracing them per pixel without the intermediate accumulation step into the SH-volume. But that's of course just the theory.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Similar Content

    • By _OskaR
      Hi,
      I have an OpenGL application but without possibility to wite own shaders.
      I need to perform small VS modification - is possible to do it in an alternative way? Do we have apps or driver modifictions which will catch the shader sent to GPU and override it?
    • By xhcao
      Does sync be needed to read texture content after access texture image in compute shader?
      My simple code is as below,
      glUseProgram(program.get());
      glBindImageTexture(0, texture[0], 0, GL_FALSE, 3, GL_READ_ONLY, GL_R32UI);
      glBindImageTexture(1, texture[1], 0, GL_FALSE, 4, GL_WRITE_ONLY, GL_R32UI);
      glDispatchCompute(1, 1, 1);
      // Does sync be needed here?
      glUseProgram(0);
      glBindFramebuffer(GL_READ_FRAMEBUFFER, framebuffer);
      glFramebufferTexture2D(GL_READ_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
                                     GL_TEXTURE_CUBE_MAP_POSITIVE_X + face, texture[1], 0);
      glReadPixels(0, 0, kWidth, kHeight, GL_RED_INTEGER, GL_UNSIGNED_INT, outputValues);
       
      Compute shader is very simple, imageLoad content from texture[0], and imageStore content to texture[1]. Does need to sync after dispatchCompute?
    • By Jonathan2006
      My question: is it possible to transform multiple angular velocities so that they can be reinserted as one? My research is below:
      // This works quat quaternion1 = GEQuaternionFromAngleRadians(angleRadiansVector1); quat quaternion2 = GEMultiplyQuaternions(quaternion1, GEQuaternionFromAngleRadians(angleRadiansVector2)); quat quaternion3 = GEMultiplyQuaternions(quaternion2, GEQuaternionFromAngleRadians(angleRadiansVector3)); glMultMatrixf(GEMat4FromQuaternion(quaternion3).array); // The first two work fine but not the third. Why? quat quaternion1 = GEQuaternionFromAngleRadians(angleRadiansVector1); vec3 vector1 = GETransformQuaternionAndVector(quaternion1, angularVelocity1); quat quaternion2 = GEQuaternionFromAngleRadians(angleRadiansVector2); vec3 vector2 = GETransformQuaternionAndVector(quaternion2, angularVelocity2); // This doesn't work //quat quaternion3 = GEQuaternionFromAngleRadians(angleRadiansVector3); //vec3 vector3 = GETransformQuaternionAndVector(quaternion3, angularVelocity3); vec3 angleVelocity = GEAddVectors(vector1, vector2); // Does not work: vec3 angleVelocity = GEAddVectors(vector1, GEAddVectors(vector2, vector3)); static vec3 angleRadiansVector; vec3 angularAcceleration = GESetVector(0.0, 0.0, 0.0); // Sending it through one angular velocity later in my motion engine angleVelocity = GEAddVectors(angleVelocity, GEMultiplyVectorAndScalar(angularAcceleration, timeStep)); angleRadiansVector = GEAddVectors(angleRadiansVector, GEMultiplyVectorAndScalar(angleVelocity, timeStep)); glMultMatrixf(GEMat4FromEulerAngle(angleRadiansVector).array); Also how do I combine multiple angularAcceleration variables? Is there an easier way to transform the angular values?
    • By dpadam450
      I have this code below in both my vertex and fragment shader, however when I request glGetUniformLocation("Lights[0].diffuse") or "Lights[0].attenuation", it returns -1. It will only give me a valid uniform location if I actually use the diffuse/attenuation variables in the VERTEX shader. Because I use position in the vertex shader, it always returns a valid uniform location. I've read that I can share uniforms across both vertex and fragment, but I'm confused what this is even compiling to if this is the case.
       
      #define NUM_LIGHTS 2
      struct Light
      {
          vec3 position;
          vec3 diffuse;
          float attenuation;
      };
      uniform Light Lights[NUM_LIGHTS];
       
       
    • By pr033r
      Hello,
      I have a Bachelor project on topic "Implenet 3D Boid's algorithm in OpenGL". All OpenGL issues works fine for me, all rendering etc. But when I started implement the boid's algorithm it was getting worse and worse. I read article (http://natureofcode.com/book/chapter-6-autonomous-agents/) inspirate from another code (here: https://github.com/jyanar/Boids/tree/master/src) but it still doesn't work like in tutorials and videos. For example the main problem: when I apply Cohesion (one of three main laws of boids) it makes some "cycling knot". Second, when some flock touch to another it scary change the coordination or respawn in origin (x: 0, y:0. z:0). Just some streng things. 
      I followed many tutorials, change a try everything but it isn't so smooth, without lags like in another videos. I really need your help. 
      My code (optimalizing branch): https://github.com/pr033r/BachelorProject/tree/Optimalizing
      Exe file (if you want to look) and models folder (for those who will download the sources):
      http://leteckaposta.cz/367190436
      Thanks for any help...

  • Popular Now