Sign in to follow this  
gboxentertainment

OpenGL Voxel Cone Tracing Experiment - Part 2 Progress

Recommended Posts


Looks pretty good, where's your main bottleneck for performance, is it really your new mip mapping?

 

No, mip-mapping is still pretty cheap in the scale of things (but the cost may accumulate later when I try to implement cascades).

Right now the main bottlenecks for performances are soft-shadowing, ssR and ssao.

I haven't bothered to set up any sort of way of querying actual cost of each feature so I can't really tell you accurately where the costs would come from. All I can tell you is the framerate of what I remember from before I implemented soft-shadowing, ssR and ssao - where it was running at 50fps with the same cone tracing features (except now I have the modified mip-mapping). I think soft-shadowing (for main pointlight and 3 emissive objects) pushed it down to ~35fps, ssR pushed it down to about ~25fps and ssao pushed it to ~20fps.

 

I believe that there is a lot of cost to binding all these textures, so I think the next step for me in improving performance would be to dwell into bindless graphics. I also want to try and get partially resident textures working on my nvidia card with OpenGL 4.4, but I haven't found any resources to help me with this - has anyone been able to implement this?

Share this post


Link to post
Share on other sites
Hodgman    51237

It's well worth spending a day to implement a basic GPU profiling system (I guess using ARB_timer_query in GL? I've only done it in D3D so far.) so you can see how many milliseconds are being burnt by your different shaders/passes. You can then print them to the screen, or the console, or a file, etc, and get decent statistics for all of your different features in one go.
 
Measuring performance with FPS is quite annoying on the other hand, requiring you to take before/after FPS measurements when turning a feature on and off.
50fps = 20ms/frame for drawing the scene, voxelizing it, cone tracing and direct lighting?
35fps = 28.5ms/frame == 8.5ms increase for adding soft-shadowing
25fps = 40ms/frame == 11.5ms increase for adding SSR
20fps = 50ms/frame == 10ms increase for adding SSAO
 

I believe that there is a lot of cost to binding all these textures, so I think the next step for me in improving performance would be to dwell into bindless graphics.

That will probably only be a CPU-side optimization, and by the sounds of it, your application is probably bottlenecked by the GPU workload.
 
Awesome results BTW biggrin.png

Edited by Hodgman

Share this post


Link to post
Share on other sites
FreneticPonE    3294

It's well worth spending a day to implement a basic GPU profiling system (I guess using ARB_timer_query in GL? I've only done it in D3D so far.) so you can see how many milliseconds are being burnt by your different shaders/passes. You can then print them to the screen, or the console, or a file, etc, and get decent statistics for all of your different features in one go.

 

 

This is an excellent idea, especially since there are plenty of optimizations to pare down SSAO/SSR, those are pretty well established and researched. You'd really be looking at cone tracing for trying novel optimizations, though I can think of several already done.

 

One is downsampling before or otherwise binning together pixel blocks for the diffuse trace, which would work well with pixels relatively close to each other but miss thin and edge objects in the right cases. Epic also does this with the specular trace though they never explained how other than a hand wavy "then you upsample and scatter".

 

To reduce the number of samples for specular trace you can check a lower mip level of the volume for the alpha to see if its empty and if you should skip it. I'm also interested in, and may eventually get back to trying to figure out realtime signed distance fields. This should give you a minimum step size you can skip to for tracing, reducing the amount of samples you need.

 

As for a huge area (if you want to go that far), volume LOD and a Directed Acylic Graph (as I mentioned earlier) should reduce memory consumption a lot since you're using volume textures and not a sparse octree, though the paper is based on doing as such for an octree so I'm not sure how a uniform volume texture would play out.

Edited by Frenetic Pony

Share this post


Link to post
Share on other sites

It's well worth spending a day to implement a basic GPU profiling system

 

First value is 32x32x32 texture and second is 64x64x64 texture

Direct light Shadow 6.3; 6.3

Emissive light Shadows (all three) 13.1; 13.1

First Voxelization 0.01; 0.01

Second Bounce Voxelization (5 diffuse cones) 1; 1.6

Mip-mapping and filtering (3x3x3 filtering) 0.6; 1.3

Final Rendering (5 diffuse cones + 1 specular cone) 14.7; 16.8

Post (SSR + SSAO) 17.2; 17.2

Total 52.91 56.31

Edited by gboxentertainment

Share this post


Link to post
Share on other sites

I'm curious as to why your SSAO + SSR are so expensive.

 

Did some debugging and found out that (because I'm using forward rendering) I had accidentally used the hi-res version of the Buddha model for my ssao and ssr (over a million tris).

So instead of 1.0ms from the vertex shader with the low-poly model, I was getting 10ms.

My ssao is about 8.5ms now.

However, when I previously reported my results, I didn't actually have any ssr turned on, so my ssr, when turned on for the entire scene (all surfaces) is an additional 8.8ms.

I guess there's still a lot of room to optimize my ssr - when I implemented it, I was looking more for getting the best quality I could get than performance.

I've managed to reduce my ssao to 4.7ms without too much quality loss.

 

I'm trying to calculate whether deferred shading has an advantage over my current forward shading. With deferred shading, I have to render the Buddha at full res for position, normal and albedo textures so this will be a fixed vertex shader cost of 30ms. At the moment with forward shading, I render the model at full res once and at low-res 7 times, so that makes 17ms altogether for vertex shader costs.

Edited by gboxentertainment

Share this post


Link to post
Share on other sites

Hi! Try to find SSR with iteractive step - not fixed step. You will find reflected pixel in 3-8 steps. SSR must be faster than any postprocess effect.

SSAO - better implement it in multiple resolutions with upsampling - faster/better/no noise/no need to post-blur.

Share this post


Link to post
Share on other sites
Styves    1792

Yeah, your SSAO and SSR implementations seem very bloated. There's definitely a lot of room for optimizations here.

 

As for deferred shading: why would your cost for geometry go up 3x? You just need to use MRT to output some g-buffer data.

Share this post


Link to post
Share on other sites
FreneticPonE    3294

I would ditch all screenspace hacks If I would have voxel data structure some where already...

 

The problem is computation and memory cost go up quite quickly with increased voxel resolution, and he's only getting that with a small room. Ideally you'd have say, a 16 square kilometer grid centered around the player. Which is going to cost more than enough by itself, without getting into the same voxel resolution as screen resolution.

Share this post


Link to post
Share on other sites
jcabeleira    723

 

I would ditch all screenspace hacks If I would have voxel data structure some where already...

 

The problem is computation and memory cost go up quite quickly with increased voxel resolution, and he's only getting that with a small room. Ideally you'd have say, a 16 square kilometer grid centered around the player. Which is going to cost more than enough by itself, without getting into the same voxel resolution as screen resolution.

 

 

Screen space effects are still useful for voxel cone tracing because the voxels often don't have enough resolution to provide finner details. For instance, ambient occlusion generated naturally by the cone tracing tends to look a bit washed out due to the lack of geometric detail and thus can benefit from SSAO to give finner details.

Same thing goes for reflections, if you want sharp reflections you'd need very small voxels which is impractical and expensive, in this case screen space reflections can help a lot. However, for blurred reflections voxel cone tracing is very good.

Share this post


Link to post
Share on other sites

Here's my ssR code for anyone that can help me optimize whilst still keeping some plausible quality:

	vec4 bColor = vec4(0.0);

	vec4 N = normalize(fNorm);
	mat3 tbn = mat3(tanMat*N.xyz, bitanMat*N.xyz, N.xyz);
	vec4 bumpMap = texture(bumpTex, texRes*fTexCoord);
	vec3 texN = (bumpMap.xyz*2.0 - 1.0);
	vec3 bumpN = bumpOn == true ? normalize(tbn*texN) : N.xyz;

	vec3 camSpaceNorm = vec3(view*(vec4(bumpN,N.w)));
	vec3 camSpacePos = vec3(view*worldPos);

	vec3 camSpaceViewDir = normalize(camSpacePos);
	vec3 camSpaceVec = normalize(reflect(camSpaceViewDir,camSpaceNorm));

	vec4 clipSpace = proj*vec4(camSpacePos,1);
	vec3 NDCSpace = clipSpace.xyz/clipSpace.w;
	vec3 screenSpacePos = 0.5*NDCSpace+0.5;

	vec3 camSpaceVecPos = camSpacePos+camSpaceVec;
	clipSpace = proj*vec4(camSpaceVecPos,1);
	NDCSpace = clipSpace.xyz/clipSpace.w;
	vec3 screenSpaceVecPos = 0.5*NDCSpace+0.5;
	vec3 screenSpaceVec = 0.01*normalize(screenSpaceVecPos - screenSpacePos);

	vec3 oldPos = screenSpacePos + screenSpaceVec;
	vec3 currPos = oldPos + screenSpaceVec;
	int count = 0;
	int nRefine = 0;
	float fade = 1.0;
	float fadeScreen = 0.0;
	float farPlane = 2.0;
	float nearPlane = 0.1;

	float cosAngInc = -dot(camSpaceViewDir,camSpaceNorm);
	cosAngInc = clamp(1-cosAngInc,0.3,1.0);
	
	if(specConeRatio <= 0.1 && ssrOn == true)
	{
	while(count < 50)
	{
		if(currPos.x < 0 || currPos.x > 1 || currPos.y < 0 || currPos.y > 1 || currPos.z < 0 || currPos.z > 1)
			break;

		vec2 ssPos = currPos.xy;

		float currDepth = 2.0*nearPlane/(farPlane+nearPlane-currPos.z*(farPlane-nearPlane));
		float sampleDepth = 2.0*nearPlane/(farPlane+nearPlane-texture(depthTex, ssPos).x*(farPlane-nearPlane));
		float diff = currDepth - sampleDepth;
		float error = length(screenSpaceVec);
		if(diff >= 0 && diff < error)
		{
			screenSpaceVec *= 0.7;
			currPos = oldPos;
			nRefine++;
			if(nRefine >= 3)
			{
					fade = float(count);
					fade = clamp(fade*fade/100,1.0,40.0);
					fadeScreen = distance(ssPos,vec2(0.5,0.5))*2;
					bColor.xyz += texture(reflTex, ssPos).xyz/2/fade*cosAngInc*(1-clamp(fadeScreen,0.0,1.0));
				break;
			}
		} else if(diff > error){
			bColor.xyz = vec3(0);
			sampleDepth = 2.0*nearPlane/(farPlane+nearPlane-texture(depthBTex, ssPos).x*(farPlane-nearPlane));
			diff = currDepth - sampleDepth;
			if(diff >= 0 && diff < error)
			{
				screenSpaceVec *= 0.7;
				currPos = oldPos;
				nRefine++;
				if(nRefine >= 3)
				{
					fade = float(count);
					fade = clamp(fade*fade/100,2.0,20.0);
					bColor.xyz += texture(reflTex, ssPos).xyz/2/fade*cosAngInc;
					break;
				}	
			}
		}

		oldPos = currPos;
		currPos = oldPos + screenSpaceVec;
		count++;

	}
	}

Note that the second half of the code (after the else if(diff > error)) is where I cover the back face of models (depthBTex is a depth texture with frontface culling) so that the back of models are reflected.

Share this post


Link to post
Share on other sites
float L=0.1;
float4 T=0;
float3 NewPos;
for(int i=0;i<10;i++){
NewPos=RealPos+R*L; // RealPos - current position, R- reflection
T=mul(float4(NewPos,1),mat_ViewProj); // Projecting new position to screen.
T.xy=0.5+0.5*float2(1,-1)*T.xy/T.w;
NewPos=GetWorldPos( GBufferPositions.Load(uint2(gbufferDim.xy* T),0),T.xy,mat_ViewProjI); // Find world position

L=length(RealPos-NewPos); // new distance
}

T.xy - texturecoord of reflected pixel

Share this post


Link to post
Share on other sites

I've managed to increase the speed of my ssR to 5.3ms at the cost of reduced quality by using variable step distance - so now i'm using 20 steps instead of 50.

 

[attachment=18456:giboxssr10.png]

 

Even if I get it down to 10 steps and remove the additional backface cover, it will still be 3.1ms - is this fast enough? or can it be optimized further?

Share this post


Link to post
Share on other sites

So I've managed to remove some of the artifacts from my soft shadows:

Previously, when I had used front-face culling I got the following issue:

[attachment=18552:givoxshadows8-0.jpg]

 

This was due to backfaces not being captured by the shadow-caster camera when at overlapping surfaces, thus leading to a gap of missing information in the depth test. There's also the issue of back-face self shadowing artifacts.

 

Using back-face culling (only rendering the front-face) resolves this problem, however, leads to the following problem:

[attachment=18553:givoxshadows8-1.jpg]

Which is front-face self shadowing artifacts - any sort of bias does not resolve this problem because it is caused by the jittering process during depth testing.

 

I came up with a solution that resolves all these issues for direct lighting shadows, which is to also store an individual object id for each object in the scene from the shadow-caster's point of view. During depth testing, I then compare the object id from the player camera's point of view with that from the shadow-caster's point of view and make it so that each object does not cast its own shadow onto itself:

[attachment=18554:givoxshadows8-2.jpg]

 

Now this is all good for direct lighting, because everything that is not directly lit I set to zero, including shadows, and then I add the indirect light to that zero - so there's a smooth transition between the shadow and the non-lit part of each object.

[attachment=18557:givoxshadleak2.jpg]

 

For indirectly lit scenes with no direct lighting at all (i.e. emissively lit by objects), things are a bit different. I don't separate a secondary bounce with the subsequent bounces, all bounces are tied together - thus I cannot just set a secondary bounce as the "direct lighting" and everything else including shadows to zero, then add the subsequent bounces. This would require an additional voxel texture and I would need to double the number of cone traces.

I cheat by making the shadowed parts of the scene darker than the non-shadowed parts (when a more accurate algorithm would be to make shadowed areas zero and add subsequent bounces to those areas). This, together with the removal of any self-shadowing leads to shadow leaking:

[attachment=18555:givoxshadleak1.jpg][attachment=18556:givoxshadleak0.jpg]

 

So I think I have two options:

  1. Add another voxel texture for the second bounce and double the number of cone traces (most expensive).
  2. Switch back to back-face rendering with front-face culling for the shadow mapping only for emissive lighting shadows (lots of ugly artifacts).

I wonder if anyone can come up with any other ideas.

Share this post


Link to post
Share on other sites

I just tested this with my brand new EVGA GTX780 and it runs at average 95fps at 1080p with all screen space effects turned on (ssao, ssr, all soft shadows). In fact, screen space effects seem to make little dent in the framerate.

 

I discovered something very unusual when testing the voxel depth. Here's my results:

32x32x32 -> 95fps (37MB memory)

64x64x64 -> 64fps (37MB memory)

128x128x128 -> 52fps (37MB memory)

256x256x256 -> 31fps (38MB memory)

512x512x512 -> 7fps (3.2GB memory)

 

How on earth did I jump from 38MB memory to 3.2GB of memory used when going from 256 to 512 3d texture depths?!

Share this post


Link to post
Share on other sites
FreneticPonE    3294

I just tested this with my brand new EVGA GTX780 and it runs at average 95fps at 1080p with all screen space effects turned on (ssao, ssr, all soft shadows). In fact, screen space effects seem to make little dent in the framerate.

 

I discovered something very unusual when testing the voxel depth. Here's my results:

32x32x32 -> 95fps (37MB memory)

64x64x64 -> 64fps (37MB memory)

128x128x128 -> 52fps (37MB memory)

256x256x256 -> 31fps (38MB memory)

512x512x512 -> 7fps (3.2GB memory)

 

How on earth did I jump from 38MB memory to 3.2GB of memory used when going from 256 to 512 3d texture depths?!

 

Obviously your profiler is broken somehow, as I doubt your experiment manages to hold ever increasing data in the same exact amount of ram.

Edited by Frenetic Pony

Share this post


Link to post
Share on other sites


Obviously your profiler is broken somehow, as I doubt your experiment manages to hold ever increasing data in the same exact amount of ram.

 

Actually I'm using the task manager to get the amount of ram that my application is using.

Share this post


Link to post
Share on other sites
Digitalfragment    1504

 


Obviously your profiler is broken somehow, as I doubt your experiment manages to hold ever increasing data in the same exact amount of ram.

 

Actually I'm using the task manager to get the amount of ram that my application is using.

 

Sounds like you hit your video cards memory limit and the drivers are now using system memory - which is also why your frame rate tanks. Task Manager only shows system memory usage, not the memory internal to the video card.

Share this post


Link to post
Share on other sites
Bummel    1888

Just a general idea regarding the light-info accumulation concept which was floating around my head for some time now and I finally want to get rid of :

Instead of cone-tracing per screen-pixel (which is how the technique works default wise IIRC), couldn't you seperate your view frustrum into cells (similar to what you do for clustered shading, but perhaps with cube-shaped cells), accumulate the light information in these represented by spherical harmonics using cone-tracing and finally use this SH - 'volume' to light your scene?

You would of course end up with low frequent information only suitable for diffuse lighting (like when using light propagation volumes, but still with less quantization since you would not (necessarily) propagate the information iteratively (or at least with fewer steps if you choose to do so to keep the trace range shorter)) but on the other hand you could probably reduce the amount of required cone-traces considerably (you also would only need to fill cells with intersecting geometry (if you choose not to propagate iteratively)) and, to some extend, resolve the correlation between the amount of traces and the output pixel count.

Just an idea.

Edited by Bummel

Share this post


Link to post
Share on other sites
FreneticPonE    3294

That's a similar idea to what others already did, which is just downsample before tracing and then upsample the results (with some trickery for fine edges). The main problem with just doing cells is that an always present (and temporally stable) specular term is part of the thing that really sells GI to begin with. Still, it's an idea if you're really performance bound.

 

I think I mentioned a similar idea but just for particles, which are going to be diffuse only anyway for the most part and would be really helpful with layers of transparency. And now that I think about it, it would also work well for highly distant objects. While specular doesn't actually fall off of course, anything but primary specular (say from the sun) shouldn't be too noticeable really far away.

 

As for transparency, "inferred" or stippled transparency rendering would be really useful for cone tracing. I'm not sure you could also downsample the tracing simultaneously, but it would still prevent tracing from multiple layers of transparency.

 

As for using a directed acylic graph. I've been thinking that you'd need to separately store albedo/position information, mipmap that, and then figure out a way to apply lighting to different portions dynamically and uniquely using the indirection table. If you're missing what I'm talking about, a Directed Acylic Graph would converge identical copies of voxel areas into just one copy, and then use a table or "indirection table" to direct the tracing to where each copied block was in worldspace.

Share this post


Link to post
Share on other sites
Bummel    1888

 The main problem with just doing cells is that an always present (and temporally stable) specular term is part of the thing that really sells GI to begin with.

As I understand it, the diffuse part is actually the costly one because of the large amount of cones you need to trace per pixel in the default solution. So for rather sharp glossy highlights you could keep tracing them per pixel without the intermediate accumulation step into the SH-volume. But that's of course just the theory.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Similar Content

    • By povilaslt2
      Hello. I'm Programmer who is in search of 2D game project who preferably uses OpenGL and C++. You can see my projects in GitHub. Project genre doesn't matter (except MMO's :D).
    • By ZeldaFan555
      Hello, My name is Matt. I am a programmer. I mostly use Java, but can use C++ and various other languages. I'm looking for someone to partner up with for random projects, preferably using OpenGL, though I'd be open to just about anything. If you're interested you can contact me on Skype or on here, thank you!
      Skype: Mangodoor408
    • By tyhender
      Hello, my name is Mark. I'm hobby programmer. 
      So recently,I thought that it's good idea to find people to create a full 3D engine. I'm looking for people experienced in scripting 3D shaders and implementing physics into engine(game)(we are going to use the React physics engine). 
      And,ye,no money =D I'm just looking for hobbyists that will be proud of their work. If engine(or game) will have financial succes,well,then maybe =D
      Sorry for late replies.
      I mostly give more information when people PM me,but this post is REALLY short,even for me =D
      So here's few more points:
      Engine will use openGL and SDL for graphics. It will use React3D physics library for physics simulation. Engine(most probably,atleast for the first part) won't have graphical fron-end,it will be a framework . I think final engine should be enough to set up an FPS in a couple of minutes. A bit about my self:
      I've been programming for 7 years total. I learned very slowly it as "secondary interesting thing" for like 3 years, but then began to script more seriously.  My primary language is C++,which we are going to use for the engine. Yes,I did 3D graphics with physics simulation before. No, my portfolio isn't very impressive. I'm working on that No,I wasn't employed officially. If anybody need to know more PM me. 
       
    • By Zaphyk
      I am developing my engine using the OpenGL 3.3 compatibility profile. It runs as expected on my NVIDIA card and on my Intel Card however when I tried it on an AMD setup it ran 3 times worse than on the other setups. Could this be a AMD driver thing or is this probably a problem with my OGL code? Could a different code standard create such bad performance?
    • By Kjell Andersson
      I'm trying to get some legacy OpenGL code to run with a shader pipeline,
      The legacy code uses glVertexPointer(), glColorPointer(), glNormalPointer() and glTexCoordPointer() to supply the vertex information.
      I know that it should be using setVertexAttribPointer() etc to clearly define the layout but that is not an option right now since the legacy code can't be modified to that extent.
      I've got a version 330 vertex shader to somewhat work:
      #version 330 uniform mat4 osg_ModelViewProjectionMatrix; uniform mat4 osg_ModelViewMatrix; layout(location = 0) in vec4 Vertex; layout(location = 2) in vec4 Normal; // Velocity layout(location = 3) in vec3 TexCoord; // TODO: is this the right layout location? out VertexData { vec4 color; vec3 velocity; float size; } VertexOut; void main(void) { vec4 p0 = Vertex; vec4 p1 = Vertex + vec4(Normal.x, Normal.y, Normal.z, 0.0f); vec3 velocity = (osg_ModelViewProjectionMatrix * p1 - osg_ModelViewProjectionMatrix * p0).xyz; VertexOut.velocity = velocity; VertexOut.size = TexCoord.y; gl_Position = osg_ModelViewMatrix * Vertex; } What works is the Vertex and Normal information that the legacy C++ OpenGL code seem to provide in layout location 0 and 2. This is fine.
      What I'm not getting to work is the TexCoord information that is supplied by a glTexCoordPointer() call in C++.
      Question:
      What layout location is the old standard pipeline using for glTexCoordPointer()? Or is this undefined?
       
      Side note: I'm trying to get an OpenSceneGraph 3.4.0 particle system to use custom vertex, geometry and fragment shaders for rendering the particles.
  • Popular Now