3D smooth voxel terrain LOD on GPU

Started by
23 comments, last by MasterReDWinD 4 years, 6 months ago

Hi,

I currently have Naive Surface Nets working in a Compute Shader and I'm using a Dictionary to load chunks around the player.  Retrieving meshes from the Compute Shader back to the CPU and creating Unity collision meshes is causing a lot of stalling.  I've decided to tackle this by implementing a LOD system.

I've found the 0fps articles on simplifying isosurfaces to be a weath of information.  After studying those and looking at the papers I've identified what I think are the two main directions for approaching LOD: Mesh Simplification and Geometry Clipmaps.  I am aiming to get one or both of these working in a Compute Shader.

This brings me to my questions:

1)  I've seen a lot of feedback that Mesh Simplification is going to be too slow for realtime terrain editing, is there a consensus on this?  Most algorithms appear to use the half-edge format but I can't find any information on efficiently converting an indexed mesh into half-edge mesh format.  Can anyone provide an example?

2)  Can the Geometry Clipmap approach work with 3D terrain as opposed to heightmaps?  Would I need to store my chunks in Octrees to make it work?  Are there any GPU implementation examples that don't work on heightmaps?

Any advice to help me further focus my research would be much appreciated.  Thanks.

 

 

 

 

Advertisement
2 hours ago, MasterReDWinD said:

1)  I've seen a lot of feedback that Mesh Simplification is going to be too slow for realtime terrain editing, is there a consensus on this?  Most algorithms appear to use the half-edge format but I can't find any information on efficiently converting an indexed mesh into half-edge mesh format.  Can anyone provide an example?

That's quite involved.

Personally i came up with two options: A simpler one, i could post the code (just let me know), but it requires a map to track incomplete half edges. So probably bad for realtime.

Later i had to add full mesh editing functionality. And now i use just that instead the above. Here there is no need for a map, instead the incomplete mesh is traversed to make sure half edges find their proper pair. I'm not sure if that's faster but guess so. (But too much code to post. Also my entire work aims preprocessing not realtime.)

Actaully i do not remember about details. Personally i like half-edge because it is easier to get things done than with alternatives i had used before. But it is almost impossible to write code that is intuitive or easy to follow, and comments do not help much to imagine what goes on geometrically. Because of this it is hard to learn from code of others, reinventing the wheel might be faster and better here.

However, this guys blog was helpful for me: http://kaba.hilvi.org/homepage/blog/halfedge/halfedge.htm

 

 

What are the collision data used for? If it's only for graphical reasons there are other structures such as sdfs you can use and you won't need to send data back to cpu after processing. Otherwise why can't the collision data be baked and just loaded in at runtime, is it softbody?

15 hours ago, JoeJ said:

That's quite involved.

Personally i came up with two options: A simpler one, i could post the code (just let me know), but it requires a map to track incomplete half edges. So probably bad for realtime.

Later i had to add full mesh editing functionality. And now i use just that instead the above. Here there is no need for a map, instead the incomplete mesh is traversed to make sure half edges find their proper pair. I'm not sure if that's faster but guess so. (But too much code to post. Also my entire work aims preprocessing not realtime.)

Actaully i do not remember about details. Personally i like half-edge because it is easier to get things done than with alternatives i had used before. But it is almost impossible to write code that is intuitive or easy to follow, and comments do not help much to imagine what goes on geometrically. Because of this it is hard to learn from code of others, reinventing the wheel might be faster and better here.

However, this guys blog was helpful for me: http://kaba.hilvi.org/homepage/blog/halfedge/halfedge.htm

 

 

Thanks for the reply and for the link, it looks like it touches on some new things that I hadn't seen before. 

Were you generating the mesh from scratch or converting one that you already had?  You mentioned full mesh editing.  That is something I am also targetting.  It probably makes much more sense to get the mesh out of Surface Nets in half-edge format to begin with, then I had just make the edits following the half-edge rules.  I think that is what you are doing?

How did you find the performance in your case?

14 hours ago, 51mon said:

What are the collision data used for? If it's only for graphical reasons there are other structures such as sdfs you can use and you won't need to send data back to cpu after processing. Otherwise why can't the collision data be baked and just loaded in at runtime, is it softbody?

Thanks for your reply.  The collision data will be used for pathfinding and probably for some movable objects/projectiles.  What was your idea involving sdf's?

The meshes are generated at runtime and will need to be recreated if the user edits them.

3 minutes ago, MasterReDWinD said:

Were you generating the mesh from scratch or converting one that you already had?  You mentioned full mesh editing.  That is something I am also targetting.

I did it for loading indexed meshes to get started. For this reason i made the simple solution using std::map.

I use the mesh editing because i work on quadrangulation, and i made edit operations like create polys, edge collapse, edge rotation etc. Likely the same stuff you would need for dynamic LOD. While working on this i realized the map is not necessary.

12 minutes ago, MasterReDWinD said:

It probably makes much more sense to get the mesh out of Surface Nets in half-edge format to begin with

Maybe yes - makes sense, but not sure. Converting indexed mesh to half-edge could be just as good. (The example below is more the kind of half edge mesh from indexed)

 

15 minutes ago, MasterReDWinD said:

How did you find the performance in your case?

I do not target realtime, but here an example (of how slow it is :) ):

image.thumb.png.71e93c3b13a6be5d14154656fa3b3ee9.png

 

So that's a tessellated bunny model, with editing ops used for tessellation.

Input mesh (solid random color) has 2500 vertices, and the tessellation (blue wire) has 15500. Takes about 7 seconds on single core. 

(looks like delauney, but i use a trick to align hex grids to triangles so i do not need expensive delauney triangulation algorithm. It's just connecting grid points to triangles, followed by edge flips and smoothing to improve the result - very most of those 7 seconds are used for mesh creation and editing.)

I have no idea how fast such things can be for realtime use. I do only care about time complexity and try to reserve memory whenever possible.

Multithreading, cache efficiency, avoiding dynamic allocation... that's definitively a lot of work to make such stuff fast.

( @Gnollrunner should be able to give a much better answer)

 

Anyways, i would still think about alternatives first. For example, if you have a volume, would it work to downscale the volume, make surface net on GPU and transfer this back to CPU? 

Mesh simplification (which i assume you want to do on CPU from high res surface net) sounds like you could loose a lot from the advantage to generate your stuff on GPU so far.

I have a mesh simplification stage as well in my project, which uses edge collapses and a priority queue - pretty standart, and not so suited for realtime. Here some work of a guy who tired to avoid the need for a queue, but not aiming realtime either: http://voxels.blogspot.com/2014/05/quadric-mesh-simplification-with-source.html

For realtime usually people try to precalculate the simplification process, like progressive meshes do.

To me it seems much less work to downscale the volume and then use some isosurface extraction.

 

 

 

2 hours ago, JoeJ said:

( @Gnollrunner should be able to give a much better answer)

 

I'm not sure how much help I can be here. I think my approach is WAY different. For one I'm doing almost everything except the graphics on the CPU.  I do have chunks of voxels however my chunks change size along with the voxel size, so the number of voxels in a chunk stays roughly the same.  I say roughly because a chunk is actually an octree so everything depends on the data. In any case that's basically how the LOD is done. It just generates meshes on bigger voxels for chunks that are farther away from the camera. 

As for the mesh format, I use an edged format. Faces point to edges and edges point to vertexes. There are also back pointers from edges to faces which lets me loop around vertexes to generate normals.

My general impression is most LOD with voxel generated smooth terrain is done similarly, by changing voxel sizes. I'm not 100% sure on that, but it does seem to be the easiest way to me, given that voxels can generate some fairly complex shapes.  The only tricky part is that you have to do level transitions.  However I think the surface nets handles that OK so it shouldn't be too bad. I'm using marching prisms so I have to do some extra work for the transitions.

I'm not really targeting editing at this time however it should be just a matter of changing values at voxel corners.

As for threading, in my case it's closely tied to the way I generate stuff.  I don't think it's really relevant to GPU generated meshes but I'll give a brief description. I have something called a ghost tree. which is a light weight octree that represents part of a voxel trees. The ghost trees are generated in threads, are independent of each other and are used to extend leaves of chunk trees as things change.  Since I'm using fractal generated voxel data, I put all my fractal routines in these threads. 

I played around with threading a bit and the fastest way that I've found so far is to have a thread pool where threads are blocked until needed. This gets rid of the thread creation, deletion problem and also gives you and easy way to control the maximum number of threads allowed. I have one special thread that's a dispatcher that gets commands from the main thread and sends them the threads in the pool.

Once a ghost tree is finished, it's then converted to voxels in the main thread. I may try to thread this part too later but it's a little more tricky since that part would involve special threading constructs like mutexes and atomics.  Also I find that most the processing is done in determining how to build your voxels and not the actual voxel building itself so I'm not sure how much I would gain by that.

Finally for collision I have a separate tree that just generates mesh data right around the player that matches the graphics data. I call it  a JIT (Just in time) mesh. The reason I'm doing it this way is to avoid any race conditions as a player moves around. This way it's impossible for an avatar to beat collision mesh generation.

Anyway I don't know if that helps, but good luck.

 

 

11 hours ago, JoeJ said:

I did it for loading indexed meshes to get started. For this reason i made the simple solution using std::map.

I use the mesh editing because i work on quadrangulation, and i made edit operations like create polys, edge collapse, edge rotation etc. Likely the same stuff you would need for dynamic LOD. While working on this i realized the map is not necessary.

Maybe yes - makes sense, but not sure. Converting indexed mesh to half-edge could be just as good. (The example below is more the kind of half edge mesh from indexed)

 

I do not target realtime, but here an example (of how slow it is :) ):

image.thumb.png.71e93c3b13a6be5d14154656fa3b3ee9.png

 

So that's a tessellated bunny model, with editing ops used for tessellation.

Input mesh (solid random color) has 2500 vertices, and the tessellation (blue wire) has 15500. Takes about 7 seconds on single core. 

(looks like delauney, but i use a trick to align hex grids to triangles so i do not need expensive delauney triangulation algorithm. It's just connecting grid points to triangles, followed by edge flips and smoothing to improve the result - very most of those 7 seconds are used for mesh creation and editing.)

I have no idea how fast such things can be for realtime use. I do only care about time complexity and try to reserve memory whenever possible.

Multithreading, cache efficiency, avoiding dynamic allocation... that's definitively a lot of work to make such stuff fast.

( @Gnollrunner should be able to give a much better answer)

 

Anyways, i would still think about alternatives first. For example, if you have a volume, would it work to downscale the volume, make surface net on GPU and transfer this back to CPU? 

Mesh simplification (which i assume you want to do on CPU from high res surface net) sounds like you could loose a lot from the advantage to generate your stuff on GPU so far.

I have a mesh simplification stage as well in my project, which uses edge collapses and a priority queue - pretty standart, and not so suited for realtime. Here some work of a guy who tired to avoid the need for a queue, but not aiming realtime either: http://voxels.blogspot.com/2014/05/quadric-mesh-simplification-with-source.html

For realtime usually people try to precalculate the simplification process, like progressive meshes do.

To me it seems much less work to downscale the volume and then use some isosurface extraction.

 

 

 

Thanks for that clarification.  The illustration and the numbers really help, there isn't much up to date performance data around.

If I did opt for the simplification route I was hoping to run it on the GPU, in either the same or another Compute Shader, so as to have less data to read back to the CPU.  As you say I think reading back the mesh from the GPU and then running simplification on the CPU wouldn't be good for my performance.

I've come across that link but haven't had a good read of it yet, I do know it avoids sorting and use of a priority queue, both performance killers from the reading I've done. I'll investigate that one more.  I'm aware that quadric error metrics are essentially the gold standard in determining the suitability of a particular edge collapse.  I also found this algorithm that I really like the sound of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.7091&rep=rep1&type=pdf.  That's what lead me to the question of how to get the initial required half-edge mesh.

When you say precalculate the simplification you're referring to storing different LOD versions of the meshes and loading them on demand?

Downscaling is what I had in mind as a starting point for the Geometry Clipmap route.  If I understand the downscaling idea correctly the simplest approach would be to sample my density grid at every 2nd point (on each axis) rather than every point, so in my case I would reduce the size of the mesh by a factor of 8 on the first past?  Then on the second pass I would sample every fourth density value.

8 hours ago, Gnollrunner said:

I'm not sure how much help I can be here. I think my approach is WAY different. For one I'm doing almost everything except the graphics on the CPU.  I do have chunks of voxels however my chunks change size along with the voxel size, so the number of voxels in a chunk stays roughly the same.  I say roughly because a chunk is actually an octree so everything depends on the data. In any case that's basically how the LOD is done. It just generates meshes on bigger voxels for chunks that are farther away from the camera. 

As for the mesh format, I use an edged format. Faces point to edges and edges point to vertexes. There are also back pointers from edges to faces which lets me loop around vertexes to generate normals.

My general impression is most LOD with voxel generated smooth terrain is done similarly, by changing voxel sizes. I'm not 100% sure on that, but it does seem to be the easiest way to me, given that voxels can generate some fairly complex shapes.  The only tricky part is that you have to do level transitions.  However I think the surface nets handles that OK so it shouldn't be too bad. I'm using marching prisms so I have to do some extra work for the transitions.

I'm not really targeting editing at this time however it should be just a matter of changing values at voxel corners.

As for threading, in my case it's closely tied to the way I generate stuff.  I don't think it's really relevant to GPU generated meshes but I'll give a brief description. I have something called a ghost tree. which is a light weight octree that represents part of a voxel trees. The ghost trees are generated in threads, are independent of each other and are used to extend leaves of chunk trees as things change.  Since I'm using fractal generated voxel data, I put all my fractal routines in these threads. 

I played around with threading a bit and the fastest way that I've found so far is to have a thread pool where threads are blocked until needed. This gets rid of the thread creation, deletion problem and also gives you and easy way to control the maximum number of threads allowed. I have one special thread that's a dispatcher that gets commands from the main thread and sends them the threads in the pool.

Once a ghost tree is finished, it's then converted to voxels in the main thread. I may try to thread this part too later but it's a little more tricky since that part would involve special threading constructs like mutexes and atomics.  Also I find that most the processing is done in determining how to build your voxels and not the actual voxel building itself so I'm not sure how much I would gain by that.

Finally for collision I have a separate tree that just generates mesh data right around the player that matches the graphics data. I call it  a JIT (Just in time) mesh. The reason I'm doing it this way is to avoid any race conditions as a player moves around. This way it's impossible for an avatar to beat collision mesh generation.

Anyway I don't know if that helps, but good luck.

 

 

Thanks for jumping in on this.

I saw a discussion about doubling the size of the chunk at the same time as halving the sampling resolution of the density grid (downsampling).  Is that what you're describing here?  The benefit of this is supposed to be that the total number of meshes rendered is reduced? My chunks are all the same size at the moment.  I have a chunk pool and then use a dictionary to store/load them based on distance from the player.  Are you using something like the clipmap approach with increasingly sized rings around the player?

Are you running any simplification on your meshes?

I had planned to have a thread pool for chunk generation but unfortunately in Unity only the main thread can run Unity rendering related code, which includes Compute shaders.  There is a way to run Compute shader dispatches asynchronously which I'm part way through implementing.  It's not ideal though.

I've read about having a small collision mesh only around the player (and other moving objects).  It sounds like a very promising idea, rather than my current setup of the collision mesh being the surface mesh.  So your player mesh is just a small subset of the surface mesh recreated from an octree as a separate step?  Are there any caveats with this that I should know about?

52 minutes ago, MasterReDWinD said:

 Are you using something like the clipmap approach with increasingly sized rings around the player?
 

I basically calculate the distance from a chunk to the player and base it's resolution on the square of that. I guess it's like spheres around the player. However it's a bit more involved because my entire voxel area is an octree (or 20 rather since I build around an icosahedron since I'm doing planets) so what I'm really doing is going down the octree and applying certain rules for splitting, merging, growing and shrinking chunks.  The chunks themselves are just pointers to points in the octree.

Quote

Are you running any simplification on your meshes?

I had some code for simplification but it's disabled right now because I want to keep the collision meshes which are only generated at max LOD the same as the view meshes and there were some issues at chunk borders. I'll probably try to enable it later when I get those issues ironed out but it's lower priority for me, since I'm working on other stuff.

3 hours ago, MasterReDWinD said:

If I did opt for the simplification route I was hoping to run it on the GPU, in either the same or another Compute Shader, so as to have less data to read back to the CPU.

You may be the first who does this on GPU - would be worth a blog post then... :)

The main problem will be the complexity of the edge collapse operation. Here my code for that so you can imagine:


bool HEMesh_Modeling::IsCollapseableEdge (const int edge, int *data)
{
	assert (edge!=-1);

	int loop[2];
	int prev[2];
	int next[2];

	loop[0] = edge;
	loop[1] = GetEdgePair(edge);

	int vI0 = GetEdgeVertex(loop[0]); 
	int vI1 = GetEdgeVertex(loop[1]); 

	// do not collapse edge if isolating a vertex 
	int edgeCount0 = GetVertexEdgeCount(vI0);
	int edgeCount1 = GetVertexEdgeCount(vI1);
	if (edgeCount0 <= 1) return false;
	if (edgeCount1 <= 1) return false;

	// do not collapse noncontradictable loop
	for (int s=0; s<2; s++)
	{
		int eI = loop[s];
		do {
			eI = GetEdgePair(eI);
			if (eI!=loop[0] && eI!=loop[1])
			{
				int v0 = GetEdgeVertex(eI); 
				int v1 = GetEdgeVertex(GetEdgePair(eI)); 
				if (v0==vI0 && v1==vI1) return false;
				if (v1==vI0 && v0==vI1) return false;
			}
			eI = GetEdgeNext(eI);
			if (eI!=loop[0] && eI!=loop[1])
			{
				int v0 = GetEdgeVertex(eI); 
				int v1 = GetEdgeVertex(GetEdgePair(eI)); 
				if (v0==vI0 && v1==vI1) return false;
				if (v1==vI0 && v0==vI1) return false;
			}
		} while (eI != loop[s]);
	}

	prev[0] = FindEdgePrev(loop[0]);
	prev[1] = FindEdgePrev(loop[1]);

	next[0] = GetEdgeNext(loop[0]);
	next[1] = GetEdgeNext(loop[1]);

	// do not collapse edge on isolated remaining triangle
	if (edgeCount0==2 && edgeCount1==2)
	{
		if (GetEdgeVertex(prev[0]) == 
			GetEdgeVertex(prev[1])) return false;
	}

	if (data)
	{
		data[0] = loop[0];
		data[1] = loop[1];
		data[2] = prev[0];
		data[3] = prev[1];
		data[4] = next[0];
		data[5] = next[1];
		data[6] = vI0;
		data[7] = vI1;
	}

	return true;
}

int HEMesh_Modeling::CollapseEdge (const int edge)
{
	assert (edge!=-1);

	int data[8];
	if (!IsCollapseableEdge (edge, data)) 
		return -1;

	int loop[2];
	int prev[2];
	int next[2];
	int vI0, vI1; 

	loop[0] = data[0];
	loop[1] = data[1];
	prev[0] = data[2];
	prev[1] = data[3];
	next[0] = data[4];
	next[1] = data[5];
	vI0     = data[6];
	vI1     = data[7];
	
	bool collapse[2] = {
		prev[0] == GetEdgeNext(next[0]),
		prev[1] == GetEdgeNext(next[1])};

	// move edges from v1 to v0
	int vEI = GetVertexEdge(vI1);
	do {
		SetEdgeVertex(vEI, vI0);
//ImGui::Text("  Set edge %i to vertex %i", vEI, vI0);
		vEI = GetEdgeNext(GetEdgePair(vEI));
	} while (vEI != GetVertexEdge(vI1));
	SetVertexEdge(vI1, -1);


	// update remaining vertex
	if (GetVertexEdge(vI0) == loop[0])
		SetVertexEdge(vI0, GetEdgePair(prev[0]));


	// update edges
	for (int s=0; s<2; s++)
	{
		if (collapse[s])
		{
			int vIp = GetEdgeVertex(prev[s]);
			if (GetVertexEdge(vIp)==prev[s]) 
				SetVertexEdge(vIp, GetEdgePair(next[s]));

			int vIn = GetEdgeVertex(next[s]);
			if (GetVertexEdge(vIn)==next[s]) 
				SetVertexEdge(vIn, GetEdgePair(prev[s]));

			int pair0 = GetEdgePair(prev[s]);
			int pair1 = GetEdgePair(next[s]);
			SetEdgePair(pair0, pair1);
			SetEdgePair(pair1, pair0);
			DisconnectEdge(prev[s]);
			DisconnectEdge(next[s]);
		}
		else SetEdgeNext(prev[s], next[s]);
	}


	// update polys
	for (int s=0; s<2; s++)
	{
		int pI = GetEdgePoly(loop[s]);
		if (pI==-1) continue;
		if (collapse[s])
			SetPolyEdge(pI,-1);
		else if (GetPolyEdge(pI) == loop[s])
			SetPolyEdge(pI, prev[s]);
	}

	DisconnectEdge(loop[0]);
	DisconnectEdge(loop[1]);

	//SetVertexPos(vI0, pos);
	return vI0;
}

(I had to split it into two functions for some reason, and i use the term 'Edge' everywhere, should have used 'Half' instead)

It's full of branches, indirection, random memory access - not the ideal case for a compute shader.

4 hours ago, MasterReDWinD said:

When you say precalculate the simplification you're referring to storing different LOD versions of the meshes and loading them on demand?

Either this (+ geomoprhing for continuous LOD if needed), or recording the changes happening from edge collpases. But because you support runtime changes of the terrain you can't do that, i guess.

4 hours ago, MasterReDWinD said:

Downscaling is what I had in mind as a starting point for the Geometry Clipmap route.  If I understand the downscaling idea correctly the simplest approach would be to sample my density grid at every 2nd point (on each axis) rather than every point, so in my case I would reduce the size of the mesh by a factor of 8 on the first past?  Then on the second pass I would sample every fourth density value.

Yeah, that's what i would do. It's easy and suits GPU implementation well (and i'm really the last who would say GPUs can do only simple brute force stuff.)

Do you already have some physics? Here a LOD approach might not work because abrupt changes on the mesh cause collision issues, and continuous changes would still add energy to the physics system. Having chunks of constant detail and downloading those nearby dynamic objects sounds more robust and practical. (I never heared about LOD for physics, although it would be an interesting topic.)

If so, all you would need the whole scene for would be path finding. And this alone would not justify the efford we discuss here IMO.

Pathfinding on GPU could be an option too. (Diffusion approach could work better here than the usual graph algorithms, depending on what you need.)

 

 

This topic is closed to new replies.

Advertisement