Speeding up voxelization from a high poly model

Graphics and GPU Programming Programming

Started by _WeirdCat_ September 19, 2019 08:18 PM

7 comments, last by dmatter 4 years, 7 months ago

_WeirdCat_

1,512

Author

September 19, 2019 08:18 PM

Basically i want to make a set of voxels out of high poly model, however process takes too long time to finish.

Now i just go through xyz and check whenever any of model triangles are within xyz voxel then create one.

After 1.5 hour i got one y step done out of 100.

I had an idea to draw slices from bottom to top to a framebuffer then read pixels and create voxels however i'm wondering if anyone has some other ideas how to speed the process up. (This gpu implementation could really fail with drawing the outline of a slice)

Master & Mentor
https://sites.google.com/site/customprog/

MJP

20,295

September 20, 2019 12:16 AM

On a GPU you don't need to render 1 slice at a time: you can compute the depth of the pixel and and then write to the appropriate voxel cell (perhaps as a 3D texture that you can directly write to from the pixel shader). To avoid holes when rasterizing you need to make sure that each triangle is being rasterized from one of the 3 major axes based on whichever axis results in the largest area after projection (in other words, rasterize it to the X Y or Z direction that the triangle is closest to facing). Or alternatively you can just rasterize everything 3 times to cover all axes.

The Blog | The Book

zhangdoa

115

September 20, 2019 01:33 AM

What MJP talked about is basically the algorithm of the voxelization on GPU, the 22nd chapter in "OpenGL Insights", Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer covered all the fundamental theory and implementation. You may need conservative rasterization for a better voxelization result, whether by handcrafting it in the geometry shader or some hardware alternatives depends on the specific API you used.

dmatter

4,872

September 20, 2019 10:23 PM

Since the classic GPU voxelization approach has been mentioned already I will throw in some other ideas which are also suitable for CPU.

As you mentioned "high poly model" - One idea would be to treat your model like a point-cloud and just flag voxels at those points.

That only really works if your voxel density is low compared to your poly density i.e. you want dense, tiny little polygons compared to the size of your voxels so you can be sure that you won't have any 'holes' in the centre of polygons from the lack any true rasterization. One trick is to tessellate up your polys to get denser points if necessary, or simply generate extra points over the face of each poly (e.g. emit a point at each vertex plus one at the center of each triangle).

Properly rasterizing your polys into your voxel grid is the other way to go. At the moment it sounds like you are doing cube-vs-triangle intersection tests between *every* voxel and *every* polygon which is probably why it's so slow. Whereas if you iterate your polys and rasterize them directly to the voxels that they occupy it is algorithmically a lot cheaper (roughly linear if all polys are about the same size as each other) and scales better for large numbers of polys and/or dense voxel grids.

Thirdly, even with your current approach of box-vs-triangle tests you can likely speed that up a lot from where you are now. Hopefully you have some early-out tests like if all the vertices of the triangle are beyond one side of the box then it's a cheap fail. After that you could think about doing a coarse-grained spatial partitioning of polys first - The simple example is to split your model in half down the middle and voxelize each half model into the corresponding half of the voxel grid, now you only have to test 50% of the polys per voxel so you have doubled your performance and you don't have to stop there - split it into four and you quadruple your speed (minus some upfront cost to partition your model). If you're coding this as a CPU algorithm then for another speed boost you could throw the voxelization of each partition onto a different thread and process them concurrently - let's say you're on a quad-core machine with 2 hardware threads per core and at least 8 partitions then you might expect to achieve somewhere between 4x to 8x speed increase (just from concurrency).

David Gill :: GitHub :: twitter .

Matias Goldberg

9,637

September 21, 2019 12:01 AM

I'd like to say a few things, since I've recently written a voxelizer in compute and done a fair amount of research:

1. GPU Rasterization to voxelize high poly meshes is a bad idea. GPUs are already bad at rasterizing tiny triangles, but this gets further aggravated by the fact that this approach requires interlocked operations and the high density of vertices means there is a lot of contention due to multiple threads trying to write into the same voxel block.
Some papers mention their implementations have a drastic time increase with poly count due to contention. Triangle density per voxel also plays a big role, because it's not the same to have a mesh that has each voxel touch one or two triangles, than a mesh that has a single voxel with 600 triangles going through it.

Another problem which most papers except a few often fail to mention (probably out of ignorance) is that unless the voxelization process is very simple, you need to blend your results; and there is no "interlocked average" instruction.

Therefore implementations perform a mutex-like locking of a voxel. This is a problem because such approaches can result in an infinite loop because half a warp acquires the lock while another warp(s) acquires the other half, thus they will fight forever for acquiring the lock.

Implementations that fail to account for this will result in a TDR, which is not immediately obvious unless you're working with high poly meshes, which is where contention happens and the infinite loop cases appear.

Implementations that successfully account for this add a 'bail out' counter: If the mutex acquisition takes more than N spins, give up. This means the voxelization process may not be accurate, and worse it may not even be deterministic. But at least TDR won't happen.
You could append those failure cases into a list and process them at the end serially though.

The only way to properly implement this is using Independent Thread Scheduling introduced by Volta, and is only supported by NVIDIA GPUs (at the time of writing).

This problem may not apply to you though, if you don't need any complex per voxel average/mutex. If a simple interlocked operation (like atomic addition) is enough, then ignore this drawback.

You can avoid the "atomic blend" problem if your 3D texture is in float format, and track the accumulated weights in another 3D texture. This consumes a ton of memory. The "atomic blend" problem appears because of memory restrictions, thus we want to blend an RGBA8 texture or similar precision.

2. That leaves the opposite approach: Have each thread perform a box vs triangle test against all primitives.
A brute force approach like that is super slow even for a GPU, much worse than doing GPU rasterization.

However it can be greatly improved using hierarchy culling: partition the mesh into smaller submeshes, calculating its AABB, and then skipping all of those triangles by performing an AABB vs AABB test.

The compute approach can be further improved by having each thread in a warp load a different triangle, and use anyInvocationARB to test if any of the 64 triangles intersects the AABB that enclosees all voxels processed by the warp.
If you're lost about this, I explain this optimization in a Stack Overflow reply.
While the theoretical performance improvement is up to 64x, in practice this optimization has yield us gains anywhere between 3x-32x depending on the scene involved (often between 3x-4x).

This is what I ended implementing for Ogre 2.2; you're welcome to try our Test_Voxelizer.exe sample (build Ogre 2.2 using the Quick Start script). Find a way to load your mesh data as an Ogre mesh, modify the same to load this mesh of yours; and time how long it takes. That way you can easily test if this approach is worth pursuing or not.
If it's not, then go back to the thinktank for something else.

Note that you should test different values of indexCountSplit in 'mVoxelizer->addItem( *itor++, false, indexCountSplit );' as that value controls how big each partition is, and this can have a huge impact in voxelization performance. There is no 'right' global value, as the best value depends on how your mesh' vertex data is layed out in memory and how much space each partition ends up covering.

Good luck
Cheers