Jump to content
  • Advertisement
Sign in to follow this  

Parallel Octree Construction on GPU

This topic is 2204 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am currently trying to construct a voxel octree on the gpu using directx11 compute shader. The only way I can think of how to do this in parallel is to take the entire data set, create 8 new nodes, check (in parallel) if every voxel in the set is less than/greater than each axis, and flag each new node if voxels exist there. So for instance:

struct Voxel
float3 pos;
float3 color;

struct OctreeNode

int voxBufIdx;
int parentIdx;
int childOffset;
int level;
int subdivided;
int terminal;

RWStructuredBuffer<OctreeNode> octIn : register(u0);
StructuredBuffer<Voxel> voxBuf : register(t0);

void createNodes(const uint voxIdx, float maxL, int level, uint baseOffset)

float L = maxL/pow(2, level);

//Flag subdivision requests

if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
octIn[baseOffset+0].subdivided = 1;
if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
octIn[baseOffset+1].subdivided = 1;
if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
octIn[baseOffset+2].subdivided = 1;
if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
octIn[baseOffset+3].subdivided = 1;
if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
octIn[baseOffset+4].subdivided = 1;
if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
octIn[baseOffset+5].subdivided = 1;
if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
octIn[baseOffset+6].subdivided = 1;
if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
octIn[baseOffset+7].subdivided = 1;

[numthreads(1024, 1, 1)]
void CS_Octree(uint gtidx : SV_GroupIndex, uint3 blockIdx : SV_GroupID)

const uint threads = gtidx + blockIdx.x*1024 +blockIdx.y*1024*1024;
const uint nodes = threads;
const uint voxIdx = threads;

float maxL = 512;
int level = 0;
int nodeOff = 0;

OctreeNode root;
root.voxBufIdx = 0;
root.parentIdx = 0;
root.childOffset = 1;
root.level = 0;
root.subdivided = 1;
root.terminal = 0;
octIn[0] = root;
level += 1;

if(octIn[0].subdivided == 1)
nodeOff +=1;
createNodes(voxIdx, maxL, level, nodeOff);

But now I've ran into a roadblock with a couple of problems:

First Problem

The compute shader does not allow recursive functions - so how would I subdivide each of the 8 nodes in level 1 in parallel?

So far, my idea is to count the number of nodes created in the current level (which is 8 for level 1) and add that to the node offset so that in my node buffer, I can add the new nodes to the end of the buffer.
i.e. after creating the 8 new nodes from the root, and checking that the first four nodes require subdivision, my buffer looks like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,...}

where index 0 is the root and index 1-8 are the 8 new nodes.

If every first four nodes (at level 2) within each of the four flagged nodes (at level 1) also require subdivision, then I would want my buffer to look like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,..}

where index 0 is the root, index 1-8 are the nodes at the first level, and index 9-40 are the child nodes of the first level that have been subdivided (4 nodes subdivided x 8 children = 32 nodes).

The challenge is understanding how to run the createNodes function 4 times in parallel. I am thinking that I can assign a thread per node to check if it has been flagged for subdivision:


if(octIn[1+nodes].subdivided == 1)
..//run createNodes function

but will this run the createNodes function 4 times in parallel? Would there be a way to use something like InterlockedAdd to do this?

Second Problem

Once I do find a solution to the first problem, there is the issue of reducing the set of voxels in each node so that only the voxels within each node are checked for the subdivision requests.

if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)

otherwise, if at level 2, four nodes are checked for subdivision in parallel, then there would need to be 4 times the threads (where 1 thread is assigned per voxel) and the number of threads required increases by 2^Level for each Level.

The only way I can think of doing this is by sorting the voxBuf data after each subdivision, so that it resides in a section of memory that corresponds with the node index at each level. For example:

take 500,000 voxels in the voxel buffer.
After the first subdivision, the data would be sorted so that:
voxBuf[0~50,000] contains the voxels that lie in the first node.
voxBuf[50,001~100,000] contains the voxels that lie in the second node.
voxBuf[450,000~500,000] contains the voxels that lie in the eighth node.

The entire set would not be evenly divided by eight, so I would have to atomically count the number of voxels in each section of the buffer to determine the offset positions. Then I could use these offset positions to reduce the set of voxels at each level so that within each node, only those voxels will be checked for subdivision, instead of the entire set.

This would require some sort of sorting algorithm that can be done in parallel. I have heard that something like Bitonic Sorting would be good, which is essentially also a 'divide-and-conquer' methodology, similar to quadtrees. If I were to use Bitonic Sorting, I would have to run three passes for each x,y,z axis. This would essentially mean that this entire Octree Construction method requires several 'divide-and-conquer' methods within another 'divide-and-conquer' method at multiple layers.

Surely there is a more efficient and simpler way of constructing an octree in parallel on the gpu?
Can anybody please give me some advice or insights on what I am doing? Edited by gboxentertainment

Share this post

Link to post
Share on other sites
Thanks for the sorting algorithm links.
I've run into another roadblock:

My octree construction requires that I check each parent node in parallel if it is flagged for subdivision. Then the createNodes function needs to be run in parallel based on these flags.
For instance, after my data is sorted, and the first four nodes are flagged for subdivision, I want createNodes to run for each one simultaneously so that:

Node 1 checks voxBuf[0~50,000] for which side of each axis it lies on.
Node 2 checks voxBuf[50,001~100,000] for which side of each axis it lies on.

In order to do this, I will have to allocate the first 50,000 threads to node 1 and the next 50,000 to node 2 and so on...
Is it possible to allocate threads in this way? Does anyone know how to do this? Edited by gboxentertainment

Share this post

Link to post
Share on other sites
Okay, so now I've decided to go with the Morton-Code sorting octree method.
I have interlaced all of my xyz coordinates into a single-dimensional list of morton codes. Then I have used the Compute Shader Bitonic Sort example to sort this list of morton codes in ascending order.
I can see that the way the Morton-Code is sorted from ascending order essentially represents an Octree. Does anyone know of any examples that implement the octree structure using Morton-Codes in parallel on the gpu?

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!