View more

View more

View more

Image of the Day Submit

IOTD | Top Screenshots

The latest, straight to your Inbox.

Subscribe to GameDev.net Direct to receive the latest updates and exclusive content.

Parallel Octree Construction on GPU

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

3 replies to this topic

#1gboxentertainment  Members

Posted 29 September 2012 - 09:53 PM

I am currently trying to construct a voxel octree on the gpu using directx11 compute shader. The only way I can think of how to do this in parallel is to take the entire data set, create 8 new nodes, check (in parallel) if every voxel in the set is less than/greater than each axis, and flag each new node if voxels exist there. So for instance:

struct Voxel
{
float3 pos;
float3 color;
};

struct OctreeNode

{
int voxBufIdx;
int parentIdx;
int childOffset;
int level;
int subdivided;
int terminal;
};

RWStructuredBuffer<OctreeNode> octIn : register(u0);
StructuredBuffer<Voxel> voxBuf : register(t0);

void createNodes(const uint voxIdx, float maxL, int level, uint baseOffset)
{

float L = maxL/pow(2, level);

//Flag subdivision requests

if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
octIn[baseOffset+0].subdivided = 1;
if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
octIn[baseOffset+1].subdivided = 1;
if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
octIn[baseOffset+2].subdivided = 1;
if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
octIn[baseOffset+3].subdivided = 1;
if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
octIn[baseOffset+4].subdivided = 1;
if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
octIn[baseOffset+5].subdivided = 1;
if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
octIn[baseOffset+6].subdivided = 1;
if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
octIn[baseOffset+7].subdivided = 1;
}

void CS_Octree(uint gtidx : SV_GroupIndex, uint3 blockIdx : SV_GroupID)
{

const uint threads = gtidx + blockIdx.x*1024 +blockIdx.y*1024*1024;
const uint nodes = threads;
const uint voxIdx = threads;

float maxL = 512;
int level = 0;
int nodeOff = 0;

OctreeNode root;
root.voxBufIdx = 0;
root.parentIdx = 0;
root.childOffset = 1;
root.level = 0;
root.subdivided = 1;
root.terminal = 0;
octIn[0] = root;
level += 1;[/indent]

if(octIn[0].subdivided == 1)
{
nodeOff +=1;
createNodes(voxIdx, maxL, level, nodeOff);
}
...
}

But now I've ran into a roadblock with a couple of problems:

First Problem

The compute shader does not allow recursive functions - so how would I subdivide each of the 8 nodes in level 1 in parallel?

So far, my idea is to count the number of nodes created in the current level (which is 8 for level 1) and add that to the node offset so that in my node buffer, I can add the new nodes to the end of the buffer.
i.e. after creating the 8 new nodes from the root, and checking that the first four nodes require subdivision, my buffer looks like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,...}

where index 0 is the root and index 1-8 are the 8 new nodes.

If every first four nodes (at level 2) within each of the four flagged nodes (at level 1) also require subdivision, then I would want my buffer to look like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,..}

where index 0 is the root, index 1-8 are the nodes at the first level, and index 9-40 are the child nodes of the first level that have been subdivided (4 nodes subdivided x 8 children = 32 nodes).

The challenge is understanding how to run the createNodes function 4 times in parallel. I am thinking that I can assign a thread per node to check if it has been flagged for subdivision:

if(nodes>8)
return;

if(octIn[1+nodes].subdivided == 1)
{
..//run createNodes function
}

but will this run the createNodes function 4 times in parallel? Would there be a way to use something like InterlockedAdd to do this?

Second Problem

Once I do find a solution to the first problem, there is the issue of reducing the set of voxels in each node so that only the voxels within each node are checked for the subdivision requests.

if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)

otherwise, if at level 2, four nodes are checked for subdivision in parallel, then there would need to be 4 times the threads (where 1 thread is assigned per voxel) and the number of threads required increases by 2^Level for each Level.

The only way I can think of doing this is by sorting the voxBuf data after each subdivision, so that it resides in a section of memory that corresponds with the node index at each level. For example:

take 500,000 voxels in the voxel buffer.
After the first subdivision, the data would be sorted so that:
voxBuf[0~50,000] contains the voxels that lie in the first node.
voxBuf[50,001~100,000] contains the voxels that lie in the second node.
..
voxBuf[450,000~500,000] contains the voxels that lie in the eighth node.

The entire set would not be evenly divided by eight, so I would have to atomically count the number of voxels in each section of the buffer to determine the offset positions. Then I could use these offset positions to reduce the set of voxels at each level so that within each node, only those voxels will be checked for subdivision, instead of the entire set.

This would require some sort of sorting algorithm that can be done in parallel. I have heard that something like Bitonic Sorting would be good, which is essentially also a 'divide-and-conquer' methodology, similar to quadtrees. If I were to use Bitonic Sorting, I would have to run three passes for each x,y,z axis. This would essentially mean that this entire Octree Construction method requires several 'divide-and-conquer' methods within another 'divide-and-conquer' method at multiple layers.

Surely there is a more efficient and simpler way of constructing an octree in parallel on the gpu?
Can anybody please give me some advice or insights on what I am doing?

Edited by gboxentertainment, 29 September 2012 - 09:57 PM.

#2Theodoros  Members

Posted 30 September 2012 - 03:08 AM

You usually convert the BVH construction problem to a sorting problem. This can be done with the use of morton codes, see for example this paper http://mgarland.org/...pers/gpubvh.pdf Bitonic sort and Radix sorts can be implemented very efficiently on modern GPUs,
see here for example for an OpenCL implementation: http://www.bealto.co...ting_intro.html

#3gboxentertainment  Members

Posted 07 October 2012 - 01:51 AM

Thanks for the sorting algorithm links.
I've run into another roadblock:

My octree construction requires that I check each parent node in parallel if it is flagged for subdivision. Then the createNodes function needs to be run in parallel based on these flags.
For instance, after my data is sorted, and the first four nodes are flagged for subdivision, I want createNodes to run for each one simultaneously so that:

Node 1 checks voxBuf[0~50,000] for which side of each axis it lies on.
Node 2 checks voxBuf[50,001~100,000] for which side of each axis it lies on.
..etc.

In order to do this, I will have to allocate the first 50,000 threads to node 1 and the next 50,000 to node 2 and so on...
Is it possible to allocate threads in this way? Does anyone know how to do this?

Edited by gboxentertainment, 07 October 2012 - 01:51 AM.

#4gboxentertainment  Members

Posted 04 November 2012 - 06:39 AM

Okay, so now I've decided to go with the Morton-Code sorting octree method.
I have interlaced all of my xyz coordinates into a single-dimensional list of morton codes. Then I have used the Compute Shader Bitonic Sort example to sort this list of morton codes in ascending order.
I can see that the way the Morton-Code is sorted from ascending order essentially represents an Octree. Does anyone know of any examples that implement the octree structure using Morton-Codes in parallel on the gpu?

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.