Jump to content

  • Log In with Google      Sign In   
  • Create Account


#Actualgboxentertainment

Posted 29 September 2012 - 09:57 PM

I am currently trying to construct a voxel octree on the gpu using directx11 compute shader. The only way I can think of how to do this in parallel is to take the entire data set, create 8 new nodes, check (in parallel) if every voxel in the set is less than/greater than each axis, and flag each new node if voxels exist there. So for instance:

struct Voxel
{
float3 pos;
float3 color;
};

struct OctreeNode

{
int voxBufIdx;
int parentIdx;
int childOffset;
int level;
int subdivided;
int terminal;
};

RWStructuredBuffer<OctreeNode> octIn : register(u0);
StructuredBuffer<Voxel> voxBuf : register(t0);



void createNodes(const uint voxIdx, float maxL, int level, uint baseOffset)
{

float L = maxL/pow(2, level);


//Flag subdivision requests

if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+0].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+1].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+2].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+3].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+4].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+5].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+6].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+7].subdivided = 1;
}


[numthreads(1024, 1, 1)]
void CS_Octree(uint gtidx : SV_GroupIndex, uint3 blockIdx : SV_GroupID)
{

const uint threads = gtidx + blockIdx.x*1024 +blockIdx.y*1024*1024;
const uint nodes = threads;
const uint voxIdx = threads;

float maxL = 512;
int level = 0;
int nodeOff = 0;

OctreeNode root;
  root.voxBufIdx = 0;
  root.parentIdx = 0;
  root.childOffset = 1;
  root.level = 0;
  root.subdivided = 1;
  root.terminal = 0;
octIn[0] = root;
level += 1;[/indent]


if(octIn[0].subdivided == 1)
{
  nodeOff +=1;
  createNodes(voxIdx, maxL, level, nodeOff);
}
...
}


But now I've ran into a roadblock with a couple of problems:

First Problem

The compute shader does not allow recursive functions - so how would I subdivide each of the 8 nodes in level 1 in parallel?

So far, my idea is to count the number of nodes created in the current level (which is 8 for level 1) and add that to the node offset so that in my node buffer, I can add the new nodes to the end of the buffer.
i.e. after creating the 8 new nodes from the root, and checking that the first four nodes require subdivision, my buffer looks like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,...}

where index 0 is the root and index 1-8 are the 8 new nodes.

If every first four nodes (at level 2) within each of the four flagged nodes (at level 1) also require subdivision, then I would want my buffer to look like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,..}

where index 0 is the root, index 1-8 are the nodes at the first level, and index 9-40 are the child nodes of the first level that have been subdivided (4 nodes subdivided x 8 children = 32 nodes).

The challenge is understanding how to run the createNodes function 4 times in parallel. I am thinking that I can assign a thread per node to check if it has been flagged for subdivision:

if(nodes>8)
return;


if(octIn[1+nodes].subdivided == 1)
{
..//run createNodes function
}

but will this run the createNodes function 4 times in parallel? Would there be a way to use something like InterlockedAdd to do this?

Second Problem

Once I do find a solution to the first problem, there is the issue of reducing the set of voxels in each node so that only the voxels within each node are checked for the subdivision requests.

if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)

otherwise, if at level 2, four nodes are checked for subdivision in parallel, then there would need to be 4 times the threads (where 1 thread is assigned per voxel) and the number of threads required increases by 2^Level for each Level.

The only way I can think of doing this is by sorting the voxBuf data after each subdivision, so that it resides in a section of memory that corresponds with the node index at each level. For example:

take 500,000 voxels in the voxel buffer.
After the first subdivision, the data would be sorted so that:
voxBuf[0~50,000] contains the voxels that lie in the first node.
voxBuf[50,001~100,000] contains the voxels that lie in the second node.
..
voxBuf[450,000~500,000] contains the voxels that lie in the eighth node.

The entire set would not be evenly divided by eight, so I would have to atomically count the number of voxels in each section of the buffer to determine the offset positions. Then I could use these offset positions to reduce the set of voxels at each level so that within each node, only those voxels will be checked for subdivision, instead of the entire set.

This would require some sort of sorting algorithm that can be done in parallel. I have heard that something like Bitonic Sorting would be good, which is essentially also a 'divide-and-conquer' methodology, similar to quadtrees. If I were to use Bitonic Sorting, I would have to run three passes for each x,y,z axis. This would essentially mean that this entire Octree Construction method requires several 'divide-and-conquer' methods within another 'divide-and-conquer' method at multiple layers.

Surely there is a more efficient and simpler way of constructing an octree in parallel on the gpu?
Can anybody please give me some advice or insights on what I am doing?

#2gboxentertainment

Posted 29 September 2012 - 09:54 PM

I am currently trying to construct a voxel octree on the gpu using directx11 compute shader. The only way I can think of how to do this in parallel is to take the entire data set, create 8 new nodes, check (in parallel) if every voxel in the set is less than/greater than each axis, and flag each new node if voxels exist there. So for instance:

struct Voxel
{
float3 pos;
float3 color;
};

struct OctreeNode

{
int voxBufIdx;
int parentIdx;
int childOffset;
int level;
int subdivided;
int terminal;
};

RWStructuredBuffer<OctreeNode> octIn : register(u0);
StructuredBuffer<Voxel> voxBuf : register(t0);



void createNodes(const uint voxIdx, float maxL, int level, uint baseOffset)
{

float L = maxL/pow(2, level);


//Flag subdivision requests

 if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+0].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+1].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+2].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+3].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+4].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+5].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+6].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+7].subdivided = 1;
}


[numthreads(1024, 1, 1)]
void CS_Octree(uint gtidx : SV_GroupIndex, uint3 blockIdx : SV_GroupID)
{

const uint threads = gtidx + blockIdx.x*1024 +blockIdx.y*1024*1024;
const uint nodes = threads;
const uint voxIdx = threads;

float maxL = 512;
int level = 0;
int nodeOff = 0;

OctreeNode root;
  root.voxBufIdx = 0;
  root.parentIdx = 0;
  root.childOffset = 1;
  root.level = 0;
  root.subdivided = 1;
  root.terminal = 0;
octIn[0] = root;
level += 1;[/indent]


if(octIn[0].subdivided == 1)
{
  nodeOff +=1;
  createNodes(voxIdx, maxL, level, nodeOff);
}
...
}


But now I've ran into a roadblock with a couple of problems:

First Problem

The compute shader does not allow recursive functions - so how would I subdivide each of the 8 nodes in level 1 in parallel?

So far, my idea is to count the number of nodes created in the current level (which is 8 for level 1) and add that to the node offset so that in my node buffer, I can add the new nodes to the end of the buffer.
i.e. after creating the 8 new nodes from the root, and checking that the first four nodes require subdivision, my buffer looks like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,...}

where index 0 is the root and index 1-8 are the 8 new nodes.

If every first four nodes (at level 2) within each of the four flagged nodes (at level 1) also require subdivision, then I would want my buffer to look like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,..}

where index 0 is the root, index 1-8 are the nodes at the first level, and index 9-40 are the child nodes of the first level that have been subdivided (4 nodes subdivided x 8 children = 32 nodes).

The challenge is understanding how to run the createNodes function 4 times in parallel. I am thinking that I can assign a thread per node to check if it has been flagged for subdivision:

if(nodes>8)
return;


if(octIn[1+nodes].subdivided == 1)
{
..//run createNodes function
}

but will this run the createNodes function 4 times in parallel? Would there be a way to use something like InterlockedAdd to do this?

Second Problem

Once I do find a solution to the first problem, there is the issue of reducing the set of voxels in each node so that only the voxels within each node are checked for the subdivision requests.

if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)

otherwise, if at level 2, four nodes are checked for subdivision in parallel, then there would need to be 4 times the threads (where 1 thread is assigned per voxel) and the number of threads required increases by 2^Level for each Level.

The only way I can think of doing this is by sorting the voxBuf data after each subdivision, so that it resides in a section of memory that corresponds with the node index at each level. For example:

take 500,000 voxels in the voxel buffer.
After the first subdivision, the data would be sorted so that:
voxBuf[0~50,000] contains the voxels that lie in the first node.
voxBuf[50,001~100,000] contains the voxels that lie in the second node.
..
voxBuf[450,000~500,000] contains the voxels that lie in the eighth node.

The entire set would not be evenly divided by eight, so I would have to atomically count the number of voxels in each section of the buffer to determine the offset positions. Then I could use these offset positions to reduce the set of voxels at each level so that within each node, only those voxels will be checked for subdivision, instead of the entire set.

This would require some sort of sorting algorithm that can be done in parallel. I have heard that something like Bitonic Sorting would be good, which is essentially also a 'divide-and-conquer' methodology, similar to quadtrees. If I were to use Bitonic Sorting, I would have to run three passes for each x,y,z axis. This would essentially mean that this entire Octree Construction method requires several 'divide-and-conquer' methods within another 'divide-and-conquer' method at multiple layers.

Surely there is a more efficient and simpler way of constructing an octree in parallel on the gpu?
Can anybody please give me some advice or insights on what I am doing?

#1gboxentertainment

Posted 29 September 2012 - 09:53 PM

I am currently trying to construct a voxel octree on the gpu using directx11 compute shader. The only way I can think of how to do this in parallel is to take the entire data set, create 8 new nodes, check (in parallel) if every voxel in the set is less than/greater than each axis, and flag each new node if voxels exist there. So for instance:

struct Voxel
{
float3 pos;
float3 color;
};

struct OctreeNode

{
int voxBufIdx;
int parentIdx;
int childOffset;
int level;
int subdivided;
int terminal;
};

RWStructuredBuffer<OctreeNode> octIn : register(u0);
StructuredBuffer<Voxel> voxBuf : register(t0);



void createNodes(const uint voxIdx, float maxL, int level, uint baseOffset)
{
[indent=1]
float L = maxL/pow(2, level);[/indent]


[indent=1]//Flag subdivision requests[/indent]

[indent=1]  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+0].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+1].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+2].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+3].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+4].subdivided = 1;
  if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+5].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && voxBuf[voxIdx].pos.z>L)
   octIn[baseOffset+6].subdivided = 1;
  if(0<L<voxBuf[voxIdx].pos.x && voxBuf[voxIdx].pos.y>L && 0<voxBuf[voxIdx].pos.z<L)
   octIn[baseOffset+7].subdivided = 1;
}[/indent]


[numthreads(1024, 1, 1)]
void CS_Octree(uint gtidx : SV_GroupIndex, uint3 blockIdx : SV_GroupID)
{
[indent=1]
const uint threads = gtidx + blockIdx.x*1024 +blockIdx.y*1024*1024;[/indent]
[indent=1]const uint nodes = threads;
const uint voxIdx = threads;[/indent]

[indent=1]float maxL = 512;
int level = 0;
int nodeOff = 0;

OctreeNode root;
  root.voxBufIdx = 0;
  root.parentIdx = 0;
  root.childOffset = 1;
  root.level = 0;
  root.subdivided = 1;
  root.terminal = 0;
octIn[0] = root;
level += 1;[/indent]


[indent=1]if(octIn[0].subdivided == 1)
{
  nodeOff +=1;
  createNodes(voxIdx, maxL, level, nodeOff);
}
...[/indent]
}


But now I've ran into a roadblock with a couple of problems:

First Problem

The compute shader does not allow recursive functions - so how would I subdivide each of the 8 nodes in level 1 in parallel?

So far, my idea is to count the number of nodes created in the current level (which is 8 for level 1) and add that to the node offset so that in my node buffer, I can add the new nodes to the end of the buffer.
i.e. after creating the 8 new nodes from the root, and checking that the first four nodes require subdivision, my buffer looks like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,...}

where index 0 is the root and index 1-8 are the 8 new nodes.

If every first four nodes (at level 2) within each of the four flagged nodes (at level 1) also require subdivision, then I would want my buffer to look like this:

octIn.subdivision = {1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,..}

where index 0 is the root, index 1-8 are the nodes at the first level, and index 9-40 are the child nodes of the first level that have been subdivided (4 nodes subdivided x 8 children = 32 nodes).

The challenge is understanding how to run the createNodes function 4 times in parallel. I am thinking that I can assign a thread per node to check if it has been flagged for subdivision:

if(nodes>8)
return;


if(octIn[1+nodes].subdivided == 1)
{
..//run createNodes function
}

but will this run the createNodes function 4 times in parallel? Would there be a way to use something like InterlockedAdd to do this?

Second Problem

Once I do find a solution to the first problem, there is the issue of reducing the set of voxels in each node so that only the voxels within each node are checked for the subdivision requests.

if(voxBuf[voxIdx].pos.x<L && voxBuf[voxIdx].pos.y<L && 0<voxBuf[voxIdx].pos.z<L)

otherwise, if at level 2, four nodes are checked for subdivision in parallel, then there would need to be 4 times the threads (where 1 thread is assigned per voxel) and the number of threads required increases by 2^Level for each Level.

The only way I can think of doing this is by sorting the voxBuf data after each subdivision, so that it resides in a section of memory that corresponds with the node index at each level. For example:

take 500,000 voxels in the voxel buffer.
After the first subdivision, the data would be sorted so that:
voxBuf[0~50,000] contains the voxels that lie in the first node.
voxBuf[50,001~100,000] contains the voxels that lie in the second node.
..
voxBuf[450,000~500,000] contains the voxels that lie in the eighth node.

The entire set would not be evenly divided by eight, so I would have to atomically count the number of voxels in each section of the buffer to determine the offset positions. Then I could use these offset positions to reduce the set of voxels at each level so that within each node, only those voxels will be checked for subdivision, instead of the entire set.

This would require some sort of sorting algorithm that can be done in parallel. I have heard that something like Bitonic Sorting would be good, which is essentially also a 'divide-and-conquer' methodology, similar to quadtrees. If I were to use Bitonic Sorting, I would have to run three passes for each x,y,z axis. This would essentially mean that this entire Octree Construction method requires several 'divide-and-conquer' methods within another 'divide-and-conquer' method at multiple layers.

Surely there is a more efficient and simpler way of constructing an octree in parallel on the gpu?
Can anybody please give me some advice or insights on what I am doing?

PARTNERS