Making an octree with OpenCL, thread-safe issues

Started by
12 comments, last by spek 11 years, 4 months ago
A bit like "GigaVoxels", I'm trying to generate an octree on the fly by using a (OpenCL) compute shader. The code (incomplete) code below should do that, BUT... obviously it's not thread-safe yet. So, I'm asking what I should do to make it safe, yet still somewhat fast.

What the code does is reading 64x64x64 pixels from a 3D texture. If those pixels contain geometry data, it will insert a leaf in the octree. Since the octree isn't filled initially (other than I reserved plenty of space in a VBO), it happens that the octree gets constructed along the way:
1- fetch a pixel, test if it should be inserted
2- Dive into the octree, starting with the rootnode. Get the leaf node at level 8
3- If a node hasn't been divided in 8 subnodes yet, get an unused pointer from the VBO and reserve 8 nodes there. Let the parent node refer to the first childnode.

The main problem is that multiple pixels may try to split octree nodes at the same time. I've seen barriers and such in the OpenCL reference, I guess I should create semaphores around the "if ( node->child1Ptr == 0 ) { ... }" section in the code below, but wouldn't make that the code terribly slow? Asides, but this is something that can't be fixed maybe, creating a semaphor requires support for the atom_xchg operation afaik. Seems my videocard is too old for that.

__kernel void voxelOctree_Build( __global struct TOctreeNode* octree,
__global int* octreeNodeCounter,
__read_only image3d_t dataTex
)
{
// We loop through all the pixels of a (64^3) 3D texture
// We inject each pixel into the octree, eventually creating new subnodes along
// the way
int3 locID = (int3)( get_local_id(0), get_local_id(1), get_local_id(2) );
int4 texCoord = (int4)( locID.x, locID.y, locID.z , 0 );
sampler_t sampler = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP;
float4 data = read_imagef( dataTex, sampler, texCoord );

if ( data.w > 0.f ) // Only insert if there is some occlusion at this point
{
// Get root node
__global struct TOctreeNode *node = &octree[0];
// Dive into the octree, eventually split nodes if not done before
int level = 0;
while ( level++ < 8 )
{
// Create 8 sub nodes, if not done before
if ( node->child1Ptr == 0 )
{
// Get an unused index from the octree VBO
int indx1 = *octreeNodeCounter;
node->child1Ptr = (int)(&amp;octree[ indx1 ]); // Store pointer to 1st child node
indx1+= 8; // Reserve 8 nodes
*octreeNodeCounter = indx1; // write back in global var
}
// Check in which of the 8 subnodes we need to dive
int localOffset = 0;
...generate a number between 0..7
// Get node from deeper level, then repeat the loop
node = (__global struct TOctreeNode*)(node->child1Ptr + localOffset);
} // while level > 0

// Now insert the actual data
...
} // occlSum
} // voxelOctree_Build

Dunno how they exactly do it in GigaVoxels, but I could also pre-generate the octree once, entirely "unfolded". That means I don't have to split nodes, making the code faster. However, that will take a lot more memory of course.

Greets,
Rick
Advertisement
You actually answered yourself - you need to make code thread safe (same way you'd do it on CPU). You *should* use atomics for this (I'm not sure, though I think now every GPU supporting OpenCL (1.1 i think, where it became core) supports also cl_khr_global_int32_base_atomics ... I don't know which GPU you have, but I'd try updating drivers & SDK for it) even though it will make code, as you stated, slower (and there probably isn't any other solution, that would be effective enough, either). The question is, whether it will be faster than doing that on CPU (with heavily optimized algorithms)?

Implementing F.e. mutex (which might work for what you need) with atomics is then really straight forward and easy stuff ... you actually could use the same construction:

while (atom_cmpxchg(mutex,1,0)==0); // Acquire mutex
/** Modify data in critical section **/
atom_cmpxchg(mutex,0,1); // Release mutex

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Ha Vilem, everything ok?

I just tried to implement a semaphor with "atom_cmpxchg" and... the code didn't crash. I couldn't verify it actually worked yet, but maybe it works after all. Yet I'm pretty sure I had problems with atomic operations half a year ago. Though it might been "atom_max" or "atom_add" I tried instead. The GPU is a nVidia 9800M from 2008 btw.

Yeah, it seems we won't utilize much of the multi-threading if the code is locked all the time... Although after a couple of insertions, the octree has already been build partially, thus requiring less updated. Maybe it helps if I create a semphor for each octree level, though I have to make multiple global counter variables in that case as well. In that case the chance one thread waits for another is slightly smaller.


Yet the octree has to be build in GPU memory anyway, so this might still be the fastest way... although OpenCL also allows to let the CPU compute the octree, then transfer it to the videocard memory. Not sure if the transfer would make things too slow. Either way, when reading about Giga Voxels and Voxel Cone Tracing, it seems they all do it 100% on the GPU. The major difference is (probably) that they only make the octree once, and update it locally when dynamic stuff moved.

But in my case I'm dealing with a roaming world. Not sure how UDK4 does it with large worlds, but in my case there are no clear boundaries for the octree. When you move into another room, the octree should move along with the camera. Either constantly, or at least when I switch rooms. My octree resolution is likely lower than in those GigaVoxels examples, plus the voxels to inject are pre-calculated already so that should save some time when rebuilding the octree. But i just wondered if GigaVoxels isn't doing things entirely different to make octree builds faster.

As always, thanks for helping!
Instead of making your code thread-safe (CPU view), you can make the code parallel (GPU view). Think of calculating mipmaps of a texture. You would render layer for layer and each rendered texel would only consider four other texels of the previous layer.

In this spirit I would approach your problem like this:
1. leaf generation: check your texture and generate a new leaf (=write texel)
2. generate parent nodes of 1 layer
..
n. generate root of the previous layer, which contains only 8 potential nodes.

No locking necessary. If we really want to utilizse the GPU we need to think in GPU programming going away from CPU thinking.
Hmm, got to reset my brain for that. Putting leafs in a list wouldn't be difficult. But how to work upwards then? Got to mention the storage is "sparse" to make optimal use the memory (otherwise the octrees could grow HUGE). At least, that's how they did it with GigaVoxels and I assume they thought very well about it ;) So, I can't put them in a 2D or 3D uniform grid like you could do when filling a texture. An extra problem is that different pixels I insert could share the same leaf, in case they are close enough. In that case they need to be merged somehow. When traversing the octree from top to bottom, you automiatcally see wether a leaf was generated before or not.

When generating 1 level upwards (or a higher MipMap level), each parent needs to know which 8 children are beneath it, but there is no easy way to sort that out since the leafs are scattered "randomly" in the linear memory buffer. It would need to search the whole buffer == inefficient.

I still could unfold the entire octree like a mipmapped 3D texture and reserve the memory required for a worst case scenario (all leafs are filled). This allows to give each node a fixed place in the memory, and we'll have to construct an empty octree only once to get the pointers (they use pointers instead of array indices to make read/write access faster): http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/
However, a somewhat big and/or detailed octree quickly consumes hundreds of megabytes... But maybe they simply reserved all that space in the GigaVoxels approach anyay, don't know...
You can still use a sparse octree, the trick is to change the process not the final product. For one you can add a compression step after creating your oct-tree to reduce the memory footprint, you could even do this approach for tiles (subdivide your world and process tile for tile compressing each tile, then in a final step add the last nodes and roots).


When generating 1 level upwards (or a higher MipMap level), each parent needs to know which 8 children are beneath it, but there is no easy way to sort that out since the leafs are scattered "randomly" in the linear memory buffer. It would need to search the whole buffer == inefficient.

While creating your tree you should use some kind of easy identification, i.e. a hash. When thinking of mipmaps, each 4 texel pair at each layer have a unique position you can easily determine. Best to separate the processing step (speed) from the compression step (memory).
Got to reset my brain for this... Problem is that I have to rebuild the octree each cycle, or at least when moving into a new room. The game world is way bigger than any octree could possibly cover, so instead the octree has to move along with the camera (requiring a rebuild as it moves). I don't know how UDK4 (which also implemented Voxel Cone Tracing) deals with large worlds, but when rebuilding all the time, I still need a large amount of memory available.

I must say I didn't test the performance yet though. Maybe I'm concerned about nothing, since my octrees don't have that much levels. So maybe even with the semaphors the rebuild impact might be acceptable.

Otherwise, I could try something as in the picture. As you say, I reserve a fixed amount of memory. But the lowest (detail) levels don't grow anymore. Distant geometry leafs are inserted at a higher level, saving a lot of nodes / megabytes in the end. With that pre-defined structure, I don't have to split nodes anymore, as we already know where the possible child nodes are stored (like with you mentioned with mipmapping). To speed up octree-traversal, I can still use pointers instead of array indexing.

I still wonder if the GigaVoxels dudes did it it different. Usually when I try to reinvent the wheel with my own "smart" tricks, subtle but important details are missed and leading to disasters hehe.

I still wonder if the GigaVoxels dudes did it it different.

so go read http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf to see how they did it.

In short:
1. voxelize your object <- you have that already, they use a voxel-list instead of a 3d texture, but that should be independent of the tree construction.
2. build your octree top-down by:
a) for each voxel (!) traverse the octree (as far as you have built it) and mark the node you end up in as needing to be split.
b) for each node in the current octree level, check if it was marked for splitting and do the actual splitting if necessary.
c) repeat until all levels are built.
3. insert your voxel values into the leaf nodes and fill in the other levels if needed.
I tried to read the papers, although I haven't seen that one yet so thanks! But usually I get dizzy after too many lines of tech-talk and jargon :P I need to convert algorithms into monkey logic bit by bit.

2a .. 2b seems to be the major difference then. Rather than locking & halting threats, the threat just marks and stops. I wonder, wouldn't they already perform a couple of splits at the start? I mean, it would be a waste of time to run all x-thousand voxels to see if the top levels need to be split. Anyway, it makes sense. As for step 3, I guess they still use atomic operations to create an average or max value for the leafs?


Since you understood the papers, I'd like to make some more use of your services hehe.
* Another thing I wonder is what they with those "bricks". They insert 2x2x2 (or 3x3x3) pixels into a 3D texture, which includes the neighbor voxels values/colors. Which is useful for smoother results with tri-linear filtering later on. But where does the Compute Shader get those neighbor values from? In my case I could read all 27 around a voxel (stored as 1 pixel) but that sounds like quite an expensive job.

* Moreoever, since the raycasting technique later on involves directions, I guess they either insert values into 6 3D textures ("ambient cube"), or at least 3 in case they compress it with Spherical Harmonics.

* Which leads to my last question... When sampling from higher levels as the raycast forms a "cone", it seems they don't sample bricks from a 3D texture, but directly a color stored in the nodes themselves. AFAIK, there is only 1 RGBA8 value in those node structs that could store a color though. Probably I'm missing something or just wrong, but that would mean tri-linear filtering is gone, as well as information from which direction the light came from (globally). Or do the higher levels also just generate bricks in a 3D texture? Ifso, again I wonder where they get the neighbor data from at that point.

Thanks!

2a .. 2b seems to be the major difference then. Rather than locking & halting threats, the threat just marks and stops. I wonder, wouldn't they already perform a couple of splits at the start? I mean, it would be a waste of time to run all x-thousand voxels to see if the top levels need to be split. Anyway, it makes sense. As for step 3, I guess they still use atomic operations to create an average or max value for the leafs?

Yep.


Since you understood the papers, I'd like to make some more use of your services hehe.
[/quote]
Heh, I'll try, but there's a lot I'm not sure about myself.


* Another thing I wonder is what they with those "bricks". They insert 2x2x2 (or 3x3x3) pixels into a 3D texture, which includes the neighbor voxels values/colors. Which is useful for smoother results with tri-linear filtering later on. But where does the Compute Shader get those neighbor values from? In my case I could read all 27 around a voxel (stored as 1 pixel) but that sounds like quite an expensive job.
[/quote]
Let's first consider the Octree without "bricks", where each node stores one child-pointer/index (the other children being fixed offsets from that) and 1 color value. If there is one voxel per leaf node and if voxels lie on node centers things are dead simple, but interpolation is difficult because we need access to neighbouring nodes. Now my understanding of the next part: Shift the nodes/voxels, so that voxel centers lie on node corners, so we get (up to) 8 voxels contributing to a node, which can be averaged or interpolated inside the node. Instead of storing a single color store an index to a 3x3x3 brick and store interpolated color values in each pixel of the brick, so that corners store the original voxel values, the center pixel is the average and the other border samples are simple averages of the corners. That way, during sampling we should get smooth trilinear interpolation, because values are repeated across borders. Does that make sense? I think there's multiple sampling schemes suggested between the various papers on the topic, which makes this part quite confusing.


* Moreoever, since the raycasting technique later on involves directions, I guess they either insert values into 6 3D textures ("ambient cube"), or at least 3 in case they compress it with Spherical Harmonics.
[/quote]
I think there's an isotropic implementation, using only 1 3D texture and the full implementation, which indeed stores 6 3D textures for non-leaf nodes. I guess the problem gets worse if you want to store more than color information in your bricks.


* Which leads to my last question... When sampling from higher levels as the raycast forms a "cone", it seems they don't sample bricks from a 3D texture, but directly a color stored in the nodes themselves. AFAIK, there is only 1 RGBA8 value in those node structs that could store a color though. Probably I'm missing something or just wrong, but that would mean tri-linear filtering is gone, as well as information from which direction the light came from (globally). Or do the higher levels also just generate bricks in a 3D texture? Ifso, again I wonder where they get the neighbor data from at that point.
[/quote]
The 1 RGBA value per node is only for interior/constant nodes iirc, where the color is actually the same for the whole voxel, other nodes store brick pointers at all levels.


Thanks!
[/quote]
Glad to help, certainly won't hurt once I get around to implementing it myself.

This topic is closed to new replies.

Advertisement