• Create Account

# Making an octree with OpenCL, thread-safe issues

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

13 replies to this topic

### #1spek  Prime Members   -  Reputation: 996

Like
0Likes
Like

Posted 14 November 2012 - 08:26 AM

A bit like "GigaVoxels", I'm trying to generate an octree on the fly by using a (OpenCL) compute shader. The code (incomplete) code below should do that, BUT... obviously it's not thread-safe yet. So, I'm asking what I should do to make it safe, yet still somewhat fast.

What the code does is reading 64x64x64 pixels from a 3D texture. If those pixels contain geometry data, it will insert a leaf in the octree. Since the octree isn't filled initially (other than I reserved plenty of space in a VBO), it happens that the octree gets constructed along the way:
1- fetch a pixel, test if it should be inserted
2- Dive into the octree, starting with the rootnode. Get the leaf node at level 8
3- If a node hasn't been divided in 8 subnodes yet, get an unused pointer from the VBO and reserve 8 nodes there. Let the parent node refer to the first childnode.

The main problem is that multiple pixels may try to split octree nodes at the same time. I've seen barriers and such in the OpenCL reference, I guess I should create semaphores around the "if ( node->child1Ptr == 0 ) { ... }" section in the code below, but wouldn't make that the code terribly slow? Asides, but this is something that can't be fixed maybe, creating a semaphor requires support for the atom_xchg operation afaik. Seems my videocard is too old for that.
__kernel void voxelOctree_Build(  __global struct TOctreeNode* octree,
__global int* octreeNodeCounter,
)
{
// We loop through all the pixels of a (64^3) 3D texture
// We inject each pixel into the octree, eventually creating new subnodes along
// the way
int3  locID  = (int3)( get_local_id(0), get_local_id(1), get_local_id(2) );
int4 texCoord = (int4)( locID.x, locID.y, locID.z , 0 );
sampler_t sampler  = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP;
float4 data   = read_imagef( dataTex, sampler, texCoord );

if ( data.w > 0.f ) // Only insert if there is some occlusion at this point
{
// Get root node
__global struct TOctreeNode *node = &amp;octree[0];
// Dive into the octree, eventually split nodes if not done before
int level = 0;
while ( level++ < 8 )
{
// Create 8 sub nodes, if not done before
if ( node->child1Ptr == 0 )
{
// Get an unused index from the octree VBO
int  indx1 = *octreeNodeCounter;
node->child1Ptr = (int)(&amp;octree[ indx1 ]); // Store pointer to 1st child node
indx1+= 8; // Reserve 8 nodes
*octreeNodeCounter = indx1; // write back in global var
}
// Check in which of the 8 subnodes we need to dive
int localOffset = 0;
...generate a number between 0..7
// Get node from deeper level, then repeat the loop
node = (__global struct TOctreeNode*)(node->child1Ptr + localOffset);
} // while level > 0

// Now insert the actual data
...
} // occlSum
} // voxelOctree_Build

Dunno how they exactly do it in GigaVoxels, but I could also pre-generate the octree once, entirely "unfolded". That means I don't have to split nodes, making the code faster. However, that will take a lot more memory of course.

Greets,
Rick

Edited by spek, 14 November 2012 - 08:36 AM.

### #2Vilem Otte  Crossbones+   -  Reputation: 1422

Like
1Likes
Like

Posted 14 November 2012 - 04:30 PM

You actually answered yourself - you need to make code thread safe (same way you'd do it on CPU). You *should* use atomics for this (I'm not sure, though I think now every GPU supporting OpenCL (1.1 i think, where it became core) supports also cl_khr_global_int32_base_atomics ... I don't know which GPU you have, but I'd try updating drivers & SDK for it) even though it will make code, as you stated, slower (and there probably isn't any other solution, that would be effective enough, either). The question is, whether it will be faster than doing that on CPU (with heavily optimized algorithms)?

Implementing F.e. mutex (which might work for what you need) with atomics is then really straight forward and easy stuff ... you actually could use the same construction:
while (atom_cmpxchg(mutex,1,0)==0);  // Acquire mutex
/** Modify data in critical section **/
atom_cmpxchg(mutex,0,1);				   // Release mutex



My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

### #3spek  Prime Members   -  Reputation: 996

Like
0Likes
Like

Posted 15 November 2012 - 01:52 AM

Ha Vilem, everything ok?

I just tried to implement a semaphor with "atom_cmpxchg" and... the code didn't crash. I couldn't verify it actually worked yet, but maybe it works after all. Yet I'm pretty sure I had problems with atomic operations half a year ago. Though it might been "atom_max" or "atom_add" I tried instead. The GPU is a nVidia 9800M from 2008 btw.

Yeah, it seems we won't utilize much of the multi-threading if the code is locked all the time... Although after a couple of insertions, the octree has already been build partially, thus requiring less updated. Maybe it helps if I create a semphor for each octree level, though I have to make multiple global counter variables in that case as well. In that case the chance one thread waits for another is slightly smaller.

Yet the octree has to be build in GPU memory anyway, so this might still be the fastest way... although OpenCL also allows to let the CPU compute the octree, then transfer it to the videocard memory. Not sure if the transfer would make things too slow. Either way, when reading about Giga Voxels and Voxel Cone Tracing, it seems they all do it 100% on the GPU. The major difference is (probably) that they only make the octree once, and update it locally when dynamic stuff moved.

But in my case I'm dealing with a roaming world. Not sure how UDK4 does it with large worlds, but in my case there are no clear boundaries for the octree. When you move into another room, the octree should move along with the camera. Either constantly, or at least when I switch rooms. My octree resolution is likely lower than in those GigaVoxels examples, plus the voxels to inject are pre-calculated already so that should save some time when rebuilding the octree. But i just wondered if GigaVoxels isn't doing things entirely different to make octree builds faster.

As always, thanks for helping!

### #4Ashaman73  Crossbones+   -  Reputation: 7480

Like
3Likes
Like

Posted 15 November 2012 - 02:03 AM

Instead of making your code thread-safe (CPU view), you can make the code parallel (GPU view). Think of calculating mipmaps of a texture. You would render layer for layer and each rendered texel would only consider four other texels of the previous layer.

In this spirit I would approach your problem like this:
1. leaf generation: check your texture and generate a new leaf (=write texel)
2. generate parent nodes of 1 layer
..
n. generate root of the previous layer, which contains only 8 potential nodes.

No locking necessary. If we really want to utilizse the GPU we need to think in GPU programming going away from CPU thinking.

### #5spek  Prime Members   -  Reputation: 996

Like
0Likes
Like

Posted 15 November 2012 - 04:06 AM

Hmm, got to reset my brain for that. Putting leafs in a list wouldn't be difficult. But how to work upwards then? Got to mention the storage is "sparse" to make optimal use the memory (otherwise the octrees could grow HUGE). At least, that's how they did it with GigaVoxels and I assume they thought very well about it ;) So, I can't put them in a 2D or 3D uniform grid like you could do when filling a texture. An extra problem is that different pixels I insert could share the same leaf, in case they are close enough. In that case they need to be merged somehow. When traversing the octree from top to bottom, you automiatcally see wether a leaf was generated before or not.

When generating 1 level upwards (or a higher MipMap level), each parent needs to know which 8 children are beneath it, but there is no easy way to sort that out since the leafs are scattered "randomly" in the linear memory buffer. It would need to search the whole buffer == inefficient.

I still could unfold the entire octree like a mipmapped 3D texture and reserve the memory required for a worst case scenario (all leafs are filled). This allows to give each node a fixed place in the memory, and we'll have to construct an empty octree only once to get the pointers (they use pointers instead of array indices to make read/write access faster): http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/
However, a somewhat big and/or detailed octree quickly consumes hundreds of megabytes... But maybe they simply reserved all that space in the GigaVoxels approach anyay, don't know...

### #6Ashaman73  Crossbones+   -  Reputation: 7480

Like
0Likes
Like

Posted 15 November 2012 - 04:28 AM

You can still use a sparse octree, the trick is to change the process not the final product. For one you can add a compression step after creating your oct-tree to reduce the memory footprint, you could even do this approach for tiles (subdivide your world and process tile for tile compressing each tile, then in a final step add the last nodes and roots).

When generating 1 level upwards (or a higher MipMap level), each parent needs to know which 8 children are beneath it, but there is no easy way to sort that out since the leafs are scattered "randomly" in the linear memory buffer. It would need to search the whole buffer == inefficient.

While creating your tree you should use some kind of easy identification, i.e. a hash. When thinking of mipmaps, each 4 texel pair at each layer have a unique position you can easily determine. Best to separate the processing step (speed) from the compression step (memory).

### #7spek  Prime Members   -  Reputation: 996

Like
0Likes
Like

Posted 15 November 2012 - 06:18 AM

Got to reset my brain for this... Problem is that I have to rebuild the octree each cycle, or at least when moving into a new room. The game world is way bigger than any octree could possibly cover, so instead the octree has to move along with the camera (requiring a rebuild as it moves). I don't know how UDK4 (which also implemented Voxel Cone Tracing) deals with large worlds, but when rebuilding all the time, I still need a large amount of memory available.

I must say I didn't test the performance yet though. Maybe I'm concerned about nothing, since my octrees don't have that much levels. So maybe even with the semaphors the rebuild impact might be acceptable.

Otherwise, I could try something as in the picture. As you say, I reserve a fixed amount of memory. But the lowest (detail) levels don't grow anymore. Distant geometry leafs are inserted at a higher level, saving a lot of nodes / megabytes in the end. With that pre-defined structure, I don't have to split nodes anymore, as we already know where the possible child nodes are stored (like with you mentioned with mipmapping). To speed up octree-traversal, I can still use pointers instead of array indexing.

I still wonder if the GigaVoxels dudes did it it different. Usually when I try to reinvent the wheel with my own "smart" tricks, subtle but important details are missed and leading to disasters hehe.

### #8Eternal  Members   -  Reputation: 617

Like
3Likes
Like

Posted 15 November 2012 - 11:28 AM

I still wonder if the GigaVoxels dudes did it it different.

so go read http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf to see how they did it.

In short:
1. voxelize your object <- you have that already, they use a voxel-list instead of a 3d texture, but that should be independent of the tree construction.
2. build your octree top-down by:
a) for each voxel (!) traverse the octree (as far as you have built it) and mark the node you end up in as needing to be split.
b) for each node in the current octree level, check if it was marked for splitting and do the actual splitting if necessary.
c) repeat until all levels are built.
3. insert your voxel values into the leaf nodes and fill in the other levels if needed.

### #9spek  Prime Members   -  Reputation: 996

Like
0Likes
Like

Posted 15 November 2012 - 03:54 PM

I tried to read the papers, although I haven't seen that one yet so thanks! But usually I get dizzy after too many lines of tech-talk and jargon I need to convert algorithms into monkey logic bit by bit.

2a .. 2b seems to be the major difference then. Rather than locking & halting threats, the threat just marks and stops. I wonder, wouldn't they already perform a couple of splits at the start? I mean, it would be a waste of time to run all x-thousand voxels to see if the top levels need to be split. Anyway, it makes sense. As for step 3, I guess they still use atomic operations to create an average or max value for the leafs?

* Another thing I wonder is what they with those "bricks". They insert 2x2x2 (or 3x3x3) pixels into a 3D texture, which includes the neighbor voxels values/colors. Which is useful for smoother results with tri-linear filtering later on. But where does the Compute Shader get those neighbor values from? In my case I could read all 27 around a voxel (stored as 1 pixel) but that sounds like quite an expensive job.

* Moreoever, since the raycasting technique later on involves directions, I guess they either insert values into 6 3D textures ("ambient cube"), or at least 3 in case they compress it with Spherical Harmonics.

* Which leads to my last question... When sampling from higher levels as the raycast forms a "cone", it seems they don't sample bricks from a 3D texture, but directly a color stored in the nodes themselves. AFAIK, there is only 1 RGBA8 value in those node structs that could store a color though. Probably I'm missing something or just wrong, but that would mean tri-linear filtering is gone, as well as information from which direction the light came from (globally). Or do the higher levels also just generate bricks in a 3D texture? Ifso, again I wonder where they get the neighbor data from at that point.

Thanks!

### #10Eternal  Members   -  Reputation: 617

Like
3Likes
Like

Posted 16 November 2012 - 12:47 PM

2a .. 2b seems to be the major difference then. Rather than locking & halting threats, the threat just marks and stops. I wonder, wouldn't they already perform a couple of splits at the start? I mean, it would be a waste of time to run all x-thousand voxels to see if the top levels need to be split. Anyway, it makes sense. As for step 3, I guess they still use atomic operations to create an average or max value for the leafs?

Yep.

Heh, I'll try, but there's a lot I'm not sure about myself.

* Another thing I wonder is what they with those "bricks". They insert 2x2x2 (or 3x3x3) pixels into a 3D texture, which includes the neighbor voxels values/colors. Which is useful for smoother results with tri-linear filtering later on. But where does the Compute Shader get those neighbor values from? In my case I could read all 27 around a voxel (stored as 1 pixel) but that sounds like quite an expensive job.

Let's first consider the Octree without "bricks", where each node stores one child-pointer/index (the other children being fixed offsets from that) and 1 color value. If there is one voxel per leaf node and if voxels lie on node centers things are dead simple, but interpolation is difficult because we need access to neighbouring nodes. Now my understanding of the next part: Shift the nodes/voxels, so that voxel centers lie on node corners, so we get (up to) 8 voxels contributing to a node, which can be averaged or interpolated inside the node. Instead of storing a single color store an index to a 3x3x3 brick and store interpolated color values in each pixel of the brick, so that corners store the original voxel values, the center pixel is the average and the other border samples are simple averages of the corners. That way, during sampling we should get smooth trilinear interpolation, because values are repeated across borders. Does that make sense? I think there's multiple sampling schemes suggested between the various papers on the topic, which makes this part quite confusing.

* Moreoever, since the raycasting technique later on involves directions, I guess they either insert values into 6 3D textures ("ambient cube"), or at least 3 in case they compress it with Spherical Harmonics.

I think there's an isotropic implementation, using only 1 3D texture and the full implementation, which indeed stores 6 3D textures for non-leaf nodes. I guess the problem gets worse if you want to store more than color information in your bricks.

* Which leads to my last question... When sampling from higher levels as the raycast forms a "cone", it seems they don't sample bricks from a 3D texture, but directly a color stored in the nodes themselves. AFAIK, there is only 1 RGBA8 value in those node structs that could store a color though. Probably I'm missing something or just wrong, but that would mean tri-linear filtering is gone, as well as information from which direction the light came from (globally). Or do the higher levels also just generate bricks in a 3D texture? Ifso, again I wonder where they get the neighbor data from at that point.

The 1 RGBA value per node is only for interior/constant nodes iirc, where the color is actually the same for the whole voxel, other nodes store brick pointers at all levels.

Thanks!

Glad to help, certainly won't hurt once I get around to implementing it myself.

### #11Vilem Otte  Crossbones+   -  Reputation: 1422

Like
0Likes
Like

Posted 16 November 2012 - 05:16 PM

I can just say thanks for the paper (I kind of missed it) and you earned at least +1.

To the topic - I haven't met too much OpenGL compute shaders yet (because I'm currently staying more in wonderful land of OpenCL), but couldn't you just store everything as single 3D buffer, where each node wouldn't be just color - but whole ambient cube around point (with some additional data I think you could fit into some 128 bytes per node - 8 * 4 bytes (for child 'pointers') + 6 * 16 bytes for /color = 32 + 96 = 128 ... ye it could work ... and in case of textures you could actually put this whole thing into some RGBA32F texture)

Edited by Vilem Otte, 16 November 2012 - 05:22 PM.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

### #12spek  Prime Members   -  Reputation: 996

Like
0Likes
Like

Posted 17 November 2012 - 02:21 AM

Thanks a lot, the tiny stinky details are becoming clear now. I promise if, IF!, I manage to get it working, I'll try to write it all down in small "understandable" bits. The interpolation part makes sense, sampling 8 voxels instead of 27 sounds a lot better. So, each node, also from higher "mipmapped" levels contains a brick... or multiple actually in case we want to store light fluxes from 6 directions or store it as a SH.

@Vilem
That would certainly be possible, although you won't benefit from hardware tri-linear filtering when sampling colors during raymarching later on. I believe that's pretty much the mean reason why they keep a separate 3D texture(s) (already asked this a while ago and got that for an answer ). Nevertheless, I'll start with storing colors into the octree buffer itself first, just for easy testing.

Thanks you guys, I'm finally getting a grip on this thing (I think)

### #13spek  Prime Members   -  Reputation: 996

Like
0Likes
Like

Posted 17 November 2012 - 10:34 AM

* Note about Semaphors.& atomics (for those who struggle as well hehe)
I thought my videocard didn't support them because the card driver would crash/time-out each time. But that was a bug from myside: you can't run multiple "workers" with semaphors. AFAIK, they all follow the same execution path so if ones waits in a loop until the semaphor gets unlocked, all workers in that same wavefront / warp will wait... forever.

If that's the case, I guess the splitting process described above can only use 1 worker per wavefront/warp then. At least, I'm still using a semaphor to allocate subnodes:
if ( parentNode.markedForSplitting )
{
getSemaphor( &lock );
int firstIndex = _globalVarNodeCounter[0]; // use a global int to keep track of the used nodes
_globalVarNodeCounter[0] += 8;
releaseSemaphor( &lock );

parentNode.childPtr1 = &octreeNodes[ firstIndex ];
...
}


However... OpenCL sais that the "atomic_Add" function returns its old value. So the code above could be replaced with:
if ( parentNode.markedForSplitting )
{
int firstIndex = atomic_add( &_globalVarNodeCounter[0], 8 );
parentNode.childPtr1 = &octreeNodes[ firstIndex ];

...
}

Correct me if I'm telling crap though, it's all a bit new to me too.

### #14spek  Prime Members   -  Reputation: 996

Like
0Likes
Like

Posted 27 November 2012 - 04:08 PM

Sorry to blow this thread back alive, but I thought this question was related so maybe it's better to keep it all together. And since I got served here well previous time

I managed to get VCT "working", but without smooth-interpolation / bricks yet. Just stored the out going voxel radiance directly into the octree nodes. And the performance was horrible btw, but that probably also has to do with the age of my videocard. Anyway, about bricks and maybe OpenCL in general...

* How to draw brick pixels on an image?
I know how to use write_imagef, but the problem is that multiple voxels may write to the same brick, so I need some sort of blending rather than just overwriting. I thought using a "max filter" would be best, though an average may do as well. Additive blending is not a good option in my case because some spots will get affected by more voxels than others. The real question is, is it possible to apply a blending method when writing pixels via OpenCL?

* If not, how did the VCT guys deal with multiple voxels on one node/brick, or did they pre-filter the voxels somehow to assure only 1 would be inserted per node?

* I can't write to volume textures to make bricks btw, also the compiler warns about not knowing the "cl_khr_3d_image_writes" extension. Am I doing something wrong, or is it very well possible my nVidia GeForce 9800M (~2009) just doesn't support it? A work around would be to write to a 2D texture first, then let regular shaders construct a volume texture out of them finally.

* Another way to create bricks &amp; work around the volume-texture issue
I could write colors first into the octree struct (like I already did). Then let ordinary shaders convert that buffer to a brick-3D texture somehow. But... again I still need to blend multiple voxels at a single node (corner). I tried the "atomic_max" function, but as far as I can see, that doesn't work on my structs from the __global address space. Asides that, I fear some hang ups with atomic operations.

Well, I just think I'm not shitting bricks in a proper way. Anyone managed to implement this part of the VCT technique?

EDIT
--------------
Now I see section 22.4.2 of the paper Eternal posted here above explains some of my "blending" questions... Seems they use atomic operations to first sum, then later divide to get an average. Got to read it further though.

Cheers,
Rick

Edited by spek, 27 November 2012 - 04:20 PM.

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS