• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
spek

Making an octree with OpenCL, thread-safe issues

13 posts in this topic

A bit like "GigaVoxels", I'm trying to generate an octree on the fly by using a (OpenCL) compute shader. The code (incomplete) code below should do that, BUT... obviously it's not thread-safe yet. So, I'm asking what I should do to make it safe, yet still somewhat fast.

What the code does is reading 64x64x64 pixels from a 3D texture. If those pixels contain geometry data, it will insert a leaf in the octree. Since the octree isn't filled initially (other than I reserved plenty of space in a VBO), it happens that the octree gets constructed along the way:
1- fetch a pixel, test if it should be inserted
2- Dive into the octree, starting with the rootnode. Get the leaf node at level 8
3- If a node hasn't been divided in 8 subnodes yet, get an unused pointer from the VBO and reserve 8 nodes there. Let the parent node refer to the first childnode.

The main problem is that multiple pixels may try to split octree nodes at the same time. I've seen barriers and such in the OpenCL reference, I guess I should create semaphores around the "if ( node->child1Ptr == 0 ) { ... }" section in the code below, but wouldn't make that the code terribly slow? Asides, but this is something that can't be fixed maybe, creating a semaphor requires support for the atom_xchg operation afaik. Seems my videocard is too old for that.
[code]
__kernel void voxelOctree_Build( __global struct TOctreeNode* octree,
__global int* octreeNodeCounter,
__read_only image3d_t dataTex
)
{
// We loop through all the pixels of a (64^3) 3D texture
// We inject each pixel into the octree, eventually creating new subnodes along
// the way
int3 locID = (int3)( get_local_id(0), get_local_id(1), get_local_id(2) );
int4 texCoord = (int4)( locID.x, locID.y, locID.z , 0 );
sampler_t sampler = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP;
float4 data = read_imagef( dataTex, sampler, texCoord );

if ( data.w > 0.f ) // Only insert if there is some occlusion at this point
{
// Get root node
__global struct TOctreeNode *node = &octree[0];
// Dive into the octree, eventually split nodes if not done before
int level = 0;
while ( level++ < 8 )
{
// Create 8 sub nodes, if not done before
if ( node->child1Ptr == 0 )
{
// Get an unused index from the octree VBO
int indx1 = *octreeNodeCounter;
node->child1Ptr = (int)(&amp;octree[ indx1 ]); // Store pointer to 1st child node
indx1+= 8; // Reserve 8 nodes
*octreeNodeCounter = indx1; // write back in global var
}
// Check in which of the 8 subnodes we need to dive
int localOffset = 0;
...generate a number between 0..7
// Get node from deeper level, then repeat the loop
node = (__global struct TOctreeNode*)(node->child1Ptr + localOffset);
} // while level > 0

// Now insert the actual data
...
} // occlSum
} // voxelOctree_Build
[/code]
Dunno how they exactly do it in GigaVoxels, but I could also pre-generate the octree once, entirely "unfolded". That means I don't have to split nodes, making the code faster. However, that will take a lot more memory of course.

Greets,
Rick Edited by spek
0

Share this post


Link to post
Share on other sites
You actually answered yourself - you need to make code thread safe (same way you'd do it on CPU). You *should* use atomics for this (I'm not sure, though I think now every GPU supporting OpenCL (1.1 i think, where it became core) supports also cl_khr_global_int32_base_atomics ... I don't know which GPU you have, but I'd try updating drivers & SDK for it) even though it will make code, as you stated, slower (and there probably isn't any other solution, that would be effective enough, either). The question is, whether it will be faster than doing that on CPU (with heavily optimized algorithms)?

Implementing F.e. mutex (which might work for what you need) with atomics is then really straight forward and easy stuff ... you actually could use the same construction:
[CODE]
while (atom_cmpxchg(mutex,1,0)==0); // Acquire mutex
/** Modify data in critical section **/
atom_cmpxchg(mutex,0,1); // Release mutex

[/CODE]
1

Share this post


Link to post
Share on other sites
Ha Vilem, everything ok?

I just tried to implement a semaphor with "atom_cmpxchg" and... the code didn't crash. I couldn't verify it actually worked yet, but maybe it works after all. Yet I'm pretty sure I had problems with atomic operations half a year ago. Though it might been "atom_max" or "atom_add" I tried instead. The GPU is a nVidia 9800M from 2008 btw.

Yeah, it seems we won't utilize much of the multi-threading if the code is locked all the time... Although after a couple of insertions, the octree has already been build partially, thus requiring less updated. Maybe it helps if I create a semphor for each octree level, though I have to make multiple global counter variables in that case as well. In that case the chance one thread waits for another is slightly smaller.


Yet the octree has to be build in GPU memory anyway, so this might still be the fastest way... although OpenCL also allows to let the CPU compute the octree, then transfer it to the videocard memory. Not sure if the transfer would make things too slow. Either way, when reading about Giga Voxels and Voxel Cone Tracing, it seems they all do it 100% on the GPU. The major difference is (probably) that they only make the octree once, and update it locally when dynamic stuff moved.

But in my case I'm dealing with a roaming world. Not sure how UDK4 does it with large worlds, but in my case there are no clear boundaries for the octree. When you move into another room, the octree should move along with the camera. Either constantly, or at least when I switch rooms. My octree resolution is likely lower than in those GigaVoxels examples, plus the voxels to inject are pre-calculated already so that should save some time when rebuilding the octree. But i just wondered if GigaVoxels isn't doing things entirely different to make octree builds faster.

As always, thanks for helping!
0

Share this post


Link to post
Share on other sites
Hmm, got to reset my brain for that. Putting leafs in a list wouldn't be difficult. But how to work upwards then? Got to mention the storage is "sparse" to make optimal use the memory (otherwise the octrees could grow HUGE). At least, that's how they did it with GigaVoxels and I assume they thought very well about it ;) So, I can't put them in a 2D or 3D uniform grid like you could do when filling a texture. An extra problem is that different pixels I insert could share the same leaf, in case they are close enough. In that case they need to be merged somehow. When traversing the octree from top to bottom, you automiatcally see wether a leaf was generated before or not.

When generating 1 level upwards (or a higher MipMap level), each parent needs to know which 8 children are beneath it, but there is no easy way to sort that out since the leafs are scattered "randomly" in the linear memory buffer. It would need to search the whole buffer == inefficient.

I still could unfold the entire octree like a mipmapped 3D texture and reserve the memory required for a worst case scenario (all leafs are filled). This allows to give each node a fixed place in the memory, and we'll have to construct an empty octree only once to get the pointers (they use pointers instead of array indices to make read/write access faster): http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/
However, a somewhat big and/or detailed octree quickly consumes hundreds of megabytes... But maybe they simply reserved all that space in the GigaVoxels approach anyay, don't know...
0

Share this post


Link to post
Share on other sites
You can still use a sparse octree, the trick is to change the process not the final product. For one you can add a compression step after creating your oct-tree to reduce the memory footprint, you could even do this approach for tiles (subdivide your world and process tile for tile compressing each tile, then in a final step add the last nodes and roots).

[quote name='spek' timestamp='1352973987' post='5001177']
When generating 1 level upwards (or a higher MipMap level), each parent needs to know which 8 children are beneath it, but there is no easy way to sort that out since the leafs are scattered "randomly" in the linear memory buffer. It would need to search the whole buffer == inefficient.
[/quote]
While creating your tree you should use some kind of easy identification, i.e. a hash. When thinking of mipmaps, each 4 texel pair at each layer have a unique position you can easily determine. Best to separate the processing step (speed) from the compression step (memory).
0

Share this post


Link to post
Share on other sites
Got to reset my brain for this... Problem is that I have to rebuild the octree each cycle, or at least when moving into a new room. The game world is way bigger than any octree could possibly cover, so instead the octree has to move along with the camera (requiring a rebuild as it moves). I don't know how UDK4 (which also implemented Voxel Cone Tracing) deals with large worlds, but when rebuilding all the time, I still need a large amount of memory available.

I must say I didn't test the performance yet though. Maybe I'm concerned about nothing, since my octrees don't have that much levels. So maybe even with the semaphors the rebuild impact might be acceptable.

Otherwise, I could try something as in the picture. As you say, I reserve a fixed amount of memory. But the lowest (detail) levels don't grow anymore. Distant geometry leafs are inserted at a higher level, saving a lot of nodes / megabytes in the end. With that pre-defined structure, I don't have to split nodes anymore, as we already know where the possible child nodes are stored (like with you mentioned with mipmapping). To speed up octree-traversal, I can still use pointers instead of array indexing.

I still wonder if the GigaVoxels dudes did it it different. Usually when I try to reinvent the wheel with my own "smart" tricks, subtle but important details are missed and leading to disasters hehe.
0

Share this post


Link to post
Share on other sites
I tried to read the papers, although I haven't seen that one yet so thanks! But usually I get dizzy after too many lines of tech-talk and jargon :P I need to convert algorithms into monkey logic bit by bit.

2a .. 2b seems to be the major difference then. Rather than locking & halting threats, the threat just marks and stops. I wonder, wouldn't they already perform a couple of splits at the start? I mean, it would be a waste of time to run all x-thousand voxels to see if the top levels need to be split. Anyway, it makes sense. As for step 3, I guess they still use atomic operations to create an average or max value for the leafs?


Since you understood the papers, I'd like to make some more use of your services hehe.
* Another thing I wonder is what they with those "bricks". They insert 2x2x2 (or 3x3x3) pixels into a 3D texture, which includes the neighbor voxels values/colors. Which is useful for smoother results with tri-linear filtering later on. But where does the Compute Shader get those neighbor values from? In my case I could read all 27 around a voxel (stored as 1 pixel) but that sounds like quite an expensive job.

* Moreoever, since the raycasting technique later on involves directions, I guess they either insert values into 6 3D textures ("ambient cube"), or at least 3 in case they compress it with Spherical Harmonics.

* Which leads to my last question... When sampling from higher levels as the raycast forms a "cone", it seems they don't sample bricks from a 3D texture, but directly a color stored in the nodes themselves. AFAIK, there is only 1 RGBA8 value in those node structs that could store a color though. Probably I'm missing something or just wrong, but that would mean tri-linear filtering is gone, as well as information from which direction the light came from (globally). Or do the higher levels also just generate bricks in a 3D texture? Ifso, again I wonder where they get the neighbor data from at that point.

Thanks!
0

Share this post


Link to post
Share on other sites
I can just say thanks for the paper (I kind of missed it) and you earned at least +1.

To the topic - I haven't met too much OpenGL compute shaders yet (because I'm currently staying more in wonderful land of OpenCL), but couldn't you just store everything as single 3D buffer, where each node wouldn't be just color - but whole ambient cube around point (with some additional data I think you could fit into some 128 bytes per node - 8 * 4 bytes (for child 'pointers') + 6 * 16 bytes for /color = 32 + 96 = 128 ... ye it could work ... and in case of textures you could actually put this whole thing into some RGBA32F texture) Edited by Vilem Otte
0

Share this post


Link to post
Share on other sites
Thanks a lot, the tiny stinky details are becoming clear now. I promise if, IF!, I manage to get it working, I'll try to write it all down in small "understandable" bits. The interpolation part makes sense, sampling 8 voxels instead of 27 sounds a lot better. So, each node, also from higher "mipmapped" levels contains a brick... or multiple actually in case we want to store light fluxes from 6 directions or store it as a SH.

@Vilem
That would certainly be possible, although you won't benefit from hardware tri-linear filtering when sampling colors during raymarching later on. I believe that's pretty much the mean reason why they keep a separate 3D texture(s) (already asked this a while ago and got that for an answer :) ). Nevertheless, I'll start with storing colors into the octree buffer itself first, just for easy testing.

Thanks you guys, I'm finally getting a grip on this thing (I think)
Enjoy your weekends!
0

Share this post


Link to post
Share on other sites
[b]* Note about Semaphors.& atomics[/b] (for those who struggle as well hehe)
I thought my videocard didn't support them because the card driver would crash/time-out each time. But that was a bug from myside: you can't run multiple "workers" with semaphors. AFAIK, they all follow the same execution path so if ones waits in a loop until the semaphor gets unlocked, all workers in that same wavefront / warp will wait... forever.

If that's the case, I guess the splitting process described above can only use 1 worker per wavefront/warp then. At least, I'm still using a semaphor to allocate subnodes:
[code]
if ( parentNode.markedForSplitting )
{
getSemaphor( &lock );
int firstIndex = _globalVarNodeCounter[0]; // use a global int to keep track of the used nodes
_globalVarNodeCounter[0] += 8;
releaseSemaphor( &lock );

parentNode.childPtr1 = &octreeNodes[ firstIndex ];
...
}
[/code]

However... OpenCL sais that the "atomic_Add" function returns its old value. So the code above could be replaced with:
[code]
if ( parentNode.markedForSplitting )
{
int firstIndex = atomic_add( &_globalVarNodeCounter[0], 8 );
parentNode.childPtr1 = &octreeNodes[ firstIndex ];

...
}
[/code]
Correct me if I'm telling crap though, it's all a bit new to me too.
0

Share this post


Link to post
Share on other sites
Sorry to blow this thread back alive, but I thought this question was related so maybe it's better to keep it all together. And since I got served here well previous time :P

I managed to get VCT "working", but without smooth-interpolation / bricks yet. Just stored the out going voxel radiance directly into the octree nodes. And the performance was horrible btw, but that probably also has to do with the age of my videocard. Anyway, [b]about bricks and maybe OpenCL in general[/b]...

* How to draw brick pixels on an image?
I know how to use write_imagef, but the problem is that multiple voxels may write to the same brick, so I need some sort of blending rather than just overwriting. I thought using a "max filter" would be best, though an average may do as well. Additive blending is not a good option in my case because some spots will get affected by more voxels than others. The real question is, is it possible to apply a blending method when writing pixels via OpenCL?

* If not, how did the VCT guys deal with multiple voxels on one node/brick, or did they pre-filter the voxels somehow to assure only 1 would be inserted per node?

* I can't write to volume textures to make bricks btw, also the compiler warns about not knowing the "cl_khr_3d_image_writes" extension. Am I doing something wrong, or is it very well possible my nVidia GeForce 9800M (~2009) just doesn't support it? A work around would be to write to a 2D texture first, then let regular shaders construct a volume texture out of them finally.

* Another way to create bricks &amp; work around the volume-texture issue
I could write colors first into the octree struct (like I already did). Then let ordinary shaders convert that buffer to a brick-3D texture somehow. But... again I still need to blend multiple voxels at a single node (corner). I tried the "atomic_max" function, but as far as I can see, that doesn't work on my structs from the __global address space. Asides that, I fear some hang ups with atomic operations.


Well, I just think I'm not shitting bricks in a proper way. Anyone managed to implement this part of the VCT technique?


EDIT
--------------
Now I see section 22.4.2 of the paper Eternal posted here above explains some of my "blending" questions... Seems they use atomic operations to first sum, then later divide to get an average. Got to read it further though.


Cheers,
Rick Edited by spek
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0