Sign in to follow this  
Vilem Otte

Voxel Cone Tracing - Octrees

Recommended Posts

And it's again me - I'd like to first say thanks for all the hints on the 3D texture mip mapping, which allowed me to do some great experiments with voxel cone tracing.

Now, I already had some time to move further and implement octree construction on GPU which is quite fast, along with populating octree with the data. So far so good, and I'm even able to search for voxel on position in octree using a function now. So why am I writing here? It is simple, as I'm going to cast rays (and later cones) in the octree:

I'd like to know whether it has sense to implement full ray-octree traversal (e.g. stack-based traversal that would actually intersect the nodes), or whether I should just do the 'sampling method' (e.g. do N steps along ray and accumulate results)?

 

Share this post


Link to post
Share on other sites

i'd say do full/real/whateveryoucallit-traversal so you won't miss any nodes alone the ray. i played around with gpu/octree/tracing a while ago and found that the fixed-step-size approach ain't very good (imo). you either have a too large stepsize, which may miss corners or thin objects in the tree, or a too tight stepping, which is just slow due to oversampling where it is not needed. 

as said it has been a while but if i remember correctly i settled for something like:

- traverse tree to find node that contains ray's origin or ray's entry-point into the tree (if ray starts outside the bbox of tree)

- check found node for opacity and so on if fully opaque, finish traversal and draw pixel.

- find the side of the current node on which the ray would exit. 

- find the smallest non-empty node touching that side of the current node.

- repeat

but performance of that depends on the cache-friendliness of your tree towards that method, and among other things depends on if all your leafes are at the same/deepest level.

that kind of ends up being some sort of adaptive-step-size-thingy, where the step size is always the distance to the next good node, however deep that node might be.

no idea how well that turns into cone-tracing. and when doing cone-tracing fixed step-size based on the size of the cone might work just as well. imagine the cone being filled with ever-larger balls, each overlapping to a certain amount. that would give you a step-size dependend on distance to ray's origin, and it still shouldn't miss many corners/thins depending on the amount of overlap of the "query-spheres" you use within the cone.

edit: however for conetracing you might need a different structure for the octree to use it efficiently - so i wouldn't plan to just slap that on later. might be better of trying to do that right from the start and plan you tree accordingly.

Edited by ninnghazad

Share this post


Link to post
Share on other sites

Thanks for reply!

My octree actually point to a 'brick' per each node, incl. leaves (which is 3x3x3 voxels stored in 3D texture, to allow for hardware trilinear filtering). I implemented this based on Voxel Cone Tracing paper.

The problem is, that all papers that went into octree voxel cone tracing didn't pay too much attention into traversal, which brings me to either:

  • They use just sampling (which is extremely simple, yet as you said has major flaws)
  • Octree traversal isn't really hard one, so they don't even consider mentioning it (yet, once you work with cones - you basically have some max. level you want to traverse into, which changes based on your cone angle ... which brings in some complexity)
  • They don't want to publish it (which I don't really believe, anyone who works a bit with ray tracing can put it together in few days)

I'll put some effort into full traversal then, and share some results here.

Share this post


Link to post
Share on other sites

The octree structure, if I'm not mistaking is there to let you not do full traversal yet have no missed nodes, so you don't really have to go through all the n voxels in the ray path, but only lg(n) of them, which makes the process way faster. Otherwise I wouldn't see a need for actual octree structure.

Share this post


Link to post
Share on other sites

Thanks for the replies.

#IYP - you're correct, yes the point is to make the process faster than going literally 'through all cells'

I believe I've got it correct to some extent. Let me share a screenshot here:

[sharedmedia=gallery:images:8634]

As you can see, the data are not really filtered - I've followed the paper - https://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf

Before I go to my questions and notes - let me outline the algorithm here, so that you might have some ideas where I did something wrong or right. So, my octree building algorithm goes as following:

Graphics queue:
VoxelOctree::Generate()
{
    // Simply render the scene, storing fragments (position + color) in UAV
    // Use counter buffer to count number of voxels
    RenderScene();
}

Compute queue:
VoxelOctree::BuildSVO()
{
    levels = log2(dimensions)
    for (i = 0; i < levels; i++)
    {
        // Flag nodes that are going to be subdivided
        OctreeNodeFlag();

        // For each node on current level, that has been flagged, allocate 8 children
        OctreeNodeAlloc();

        // Initialize children that were generated
        OctreeNodeInit();
    }

    // At this point I'm absolutely sure that octree structure is working properly!
    // But how to fill the data???
    // Next to octree I have a data buffer which currently has N*rgba8 field (where N is 
    // the total number of nodes in octree). Octree node index basically matches the color
    // in this array.
    //
    // Following function just clear octree color and fill single color per node (using 
    // atomics and approximate average allows us to fill EACH node to which voxel maps)
    OctreeNodeClear();
    OctreeNodeFill();
}

My understanding of bricks (e.g. how the data should be stored originally) is to allocate large structured buffer which would have 3x3x3 voxel space per each node in the octree (that means 27 times the size of single color per node), which would allow me to the filtering.

So what I had? I simply ran another kernel for each leaf node, that would search what is in 26 voxel surrounding (searching the whole tree from top) the center and filling those values. Which tends to be quite slow. Plus I'm not entirely sure how should I filter it (I stored it in the structured buffer and attempted to do trilinear filtering, without too much success though).

So my real question is, what should I store in those 'bricks' and how to store & obtain data for them (without taking too long time)? Once stored, I still need to perform filtering (which can be done by using 3D texture and pointing to voxels in center of each 3x3x3 voxels in that ~ I can figure out indexing, but I still have problem with obtaining data for them and potentially also storing them in interior nodes of the tree). The paper doesn't really mention how to fill data into the 'bricks', and we don't really have any information about ALL neighbors (just 7 others, that are in the same parent of the current nodes, but what about the rest?).

Share this post


Link to post
Share on other sites

Looking at the original paper they do not need to filter across bricks because they store 2x2x2 data in 3x3x3 bricks, so they avoid the problem but use more memory.

https://research.nvidia.com/sites/default/files/publications/GIVoxels-pg2011-authors.pdf

 

I assume sparse virtual 3D texturing tiled resource features (not sure which name to pick) would do the filtering in hardware. Did you consider this? (AMD does not have it yet)

 

You could read up Remedys paper about lighting in Quantum Break. They talked about filtering across full/empty or cells with different LODs IIRC - may be related.

Edited by JoeJ

Share this post


Link to post
Share on other sites

Thanks for the response. I've read through Quantum Break paper but it doesn't really help much.

I'm not really against increasing memory footprint, I just don't have any idea how to store 2x2x2 data in 3x3x3 brick (e.g. basically a brick would hold all 8 voxels in a leaf). Based on my thinking it has to be 4x4x4 brick (you have 2x2x2 interior composed of those 8 values + the border), which doesn't sound that bad.

EDIT: They do have this paragraph:

 

To solve this problem, we propose to use 3×3×3 voxel bricks, but assume that the voxel centers are located at the node corners instead of the node centers (Fig. left). This ensures that interpolated values can always be computed inside a brick covering a set of 2×2×2 nodes. Redundancy is not eliminated, but we use less than half the amount of memory compared to adding a boundary, for the same sampling precision

 

This doesn't really make any sense to me. As in this sense I need to have a function like:

// This is pseudo-code! But will do as example.
// I'm looping through all voxels and storing them in tree, I can find out respective brick without any problem
// But I have no idea how to store the data!
//
// brick here points to one brick in which I'm storing current voxel that is in processing
// color is the value I want to store
// location is index into X, Y and Z in voxel array
void StoreColorInBrick(uint brick[3][3][3], uint color, uint location[3])
{
    // ... Now assuming my location is:
    // location[0] % 2 determines whether current voxel color is to the left or right (e.g. on x-axis)
    // location[1] % 2 determines whether current voxel color is to the top or bottom (e.g. on y-axis)
    // location[2] % 2 determines whether current voxel color is to the front or back (e.g. on z-axis)
    //
    // For simplicity let's assume I'm doing lower left corner of some node, where do I store color?
    // should it be brick[0][0][0]?
    // or added that belong to lower left (4 nodes), e.g. to brick[0..1][0..1][0..1] .. in which case 
    // how to handle the parts that are not in corner (there will be more values written most likely).
    // Average them? Sum them?
}

I believe I can make the interpolation work after (for each brick, I can find neighboring brick (if there isn't any, that means there are zeroes ~ that node is empty)). I assume the interpolation would then be only for boundary voxels standard sum/2 for 2 that are directly neighboring between 2 bricks.

Edited by Vilem Otte

Share this post


Link to post
Share on other sites

I did not try to follow your code, tried to make a 1D example instead:

  0   4   0   0   8   4   0   0  // rasterized input
    2   2   0   4   6   2   0  // interpolated

  // resulteing triples for seamless filtering using range [0.33,0.66]:
  024,242,420,200,000,004,048,486,864,642,420,200,000

This should work, but it really needs a lot of memory :(

Not sure if i miss something here, you would need a lot of empty space to make the octree a win with this.

Also i did not think how this extends to building a tree.

 

 

EDIT: Stupid way to solve this, better example below (!)

Edited by JoeJ

Share this post


Link to post
Share on other sites

No worries, I got the idea. It makes sense but you're right, it will have quite large memory impact (yet doing trilinear filtering in shader seems to have more impact).

The octree tends to be quite sparse (For Sponza, the occupancy is about 3-4% (and it tends to be groupped) so there tends to be quite a lot of empty nodes even higher in the hierarchy. Although this might not be general rule for any scene.

Anyways, time to do some coding. Thanks!

Share this post


Link to post
Share on other sites

Probably it makes sense to use octree only for the highest detail levels, and use reguler 3D texture at low detail.

Share this post


Link to post
Share on other sites

So, work in progress.

[sharedmedia=gallery:images:8640]

I believe I've got filtering and brick building correct as of now (I've even checked the 3D texture and it seems correct to me. Anyways I'm using the sampling traversal as of now (which is incorrect, as it doesn't give me any advantage of using octree), but I had to switch back to it (for testing the filtering) due to having something wrong in the ray-octree one (which I'll need to update to cone-octree).

Whole SVO is built on GPU as of now, and the steps are following:

// Build a hierarchy (similar to how it is done in papers)
for (i = 0; i < levels; i++)
{
    FlagNodes();
    AllocNodes();
    InitNodes();
}

// Fill base level with data (as per your description, just done in 3D)
FillLeaves();
// Filter bricks (as we need neighbor information!)
FilterLeaves();

// Build interior nodes (interpolate from lower levels)
for (i = 0; i < levels; i++)
{
    BuildInterior(levels - i - 1);
}

Works like a charm, it is also very fast (I will do actual measurement shortly), and I am over-allocating memory a bit now (as I don't need log2(levels), but log2(levels)-1 depth of tree - as leaves (bricks) actually store one whole level in the 3D texture. I'm going to give figures for performance and memory out once I solve the traversal part (which I'm digging in now).

And I haven't even started optimization yet!

Thanks for advises in this thread. If everything is successful, then my next post will go into blog here on GameDev, with the figures and maybe some demo/code to show.

Share this post


Link to post
Share on other sites

And no, I still don't have filtering right. I noticed that 2 items will cause problems, so the first one is filtering inside leaves:

Here is a pseudo-example:

[sharedmedia=gallery:images:8641]

This is our starting point - semi transparent nodes are visualization of part of the tree that is empty (therefore those nodes are not really there, and or course we can detect that there are empty nodes). Top left 4 nodes are full. Bottom left 4 nodes are also full (some with data that have alpha = 0).

I will demonstrate filtering along X-axis (as described in paper). So, the first step is:

[sharedmedia=gallery:images:8642]

Writing data to the right neighbor (of course as tree is sparse we can't really access the 'transparent' - non existing nodes). This will be the result:

[sharedmedia=gallery:images:8643]

Now, as the data in right-most voxels need to be the same as in the left-most voxels, we need to copy from right back to the left. Like:

[sharedmedia=gallery:images:8644]

And the problem is:

[sharedmedia=gallery:images:8646]

Obviously as we can't write into non-existent nodes (due to sparse nature of the tree), the values won't match (even when we assume that value was with alpha = 0 in the previous steps). All the data neighboring non-existent nodes will be invalid ending up in a border.

Of course the same problem is hit when computing interior nodes (mip mapped ones) - they will not match properly. Resulting in non-smooth ambient occlusion like this:

[sharedmedia=gallery:images:8648]
The paper sadly doesn't address this problem at all (it describes scenario when the tree is dense, in which case it is working properly, but as soon as tree is sparse, the problems arise).
Any idea how to properly solve this? Obviously the leaf nodes can be solved by detecting non-existent nodes around currently processed node and set values to match (in all 6 directions). But how to perform 'mip-mapping' after (e.g. how to compute interior node values in a way that would make sense)?
 
My apologize for double-post!
Edited by Vilem Otte

Share this post


Link to post
Share on other sites

I think you do need to interolate first, build the triples second, and only after this is done you can build the tree. This way it can not happen you would need non existing nodes.

Take a look at my 1D example above, it only results in empty cells if at least two adjecent cells from original data are empty.

But i'm still unsure if my example is the best way to solve filtering - it needs 6 times the original memory just for the 1D case.

I'll try another option:

0   4     0   0     8   4     0   0   // rasterized input (with an additional space to show the data we want to group in a node)

// interpolation not necessary, instead we add the right neighour to allow filtering.
// So we get those triples for seamless filtering using range [0.33,0.66]:

  04 0,  00 8,  84 0,  00 0

... ah, much better - only 1.5 times the data, stupid me :) Requires 3 adjacent empty cells to generate an empty voxel.

 

So if i'm right here, something like mixing colors like you show in your image should not be necessary.

But when you bulid the tree you need to look at neighbours as well to be sure a node will in deed be empty.

Edited by JoeJ

Share this post


Link to post
Share on other sites

The problem with interpolation is, I can't do that beforehand. I have a set of elements from which I'm building a tree (those are voxels, generated for the scene). Now as I know the position of each voxel, I have no idea about its sorroundings (they can be in random order inserted into the set), all I know is their position and data (color, normal, whatever I wish to store). Adding voxels with interpolations would mean that I need to find neighbors in this set (which is actually easily done when octree building is finished).

This will be quite hard, as the tree can't be easily modified. EDIT: Thinking of this, yes technically it is possible to add nodes, removing them might be more tricky - but yes, also possible ... as actually each node represents sparse octree itself. I believe this might be a decent way to handle inserting dynamic objects too ... or maybe different levels of quality (whenever it is necessary). Making this easily usable even for case of 'ball in stadium'.

Each of the elements is inserted into the tree (the tree I generate is sparse), and doing any form of refinement will be hard, if not impossible once tree is generated (it uses parallel algorithm for that):

  1. You start with root node (levels = 1)
  2. Loop (for i = 0 to levels):
  3. For each element - traverse into the 'levels', and flag which nodes are to be split
  4. For each node in current level - if node is flagged, generate 8 child nodes
  5. Allocate generated child nodes
  6. Repeat the loop

Now, I can't really change a tree once it is finished. The point is, I realized that bricks are something different than what I thought they are. The value in the centre is the most important one, and 26 values sorrounding the center are duplicated (yes, it is "memory overkill" in this sense)! The correct version of previous images is (and should be):

[sharedmedia=gallery:images:8649]

The mipmapping part seemed tricky at first (but my algorithm is wrong), what I should actually do is, for child nodes, use their center values (and the values between the centres - which will be 3x3x3 total values! Not 4x4x4!) and average this into the parent node's center. Then put the boundaries to fill boundaries in the parent node.

Now, this will need additional step to perform the interpolations like I did for leaves, to guarantee that the border values in bricks are correct.

[sharedmedia=gallery:images:8650]
 
Sounds quite complicated though (I didn't have time to measure time figures, as the code is still work in progress, so hopefully it won't be that big problem to compute).
 
EDIT2: Yes, I can confirm that now it works quite properly (even AO is smooth). The last part I'd like to investigate is traversal.
 
The sampling one is of course extremely fast if sample count is low, but as stated - unreliable. Increasing sample count can solve problems (basically oversampling everything), but tends to be quite heavy. But what's the point of using SVO when I'm just brute-forcing through right? Which has O(n) complexity.
 
I've tried ray-AABB intersections and stepping down the tree (using stack-based approach ... basically something similar to what one would use for BVH ray tracing). While reliable, it is extremely heavy if the nodes aren't pushed into stack in correct order (which doesn't seem as straight forward as I thought will be). This should theoretically have O(log(n)) complexity, assuming I put the children to stack in order to process them "along the ray".
 
I'd also like to try DDA-like approach (as each time I can determine on which level the empty intersected node is, it should be straight forward to step accordingly to next node), as stated previously - if implemented correctly, it should have O(log(n)) complexity of finding next node along "ray", and 
Edited by Vilem Otte

Share this post


Link to post
Share on other sites

The problem with interpolation is, I can't do that beforehand. I have a set of elements from which I'm building a tree (those are voxels, generated for the scene). Now as I know the position of each voxel, I have no idea about its sorroundings (they can be in random order inserted into the set), all I know is their position and data (color, normal, whatever I wish to store). Adding voxels with interpolations would mean that I need to find neighbors in this set (which is actually easily done when octree building is finished).

You could use a grid of indices and populate it while generating the voxels. Assuming this grid fits into memory (256^3 * 32bit = 64MB), neighbourfinding is just a lookup.

 

 

Not sure about your tree generation algorithm, personally i do it this way for a similar usecase:

for each level top down:

 for each leaf determinate the subdivided octant of its actual parent and increase atomic counters.

 Do prefix sum over the counters to determinate new child nodes and link leaf parent pointers to them.

Advantage: Each thread maps to one leaf and is busy even at the top levels, no inner loops. Disadvantage: Needs sync between levels, cumbersome to keep atomic counters in LDS

 

Another way: https://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/

There is a link to a paper which describes how to convert binary tree results to octree.

 

 


I've tried ray-AABB intersections and stepping down the tree (using stack-based approach ... basically something similar to what one would use for BVH ray tracing). While reliable, it is extremely heavy if the nodes aren't pushed into stack in correct order (which doesn't seem as straight forward as I thought will be). This should theoretically have O(log(n)) complexity, assuming I put the children to stack in order to process them "along the ray".

 

For ray traversal you could use a skip pointer per node instead of a stack. Should be a big win.

EDIT: see here: https://www.cs.utah.edu/~bes/papers/fastRT/paper-node12.html

Edited by JoeJ

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this