Sparse Voxel Octrees ray packets traversal

Started by
6 comments, last by MxADD 9 years, 3 months ago
Hi
I'm looking for any algorithm to do (coherent) ray packet vs. Sparse Voxel Octrees traversal on CPU - are there any ?
For now I use "An Efficient Parametric Algorithm for Octree Traversal". (http://wscg.zcu.cz/wscg2000/Papers_2000/X31.pdf)
but it is for single rays only, not packets, and cannot be easily converted to handle packets (?)
Advertisement

That paper seems to be about squeezing out the last bit of performance of ray cube intersection (and seems familiar to me from an old book). Is that your intent: Optimal performance? I do not know about a ray-packet. I like coherence, but ray-tracers do not support this usually. You could start with bounding box => Box-Box intersection.

That paper seems to be about squeezing out the last bit of performance of ray cube intersection (and seems familiar to me from an old book). Is that your intent: Optimal performance? I do not know about a ray-packet. I like coherence, but ray-tracers do not support this usually. You could start with bounding box => Box-Box intersection.

I don't know if you correctly understand me :)

So I explain in little more detail:

For KD-Trees there are KD-Tree ray packet traversal (and they are widelly used to get decent performance at least for CPU implementations, GPU implementations utilize GPU architecture for this kind of acceleration (coherency in warps, etc.))

Packet traversal utilize the fact that rays that differs only a little in direction and starts from that same origin, travel mostly that same nodes, so instead iterating with each ray thru the same set of nodes (only to split at last few levels) we get bunch of similar rays (4 or 16 is a common case) and traverse that packet thru the whole tree, visiting ALL nodes that can be reached by individual rays from packet - this way the CPU cache is utilized better and the performance is improved.

Now I ask if such an algorithm (to travel bunch of coherent rays AT ONCE thru octtree) exist ?

What I realy want to achieve is to test visibility of bunch (about 20 000) objects

(to see if they are occluded by octtree or not (ray T from octree is smaller than the ray T casted to object))

One way to do this is to raycast octree into array (say 512x512) and then generate mipmaps for this aray using the MIN filter,

then the test for object would be as easy as determinig screenspace size of their bboxes, from that size determine the mip level for lockup, and lockup exacly one pixel from this level, this way I could test massove number of objects relativelly fast. (all this needs to stay on CPU)

the whole visibility test is below 12 ms, (generate mipmaps from 512x512 array and testing) but the octree raycasting to get top level array (512x512) is currencly slooooow (40 ms if split into tiles and raycasted on all 8 cores, or 200 ms if raycasted on single core).

I have not found any decent algorithm that could be used to test visibility of bbox vs. svo - meybe this is the way to go ?

What kind of packet tracing algorithm are you looking for? SIMD-based packet tracing (for 2x2 ray packets through SSE, or 4x2 ray packets through AVX), or large packet tracing (16x16, or even more)?

Packet tracing techniques are very efficient for primary rays, for secondary... not so much (unless you try to re-arrange them into buckets that are approximately in the same direction).

And the answer is yes - there exist packet tracing algorithm for any acceleration structure (or even no acceleration structure). There is very efficient 4-ray vs. bbox test, 4-ray vs triangle test and 4-ray vs plane test (which says that there is also efficient 4-ray BVH, KD-tree, Octree, Grid, etc. traversal).

The correct question is - do you want to use primary rays only, or? Because the performance for packet tracing often drops for secondary rays (and standard 1-ray tracing wins there), and depending on that, improving performance for primary rays might not give you as much performance gain as expected.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Thanks for the info about 4 ray packets state of the art. I was thinking more in the range of 256 rays applications. Then I still do not understand why coherent? I though you try to simulate interference and diffraction.

What kind of packet tracing algorithm are you looking for? SIMD-based packet tracing (for 2x2 ray packets through SSE, or 4x2 ray packets through AVX), or large packet tracing (16x16, or even more)?

Packet tracing techniques are very efficient for primary rays, for secondary... not so much (unless you try to re-arrange them into buckets that are approximately in the same direction).

And the answer is yes - there exist packet tracing algorithm for any acceleration structure (or even no acceleration structure). There is very efficient 4-ray vs. bbox test, 4-ray vs triangle test and 4-ray vs plane test (which says that there is also efficient 4-ray BVH, KD-tree, Octree, Grid, etc. traversal).

The correct question is - do you want to use primary rays only, or? Because the performance for packet tracing often drops for secondary rays (and standard 1-ray tracing wins there), and depending on that, improving performance for primary rays might not give you as much performance gain as expected.

1] I want to trace PRIMARY rays only as stated above

2] I'v already have in my math library box vs. 4 rays SSE4.1 optimized (AVX is not an option currently couse engine needs to run on older core I processors)

(and the way visual studio handles avx (compiller flag :/) is not ideal either - so for now I stay with SSE)

I'v already tested 'bruteforce' approach - box vs. ray pocket and visit all nodes until leafs (leafs in my tree are just solid boxes, i have no triangle data asociated with tree as the tree was build from volumetric data (from scan)) that any of the ray can hit ...

The results are:

single ray, "An Efficient Parametric Algorithm for Octree Traversal" implementation (optimized for SSE) - 512x512 rays - 189ms (on single core) (average from 128 runs at random places in tree)

2x2 SSE packet (packet is (x,y),(x+1,y),(x,y+1),(x+1,y+1)) - 512x512 rays - 187ms (on single core) (average from 128 runs at random places in tree)

so the difference is negligible

(the first one is so fast couse it's smart about in what order check child nodes, but trying to mimic it for packet is a hell and leads to spagetti if/else construct with in turn is slower

than just checking all 8 nodes in a loop)

For primary rays I receive about 2.2x to 2.8x speedup on average (depending on the scene) when comparing SIMD 2x2 ray packet traversal vs. mono-ray traversal.

I've tested this for Bounding Volume Hierarchies (Surface Area Heuristics + Spatial Splits), Surface Area Heuristics KD-Tree, Split in the middle of longest axis KD-Tree (which is very close to what ochre is), Nested grid and Brute force.

That means that there is most likely something either wrong with your code, or compiler flags (or that you are bounded elsewhere than in computation power - like in memory access).

Alright, here is basic ray vs. aabb test:


bool IntrRayAABB(const Ray* r, const AABB* b, float* enter, float* exit)
{
    float4 tmin = (b->minimum - r->origin) * r->invdirection;
    float4 tmax = (b->maximum - r->origin) * r->invdirection;

    float4 tnear = f4min(tmin, tmax);
    float4 tfar = f4min(tmin, tmax);

    *enter = max(max(tnear.x, 0.0f), max(tnear.y, tnear.z));
    *exit = min(tfar.x, min(tfar.y, tfar.z));

    return (*exit > 0.0f && *enter < *exit);
}

Where:

  • Ray is a structure containing origin and invdirection (4D vectors stored in single 4x32f type).
  • AABB is a structure containing min and max points stored as 4D vectors
  • As a result we have 2 hits (entry and exit point) and a boolean

Moving this code to packet-based isn't as straight forward as one could think, first of all we would like to store rays a bit differently, as SoA (Struct of Arrays) instead of AoS (Arrays of Structs), which means:


struct Ray
{
    float4 origin;
    float4 direction;
    float4 invdirection;
};

struct Ray4
{
    float4 ox, oy, oz;
    float4 dx, dy, dz;
    float4 ix, iy, iz;
};

This allows us to do very effective computations! Compare:


float4 IntrRay4AABB(const Ray4* r, const AABB* b, float4* enter, float4* exit)
{
    float4 txmin = (float4(b->minimum.x) - r->ox) * r->ix;
    float4 tymin = (float4(b->minimum.y) - r->oy) * r->iy;
    float4 tzmin = (float4(b->minimum.z) - r->oz) * r->iz;
    float4 txmax = (float4(b->maximum.x) - r->ox) * r->ix;
    float4 tymax = (float4(b->maximum.y) - r->oy) * r->iy;
    float4 tzmax = (float4(b->maximum.z) - r->oz) * r->iz;
    float4 txnear = min(txmin, txmax);
    float4 txfar = max(txmin, txmax);
    float4 tynear = min(tymin, tymax);
    float4 tyfar = max(tymin, tymax);
    float4 tznear = min(tzmin, tzmax);
    float4 tzfar = max(tzmin, tzmax);
    *enter = max(max(txnear, float4(0.0f)), max(tynear, tznear));
    *exit = min(txfar, min(tyfar, tzfar));
    return ((*exit > float4(0.0f)) & (*enter < *exit));
}

Where:

  • Constructor float4(float) - uses _mm_set1_ps
  • Operator + and * - use _mm_sub_ps and _mm_mul_ps
  • Operators for minimum and maximum use _mm_min_ps and _mm_max_ps
  • Comparisons use _mm_cmplt_ps and _mm_cmpgt_ps
  • And at last and operation uses _mm_and_ps

That way we get good values in enter and exit, the returning float4 will have 0x00000000 value where there was no hit and 0xffffffff value where there was hit (this is easy to use for further masking in traversal; all you are going to need is the _mm_and_ps - note, you might need to mask enter/exit after this function!)

Also I don't guarantee it is 100% correct, I've tried to rewrite the code I'm using (which is in intrinsics only, they are hard to understand and hard to read - so copy/paste most likely won't work).

The whole idea behind this is, that you also need to use SoA instead of AoS to get some real speedup.

Now we have 2x2 ray packets going always from top of the tree into the scene and testing against AABBs. Can we do even better? Hell yes! I will just briefly describe the algorithm:

  • We work on 32x32 or 16x16 (or ..., I will use 32x32 as an example) ray packets, for there we build a frustum that is a bounding volume for all the rays in this packet
  • We start traversing this frustum through the scene (from root node), storing special stack (which is basically a todo for all rays in packet), there are 3 cases of test frustum vs. AABB
  1. There is no intersection between frustum and AABB - so we continue on with next node from stack
  2. The AABB intersects frustum
  3. The AABB is contained inside frustum
  • For 2. and 3. we store these on the special stack
  • Once our traversal stack is empty, we (most likely) have some nodes on special stack - for each 2x2 SIMD ray packet in the 32x32 packet we perform standard traversal, but my input stack will be a copy of that special stack (so we don't start from top of the whole scene tree, but we start with just bunch of small sub-trees)

Using this algorithm it is possible to achieve real time performance on modern CPUs (for primary ray tracing). The problem is, when rays in our packet are not coherent (the frustum becomes too large and we search large portions of scene tree ... mono-ray tracing results often in better speed by then).

Btw. My apologies for typos, I'm on Mac at the moment, and I'm not used to the keyboard yet.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

For primary rays I receive about 2.2x to 2.8x speedup on average (depending on the scene) when comparing SIMD 2x2 ray packet traversal vs. mono-ray traversal.

I've tested this for Bounding Volume Hierarchies (Surface Area Heuristics + Spatial Splits), Surface Area Heuristics KD-Tree, Split in the middle of longest axis KD-Tree (which is very close to what ochre is), Nested grid and Brute force.

That means that there is most likely something either wrong with your code, or compiler flags (or that you are bounded elsewhere than in computation power - like in memory access).

Alright, here is basic ray vs. aabb test:


bool IntrRayAABB(const Ray* r, const AABB* b, float* enter, float* exit)
{
    float4 tmin = (b->minimum - r->origin) * r->invdirection;
    float4 tmax = (b->maximum - r->origin) * r->invdirection;

    float4 tnear = f4min(tmin, tmax);
    float4 tfar = f4min(tmin, tmax);

    *enter = max(max(tnear.x, 0.0f), max(tnear.y, tnear.z));
    *exit = min(tfar.x, min(tfar.y, tfar.z));

    return (*exit > 0.0f && *enter < *exit);
}

Where:

  • Ray is a structure containing origin and invdirection (4D vectors stored in single 4x32f type).
  • AABB is a structure containing min and max points stored as 4D vectors
  • As a result we have 2 hits (entry and exit point) and a boolean

Moving this code to packet-based isn't as straight forward as one could think, first of all we would like to store rays a bit differently, as SoA (Struct of Arrays) instead of AoS (Arrays of Structs), which means:


struct Ray
{
    float4 origin;
    float4 direction;
    float4 invdirection;
};

struct Ray4
{
    float4 ox, oy, oz;
    float4 dx, dy, dz;
    float4 ix, iy, iz;
};

This allows us to do very effective computations! Compare:


float4 IntrRay4AABB(const Ray4* r, const AABB* b, float4* enter, float4* exit)
{
    float4 txmin = (float4(b->minimum.x) - r->ox) * r->ix;
    float4 tymin = (float4(b->minimum.y) - r->oy) * r->iy;
    float4 tzmin = (float4(b->minimum.z) - r->oz) * r->iz;
    float4 txmax = (float4(b->maximum.x) - r->ox) * r->ix;
    float4 tymax = (float4(b->maximum.y) - r->oy) * r->iy;
    float4 tzmax = (float4(b->maximum.z) - r->oz) * r->iz;
    float4 txnear = min(txmin, txmax);
    float4 txfar = max(txmin, txmax);
    float4 tynear = min(tymin, tymax);
    float4 tyfar = max(tymin, tymax);
    float4 tznear = min(tzmin, tzmax);
    float4 tzfar = max(tzmin, tzmax);
    *enter = max(max(txnear, float4(0.0f)), max(tynear, tznear));
    *exit = min(txfar, min(tyfar, tzfar));
    return ((*exit > float4(0.0f)) & (*enter < *exit));
}

Where:

  • Constructor float4(float) - uses _mm_set1_ps
  • Operator + and * - use _mm_sub_ps and _mm_mul_ps
  • Operators for minimum and maximum use _mm_min_ps and _mm_max_ps
  • Comparisons use _mm_cmplt_ps and _mm_cmpgt_ps
  • And at last and operation uses _mm_and_ps

That way we get good values in enter and exit, the returning float4 will have 0x00000000 value where there was no hit and 0xffffffff value where there was hit (this is easy to use for further masking in traversal; all you are going to need is the _mm_and_ps - note, you might need to mask enter/exit after this function!)

Also I don't guarantee it is 100% correct, I've tried to rewrite the code I'm using (which is in intrinsics only, they are hard to understand and hard to read - so copy/paste most likely won't work).

The whole idea behind this is, that you also need to use SoA instead of AoS to get some real speedup.

Now we have 2x2 ray packets going always from top of the tree into the scene and testing against AABBs. Can we do even better? Hell yes! I will just briefly describe the algorithm:

  • We work on 32x32 or 16x16 (or ..., I will use 32x32 as an example) ray packets, for there we build a frustum that is a bounding volume for all the rays in this packet
  • We start traversing this frustum through the scene (from root node), storing special stack (which is basically a todo for all rays in packet), there are 3 cases of test frustum vs. AABB
  1. There is no intersection between frustum and AABB - so we continue on with next node from stack
  2. The AABB intersects frustum
  3. The AABB is contained inside frustum
  • For 2. and 3. we store these on the special stack
  • Once our traversal stack is empty, we (most likely) have some nodes on special stack - for each 2x2 SIMD ray packet in the 32x32 packet we perform standard traversal, but my input stack will be a copy of that special stack (so we don't start from top of the whole scene tree, but we start with just bunch of small sub-trees)

Using this algorithm it is possible to achieve real time performance on modern CPUs (for primary ray tracing). The problem is, when rays in our packet are not coherent (the frustum becomes too large and we search large portions of scene tree ... mono-ray tracing results often in better speed by then).

Btw. My apologies for typos, I'm on Mac at the moment, and I'm not used to the keyboard yet.

all is fine, but you do not need to check AABB vs. ray in nodes, just in leafs (see (http://wscg.zcu.cz/wscg2000/Papers_2000/X31.pdf) in nodes you only compute TM and do switch over nodes).

All in all it seems that my real botleneck is memory :/

The whole tree is big (around 70MB of data - one block of memory, pointers free, just indices into that block) (6 bytes per node, 2 bits per leaf node).

I'v tested your idea to first 'clip' tree to camera frustum and make local copy of all nodes of interest and then do casting of them, in case when we are in tree and see relativelly

small piece of it - it gives speedup, since small amount of nodes is fetched (around 3 - 4 MB) so they then nicelly fit into cache, but if we are outside of tree, then hell the list will be of the entire tree size - so this is not a win.

My next idea is to 'reduce' the tree to not be so big, but for now I do not know how to do this. Tree is need for occlusion culling so the reduction needs to be conservative

to not provide false positives, and the main problem is that the original volume data contains a lot of thin (but large) walls - they are good occluders, but I see no way to 'reduce'

the deepth of the three and still have them.

This topic is closed to new replies.

Advertisement