# Performance of an experimental SVO raycaster

This topic is 804 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi folks

I've made a GPU-raycaster based of bcmpinc's algorithm: https://bcmpinc.wordpress.com/. In this pdf, he/she descibes how it works: https://app.box.com/s/rxvvymcz4nfygvs6fz7a.

I used HLSL and DirectCompute for the computation. So here is what I've come up with:

#define I1 1.0f

//trace order indices:
//the algorithm begins with the voxel 000
//than it looks at the voxels 010,001 or 100,001, depending on the direction of the ray
//etc
const uint start[5] = { 0, 1, 3, 5, 6 };
const uint indices1[6] = { B_0000, B_0100, B_0001, B_0101, B_0110, B_0111 };
const uint indices2[6] = { B_0000, B_0010, B_0001, B_0011, B_0110, B_0111 };

...

//_n is a stack of pointers to the visited voxels
//_x is a float2 stack, which stores the ray plane intersections.
//the voxel planes are parallel to the current cubemap side
//_now is a stack of uint, that indicate which voxels are already traversed by the algorithm
//dir is a float2 that stores the direction of the ray
iter = 0;
while (depth != -1)
{
++iter;
//safety first
if(iter == 200)
{
return float4(0.0f,0.0f,1.0f,1.0f);
}
if (isLeaf(_n[depth]))
{
return float4(iter/64.0f, 0, iter > 100 ? 1.0f : 0.0f, 1);
}
if (_now[depth] == 4)
{
//pop stack
--depth;
}
else
{
//goto all next voxels
bool found = false;
for (uint i = start[_now[depth]]; i < start[_now[depth] + 1]; ++i)
{
//get index of the next voxel
uint trace = _x[depth].x * dir.y < _x[depth].y * dir.x ? indices1[i] : indices2[i];						//get intersections with voxel
_x[depth + 1] = ((trace & B_0001 ? _x[depth] : (_x[depth] - dir)) * 2.0f)
+ float2((trace & B_0010) ? -I1 : I1, (trace & B_0100) ? -I1 : I1);
//if ray intersects voxel
if(_x[depth + 1].x >= -I1 && _x[depth + 1].y >= -I1 && _x[depth + 1].x - 2.0f * dir.x < I1 && _x[depth + 1].y - 2.0f * dir.y < I1)
{
//get pointer to next voxel
uint ne = getChild(_n[depth], trace, flip, rot);
if (!isNothing(ne))
{
//traverse to next set of voxels
++(_now[depth]);
//push stack
++depth;
_n[depth] = ne;
//start at first voxel
_now[depth] = 0;
found = true;
break;
}
}
}
if (!found)
{
//traverse to next set of voxels
++(_now[depth]);
}
}
}
return float4(iter/64.0f, 0, iter > 100 ? 1.0f : 0.0f, 1);

This algorithm projects the octree on a cubemap. This cubemap will then get rendered to the screen.

I also use the original algorithm to raycast with lower resolution on the cpu and use the output as a starting point for the pixels on the GPU.

The problem is, even with a octree depth of 7, I get low framerates, especially if the camera is close to and facing the surface, because a lot of pixels are hitting voxels. Also pretracing with lower resolution doesn't really help. I get the same framerates with and without this acceleration structur. Is there some major optimization I forgotten? How do these guys

get such a good performance? In fact this raycaster should perform better as ordinary ones, as the algorithm I use only approximates cubes. From a far one shouldn't notice that.

Here are some pictures of the raycaster:

Here is a picture without optimization:

The amount of red is porportional to the number of iterations per pixel. If a pixel exceeds 100 iterations, it turn purple.
Here is a picture with the acceleration structur:

##### Share on other sites

I'm not sure if it will make a difference but bcmpinc changed his method part way through to ditch using the cube map and instead, simply project the screen on to the octree and do frustum checks using a quadtree hierarchal z buffer. In his latest source code he doesn't have the cubemap as part of the main render loop and his pdf is the old technique which he was using. He stated that the new method of just projecting the screen on to the octree was a lot faster than creating the cube map so it might help to see what his current method is doing unless you specifically want to use the cube map and trace rays.

Lastly, the main premise of his algorithm was to get rid of divisions in the code and to not use the concept of rays (which he succeeded, no perspective division at all in the code) so I'm not entirely sure why you are raycasting when attempting to mimic his algorithm. He has a couple of posts on getting his algorithm to the GPU that might be worth checking out if your interested.

##### Share on other sites

In his latest source code he doesn't have the cubemap as part of the main render loop and his pdf is the old technique which he was using.

I also don't really use a cubemap. I just identify for each screen-ray which side of the cube it would correspond. I've just drawn a link between my and bcmpinc's old algorithm in saying that I project the octree on a cubemap, what I effectively do.

which he succeeded, no perspective division at all in the code

That's impressive, but I can't seem to find where bcmpinc is mentioning this.

I'm not entirely sure why you are raycasting when attempting to mimic his algorithm

I don't try to mimic his algorithm. I just got inspiration of it for this gpu-raycaster. The benefit I see with it, is that the ray voxel intersection is much simpler and the front to back sorting of the octree is automatically given. This should theoretically be a benifit, but in practice it isn't. The question is: Why is that?

##### Share on other sites

In his latest source code he doesn't have the cubemap as part of the main render loop and his pdf is the old technique which he was using.

I also don't really use a cubemap. I just identify for each screen-ray which side of the cube it would correspond. I've just drawn a link between my and bcmpinc's old algorithm in saying that I project the octree on a cubemap, what I effectively do.

which he succeeded, no perspective division at all in the code

That's impressive, but I can't seem to find where bcmpinc is mentioning this.

I'm not entirely sure why you are raycasting when attempting to mimic his algorithm

I don't try to mimic his algorithm. I just got inspiration of it for this gpu-raycaster. The benefit I see with it, is that the ray voxel intersection is much simpler and the front to back sorting of the octree is automatically given. This should theoretically be a benifit, but in practice it isn't. The question is: Why is that?

I had the same questions as to how his code works and you can view our exchange in the comment section in the following link (https://bcmpinc.wordpress.com/2013/11/03/a-sorting-based-rendering-method/). My user name was D.V.D and I ask him in more detail exactly how he does his rendering.

From what I understand, he basically projects the screen on to each corner and finds how far in each axis (l,r,t,b) it is from the screen bounds to the corner. He then uses this to interpolate in world space how these bounds change when he subdivides the octree to its children (this is all linear since its done in world space). At this point, he checks his quadtree if the current octree node is inside the frustum bounds of the current quadtree node. If it is, he subdivides the quadtree into 4 children and creates a sub frustum for each new quadtree node. He then calls the rendering function recursively on each of the quadtree children with each of the 4 sub frustum's and continues rendering. If he ever finds the octree node to be larger than the current frustum, he recursively splits the octree into its 8 children and calls the rendering function recursively on the octree children. Once he hits a leaf node for the quadtree, if the current octree node is inside the leafs frustum then he just sets the pixels color to the octree nodes color.

It basically becomes a method of splitting up the view frustum in world space until eventually each sub frustum is a pixel in size. At that point if a node is inside the frustum bounds, he just sets the pixel in the quadtree leaf. He doesn't have to do any division because all of this is done in worldspace (the frustum checks and what not).

As to the benefits of his algorithm, in complete software with simd and a single thread, he gets between 5-10 fps so this isn't a really fast algorithm unfortunately. On a GPU, I guess you can skip the quadtree entirely by just doing the frustum checks for each pixel in parallel but then that just amounts to raycasting and you retraverse the octree for each pixel. His algorithm does all of this hierarchally so it traverses the octree once but that makes it very linear. I haven't ever worked with GPU volume rendering so I'm not sure why your approach is slow but I did my own version of bcmpinc's renderer in software so if you know how his algorithm works, it might help identifying why your approach is slow.

##### Share on other sites

From what I understand, he basically projects the screen on to each corner and finds how far in each axis (l,r,t,b) it is from the screen bounds to the corner. He then uses this to interpolate in world space how these bounds change when he subdivides the octree to its children (this is all linear since its done in world space). At this point, he checks his quadtree if the current octree node is inside the frustum bounds of the current quadtree node. If it is, he subdivides the quadtree into 4 children and creates a sub frustum for each new quadtree node. He then calls the rendering function recursively on each of the quadtree children with each of the 4 sub frustum's and continues rendering. If he ever finds the octree node to be larger than the current frustum, he recursively splits the octree into its 8 children and calls the rendering function recursively on the octree children. Once he hits a leaf node for the quadtree, if the current octree node is inside the leafs frustum then he just sets the pixels color to the octree nodes color.

So it's basically like his original algortihm. The only difference is that the intersection planes of the octree aren't parallel to the view plane.

On a GPU, I guess you can skip the quadtree entirely by just doing the frustum checks for each pixel in parallel but then that just amounts to raycasting and you retraverse the octree for each pixel.

That's essentially what I'm doing right now. And other GPU raycaster also do it this way. So why is my code so slow in comparison to them? That's the question.

His algorithm does all of this hierarchally so it traverses the octree once but that makes it very linear.

This is the main issue when porting bcmpinc's algorithm to the GPU. I mean he traverses an octree and builds up a quadtree at the same time. I don't see how one could program the GPU to do that efficiently. The only thing I know is that he wants to implement his algorithm with a breadth-first search instead of a depth-first search, but I can't seem to make out the benifit of a breadth-first search on the GPU.

##### Share on other sites

Oh, and by the way, I'm female.

he/she

Somehow I knew that but I wasn't sure .

These elements can all be processed in parallel and without the need of a stack, which means that it fits really well on a GPU.

So for every quadtree node, you send the pointer to the current octree node and start from there a BFS, to find the next octree. After that the quadtree splits and this process begins again. Am I right? If yes than there is still the problem of building up a quadtree, either on the CPU or the GPU. The problem on the CPU would be that you have to make a pass of BFS for every quadtree node, so I could think that this would result in a big overhead. On the GPU you would have to deal with a lot of synchronization to build a quadtree (I'm not sure with this one. Correct me if I'm wrong.)

##### Share on other sites

No, I start the BFS in the rootnode of the quadtree and then the BFS is executed in multiple shader passes. For each quadtreenode I process a list of octree nodes, as there can be and often are multiple octree nodes per quadtree node. I traverse the octree nodes if necessary; sort them and distribute them over the quadtree-childnodes. This way you only need to synchronize between each layer of the quadtree, of which there are at most 12. Then there is still some issue with unbounded memory usage, but that can be solved by imposing a (quadtree-layer dependent) limit on the number of octree nodes per quadtree node. That means that some rendering artifacts are introduced, though I expect those to be unnoticable in practice. Picking the right limits also has the interesting effect that the algorithm's running time becomes O(pixels). See also this comment: https://bcmpinc.wordpress.com/2015/08/09/moving-towards-a-gpu-implementation/#comment-215.

##### Share on other sites

So for every quadtree node, you send the pointer to the current octree node and start from there a BFS, to find the next octree. After that the quadtree splits and this process begins again.

I start the BFS in the rootnode of the quadtree and then the BFS is executed in multiple shader passes. For each quadtreenode I process a list of octree nodes, as there can be and often are multiple octree nodes per quadtree node.

Ok this is roughly the same idea I had. Anyway, I have one question: Does the traversion of an octreenode inside the list of octreenodes happen in one thread? Also you stated in one of your articles on your blog that you might use atomic counter. I'm not sure but I think that these are really performance heavy (http://stackoverflow.com/questions/22367238/cuda-atomic-operation-performance-in-different-scenarios). You could use a prefix sum instead.

##### Share on other sites

Does the traversion of an octreenode inside the list of octreenodes happen in one thread?

I would process one quadtree node and its entire list of octreenodes in one computation unit, as they're called on the GPU.

Also you stated in one of your articles on your blog that you might use atomic counter

I figured out that I don't need them.

I should note though that I haven't implemented it for the GPU yet, so I don't know how well such implementation would work, only that it is possible.

Edited by bcmpinc

##### Share on other sites
How do these guys get such a good performance?

The first movie shows a heightmap raycaster, which is quite easy to implement efficiently. The second movie, I don't know. Though I do see it paging geometry in and out of video memory, when the camera is rotating (with various gigabytes of geometry data, I will need to use paging for my GPU implementation as well). And there is a low resolution in the dark background, which I expect to be a video compression artifact. There are some GPU raytracers out there, such as https://code.google.com/p/efficient-sparse-voxel-octrees/ (also check the external links on that page). However, it uses CUDA, so I didn't bother trying to get it running.

Edited by bcmpinc

##### Share on other sites

These were just some examples. I thought especially the second one would use the same octree depth first traversal algorithm as me.

I should note though that I haven't implemented it for the GPU yet, so I don't know how well such implementation would work, only that it is possible.

I could try to implement it with D3D11 if you don't mind. If it works I would of course send you the code.

##### Share on other sites

I could try to implement it with D3D11 if you don't mind. If it works I would of course send you the code.

Yes, go ahead, I'd love to see a proof of concept implementation.

##### Share on other sites

Here's an update:

I've implemented the BFS on the GPU with your old algorithm, as I don't fully understand the new one, and it looks promising. Moreover I can confirm O(pixel) as far as I can tell. Or more precise O(buffer size), where the buffer size is the amount of BFS data stored on the GPU. Until 100 MB of BFS data I get 60 fps. And here is the problem: If I want to increase the amount of pixel or voxel and let the buffer size fixed, then the resulting frames get more and more errors.

One solution would be to run a BFS on the first few quadtree depths and then switch to DFS at high resolution.

I can post the code, if there is a need for that. Maybe my code has a major flaw.

Here you can see the errors at a resolution of 512x512:

[attachment=29434:werrors.PNG]

Here a frame of the same octree and same buffer size with a resolution of 256x256:

[attachment=29435:woerrors.PNG]

Both run at 60 fps.

##### Share on other sites
I think you're on the right track. It renders right, but with holes, which are caused by pruning too much octree nodes. So you have to find a way to fill those. They can be filled for example by rendering your 512x512 picture on top of the 256x256 picture. But there are two better ways to do it. One, you can stop traversing the octree, to prevent the buffer to overflow, which causes pruning. Two you can render the pruned octnodes to a lower resolution image and use that as a background for your final image. You can also combine those methods.

So basically you add a LOD mechanism for geometry that isn't far away, but is less visible because its obscured by other geometry and thus you can get away with rendering it at a lower detail level.

The rectangles are quite visible, so I guess you stop some octree traversals an iteration too early.

If you want some better geometry to test on, please take a look at: https://bcmpinc.wordpress.com/2013/12/10/benchmarks/. It is the Sibenik dataset converted to a pointcloud. Once you've converted it into the octree format you're using you can use it to test your renderer. It looks like this (the individual voxels are somewhat noticeable on the pillar on the left):

##### Share on other sites

I've implemented both solutions (by reducing the stack to 3 and drawing the whole quadtree node if an overflow occurs).

Here's a picture of the church without these at a resolution of 512x512:

[attachment=29506:20fps.PNG]

Here's one with both active:

[attachment=29507:16fps.PNG]

Both run at about 20 fps. One could get a better performance by reducing the size of the buffer, but this would greatly reduce the quality even more. The same is true for increasing the voxel density (it's now at 2^12 per octree side).

I also found out that a fixed point implementation of bcmpinc's (old) algorithm runs on the gpu at the same speed as a floating point one. I find it a bit strange as I thought the gpu is optimized for floating point operations.

I would also like to say that I'm not really a gpu wizard (like these people here: https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf). I'm just doing this as a hobby. So it's quite possible that there're a dozen of optimizations one can make with the code. So if someone wants too see and/or optimize my code I could upload it here or somewhere else.

@bcmpinc Theres something strange about the point data of the model. The coordinates of them aren't normalized, there're all between 30800 and 34740. So an octree depth of 13 and greater doesn't make sense because of that precision loss.

##### Share on other sites

The performance is kind of disappointing. Getting 20fps on 512x512 is worse than the 10fps on 1024x768 I get with a single CPU core. I also expected the hole filling to be less noticeable. Though I'm sure that you can still do a lot of optimization. I doubt I can be of much help here, due to my lack of knowledge of direct3d, though please share the code. So I and other people can have a look at it.

I didn't create the dataset, though I guess that the coordinates are between 30800 and 34740, because otherwise it would be prohibitively large. What I do though, is I handle the leaf nodes of the octree to be nodes whose childnode pointers all point to itself. That way you can always keep traversing your octree even if you've hit the resolution limit.

Regarding the GPU's fixed point performance, GPU programming has become so popular that I can imagine that they have native support for integer operations. Or, they're just really good at emulating integer operations using floating point operations.

##### Share on other sites

edit: nothing here. my fault

Edited by _undex

##### Share on other sites

The pixel resolution has a minor impact on the perfromance. One could run this program at 60 fps with a resolution of 1024x768, the quality however would be really bad. This for example runs at 60 fps (without hole filling):

[attachment=29510:60fps.PNG]

I also forgot to mention that my grafics card is kind of old. I use a Nvidia GeForce GTX 550Ti, which has a theoretical performance of 690 gflops (https://en.wikipedia.org/wiki/GeForce_500_series), 10x more than my cpu. It could outperform my cpu on a tree traversal algorithm.

Anyway, here's the whole shader code:

#define B_0000  0
#define B_0001  1
#define B_0010  2
#define B_0011  3
#define B_0100  4
#define B_0101  5
#define B_0110  6
#define B_0111  7
#define LAST 0x80000000

#define MAXDEPTH 2
#define MAXDEPTH2 3
#define CALLS 256

#define MINPIX 2048

cbuffer matrixBuff : register( b0 )
{
uint size : packoffset(c0); //buffer size per thread
uint pxsize : packoffset(c0.y); //pixel size
uint2 puffer : packoffset(c0.z);
};

struct BFSDataINT
{
int4 ab; // quadtree projection points
int4 dadb; // change of the projection
uint pointer; //pointer to the octree node
};

StructuredBuffer<uint> octree : register(t0);
StructuredBuffer<BFSDataINT> intrace : register(t1);
StructuredBuffer<uint> incount : register(t2);

RWTexture2D<float4> BufferOut : register(u0);
RWStructuredBuffer<BFSDataINT> outtrace : register(u1);
RWStructuredBuffer<uint> outcount : register(u2);

/*
Decode morton code from
https://fgiesen.wordpress.com/2009/12/13/decoding-morton-codes/
*/
uint Compact1By1(uint x)
{
x &= 0x55555555;                  // x = -f-e -d-c -b-a -9-8 -7-6 -5-4 -3-2 -1-0
x = (x ^ (x >>  1)) & 0x33333333; // x = --fe --dc --ba --98 --76 --54 --32 --10
x = (x ^ (x >>  2)) & 0x0f0f0f0f; // x = ---- fedc ---- ba98 ---- 7654 ---- 3210
x = (x ^ (x >>  4)) & 0x00ff00ff; // x = ---- ---- fedc ba98 ---- ---- 7654 3210
x = (x ^ (x >>  8)) & 0x0000ffff; // x = ---- ---- ---- ---- fedc ba98 7654 3210
return x;
}

uint DecodeMorton2X(uint code)
{
return Compact1By1(code >> 0);
}

uint DecodeMorton2Y(uint code)
{
return Compact1By1(code >> 1);
}

uint2 DecodeMorton(uint code)
{
return uint2(DecodeMorton2X(code),
DecodeMorton2Y(code));
}

bool hasChildren(int ptr)
{
return octree[ptr] & LAST;
}

uint getChild(int ptr, int now)
{
if (now != 0)
return ptr + octree[ptr + now];
else
return ptr + 8;
}

bool isFilled(int ptr)
{
return !(octree[ptr] & 0x2);
}

float4 getColor(int ptr)
{
uint val = octree[ptr] >> 8;
return float4((val & 0xFF) / (float)0xFF, ((val >> 8) & 0xFF) / (float)0xFF, ((val >> 16) & 0x7F) / (float)0x7F, 1.0f);
}

void CSPASS( uint3 GID : SV_GroupID, uint3 TID : SV_GroupThreadID )
{
const int I1 = 0x10000;
const int I2 = 0x20000;
const uint indices[8] = { B_0000, B_0001, B_0010, B_0100, B_0011, B_0101, B_0110, B_0111 };
uint resultcount = 0;
uint idx = GID.x * CALLS + TID.x;

for (uint i = 0; i < incount[idx / 4]; ++i)
{
//Prepare
int offset = i + size * (idx / 4);
int startptr = intrace[offset].pointer;
BFSDataINT _strace = intrace[offset];
BFSDataINT strace;

strace.ab =
int4( (idx & 1) == 0 ? _strace.ab.x : (_strace.ab.x + _strace.ab.z) / 2,
(idx & 2) == 0 ? _strace.ab.y : (_strace.ab.y + _strace.ab.w) / 2,
(idx & 1) == 0 ? (_strace.ab.x + _strace.ab.z) / 2 : _strace.ab.z,
(idx & 2) == 0 ? (_strace.ab.y + _strace.ab.w) / 2 : _strace.ab.w);

strace.pointer = startptr;

if (strace.ab.z - strace.ab.x > I2)
{
//must split
int resultoffset = resultcount + (size / 4) * idx;
outtrace[resultoffset] = strace;
++resultcount;
if (resultcount == size / 4)
{
outcount[idx] = resultcount;
if(pxsize < MINPIX)
{
uint2 code = DecodeMorton(idx) * pxsize;
for(uint _y = 0; _y < pxsize; ++_y)
for(uint _x = 0; _x < pxsize; ++_x)
{
BufferOut[code + uint2(_x, _y)] = getColor(intrace[size * (idx / 4)].pointer);
}
}
return;
}
}
else
{

//Start trace
int depth = 0;
uint _now[MAXDEPTH + 1]; //0...7
BFSDataINT trace[MAXDEPTH + 1];
bool haschildren[MAXDEPTH + 1];
for (int j = 0; j < MAXDEPTH + 1; ++j)
{
_now[j] = 0;
}
trace[depth] = strace;
haschildren[depth] = hasChildren(trace[depth].pointer);

while (depth >= 0)
{
if (depth < MAXDEPTH)
{
if (haschildren[depth])
{
uint itrace = indices[_now[depth]];
BFSDataINT nexttrace;
nexttrace.pointer = getChild(trace[depth].pointer, itrace);
haschildren[depth + 1] = hasChildren(nexttrace.pointer);
if(haschildren[depth + 1] || isFilled(nexttrace.pointer))
{
BFSDataINT nowtrace = trace[depth];
//compute next projection
nexttrace.ab = (itrace & 0x1 ? nowtrace.ab : nowtrace.ab - strace.dadb) * 2
+ int4((itrace & 0x2) ? -I1 : I1,
(itrace & 0x4) ? -I1 : I1,
(itrace & 0x2) ? -I1 : I1,
(itrace & 0x4) ? -I1 : I1);

//traverse
++_now[depth];
//if octree is inside projection
if (nexttrace.ab.z - nexttrace.ab.x > 0 && nexttrace.ab.z > -I1 && nexttrace.ab.x - strace.dadb.x * 2 <= I1
&& nexttrace.ab.w > -I1 && nexttrace.ab.y - strace.dadb.y * 2 <= I1)
{

if (nexttrace.ab.z - nexttrace.ab.x <= I2)
{
nexttrace.dadb = int4(0, 0, 0, 0);
//push stack
_now[depth + 1] = 0;
trace[depth + 1] = nexttrace;
++depth;
}
else
{
//must split => return result to next pass
int resultoffset = resultcount + (size / 4) * idx;
outtrace[resultoffset] = nexttrace;
++resultcount;
if (resultcount == size / 4)
{
outcount[idx] = resultcount;
if(pxsize < MINPIX)
{
uint2 code = DecodeMorton(idx) * pxsize;
for(uint _y = 0; _y < pxsize; ++_y)
for(uint _x = 0; _x < pxsize; ++_x)
{
BufferOut[code + uint2(_x, _y)] = getColor(intrace[size * (idx / 4)].pointer);
}
}
return;
}
}
}
}
else
++_now[depth];
}
else if (isFilled(trace[depth].pointer))
{
int resultoffset = resultcount + (size / 4) * idx;
outtrace[resultoffset] = trace[depth];
++resultcount;
if (resultcount == size / 4)
{
outcount[idx] = resultcount;
if(pxsize < MINPIX)
{
uint2 code = DecodeMorton(idx) * pxsize;
for(uint _y = 0; _y < pxsize; ++_y)
for(uint _x = 0; _x < pxsize; ++_x)
{
BufferOut[code + uint2(_x, _y)] = getColor(intrace[size * (idx / 4)].pointer);
}
}
return;
}
//pop stack
--depth;
}
else
{
//pop stack
--depth;
}
}
else
{
int resultoffset = resultcount + (size / 4) * idx;
outtrace[resultoffset] = trace[depth];
++resultcount;
if (resultcount == size / 4)
{
outcount[idx] = resultcount;
if(pxsize < MINPIX)
{
uint2 code = DecodeMorton(idx) * pxsize;
for(uint _y = 0; _y < pxsize; ++_y)
for(uint _x = 0; _x < pxsize; ++_x)
{
BufferOut[code + uint2(_x, _y)] = getColor(intrace[size * (idx / 4)].pointer);
}
}
return;
}
//pop stack
--depth;
}
while (_now[depth] == 8 && depth >= 0)
{
//end of traversion
if (depth == 0)
{
depth = -1;
break;
}
//pop stack
--depth;
}
}

}
}
outcount[idx] = resultcount;
}

void CSLASTPASS( uint3 GID : SV_GroupID, uint3 TID : SV_GroupThreadID )
{
const int I1 = 0x10000;
const int I2 = 0x20000;
const uint indices[8] = { B_0000, B_0001, B_0010, B_0100, B_0011, B_0101, B_0110, B_0111 };
uint idx = GID.x * CALLS + TID.x;
uint gid = idx / (pxsize * pxsize);
uint pix = idx % (pxsize * pxsize);
uint2 local = uint2(pix % pxsize, pix / pxsize);
uint2 code = DecodeMorton(gid);
uint2 globa = code * pxsize;
float2 div = local / (float)pxsize;
float2 di2 = (local + uint2(1,1)) / (float)pxsize;
for (uint i = 0; i < incount[gid]; ++i)
{
//Prepare
int offset = i + size * gid;
int startptr = intrace[offset].pointer;
BFSDataINT _strace = intrace[offset];
BFSDataINT strace;

strace.ab =
int4(_strace.ab.x * (1.0f - div.x) + _strace.ab.z * div.x,
_strace.ab.y * (1.0f - div.y) + _strace.ab.w * div.y,
_strace.ab.x * (1.0f - di2.x) + _strace.ab.z * di2.x,
_strace.ab.y * (1.0f - di2.y) + _strace.ab.w * di2.y);

strace.pointer = startptr;
if (strace.ab.z - strace.ab.x > I2)
{
BufferOut[globa + local] = getColor(startptr);
return;
}
else
{
//Start trace
int depth = 0;
uint _now[MAXDEPTH2 + 1]; //0...7
BFSDataINT trace[MAXDEPTH2 + 1];
bool haschildren[MAXDEPTH2 + 1];
for (int j = 0; j < MAXDEPTH + 1; ++j)
{
_now[j] = 0;
}
trace[depth] = strace;
haschildren[depth] = hasChildren(trace[depth].pointer);
while (depth >= 0)
{
if(depth < MAXDEPTH2)
{
if (haschildren[depth])
{
uint itrace = indices[_now[depth]];
BFSDataINT nexttrace;
nexttrace.pointer = getChild(trace[depth].pointer, itrace);
haschildren[depth + 1] = hasChildren(nexttrace.pointer);
if(haschildren[depth + 1] || isFilled(nexttrace.pointer))
{
BFSDataINT nowtrace = trace[depth];
//compute next projection
nexttrace.ab = (itrace & 0x1 ? nowtrace.ab : nowtrace.ab - strace.dadb) * 2
+ int4((itrace & 0x2) ? -I1 : I1,
(itrace & 0x4) ? -I1 : I1,
(itrace & 0x2) ? -I1 : I1,
(itrace & 0x4) ? -I1 : I1);
//traverse
++_now[depth];
//if octree is inside projection
if (nexttrace.ab.z - nexttrace.ab.x > 0 && nexttrace.ab.z > -I1 && nexttrace.ab.x - strace.dadb.x * 2 <= I1
&& nexttrace.ab.w > -I1 && nexttrace.ab.y - strace.dadb.y * 2 <= I1)
{
if (nexttrace.ab.z - nexttrace.ab.x <= I2)
{
nexttrace.dadb = int4(0, 0, 0, 0);
//push stack
_now[depth + 1] = 0;
trace[depth + 1] = nexttrace;
++depth;
}
else
{
BufferOut[globa + local] = getColor(nexttrace.pointer);
return;
}
}
}
else
++_now[depth];
}
else if (isFilled(trace[depth].pointer))
{
BufferOut[globa + local] = getColor(trace[depth].pointer);
return;
}
else
{
//pop stack
--depth;
}
}
else
{
BufferOut[globa + local] = getColor(trace[depth].pointer);
return;
}
while (_now[depth] == 8 && depth >= 0)
{
//end of traversion
if (depth == 0)
{
depth = -1;
break;
}
//pop stack
--depth;
}
}
}
}
}



This is the fixed point version. Just replace I1 with 1.0f, and I2 with 2.0f and every int4 with float4 to get the floating point one. The main problem with every tree traversal algorithm is that one has to rely heavily on branching, which is painfully slow on the gpu, because of their SIMD architecture.

If there are questions regarding the code, just ask.

##### Share on other sites

That's still more than the 300 gflops that I have with my NVidia GT555M. Though, 300 gflops / 60fps / (1920x1080) is still 2411 floating-point operations per pixel, which is a lot more than the ~36 instructions per pixel that I get out of a single core of my 2.2Ghz CPU.

Though that my old algorithm, the recursive one, doesn't perform well on the GPU isn't really a surprise. Because of its recursive nature, I guess there is often only one unit per GPU workgroup that is actually performing any work. Still, quite impressive that you got it working at all, because GPU's don't support recursion. I barely recognize my algorithm in it. You did quite some work to reduce the number of registers that the algorithm uses. Which is good, as using an array to store the stack uses up so many registers that it kills the GPU's latency hiding techniques.

I'm making some progress with the BFS based version using OpenGL 4.5. Currently I'm considering whether I should keep track of the bounds of each octree node or instead track the position of the octree node and use that to recompute its bounds. It makes sense to pick the later, because I have to sort on the depth of the node and thus need to keep track of that value anyway. Though that also means that I'll need to switch to floating point computations. And I'm unsure whether floats will be accurate enough.

##### Share on other sites

Though that my old algorithm, the recursive one, ...

I thought that the new one is also recursive. You have to search the next octree nodes given the octree nodes from the last BFS pass. And this search is recursive. Or does it change if one projects the octree on a viewplane instead of a cubemap?

I'm considering whether I should keep track of the bounds of each octree node or instead track the position of the octree node and use that to recompute its bounds. It makes sense to pick the later, because I have to sort on the depth of the node and thus need to keep track of that value anyway. Though that also means that I'll need to switch to floating point computations. And I'm unsure whether floats will be accurate enough.

I also guess that the second one is faster, since (global) memory access is very expensive on the gpu, especially if its not coalesced, which is true for your shader code (i think). I can remember that one access takes 500 gpu cycles to complete, though I can't find where I read this. So recomputing stuff should be faster, after all the gpu is optimized to calculate float operation as fast as possible.

Floats should be accurate enough. I ran my code with both float and int and they produced the same output. However in the end, there is only one way to find out...

##### Share on other sites

I can remember that one access takes 500 gpu cycles to complete, though I can't find where I read this.

Can be as high as 1000 divided by occupancy. Memory-heavy shaders benefit greatly from optimizing for increased occupancy / reduced GPR counts.

##### Share on other sites

I thought that the new one is also recursive. You have to search the next octree nodes given the octree nodes from the last BFS pass. And this search is recursive. Or does it change if one projects the octree on a viewplane instead of a cubemap?

Ah, forgot about the cubemap based version. So there is the cubemap algorithm, which is recursive/uses DFS. There is the improved version that bypasses the cubemap and renders directly to the viewplane, but still is recursive/uses DFS. And there is the newest version, where I try to switch from DFS to BFS.

I've settled for using the position stored as float. So far they seem to be accurate enough. They're also a lot easier to deal with. So far the results look promising, with 15ms per frame at 1024x768. Though that is without sorting and still too much pruning. This is what it looks like so far (not depicted: the flickering that is visible in parts of the image):

##### Share on other sites

And there is the newest version, where I try to switch from DFS to BFS.

Ah I see. So what I did is that I seached for every given octree node the next octree nodes for which the current quadtree node has to split, because these next octree nodes are to small. This search ist depth first. These octree nodes are the starting points for the next BFS pass. So only the quadtree building is breadth first in respect of the octree. Inbetween it's depth first. So the number of BFS passes is the depth of the quadtree.

You made both searches breadth first, am I right? So how do you know when to split the quadtree and with it the BFS buffer and when to stop with the BFS passes?

So far the results look promising, with 15ms per frame at 1024x768. Though that is without sorting and still too much pruning.

I agree, it looks promising, as it's already faster at the same level of quality as the mixture of DFS and BFS.

##### Share on other sites

Ah I see. So what I did is that I seached for every given octree node the next octree nodes for which the current quadtree node has to split, because these next octree nodes are to small. This search ist depth first. These octree nodes are the starting points for the next BFS pass.

Yes, you still need to start with a DFS, but you only need to do this for the rootnode of the quadtree. And you can and should do this step on the CPU.

So only the quadtree building is breadth first in respect of the octree. Inbetween it's depth first. So the number of BFS passes is the depth of the quadtree.

The BFS is indeed used for traversing or building the quadtree. So, yes, the number of BFS passes is the depth of the quadtree. However, you don't need DFS in between, because you need at most one octree traversal per quadtree layer.

If you compute the bounding box in the viewport-aligned plane that contains a point of the octree that is nearest to the viewer, then, when you traverse the octree, the sides of the bounding box are reduced by at least a factor 2. When you go to the next layer of the quadtree, you want to reduce the size of the bounding boxes of the octree nodes by a factor 2, which is thus done by a single octree traversal. Though in some cases you don't even need the octree traversal.

I agree, it looks promising, as it's already faster at the same level of quality as the mixture of DFS and BFS.

Yes, though thus far I don't see how to get it working properly, while still being fast. The dark gray noise and flickering were caused by a buffer overflow. Then there are the gaps at the edges of the geometry, which are a lot more visible than I expected,though I have a few ideas on how to get rid of them. Getting the octree nodes sorted in front to back order seems to be the biggest problem. There is so much data that needs to be processed by the GPU that a separate sorting pass that just reads the data and writes it back, without any actual sorting, is already too slow. And it gets a lot worse when you do actually try to sort.