Performance of an experimental SVO raycaster

Started by
25 comments, last by bcmpinc 8 years, 4 months ago
Hi folks
I've made a GPU-raycaster based of bcmpinc's algorithm: https://bcmpinc.wordpress.com/. In this pdf, he/she descibes how it works: https://app.box.com/s/rxvvymcz4nfygvs6fz7a.
I used HLSL and DirectCompute for the computation. So here is what I've come up with:


#define I1 1.0f

//trace order indices:
//the algorithm begins with the voxel 000
//than it looks at the voxels 010,001 or 100,001, depending on the direction of the ray
//etc
const uint start[5] = { 0, 1, 3, 5, 6 };
const uint indices1[6] = { B_0000, B_0100, B_0001, B_0101, B_0110, B_0111 };
const uint indices2[6] = { B_0000, B_0010, B_0001, B_0011, B_0110, B_0111 };

...

//_n is a stack of pointers to the visited voxels
//_x is a float2 stack, which stores the ray plane intersections. 
//the voxel planes are parallel to the current cubemap side
//_now is a stack of uint, that indicate which voxels are already traversed by the algorithm
//dir is a float2 that stores the direction of the ray
iter = 0;
while (depth != -1)
{
	++iter;
	//safety first
	if(iter == 200)
	{
		return float4(0.0f,0.0f,1.0f,1.0f);
	}
	if (isLeaf(_n[depth]))
	{
		return float4(iter/64.0f, 0, iter > 100 ? 1.0f : 0.0f, 1);
	}
        if (_now[depth] == 4)
	{
		//pop stack
		--depth;
	}
	else
	{
		//goto all next voxels
		bool found = false;
		for (uint i = start[_now[depth]]; i < start[_now[depth] + 1]; ++i)
		{
			//get index of the next voxel
			uint trace = _x[depth].x * dir.y < _x[depth].y * dir.x ? indices1[i] : indices2[i];						//get intersections with voxel
			_x[depth + 1] = ((trace & B_0001 ? _x[depth] : (_x[depth] - dir)) * 2.0f)
					+ float2((trace & B_0010) ? -I1 : I1, (trace & B_0100) ? -I1 : I1);
			//if ray intersects voxel
			if(_x[depth + 1].x >= -I1 && _x[depth + 1].y >= -I1 && _x[depth + 1].x - 2.0f * dir.x < I1 && _x[depth + 1].y - 2.0f * dir.y < I1)
			{
				//get pointer to next voxel
				uint ne = getChild(_n[depth], trace, flip, rot);
				if (!isNothing(ne))
				{
					//traverse to next set of voxels
					++(_now[depth]);
					//push stack
					++depth;
					_n[depth] = ne;
					//start at first voxel
					_now[depth] = 0;
					found = true;
					break;
				}
			}
		}
		if (!found)
		{
			//traverse to next set of voxels
			++(_now[depth]);
		}
	}
}
return float4(iter/64.0f, 0, iter > 100 ? 1.0f : 0.0f, 1);
This algorithm projects the octree on a cubemap. This cubemap will then get rendered to the screen.
I also use the original algorithm to raycast with lower resolution on the cpu and use the output as a starting point for the pixels on the GPU.
The problem is, even with a octree depth of 7, I get low framerates, especially if the camera is close to and facing the surface, because a lot of pixels are hitting voxels. Also pretracing with lower resolution doesn't really help. I get the same framerates with and without this acceleration structur. Is there some major optimization I forgotten? How do these guys
get such a good performance? In fact this raycaster should perform better as ordinary ones, as the algorithm I use only approximates cubes. From a far one shouldn't notice that.
Here are some pictures of the raycaster:
LHpEfoX.png
Here is a picture without optimization:
5QUSSqP.png
The amount of red is porportional to the number of iterations per pixel. If a pixel exceeds 100 iterations, it turn purple.
Here is a picture with the acceleration structur:
L4wrMWj.png
Advertisement

I'm not sure if it will make a difference but bcmpinc changed his method part way through to ditch using the cube map and instead, simply project the screen on to the octree and do frustum checks using a quadtree hierarchal z buffer. In his latest source code he doesn't have the cubemap as part of the main render loop and his pdf is the old technique which he was using. He stated that the new method of just projecting the screen on to the octree was a lot faster than creating the cube map so it might help to see what his current method is doing unless you specifically want to use the cube map and trace rays.

Lastly, the main premise of his algorithm was to get rid of divisions in the code and to not use the concept of rays (which he succeeded, no perspective division at all in the code) so I'm not entirely sure why you are raycasting when attempting to mimic his algorithm. He has a couple of posts on getting his algorithm to the GPU that might be worth checking out if your interested.

In his latest source code he doesn't have the cubemap as part of the main render loop and his pdf is the old technique which he was using.

I also don't really use a cubemap. I just identify for each screen-ray which side of the cube it would correspond. I've just drawn a link between my and bcmpinc's old algorithm in saying that I project the octree on a cubemap, what I effectively do.

which he succeeded, no perspective division at all in the code

That's impressive, but I can't seem to find where bcmpinc is mentioning this.

I'm not entirely sure why you are raycasting when attempting to mimic his algorithm

I don't try to mimic his algorithm. I just got inspiration of it for this gpu-raycaster. The benefit I see with it, is that the ray voxel intersection is much simpler and the front to back sorting of the octree is automatically given. This should theoretically be a benifit, but in practice it isn't. The question is: Why is that?

In his latest source code he doesn't have the cubemap as part of the main render loop and his pdf is the old technique which he was using.

I also don't really use a cubemap. I just identify for each screen-ray which side of the cube it would correspond. I've just drawn a link between my and bcmpinc's old algorithm in saying that I project the octree on a cubemap, what I effectively do.

which he succeeded, no perspective division at all in the code

That's impressive, but I can't seem to find where bcmpinc is mentioning this.

I'm not entirely sure why you are raycasting when attempting to mimic his algorithm

I don't try to mimic his algorithm. I just got inspiration of it for this gpu-raycaster. The benefit I see with it, is that the ray voxel intersection is much simpler and the front to back sorting of the octree is automatically given. This should theoretically be a benifit, but in practice it isn't. The question is: Why is that?

I had the same questions as to how his code works and you can view our exchange in the comment section in the following link (https://bcmpinc.wordpress.com/2013/11/03/a-sorting-based-rendering-method/). My user name was D.V.D and I ask him in more detail exactly how he does his rendering.

From what I understand, he basically projects the screen on to each corner and finds how far in each axis (l,r,t,b) it is from the screen bounds to the corner. He then uses this to interpolate in world space how these bounds change when he subdivides the octree to its children (this is all linear since its done in world space). At this point, he checks his quadtree if the current octree node is inside the frustum bounds of the current quadtree node. If it is, he subdivides the quadtree into 4 children and creates a sub frustum for each new quadtree node. He then calls the rendering function recursively on each of the quadtree children with each of the 4 sub frustum's and continues rendering. If he ever finds the octree node to be larger than the current frustum, he recursively splits the octree into its 8 children and calls the rendering function recursively on the octree children. Once he hits a leaf node for the quadtree, if the current octree node is inside the leafs frustum then he just sets the pixels color to the octree nodes color.

It basically becomes a method of splitting up the view frustum in world space until eventually each sub frustum is a pixel in size. At that point if a node is inside the frustum bounds, he just sets the pixel in the quadtree leaf. He doesn't have to do any division because all of this is done in worldspace (the frustum checks and what not).

As to the benefits of his algorithm, in complete software with simd and a single thread, he gets between 5-10 fps so this isn't a really fast algorithm unfortunately. On a GPU, I guess you can skip the quadtree entirely by just doing the frustum checks for each pixel in parallel but then that just amounts to raycasting and you retraverse the octree for each pixel. His algorithm does all of this hierarchally so it traverses the octree once but that makes it very linear. I haven't ever worked with GPU volume rendering so I'm not sure why your approach is slow but I did my own version of bcmpinc's renderer in software so if you know how his algorithm works, it might help identifying why your approach is slow.

From what I understand, he basically projects the screen on to each corner and finds how far in each axis (l,r,t,b) it is from the screen bounds to the corner. He then uses this to interpolate in world space how these bounds change when he subdivides the octree to its children (this is all linear since its done in world space). At this point, he checks his quadtree if the current octree node is inside the frustum bounds of the current quadtree node. If it is, he subdivides the quadtree into 4 children and creates a sub frustum for each new quadtree node. He then calls the rendering function recursively on each of the quadtree children with each of the 4 sub frustum's and continues rendering. If he ever finds the octree node to be larger than the current frustum, he recursively splits the octree into its 8 children and calls the rendering function recursively on the octree children. Once he hits a leaf node for the quadtree, if the current octree node is inside the leafs frustum then he just sets the pixels color to the octree nodes color.

So it's basically like his original algortihm. The only difference is that the intersection planes of the octree aren't parallel to the view plane.

On a GPU, I guess you can skip the quadtree entirely by just doing the frustum checks for each pixel in parallel but then that just amounts to raycasting and you retraverse the octree for each pixel.

That's essentially what I'm doing right now. And other GPU raycaster also do it this way. So why is my code so slow in comparison to them? That's the question.

His algorithm does all of this hierarchally so it traverses the octree once but that makes it very linear.

This is the main issue when porting bcmpinc's algorithm to the GPU. I mean he traverses an octree and builds up a quadtree at the same time. I don't see how one could program the GPU to do that efficiently. The only thing I know is that he wants to implement his algorithm with a breadth-first search instead of a depth-first search, but I can't seem to make out the benifit of a breadth-first search on the GPU.

Hi there, let me join this discussion.

He stated that the new method of just projecting the screen on to the octree was a lot faster than creating the cube map

Without the cubemap it was about equally fast, or even slower. However, it simplified the algorithm a bit and also improved the rendering quality, as the cubemap would always cause some amount of blurring. Oh, and by the way, I'm female.

The only thing I know is that he wants to implement his algorithm with a breadth-first search instead of a depth-first search, but I can't seem to make out the benifit of a breadth-first search on the GPU.

The problem with DFS is that, since it is recursive, it won't work well on a GPU. As recursive algorithms cannot really be parallelized, and furthermore, recursion requires a stack, which is something GPU's don't like either. With BFS you get a queue of elements to process. These elements can all be processed in parallel and without the need of a stack, which means that it fits really well on a GPU. So by switching from DFS to BFS, it is suddenly possible to port it to GPU and expect decent performance. I haven't found time to work on it though, as I'm also doing a PhD on a completely different topic.

These elements can all be processed in parallel and without the need of a stack, which means that it fits really well on a GPU.

So for every quadtree node, you send the pointer to the current octree node and start from there a BFS, to find the next octree. After that the quadtree splits and this process begins again. Am I right? If yes than there is still the problem of building up a quadtree, either on the CPU or the GPU. The problem on the CPU would be that you have to make a pass of BFS for every quadtree node, so I could think that this would result in a big overhead. On the GPU you would have to deal with a lot of synchronization to build a quadtree (I'm not sure with this one. Correct me if I'm wrong.)





No, I start the BFS in the rootnode of the quadtree and then the BFS is executed in multiple shader passes. For each quadtreenode I process a list of octree nodes, as there can be and often are multiple octree nodes per quadtree node. I traverse the octree nodes if necessary; sort them and distribute them over the quadtree-childnodes. This way you only need to synchronize between each layer of the quadtree, of which there are at most 12. Then there is still some issue with unbounded memory usage, but that can be solved by imposing a (quadtree-layer dependent) limit on the number of octree nodes per quadtree node. That means that some rendering artifacts are introduced, though I expect those to be unnoticable in practice. Picking the right limits also has the interesting effect that the algorithm's running time becomes O(pixels). See also this comment: https://bcmpinc.wordpress.com/2015/08/09/moving-towards-a-gpu-implementation/#comment-215.


So for every quadtree node, you send the pointer to the current octree node and start from there a BFS, to find the next octree. After that the quadtree splits and this process begins again.


I start the BFS in the rootnode of the quadtree and then the BFS is executed in multiple shader passes. For each quadtreenode I process a list of octree nodes, as there can be and often are multiple octree nodes per quadtree node.

Ok this is roughly the same idea I had. Anyway, I have one question: Does the traversion of an octreenode inside the list of octreenodes happen in one thread? Also you stated in one of your articles on your blog that you might use atomic counter. I'm not sure but I think that these are really performance heavy (http://stackoverflow.com/questions/22367238/cuda-atomic-operation-performance-in-different-scenarios). You could use a prefix sum instead.

Does the traversion of an octreenode inside the list of octreenodes happen in one thread?

I would process one quadtree node and its entire list of octreenodes in one computation unit, as they're called on the GPU.

Also you stated in one of your articles on your blog that you might use atomic counter

I figured out that I don't need them.

I should note though that I haven't implemented it for the GPU yet, so I don't know how well such implementation would work, only that it is possible.

This topic is closed to new replies.

Advertisement