Performance of an experimental SVO raycaster

Started by
25 comments, last by bcmpinc 8 years, 4 months ago
How do these guys get such a good performance?

The first movie shows a heightmap raycaster, which is quite easy to implement efficiently. The second movie, I don't know. Though I do see it paging geometry in and out of video memory, when the camera is rotating (with various gigabytes of geometry data, I will need to use paging for my GPU implementation as well). And there is a low resolution in the dark background, which I expect to be a video compression artifact. There are some GPU raytracers out there, such as https://code.google.com/p/efficient-sparse-voxel-octrees/ (also check the external links on that page). However, it uses CUDA, so I didn't bother trying to get it running.

Advertisement

These were just some examples. I thought especially the second one would use the same octree depth first traversal algorithm as me.


I should note though that I haven't implemented it for the GPU yet, so I don't know how well such implementation would work, only that it is possible.


I could try to implement it with D3D11 if you don't mind. If it works I would of course send you the code.

I could try to implement it with D3D11 if you don't mind. If it works I would of course send you the code.

Yes, go ahead, I'd love to see a proof of concept implementation.

Here's an update:

I've implemented the BFS on the GPU with your old algorithm, as I don't fully understand the new one, and it looks promising. Moreover I can confirm O(pixel) as far as I can tell. Or more precise O(buffer size), where the buffer size is the amount of BFS data stored on the GPU. Until 100 MB of BFS data I get 60 fps. And here is the problem: If I want to increase the amount of pixel or voxel and let the buffer size fixed, then the resulting frames get more and more errors.

One solution would be to run a BFS on the first few quadtree depths and then switch to DFS at high resolution.

I can post the code, if there is a need for that. Maybe my code has a major flaw.

Here you can see the errors at a resolution of 512x512:

[attachment=29434:werrors.PNG]

Here a frame of the same octree and same buffer size with a resolution of 256x256:

[attachment=29435:woerrors.PNG]

Both run at 60 fps.

I think you're on the right track. It renders right, but with holes, which are caused by pruning too much octree nodes. So you have to find a way to fill those. They can be filled for example by rendering your 512x512 picture on top of the 256x256 picture. But there are two better ways to do it. One, you can stop traversing the octree, to prevent the buffer to overflow, which causes pruning. Two you can render the pruned octnodes to a lower resolution image and use that as a background for your final image. You can also combine those methods.

So basically you add a LOD mechanism for geometry that isn't far away, but is less visible because its obscured by other geometry and thus you can get away with rendering it at a lower detail level.

The rectangles are quite visible, so I guess you stop some octree traversals an iteration too early.

If you want some better geometry to test on, please take a look at: https://bcmpinc.wordpress.com/2013/12/10/benchmarks/. It is the Sibenik dataset converted to a pointcloud. Once you've converted it into the octree format you're using you can use it to test your renderer. It looks like this (the individual voxels are somewhat noticeable on the pillar on the left):

53vn.png

I've implemented both solutions (by reducing the stack to 3 and drawing the whole quadtree node if an overflow occurs).

Here's a picture of the church without these at a resolution of 512x512:

[attachment=29506:20fps.PNG]

Here's one with both active:

[attachment=29507:16fps.PNG]

Both run at about 20 fps. One could get a better performance by reducing the size of the buffer, but this would greatly reduce the quality even more. The same is true for increasing the voxel density (it's now at 2^12 per octree side).

I also found out that a fixed point implementation of bcmpinc's (old) algorithm runs on the gpu at the same speed as a floating point one. I find it a bit strange as I thought the gpu is optimized for floating point operations.

I would also like to say that I'm not really a gpu wizard (like these people here: https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf). I'm just doing this as a hobby. So it's quite possible that there're a dozen of optimizations one can make with the code. So if someone wants too see and/or optimize my code I could upload it here or somewhere else.

@bcmpinc Theres something strange about the point data of the model. The coordinates of them aren't normalized, there're all between 30800 and 34740. So an octree depth of 13 and greater doesn't make sense because of that precision loss.

The performance is kind of disappointing. Getting 20fps on 512x512 is worse than the 10fps on 1024x768 I get with a single CPU core. I also expected the hole filling to be less noticeable. Though I'm sure that you can still do a lot of optimization. I doubt I can be of much help here, due to my lack of knowledge of direct3d, though please share the code. So I and other people can have a look at it.

I didn't create the dataset, though I guess that the coordinates are between 30800 and 34740, because otherwise it would be prohibitively large. What I do though, is I handle the leaf nodes of the octree to be nodes whose childnode pointers all point to itself. That way you can always keep traversing your octree even if you've hit the resolution limit.

Regarding the GPU's fixed point performance, GPU programming has become so popular that I can imagine that they have native support for integer operations. Or, they're just really good at emulating integer operations using floating point operations.

edit: nothing here. my fault

The pixel resolution has a minor impact on the perfromance. One could run this program at 60 fps with a resolution of 1024x768, the quality however would be really bad. This for example runs at 60 fps (without hole filling):

[attachment=29510:60fps.PNG]

I also forgot to mention that my grafics card is kind of old. I use a Nvidia GeForce GTX 550Ti, which has a theoretical performance of 690 gflops (https://en.wikipedia.org/wiki/GeForce_500_series), 10x more than my cpu. It could outperform my cpu on a tree traversal algorithm.

Anyway, here's the whole shader code:


#define B_0000  0
#define B_0001  1
#define B_0010  2
#define B_0011  3
#define B_0100  4
#define B_0101  5
#define B_0110  6
#define B_0111  7
#define LAST 0x80000000

#define MAXDEPTH 2
#define MAXDEPTH2 3
#define CALLS 256

#define MINPIX 2048

cbuffer matrixBuff : register( b0 )
{
	uint size : packoffset(c0); //buffer size per thread
	uint pxsize : packoffset(c0.y); //pixel size
	uint2 puffer : packoffset(c0.z);
};

struct BFSDataINT
{
	int4 ab; // quadtree projection points
	int4 dadb; // change of the projection
	uint pointer; //pointer to the octree node
};

StructuredBuffer<uint> octree : register(t0);
StructuredBuffer<BFSDataINT> intrace : register(t1);
StructuredBuffer<uint> incount : register(t2);

RWTexture2D<float4> BufferOut : register(u0);
RWStructuredBuffer<BFSDataINT> outtrace : register(u1);
RWStructuredBuffer<uint> outcount : register(u2);

/*
Decode morton code from
https://fgiesen.wordpress.com/2009/12/13/decoding-morton-codes/
*/
uint Compact1By1(uint x)
{
  x &= 0x55555555;                  // x = -f-e -d-c -b-a -9-8 -7-6 -5-4 -3-2 -1-0
  x = (x ^ (x >>  1)) & 0x33333333; // x = --fe --dc --ba --98 --76 --54 --32 --10
  x = (x ^ (x >>  2)) & 0x0f0f0f0f; // x = ---- fedc ---- ba98 ---- 7654 ---- 3210
  x = (x ^ (x >>  4)) & 0x00ff00ff; // x = ---- ---- fedc ba98 ---- ---- 7654 3210
  x = (x ^ (x >>  8)) & 0x0000ffff; // x = ---- ---- ---- ---- fedc ba98 7654 3210
  return x;
}

uint DecodeMorton2X(uint code)
{
  return Compact1By1(code >> 0);
}

uint DecodeMorton2Y(uint code)
{
  return Compact1By1(code >> 1);
}

uint2 DecodeMorton(uint code)
{
	return uint2(DecodeMorton2X(code),
				DecodeMorton2Y(code));
}

bool hasChildren(int ptr)
{
	return octree[ptr] & LAST;
}

uint getChild(int ptr, int now)
{
	if (now != 0)
		return ptr + octree[ptr + now];
	else
		return ptr + 8;
}

bool isFilled(int ptr)
{
	return !(octree[ptr] & 0x2);
}

float4 getColor(int ptr)
{
	uint val = octree[ptr] >> 8;
	return float4((val & 0xFF) / (float)0xFF, ((val >> 8) & 0xFF) / (float)0xFF, ((val >> 16) & 0x7F) / (float)0x7F, 1.0f);
}

[numthreads(CALLS, 1, 1)]
void CSPASS( uint3 GID : SV_GroupID, uint3 TID : SV_GroupThreadID )
{
	const int I1 = 0x10000;
	const int I2 = 0x20000;
	const uint indices[8] = { B_0000, B_0001, B_0010, B_0100, B_0011, B_0101, B_0110, B_0111 };
	uint resultcount = 0;
	uint idx = GID.x * CALLS + TID.x;
	
	for (uint i = 0; i < incount[idx / 4]; ++i)
	{
		//Prepare
		int offset = i + size * (idx / 4);
		int startptr = intrace[offset].pointer;
		BFSDataINT _strace = intrace[offset];
		BFSDataINT strace;

		strace.ab = 
		int4( (idx & 1) == 0 ? _strace.ab.x : (_strace.ab.x + _strace.ab.z) / 2,
				(idx & 2) == 0 ? _strace.ab.y : (_strace.ab.y + _strace.ab.w) / 2,
				(idx & 1) == 0 ? (_strace.ab.x + _strace.ab.z) / 2 : _strace.ab.z,
				(idx & 2) == 0 ? (_strace.ab.y + _strace.ab.w) / 2 : _strace.ab.w);

		strace.dadb = 
		int4( (idx & 1) == 0 ? _strace.dadb.x : (_strace.dadb.x + _strace.dadb.z) / 2,
				(idx & 2) == 0 ? _strace.dadb.y : (_strace.dadb.y + _strace.dadb.w) / 2,
				(idx & 1) == 0 ? (_strace.dadb.x + _strace.dadb.z) / 2 : _strace.dadb.z,
				(idx & 2) == 0 ? (_strace.dadb.y + _strace.dadb.w) / 2 : _strace.dadb.w);

		strace.pointer = startptr;

		if (strace.ab.z - strace.ab.x > I2)
		{
			//must split
			int resultoffset = resultcount + (size / 4) * idx;
			outtrace[resultoffset] = strace;
			++resultcount;
			if (resultcount == size / 4)
			{
				outcount[idx] = resultcount;
				if(pxsize < MINPIX)
				{
					uint2 code = DecodeMorton(idx) * pxsize;
					for(uint _y = 0; _y < pxsize; ++_y)
					for(uint _x = 0; _x < pxsize; ++_x)
					{
						BufferOut[code + uint2(_x, _y)] = getColor(intrace[size * (idx / 4)].pointer);
					}
				}
				return;
			}
		}
		else
		{

		//Start trace
		int depth = 0;
		uint _now[MAXDEPTH + 1]; //0...7
		BFSDataINT trace[MAXDEPTH + 1];
		bool haschildren[MAXDEPTH + 1];
		for (int j = 0; j < MAXDEPTH + 1; ++j)
		{
			_now[j] = 0;
		}
		trace[depth] = strace;
		haschildren[depth] = hasChildren(trace[depth].pointer);

		while (depth >= 0)
		{
			if (depth < MAXDEPTH)
			{
				if (haschildren[depth])
				{
					uint itrace = indices[_now[depth]];
					BFSDataINT nexttrace;
					nexttrace.pointer = getChild(trace[depth].pointer, itrace);
					haschildren[depth + 1] = hasChildren(nexttrace.pointer);
					if(haschildren[depth + 1] || isFilled(nexttrace.pointer))
					{
						BFSDataINT nowtrace = trace[depth];
                                                //compute next projection
						nexttrace.ab = (itrace & 0x1 ? nowtrace.ab : nowtrace.ab - strace.dadb) * 2
									+ int4((itrace & 0x2) ? -I1 : I1,
											(itrace & 0x4) ? -I1 : I1,
											(itrace & 0x2) ? -I1 : I1,
											(itrace & 0x4) ? -I1 : I1);
							
						//traverse
						++_now[depth];
                                                //if octree is inside projection
						if (nexttrace.ab.z - nexttrace.ab.x > 0 && nexttrace.ab.z > -I1 && nexttrace.ab.x - strace.dadb.x * 2 <= I1
							&& nexttrace.ab.w > -I1 && nexttrace.ab.y - strace.dadb.y * 2 <= I1)
						{
								
							if (nexttrace.ab.z - nexttrace.ab.x <= I2)
							{
								nexttrace.dadb = int4(0, 0, 0, 0);
								//push stack
								_now[depth + 1] = 0;
								trace[depth + 1] = nexttrace;
								++depth;
							}
							else
							{
								nexttrace.dadb = strace.dadb;
								//must split => return result to next pass
								int resultoffset = resultcount + (size / 4) * idx;
								outtrace[resultoffset] = nexttrace;
								++resultcount;
								if (resultcount == size / 4)
								{
									outcount[idx] = resultcount;
									if(pxsize < MINPIX)
									{
										uint2 code = DecodeMorton(idx) * pxsize;
										for(uint _y = 0; _y < pxsize; ++_y)
										for(uint _x = 0; _x < pxsize; ++_x)
										{
											BufferOut[code + uint2(_x, _y)] = getColor(intrace[size * (idx / 4)].pointer);
										}
									}
									return;
								}
							}
						}
					}
					else
						++_now[depth];
				}
				else if (isFilled(trace[depth].pointer))
				{
					trace[depth].dadb = strace.dadb;
					//add to result
					int resultoffset = resultcount + (size / 4) * idx;
					outtrace[resultoffset] = trace[depth];
					++resultcount;
					if (resultcount == size / 4)
					{
						outcount[idx] = resultcount;
						if(pxsize < MINPIX)
						{
							uint2 code = DecodeMorton(idx) * pxsize;
							for(uint _y = 0; _y < pxsize; ++_y)
							for(uint _x = 0; _x < pxsize; ++_x)
							{
								BufferOut[code + uint2(_x, _y)] = getColor(intrace[size * (idx / 4)].pointer);
							}
						}
						return;
					}
					//pop stack
					--depth;
				}
				else
				{
					//pop stack
					--depth;
				}
			}
			else
			{
				trace[depth].dadb = strace.dadb;
				//add to result
				int resultoffset = resultcount + (size / 4) * idx;
				outtrace[resultoffset] = trace[depth];
				++resultcount;
				if (resultcount == size / 4)
				{
					outcount[idx] = resultcount;
					if(pxsize < MINPIX)
					{
						uint2 code = DecodeMorton(idx) * pxsize;
						for(uint _y = 0; _y < pxsize; ++_y)
						for(uint _x = 0; _x < pxsize; ++_x)
						{
							BufferOut[code + uint2(_x, _y)] = getColor(intrace[size * (idx / 4)].pointer);
						}
					}
					return;
				}
				//pop stack
				--depth;
			}
			while (_now[depth] == 8 && depth >= 0)
			{
				//end of traversion
				if (depth == 0)
				{
					depth = -1;
					break;
				}
				//pop stack
				--depth;
			}
		}

		}
	}
	outcount[idx] = resultcount;
}


[numthreads(CALLS, 1, 1)]
void CSLASTPASS( uint3 GID : SV_GroupID, uint3 TID : SV_GroupThreadID )
{
	const int I1 = 0x10000;
	const int I2 = 0x20000;
	const uint indices[8] = { B_0000, B_0001, B_0010, B_0100, B_0011, B_0101, B_0110, B_0111 };
	uint idx = GID.x * CALLS + TID.x;
	uint gid = idx / (pxsize * pxsize);
	uint pix = idx % (pxsize * pxsize);
	uint2 local = uint2(pix % pxsize, pix / pxsize);
	uint2 code = DecodeMorton(gid);
	uint2 globa = code * pxsize;
	float2 div = local / (float)pxsize;
	float2 di2 = (local + uint2(1,1)) / (float)pxsize;
	for (uint i = 0; i < incount[gid]; ++i)
	{
		//Prepare
		int offset = i + size * gid;
		int startptr = intrace[offset].pointer;
		BFSDataINT _strace = intrace[offset];
		BFSDataINT strace;

		strace.ab = 
		int4(_strace.ab.x * (1.0f - div.x) + _strace.ab.z * div.x,
				_strace.ab.y * (1.0f - div.y) + _strace.ab.w * div.y,
				_strace.ab.x * (1.0f - di2.x) + _strace.ab.z * di2.x,
				_strace.ab.y * (1.0f - di2.y) + _strace.ab.w * di2.y);

		strace.dadb = 
		int4(_strace.dadb.x * (1.0f - div.x) + _strace.dadb.z * div.x,
				_strace.dadb.y * (1.0f - div.y) + _strace.dadb.w * div.y,
				_strace.dadb.x * (1.0f - di2.x) + _strace.dadb.z * di2.x,
				_strace.dadb.y * (1.0f - di2.y) + _strace.dadb.w * di2.y);

		strace.pointer = startptr;
		if (strace.ab.z - strace.ab.x > I2)
		{
			BufferOut[globa + local] = getColor(startptr);
			return;
		}
		else
		{
			//Start trace
			int depth = 0;
			uint _now[MAXDEPTH2 + 1]; //0...7
			BFSDataINT trace[MAXDEPTH2 + 1];
			bool haschildren[MAXDEPTH2 + 1];
			for (int j = 0; j < MAXDEPTH + 1; ++j)
			{
				_now[j] = 0;
			}
			trace[depth] = strace;
			haschildren[depth] = hasChildren(trace[depth].pointer);
			while (depth >= 0)
			{
				if(depth < MAXDEPTH2)
				{
					if (haschildren[depth])
					{
						uint itrace = indices[_now[depth]];
						BFSDataINT nexttrace;
						nexttrace.pointer = getChild(trace[depth].pointer, itrace);
						haschildren[depth + 1] = hasChildren(nexttrace.pointer);
						if(haschildren[depth + 1] || isFilled(nexttrace.pointer))
						{
							BFSDataINT nowtrace = trace[depth];
                                                        //compute next projection
							nexttrace.ab = (itrace & 0x1 ? nowtrace.ab : nowtrace.ab - strace.dadb) * 2
									+ int4((itrace & 0x2) ? -I1 : I1,
											(itrace & 0x4) ? -I1 : I1,
											(itrace & 0x2) ? -I1 : I1,
											(itrace & 0x4) ? -I1 : I1);
							//traverse
							++_now[depth];
                                                        //if octree is inside projection
							if (nexttrace.ab.z - nexttrace.ab.x > 0 && nexttrace.ab.z > -I1 && nexttrace.ab.x - strace.dadb.x * 2 <= I1
								&& nexttrace.ab.w > -I1 && nexttrace.ab.y - strace.dadb.y * 2 <= I1)
							{
								if (nexttrace.ab.z - nexttrace.ab.x <= I2)
								{
									nexttrace.dadb = int4(0, 0, 0, 0);
									//push stack
									_now[depth + 1] = 0;
									trace[depth + 1] = nexttrace;
									++depth;
								}
								else
								{
									BufferOut[globa + local] = getColor(nexttrace.pointer);
									return;
								}
							}
						}
						else
							++_now[depth];
					}
					else if (isFilled(trace[depth].pointer))
					{
						BufferOut[globa + local] = getColor(trace[depth].pointer);
						return;
					}
					else
					{
						//pop stack
						--depth;
					}
				}
				else
				{
					BufferOut[globa + local] = getColor(trace[depth].pointer);
					return;
				}
				while (_now[depth] == 8 && depth >= 0)
				{
					//end of traversion
					if (depth == 0)
					{
						depth = -1;
						break;
					}
					//pop stack
					--depth;
				}
			}
		}
	}
}

This is the fixed point version. Just replace I1 with 1.0f, and I2 with 2.0f and every int4 with float4 to get the floating point one. The main problem with every tree traversal algorithm is that one has to rely heavily on branching, which is painfully slow on the gpu, because of their SIMD architecture.

If there are questions regarding the code, just ask.

That's still more than the 300 gflops that I have with my NVidia GT555M. Though, 300 gflops / 60fps / (1920x1080) is still 2411 floating-point operations per pixel, which is a lot more than the ~36 instructions per pixel that I get out of a single core of my 2.2Ghz CPU.

Though that my old algorithm, the recursive one, doesn't perform well on the GPU isn't really a surprise. Because of its recursive nature, I guess there is often only one unit per GPU workgroup that is actually performing any work. Still, quite impressive that you got it working at all, because GPU's don't support recursion. I barely recognize my algorithm in it. You did quite some work to reduce the number of registers that the algorithm uses. Which is good, as using an array to store the stack uses up so many registers that it kills the GPU's latency hiding techniques.

I'm making some progress with the BFS based version using OpenGL 4.5. Currently I'm considering whether I should keep track of the bounds of each octree node or instead track the position of the octree node and use that to recompute its bounds. It makes sense to pick the later, because I have to sort on the depth of the node and thus need to keep track of that value anyway. Though that also means that I'll need to switch to floating point computations. And I'm unsure whether floats will be accurate enough.

This topic is closed to new replies.

Advertisement