Performance of an experimental SVO raycaster

Graphics and GPU Programming Programming

Started by Kirill Alpin September 21, 2015 11:24 PM

25 comments, last by bcmpinc 8 years, 4 months ago

Kirill Alpin

325

Author

November 19, 2015 10:02 PM

Though that my old algorithm, the recursive one, ...

I thought that the new one is also recursive. You have to search the next octree nodes given the octree nodes from the last BFS pass. And this search is recursive. Or does it change if one projects the octree on a viewplane instead of a cubemap?

I'm considering whether I should keep track of the bounds of each octree node or instead track the position of the octree node and use that to recompute its bounds. It makes sense to pick the later, because I have to sort on the depth of the node and thus need to keep track of that value anyway. Though that also means that I'll need to switch to floating point computations. And I'm unsure whether floats will be accurate enough.

I also guess that the second one is faster, since (global) memory access is very expensive on the gpu, especially if its not coalesced, which is true for your shader code (i think). I can remember that one access takes 500 gpu cycles to complete, though I can't find where I read this. So recomputing stuff should be faster, after all the gpu is optimized to calculate float operation as fast as possible.

Floats should be accurate enough. I ran my code with both float and int and they produced the same output. However in the end, there is only one way to find out...

Hodgman

52,717

November 19, 2015 10:34 PM

I can remember that one access takes 500 gpu cycles to complete, though I can't find where I read this.

Can be as high as 1000 divided by occupancy. Memory-heavy shaders benefit greatly from optimizing for increased occupancy / reduced GPR counts.

. 22 Racing Series .

bcmpinc

247

November 21, 2015 09:12 PM

I thought that the new one is also recursive. You have to search the next octree nodes given the octree nodes from the last BFS pass. And this search is recursive. Or does it change if one projects the octree on a viewplane instead of a cubemap?

Ah, forgot about the cubemap based version. So there is the cubemap algorithm, which is recursive/uses DFS. There is the improved version that bypasses the cubemap and renders directly to the viewplane, but still is recursive/uses DFS. And there is the newest version, where I try to switch from DFS to BFS.

I've settled for using the position stored as float. So far they seem to be accurate enough. They're also a lot easier to deal with. So far the results look promising, with 15ms per frame at 1024x768. Though that is without sorting and still too much pruning. This is what it looks like so far (not depicted: the flickering that is visible in parts of the image):

Kirill Alpin

325

Author

November 26, 2015 06:56 PM

And there is the newest version, where I try to switch from DFS to BFS.

Ah I see. So what I did is that I seached for every given octree node the next octree nodes for which the current quadtree node has to split, because these next octree nodes are to small. This search ist depth first. These octree nodes are the starting points for the next BFS pass. So only the quadtree building is breadth first in respect of the octree. Inbetween it's depth first. So the number of BFS passes is the depth of the quadtree.

You made both searches breadth first, am I right? So how do you know when to split the quadtree and with it the BFS buffer and when to stop with the BFS passes?

So far the results look promising, with 15ms per frame at 1024x768. Though that is without sorting and still too much pruning.

I agree, it looks promising, as it's already faster at the same level of quality as the mixture of DFS and BFS.

bcmpinc

247

November 27, 2015 11:55 AM

Ah I see. So what I did is that I seached for every given octree node the next octree nodes for which the current quadtree node has to split, because these next octree nodes are to small. This search ist depth first. These octree nodes are the starting points for the next BFS pass.

Yes, you still need to start with a DFS, but you only need to do this for the rootnode of the quadtree. And you can and should do this step on the CPU.

So only the quadtree building is breadth first in respect of the octree. Inbetween it's depth first. So the number of BFS passes is the depth of the quadtree.

The BFS is indeed used for traversing or building the quadtree. So, yes, the number of BFS passes is the depth of the quadtree. However, you don't need DFS in between, because you need at most one octree traversal per quadtree layer.

If you compute the bounding box in the viewport-aligned plane that contains a point of the octree that is nearest to the viewer, then, when you traverse the octree, the sides of the bounding box are reduced by at least a factor 2. When you go to the next layer of the quadtree, you want to reduce the size of the bounding boxes of the octree nodes by a factor 2, which is thus done by a single octree traversal. Though in some cases you don't even need the octree traversal.

I agree, it looks promising, as it's already faster at the same level of quality as the mixture of DFS and BFS.

Yes, though thus far I don't see how to get it working properly, while still being fast. The dark gray noise and flickering were caused by a buffer overflow. Then there are the gaps at the edges of the geometry, which are a lot more visible than I expected,though I have a few ideas on how to get rid of them. Getting the octree nodes sorted in front to back order seems to be the biggest problem. There is so much data that needs to be processed by the GPU that a separate sorting pass that just reads the data and writes it back, without any actual sorting, is already too slow. And it gets a lot worse when you do actually try to sort.

Kirill Alpin

325

Author

November 27, 2015 08:50 PM

When you go to the next layer of the quadtree, you want to reduce the size of the bounding boxes of the octree nodes by a factor 2, which is thus done by a single octree traversal.

I tried that by reducing the stack size to 1, but in that case I get some weird pruning effects. If this weren't so I would also think that one traversal is enough. Maybe there is something wrong with my code. I'll try and check this.

Getting the octree nodes sorted in front to back order seems to be the biggest problem.

In my case it isn't, because I traverse the octree and therefore write in the BFS buffer in front to back order, so the order is automatically conserved. This is trivial for rendering a quadrant of a cube side, but you could create lookup tables for the order of octree nodes. You could render a cube with indices to the correponding lookup tables to a texture, so every pixel has an order associated with it. If the quadtree node is entirly inside a region of one order, you can traverse the octree with this order. The only problem then is to handle the case when one quadtree node is inside multipe regions of order. If that is the case you would have to order the BFS data inside the next BFS pass, corresponding to the order-pixel of the corner, which the next quadtree belongs to, of the last quadtree node.

bcmpinc

247

December 12, 2015 12:18 AM

I tried that by reducing the stack size to 1, but in that case I get some weird pruning effects. If this weren't so I would also think that one traversal is enough. Maybe there is something wrong with my code. I'll try and check this.

I still want to write a blog post about how I do that. It is still quite similar to what I explain here under "culling". Though as O you have to pick the point that is nearest to the camera.

In my case it isn't, because I traverse the octree and therefore write in the BFS buffer in front to back order, so the order is automatically conserved.

My CPU implementation does that. Unfortunately for my GPU implementation it doesn't work. For example, when you traverse the children of an octree node in front to back order, and for each child traverse their children in front to back order, then globally those children's children are not traversed in front to back order. On the CPU this almost never gives any problems, however on the GPU it will probably cause the (rather small) octree node buffers to overflow. I'm thinking about using the GPU's local memory and sort the octree nodes in the same compute shader pass as the quadtree and octree traversal, prior to writing them back to the GPU's main memory.

Performance of an experimental SVO raycaster

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Performance of an experimental SVO raycaster

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines