Jump to content
  • Advertisement

bcmpinc

Member
  • Content Count

    11
  • Joined

  • Last visited

Community Reputation

247 Neutral

About bcmpinc

  • Rank
    Member

Personal Information

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. I still want to write a blog post about how I do that. It is still quite similar to what I explain here under "culling". Though as O you have to pick the point that is nearest to the camera.   My CPU implementation does that. Unfortunately for my GPU implementation it doesn't work. For example, when you traverse the children of an octree node in front to back order, and for each child traverse their children in front to back order, then globally those children's children are not traversed in front to back order. On the CPU this almost never gives any problems, however on the GPU it will probably cause the (rather small) octree node buffers to overflow. I'm thinking about using the GPU's local memory and sort the octree nodes in the same compute shader pass as the quadtree and octree traversal, prior to writing them back to the GPU's main memory.
  2. Yes, you still need to start with a DFS, but you only need to do this for the rootnode of the quadtree. And you can and should do this step on the CPU.   The BFS is indeed used for traversing or building the quadtree. So, yes, the number of BFS passes is the depth of the quadtree. However, you don't need DFS in between, because you need at most one octree traversal per quadtree layer.   If you compute the bounding box in the viewport-aligned plane that contains a point of the octree that is nearest to the viewer, then, when you traverse the octree, the sides of the bounding box are reduced by at least a factor 2. When you go to the next layer of the quadtree, you want to reduce the size of the bounding boxes of the octree nodes by a factor 2, which is thus done by a single octree traversal. Though in some cases you don't even need the octree traversal.   Yes, though thus far I don't see how to get it working properly, while still being fast. The dark gray noise and flickering were caused by a buffer overflow. Then there are the gaps at the edges of the geometry, which are a lot more visible than I expected,though I have a few ideas on how to get rid of them. Getting the octree nodes sorted in front to back order seems to be the biggest problem. There is so much data that needs to be processed by the GPU that a separate sorting pass that just reads the data and writes it back, without any actual sorting, is already too slow. And it gets a lot worse when you do actually try to sort.
  3. Ah, forgot about the cubemap based version. So there is the cubemap algorithm, which is recursive/uses DFS. There is the improved version that bypasses the cubemap and renders directly to the viewplane, but still is recursive/uses DFS. And there is the newest version, where I try to switch from DFS to BFS.   I've settled for using the position stored as float. So far they seem to be accurate enough. They're also a lot easier to deal with. So far the results look promising, with 15ms per frame at 1024x768. Though that is without sorting and still too much pruning. This is what it looks like so far (not depicted: the flickering that is visible in parts of the image):  
  4. That's still more than the 300 gflops that I have with my NVidia GT555M. Though, 300 gflops / 60fps / (1920x1080) is still 2411 floating-point operations per pixel, which is a lot more than the ~36 instructions per pixel that I get out of a single core of my 2.2Ghz CPU.   Though that my old algorithm, the recursive one, doesn't perform well on the GPU isn't really a surprise. Because of its recursive nature, I guess there is often only one unit per GPU workgroup that is actually performing any work. Still, quite impressive that you got it working at all, because GPU's don't support recursion. I barely recognize my algorithm in it. You did quite some work to reduce the number of registers that the algorithm uses. Which is good, as using an array to store the stack uses up so many registers that it kills the GPU's latency hiding techniques.   I'm making some progress with the BFS based version using OpenGL 4.5. Currently I'm considering whether I should keep track of the bounds of each octree node or instead track the position of the octree node and use that to recompute its bounds. It makes sense to pick the later, because I have to sort on the depth of the node and thus need to keep track of that value anyway. Though that also means that I'll need to switch to floating point computations. And I'm unsure whether floats will be accurate enough.
  5. The performance is kind of disappointing. Getting 20fps on 512x512 is worse than the 10fps on 1024x768 I get with a single CPU core. I also expected the hole filling to be less noticeable. Though I'm sure that you can still do a lot of optimization. I doubt I can be of much help here, due to my lack of knowledge of direct3d, though please share the code. So I and other people can have a look at it. I didn't create the dataset, though I guess that the coordinates are between 30800 and 34740, because otherwise it would be prohibitively large. What I do though, is I handle the leaf nodes of the octree to be nodes whose childnode pointers all point to itself. That way you can always keep traversing your octree even if you've hit the resolution limit.   Regarding the GPU's fixed point performance, GPU programming has become so popular that I can imagine that they have native support for integer operations. Or, they're just really good at emulating integer operations using floating point operations.
  6. I think you're on the right track. It renders right, but with holes, which are caused by pruning too much octree nodes. So you have to find a way to fill those. They can be filled for example by rendering your 512x512 picture on top of the 256x256 picture. But there are two better ways to do it. One, you can stop traversing the octree, to prevent the buffer to overflow, which causes pruning. Two you can render the pruned octnodes to a lower resolution image and use that as a background for your final image. You can also combine those methods.   So basically you add a LOD mechanism for geometry that isn't far away, but is less visible because its obscured by other geometry and thus you can get away with rendering it at a lower detail level.   The rectangles are quite visible, so I guess you stop some octree traversals an iteration too early.   If you want some better geometry to test on, please take a look at: https://bcmpinc.wordpress.com/2013/12/10/benchmarks/. It is the Sibenik dataset converted to a pointcloud. Once you've converted it into the octree format you're using you can use it to test your renderer. It looks like this (the individual voxels are somewhat noticeable on the pillar on the left):  
  7. Yes, go ahead, I'd love to see a proof of concept implementation.
  8. The first movie shows a heightmap raycaster, which is quite easy to implement efficiently. The second movie, I don't know. Though I do see it paging geometry in and out of video memory, when the camera is rotating (with various gigabytes of geometry data, I will need to use paging for my GPU implementation as well). And there is a low resolution in the dark background, which I expect to be a video compression artifact. There are some GPU raytracers out there, such as https://code.google.com/p/efficient-sparse-voxel-octrees/ (also check the external links on that page). However, it uses CUDA, so I didn't bother trying to get it running.
  9. I would process one quadtree node and its entire list of octreenodes in one computation unit, as they're called on the GPU.     I figured out that I don't need them.   I should note though that I haven't implemented it for the GPU yet, so I don't know how well such implementation would work, only that it is possible.
  10. No, I start the BFS in the rootnode of the quadtree and then the BFS is executed in multiple shader passes. For each quadtreenode I process a list of octree nodes, as there can be and often are multiple octree nodes per quadtree node. I traverse the octree nodes if necessary; sort them and distribute them over the quadtree-childnodes. This way you only need to synchronize between each layer of the quadtree, of which there are at most 12. Then there is still some issue with unbounded memory usage, but that can be solved by imposing a (quadtree-layer dependent) limit on the number of octree nodes per quadtree node. That means that some rendering artifacts are introduced, though I expect those to be unnoticable in practice. Picking the right limits also has the interesting effect that the algorithm's running time becomes O(pixels). See also this comment: https://bcmpinc.wordpress.com/2015/08/09/moving-towards-a-gpu-implementation/#comment-215.  
  11. Hi there, let me join this discussion.   Without the cubemap it was about equally fast, or even slower. However, it simplified the algorithm a bit and also improved the rendering quality, as the cubemap would always cause some amount of blurring. Oh, and by the way, I'm female.   The problem with DFS is that, since it is recursive, it won't work well on a GPU. As recursive algorithms cannot really be parallelized, and furthermore, recursion requires a stack, which is something GPU's don't like either. With BFS you get a queue of elements to process. These elements can all be processed in parallel and without the need of a stack, which means that it fits really well on a GPU. So by switching from DFS to BFS, it is suddenly possible to port it to GPU and expect decent performance. I haven't found time to work on it though, as I'm also doing a PhD on a completely different topic.
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!