Is there a trick to use signed floats with integer atomic min/max?
I'm building a tree by splitting the bounding box at the center into child octants if there are too many nodes inside or if the box is too large.
So i have no geometry to render, just a few thousand points or more likely hundreds.
But i can't believe rendering those points to two pixels (which is still a render target in main memory) can be faster than atomics to LDS (on core memory which is more than 10 times faster).
I guess the graphics pipeline uses the same mechanism to do things like z-buffer and alpha blending (and that's where GPU atomics come from).
But usually the renderer has to draw the primitives in the order we give it to the API, and the only way to parallelize this is to distribute the primitives to small nonoverlapping pixel rectangles.
If we are lucky and can tell the API to ignore primitive order, those two pixels still fall into a single rectangle and we end up using sequential atomics to main memory under the hood.
Even if the GPU is clever enough to cache this to LDS, i would get additional synchronization cost between graphics and compute.
So i say bounding boxes are a typical compute workload today and using pixel shader is a dated approach.
But let me know if i'm wrong with anything - my knowledge about the graphics pipeline is not that good.
I'm still confused about it. It isn't a large chunk of data, just "a few thousand points or hundreds". And that description is for managing a spatial tree for future processing tasks. Why are you using the graphics card for the processing of the tree? That type of tree management is usually a CPU-side operation.
Assuming they're in a linear array and you've got proper simd instructions to test XYZ at once, hundreds to thousands is going to be faster on the CPU than a single round-trip to the GPU. If you were talking about tens or hundreds of thousands and bigger CUDA processing, that would be a different story.
I could easily be confused since I only have the small amount you wrote above, but I don't see why it is a GPU-side task if it isn't for processing relatively complex geometry.
But i can't believe rendering those points to two pixels (which is still a render target in main memory) can be faster than atomics to LDS (on core memory which is more than 10 times faster). I guess the graphics pipeline uses the same mechanism to do things like z-buffer and alpha blending (and that's where GPU atomics come from).The graphics hardware has dedicated hardware (often called ROP's, or the OM stage by Microsoft, or depth-block/color-block by AMD) which writes to colour/depth buffers, including any sorting/read-modify-write logic. Atomics on memory are a completely different bit of hardware, and atomics in LDS are a different bit of hardware again.
Atomics in LDS are going to be fast, but they're also limited to communication within the current thread-group. LDS hardware is within a shader core, which can only host a few hundred threads at a time. If you're reducing a data set of tens of thousands of points, then your kernel will be running across multiple shader cores with multiple non-connected LDS banks. So at some point you need to do some atomics out to memory addresses... but yes, if you can structure your algorithm so that it does most of its communication in LDS and then only small amounts of communication back to memory then you'll probably be doing pretty well.
Agree with this, i'll keep CPU generation (which i already have) as an option.I'm still confused about it. It isn't a large chunk of data, just "a few thousand points or hundreds". And that description is for managing a spatial tree for future processing tasks. Why are you using the graphics card for the processing of the tree? That type of tree management is usually a CPU-side operation.
In fact the tree rebuild takes 1% of runtime from the complete algorithm on CPU, on GPU it will be maybe 5% i guess.
It hurts to put small workloads on GPU, but i save the need to upload data per frame.
Also at the beginning of the frame all the setup work coming after the tree rebuild are small workloads too and i hope i can do shadowmap rendering or texture space shading async at that time.
Ha ok, i did not know. ROPs, of course :) But i guess the limitation that all work goes to just one ROP blocking any others still applies somehow.The graphics hardware has dedicated hardware (often called ROP's, or the OM stage by Microsoft, or depth-block/color-block by AMD) which writes to colour/depth buffers, including any sorting/read-modify-write logic.
So here is what i'm trying to do and how i plan to implement: (It's inefficient, let me know if hou have an idea...)
I have all geometry covered with disc shaped samples to calculate GI. They are at the positions of lightmap texel centers and there's a hierarchy similar to mip maps, so we can think of them as a quad tree.
So the tree for something like a character is precomputed and does not need to be rebuild, i only need to rebuild the tree above all those object parents as they move around.
To make sure the direction of parent discs can be averaged from children, i first bin all these nodes to the closest of the 8 directions formed by a cube corner to it's center.
From that i launch 8 workgroups (each 256-1024 threads wide, depending on how much dynamic objects a game has), and each workgroup builds the entire tree bottom down.
At the end the 8 trees get a common parent and done.
(The precomputed object trees also use the 8 directions approach, so each object generates 8 root nodes to process)
So i have work only for 8 x 1024 threads but at least they likely keep busy from start to finish.
Unfortunately this will eat up pretty all LDS to store the boxes which limits the chance to do other things async very well.
Using a bounding sphere instead box could help a bit but requires 2 runs over the nodes: One to calculate the center and one to find the radius.
My nodes can have 8 children at most,
a node can have a maximum bounding radius of (1<<treeLevel) * detail,
and the tree needs to be level complete (i duplicate nodes from a level 5 child up to level 8 parent in a later step to get missing level 6+7).
What i really dislike about my approach is that it mixes nodes of various levels under a common parent.
A bottom up approach wold be much nicer: Create parents for all level 0 nodes first, add results to existing level 1 nodes and proceed.
This would result in much better trees, but i guess the compleixity is too high to be worth it.