I did think about using a higher branching tree like a an octree for compression reasons but i think it would possibly cause the shader to be too unbalanced, doing uneven work in different branches, which is required for high throughput.... What do you think?
I use a tree with max 8 child nodes while 'in space', and 4 children at the surface.
In space means just the necessary nodes to have a single tree - you're on the surface most of the time and nodes with many children atomatically fall to the higher tree levels, so the additional divergence becomes neglicible.
Further, with a binary tree you have much more tree levels, which means more dispatches and less work per dispatch - the worst thing you can do actually, because:
Currently i use 16 levels and the constant cost of all dispatches (excluding the work done) and memory barriers is about 0.03ms on FuryX.
I tried to hide those 0.03ms with async compute with GCN and Vulkan. No luck. Even the GPU is idle for this time, i can't manage to process other compute workloads meanwhile efficiently. All i get is slowdowns.
Either i'm doing it wrong or async compute works properly only when pairing graphics and compute work for now, and fine grained compute is on their todo list.
EDIT:
My fault - Indexing bug in queue setup. Async Compute works great and can perfectly hide those small bubbles and also the small workload problem e.g. near the root of a tree. Awesome!
Yeah. In section 3.2 of the paper they deacribe the "Incremental Approach". With this they never restart from the root anyway. So my idea would rely on that doing the refining in subsequent frames.
Ah yes, read again and remember correctly now. This was what i've meant with 'trick' (has nothing to do with animation - sorry)
For me both Incremental and Basic approach had equal performance, because:
Using the larger branching factor and the Basic approach you get to the final cut much faster in terms of tree levels, as a result it is equal work to do either this or moving up / down the tree for each of the many nodes in from the previous cut.
But wait - i realize now: The incremental approach has a big advantage on GPU. I finally get what you mean with 'set number of iterations per frame' - you avoid the GPU workload problems i've mentioned above almost completely!
Not sure if this should affect my opinion on branching facter as well - probably not...