ManyLODs algorithm

Started by
10 comments, last by JoeJ 7 years, 4 months ago

Hi,

I was having a go at implementing the ManyLODs algorithm from this paper http://perso.telecom-paristech.fr/~boubek/papers/ManyLoDs/ManyLoDs.pdf

And was wondering if anyone knows why they choose to use the geometry stream output instead of doing it all in a compute shader?

Thanks

Advertisement
Direct3D11 would not yet have been a production thing while the research was taking place ;)

EDIT: As has been pointed out below, this is incorrect; I had it in my head that D3D11 was launched late 2011 when in fact the API (and HW capable of running it, not just WARP) were available around ~2009.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

I'm not so sure about that. D3D11 came out 2008 and the paper came out 2011.

The paper says that they implemented their technique using OpenGL 4.1, which did not support compute shaders. Compute shaders didn't come to GL until 4.3, which was in 2012.

Ah yes, that makes sense. So which do you think would be more efficient? Compute or Geom shader?

It seems pretty simple to implement as a geom shader and I think a compute shader would be slightly more complex.

One thing I am wondering about though, is how to know when to stop iterating? The active set will be empty, but do i need to read that back to the cpu to know? seems like it would introduce a lot of latency...

I have had conversation with Matthias Holländer about this. Yes, Compute was not there at the time so this was a work around. They made a Cuda implementation later.

Compute is ideal for this, pretty sure Geom is not worth to try.

One thing I am wondering about though, is how to know when to stop iterating? The active set will be empty, but do i need to read that back to the cpu to know? seems like it would introduce a lot of latency...

With Vulkan / DX12 there is no need to read back (but the driver may unlikely need to do it depending on hardware):

Prerecord a command buffer doing an indirect dispatch of your shader once for all of your maximum of tree levels.

From the shader build a worklist for the next and also set the list count as the upcoming indirect dispatch count.

This works well, but it also means you probably end up invoking shaders with zero work.

Currently i use 16 levels and the constant cost of all dispatches (excluding the work done) and memory barriers is about 0.03ms on FuryX.

I'm happy with that. With OpenCL 1.x which does not have indirect dispatch the read back was quite a performance killer.

Nvidia has an experimental VK extension where you can build the command buffer on GPU, so no more zero work dispatches.

OpenGL also has indirect dispatch, but no command buffers - so no need to read back, but still 16 dispatches per frame. (Not sure about older DX)

I was thinking one solution would be to have a set number of iterations per frame, and the shader will draw everything that is still active on the last iteration. This way it would at least draw something at a lower lod on one frame and then refine it on the next.

I was thinking one solution would be to have a set number of iterations per frame, and the shader will draw everything that is still active on the last iteration. This way it would at least draw something at a lower lod on one frame and then refine it on the next.

So at the next frame, you continue traversal instead restrating from the root?

Interesting idea. So the point when anything is done may happen anytime and is out of sync with rendering. You would need two copies of the enviroment maps - one that is actually updating by the GI, and another that has been finished and can be used by the renderer?

I'm doing something similar, i think i get the same advantege by updating only a constant number of enviroment maps per frame: Each sample has a timer increased by a value mixed of randomness, porbability and it's change of luminosity. If the timer goes above 1, the sample goes to the update queue and the timer reduces to its fractional part. May be a alternative.

My algorithm is quite different from ManyLODs, but i have implemeted the paper years ago (on CPU only) and remember some things woth to add:

I think a tree with a larger branching factor is much better: For surfaces four children per node is ideal, similar to texture mip maps. Binary tree with two children causes irregular distribution over the surface area.

They talk about a trick to update the tree for animation to get a speed up by an order of magnitude: For my the speedup was exactly zero, probably because my larger branching factor as well.

They use enviroment maps for visibility calculation. Due to undersampling this turned out to be too inaccurate if all lighting comes from emissive materials.

A small bright light source may end up in a single pixel or not, and this pixel has no information about the correct area of the light. I've spent a lot of time trying to fix those issues but no luck.

You need to inject direct lighting from shadow mapped pointlights or similar accurate techniques. (Probably you plan this anyways).

So at the next frame, you continue traversal instead restrating from the root?

Yeah. In section 3.2 of the paper they deacribe the "Incremental Approach". With this they never restart from the root anyway. So my idea would rely on that doing the refining in subsequent frames.

I did think about using a higher branching tree like a an octree for compression reasons but i think it would possibly cause the shader to be too unbalanced, doing uneven work in different branches, which is required for high throughput.... What do you think?

I did think about using a higher branching tree like a an octree for compression reasons but i think it would possibly cause the shader to be too unbalanced, doing uneven work in different branches, which is required for high throughput.... What do you think?

I use a tree with max 8 child nodes while 'in space', and 4 children at the surface.

In space means just the necessary nodes to have a single tree - you're on the surface most of the time and nodes with many children atomatically fall to the higher tree levels, so the additional divergence becomes neglicible.

Further, with a binary tree you have much more tree levels, which means more dispatches and less work per dispatch - the worst thing you can do actually, because:

Currently i use 16 levels and the constant cost of all dispatches (excluding the work done) and memory barriers is about 0.03ms on FuryX.

I tried to hide those 0.03ms with async compute with GCN and Vulkan. No luck. Even the GPU is idle for this time, i can't manage to process other compute workloads meanwhile efficiently. All i get is slowdowns.

Either i'm doing it wrong or async compute works properly only when pairing graphics and compute work for now, and fine grained compute is on their todo list.

EDIT:

My fault - Indexing bug in queue setup. Async Compute works great and can perfectly hide those small bubbles and also the small workload problem e.g. near the root of a tree. Awesome!

Yeah. In section 3.2 of the paper they deacribe the "Incremental Approach". With this they never restart from the root anyway. So my idea would rely on that doing the refining in subsequent frames.

Ah yes, read again and remember correctly now. This was what i've meant with 'trick' (has nothing to do with animation - sorry)

For me both Incremental and Basic approach had equal performance, because:

Using the larger branching factor and the Basic approach you get to the final cut much faster in terms of tree levels, as a result it is equal work to do either this or moving up / down the tree for each of the many nodes in from the previous cut.

But wait - i realize now: The incremental approach has a big advantage on GPU. I finally get what you mean with 'set number of iterations per frame' - you avoid the GPU workload problems i've mentioned above almost completely!

Not sure if this should affect my opinion on branching facter as well - probably not...

This topic is closed to new replies.

Advertisement