Sign in to follow this  
fries

ManyLODs algorithm

Recommended Posts

Direct3D11 would not yet have been a production thing while the research was taking place ;)

EDIT: As has been pointed out below, this is incorrect; I had it in my head that D3D11 was launched late 2011 when in fact the API (and HW capable of running it, not just WARP) were available around ~2009. Edited by InvalidPointer

Share this post


Link to post
Share on other sites

Ah yes, that makes sense. So which do you think would be more efficient? Compute or Geom shader?

 

It seems pretty simple to implement as a geom shader and I think a compute shader would be slightly more complex.

 

One thing I am wondering about though, is how to know when to stop iterating? The active set will be empty, but do i need to read that back to the cpu to know? seems like it would introduce a lot of latency...

Edited by fries

Share this post


Link to post
Share on other sites

I have had conversation with Matthias Holländer about this. Yes, Compute was not there at the time so this was a work around. They made a Cuda implementation later.

Compute is ideal for this, pretty sure Geom is not worth to try.

 

One thing I am wondering about though, is how to know when to stop iterating? The active set will be empty, but do i need to read that back to the cpu to know? seems like it would introduce a lot of latency...

 

With Vulkan / DX12 there is no need to read back (but the driver may unlikely need to do it depending on hardware):

Prerecord a command buffer doing an indirect dispatch of your shader once for all of your maximum of tree levels.

From the shader build a worklist for the next and also set the list count as the upcoming indirect dispatch count.

 

This works well, but it also means you probably end up invoking shaders with zero work.

Currently i use 16 levels and the constant cost of all dispatches (excluding the work done) and memory barriers is about 0.03ms on FuryX.

 

I'm happy with that. With OpenCL 1.x which does not have indirect dispatch the read back was quite a performance killer.

Nvidia has an experimental VK extension where you can build the command buffer on GPU, so no more zero work dispatches.

OpenGL also has indirect dispatch, but no command buffers -  so no need to read back, but still 16 dispatches per frame. (Not sure about older DX)

Share this post


Link to post
Share on other sites

I was thinking one solution would be to have a set number of iterations per frame, and the shader will draw everything that is still active on the last iteration. This way it would at least draw something at a lower lod on one frame and then refine it on the next. 

Share this post


Link to post
Share on other sites

I was thinking one solution would be to have a set number of iterations per frame, and the shader will draw everything that is still active on the last iteration. This way it would at least draw something at a lower lod on one frame and then refine it on the next.

 

So at the next frame, you continue traversal instead restrating from the root?

Interesting idea. So the point when anything is done may happen anytime and is out of sync with rendering. You would need two copies of the enviroment maps - one that is actually updating by the GI, and another that has been finished and can be used by the renderer?

 

I'm doing something similar, i think i get the same advantege by updating only a constant number of enviroment maps per frame: Each sample has a timer increased by a value mixed of randomness, porbability and it's change of luminosity. If the timer goes above 1, the sample goes to the update queue and the timer reduces to its fractional part. May be a alternative.

 

My algorithm is quite different from ManyLODs, but i have implemeted the paper years ago (on CPU only) and remember some things woth to add:

 

I think a tree with a larger branching factor is much better: For surfaces four children per node is ideal, similar to texture mip maps. Binary tree with two children causes irregular distribution over the surface area.

 

They talk about a trick to update the tree for animation to get a speed up by an order of magnitude: For my the speedup was exactly zero, probably because my larger branching factor as well.

 

They use enviroment maps for visibility calculation. Due to undersampling this turned out to be too inaccurate if all lighting comes from emissive materials.

A small bright light source may end up in a single pixel or not, and this pixel has no information about the correct area of the light. I've spent a lot of time trying to fix those issues but no luck.

You need to inject direct lighting from shadow mapped pointlights or similar accurate techniques. (Probably you plan this anyways).

Share this post


Link to post
Share on other sites

So at the next frame, you continue traversal instead restrating from the root?

Yeah. In section 3.2 of the paper they deacribe the "Incremental Approach". With this they never restart from the root anyway. So my idea would rely on that doing the refining in subsequent frames.

 

I did think about using a higher branching tree like a an octree for compression reasons but i think it would possibly cause the shader to be too unbalanced, doing uneven work in different branches, which is required for high throughput.... What do you think?

Edited by fries

Share this post


Link to post
Share on other sites
I did think about using a higher branching tree like a an octree for compression reasons but i think it would possibly cause the shader to be too unbalanced, doing uneven work in different branches, which is required for high throughput.... What do you think?

 

I use a tree with max 8 child nodes while 'in space', and 4 children at the surface.

In space means just the necessary nodes to have a single tree - you're on the surface most of the time and nodes with many children atomatically fall to the higher tree levels, so the additional divergence becomes neglicible.

Further, with a binary tree you have much more tree levels, which means more dispatches and less work per dispatch - the worst thing you can do actually, because:

 

 

 

Currently i use 16 levels and the constant cost of all dispatches (excluding the work done) and memory barriers is about 0.03ms on FuryX.

 

I tried to hide those 0.03ms with async compute with GCN and Vulkan. No luck. Even the GPU is idle for this time, i can't manage to process other compute workloads meanwhile efficiently. All i get is slowdowns.

Either i'm doing it wrong or async compute works properly only when pairing graphics and compute work for now, and fine grained compute is on their todo list.

 

 

 

EDIT:

 

My fault - Indexing bug in queue setup. Async Compute works great and can perfectly hide those small bubbles and also the small workload problem e.g. near the root of a tree. Awesome!

 

 

 

 

 

 

Yeah. In section 3.2 of the paper they deacribe the "Incremental Approach". With this they never restart from the root anyway. So my idea would rely on that doing the refining in subsequent frames.

 

Ah yes, read again and remember correctly now. This was what i've meant with 'trick' (has nothing to do with animation - sorry)

For me both Incremental and Basic approach had equal performance, because:

Using the larger branching factor and the Basic approach you get to the final cut much faster in terms of tree levels, as a result it is equal work to do either this or moving up / down the tree for each of the many nodes in from the previous cut.

 

But wait - i realize now: The incremental approach has a big advantage on GPU. I finally get what you mean with 'set number of iterations per frame' - you avoid the GPU workload problems i've mentioned above almost completely!

Not sure if this should affect my opinion on branching facter as well - probably not...

Edited by JoeJ

Share this post


Link to post
Share on other sites

IIRC i have used only one iteration and got equal performance only with that. With 2 iterations AND a large branching factor i assume the basic approach wins also on GPU.

But one iteration can be enough, depending on detail, kind of game etc. (one 4-branch interation == two binary tree iterations)

Downside of Iterative is that you need to keep lodcuts in memory for all nodes. Basic approach needs memory only for nodes you select for an update (important if you want a reasonable detailed large world where updating all nodes each frame would be to slow).

 

It remains a difficult decission, probably you'll end up implementing multiple variants... as always... ;)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this