Sign in to follow this  

Mesh pre-processing

This topic is 820 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

With D3D12 out it seems feasible for me to try to do some screen space tiling in my renderer.  So I find myself with the problem of figuring out how to divide my mesh into smaller clusters.  So I was wondering if anyone had some resources on mesh feature detection, and other mesh pre-processing functionality?  While I have a specific goal, its about time I looked at mesh pre-processing in general so all resources are welcome.

 

Thanks in advance.

Share this post


Link to post
Share on other sites

OpenMesh is a great library for mesh processing in general. For example, I used it for a custom decimation operation but it can do lots more.

 

But I don't get the rest of your question: What has this to do with D3D12 and how do you intend to split meshes for rendering "screen space tiles". From OpenMesh you should not expect high enough performance that allows you to change all your models every frame if that is what you're after. (I doubt that there is any general mesh pre-processing framework that can do such things).

Share this post


Link to post
Share on other sites


With D3D12 out it seems feasible for me to try to do some screen space tiling in my renderer.  So I find myself with the problem of figuring out how to divide my mesh into smaller clusters.  So I was wondering if anyone had some resources on mesh feature detection, and other mesh pre-processing functionality?
What? No. In the renderer? Absolutely not.

Ok, I'm mildly interested. What are you really thinking about?

And again, what's the point of D3D12 in this context?

Share this post


Link to post
Share on other sites


OpenMesh is a great library for mesh processing in general. For example, I used it for a custom decimation operation but it can do lots more.

Thanks I will take a look.

 


But I don't get the rest of your question: What has this to do with D3D12 and how do you intend to split meshes for rendering "screen space tiles". From OpenMesh you should not expect high enough performance that allows you to change all your models every frame if that is what you're after. (I doubt that there is any general mesh pre-processing framework that can do such things).


What? No. In the renderer? Absolutely not.
Ok, I'm mildly interested. What are you really thinking about?
And again, what's the point of D3D12 in this context?

Basically what I'm going to do is render my scene sort of like a PowerVR GPU except manually and using triangle clusters instead of binning individual triangles.  So the basic gist is this:

 

0. Pre-process meshes into clusters of triangles.  Create OBB per cluster.

1. Divide screen into SS tiles that can fit in the GPU on chip caches.  A list of clusters to render per tile.

2. When rendering after frustum culling, Use OBB's to find out which clusters affect which tiles, add to tile list.

3. Render clusters associated per tile with scissor rect to enforce tile.

 

Since all frame buffer access will hit the cache, rendering should be faster at the cost of render triangles multiple times since a cluster has to be rerendered per tile.  DX12 is relevent since there will be alot more draw calls, because of rendering clusters and the overhead of one cluster for multiple tiles, in addition the tiling breaks batching which if I understand things correctly is rather important for DX11.  I know DX11 supports drawindirect, but that will come later after I get a feel for the technique after profiling.

Share this post


Link to post
Share on other sites

I guess the original poster is describing standard tiled rendering, usually implemented as an optimization in deferred renderers or Forward+.

He mistakenly believes that meshes need to be subdivided for this approach to work, or mistakenly believes that doing it this way by manually breaking up a mesh would be faster.

 

It is neither necessary nor faster to break apart your mesh, not to mention unfeasible for any but the most basic cases where the meshes never move relative to the screen.

Research deferred rendering tiling optimizations or Forward+.  This is unrelated to specifically Direct3D 12.

 

 

L. Spiro

Edited by L. Spiro

Share this post


Link to post
Share on other sites

I think Infini talking about Tiled Rendering in the mobile sense, where screenspace tiles are used for geometry processing rather than light culling to fit into a cache. Though now that I look at it again it does seem to be something about a lighting because a screen buffer is mentioned.

Edited by Frenetic Pony

Share this post


Link to post
Share on other sites

Since all frame buffer access will hit the cache, rendering should be faster at the cost of

Assuming that frame buffer latency is actually a bottleneck in the first place?

If you're doing any sort of modern/fancy shading, then your shading time per pixel will likely be higher than your frame-buffer write time, so pipelining will make the buffer writes 'free'...

Deferred/Tiled (PowerVR style) triangle-binning works well on PowerVR, because it means that instead of performing frame-buffer writes to RAM, it can perform them to a tiny-but-super-fast local storage area (ESRAM/etc), and then later bulk-flush that local storage to RAM.
Implenting the algorithm without having the hardware to suit may not be the best idea...

Xbox360 has local/fast EDRAM, and XbOne and Intel GPUs have local/fast ESRAM, so if you're targeting them you may be able to find some benefit. These local storage areas are likely measured in the 10's of MB's though, so you could get away with using very large tiles.


One other benefit of PowerVR style tiling is that they perform polygon sorting to eliminate overdraw, and allow OIT. By sorting your mesh-chunks (front to back for opaque and back to front for translucent) you'd gain the same benefits but with slightly corser accuracy.
You'd want to perform this sorting on the GPU though, so I'd make use of indirect draws, rather than relying on thousands of individual CPU-driven draws.

Share this post


Link to post
Share on other sites


I guess the original poster is describing standard tiled rendering, usually implemented as an optimization in deferred renderers or Forward+.
He mistakenly believes that meshes need to be subdivided for this approach to work, or mistakenly believes that doing it this way by manually breaking up a mesh would be faster.

Nope not talking directly about tiled deferred or forward+.

 


It is neither necessary nor faster to break apart your mesh, not to mention unfeasible for any but the most basic cases where the meshes never move relative to the screen.

Ubisoft and RedLynx seem to be breaking apart there meshes into clusters for culling purposes with success, see GPU driven rendering pipelines here: http://advances.realtimerendering.com/s2015/index.html

Share this post


Link to post
Share on other sites

Assuming that frame buffer latency is actually a bottleneck in the first place?

If you're doing any sort of modern/fancy shading, then your shading time per pixel will likely be higher than your frame-buffer write time, so pipelining will make the buffer writes 'free'...

Sort of, first off I just wanted to play around and see if there were gains to be had.  Second I was going to pair it with a two pass renderer, where the first pass is a visibility pass with no ALU work at all... two variations, zonly and z + id.

 

 

 


Deferred/Tiled (PowerVR style) triangle-binning works well on PowerVR, because it means that instead of performing frame-buffer writes to RAM, it can perform them to a tiny-but-super-fast local storage area (ESRAM/etc), and then later bulk-flush that local storage to RAM.
Implenting the algorithm without having the hardware to suit may not be the best idea...

I brought up GPU caches in a thread on another forum in relation to this type of rendering and was informed on GCN there is something called the CB and DB caches.  I believe these are dedicated caches for the color and depth buffers.  In that thread I was also referred to a different thread where somebody tried this and IIRC  came up with a max tile size of 128x128 for GCN before there was a performance drop indicating that you were going off-chip.

edit - for a 7970 (128k ROP cache) 

 

 

 


One other benefit of PowerVR style tiling is that they perform polygon sorting to eliminate overdraw, and allow OIT. By sorting your mesh-chunks (front to back for opaque and back to front for translucent) you'd gain the same benefits but with slightly corser accuracy.
You'd want to perform this sorting on the GPU though, so I'd make use of indirect draws, rather than relying on thousands of individual CPU-driven draws.

Yes I was going to sort front to back for opaques, but I was also going to do back to front to get a feel for performance delta's.  You're right about using indirect draws and tiling and sorting on the GPU, but I wanted to keep it simple at first... I have no experience with indirect draws, but like I said I will in the future.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

Again, sounds a lot like a bounding volume heirarchy in a way: https://en.wikipedia.org/wiki/Bounding_volume_hierarchy

 

Still not sure what exactly the end purpose of all this is, OIT is nice, but overdraw isn't as much of a concern anymore except for translucency.

 

What makes you think overdraw isn't much of a concern anymore?  I will either do the same thing for alphablending or use compute based particles, but I want to see if I can accelerate opaque rendering as well.  There are caches in the GPU that provide much faster access than going to memory, by manually trying to render only to these caches I figure I might get a speedup for non-alu bound rendering.  In fact I wonder if the reduction in off chip access will speed up ALU bound shaders that rely heavily on texture accesses.  There is also the concern of non-high end gpu's and there performance due to limited bandwidth, like I said its more of an experiment for now.

 

edit- of course it all depends on the size of the onchip caches, but for now I just want to see the performance metrics.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

So far as I've seen, most rendering is ALU bound today, so I'm not sure how much benefit will be gained. But heck, what do I know, obviously I've never tried this experiment. Interested to see how it turns out.

Edited by Frenetic Pony

Share this post


Link to post
Share on other sites

So far as I've seen, most rendering is ALU bound today, so I'm not sure how much benefit will be gained. But heck, what do I know, obviously I've never tried this experiment. Interested to see how it turns out.

 

  How much ALU is there in a Z-only pass for forward+?  With deferred how much ALU does filling the G-buffer take?  This is when you are determining visibility.  The main problem is that the cache effectively becomes smaller as the G-buffer for deferred becomes fatter, and small tiles will reduce efficiency.

 

edit - BTW thank you for mentioning the BVH, I already know about them.  The problem is there are more constraints than spatial locality,  normal "locality" since I want to do clustered backface culling, cluster size in regards to wave/warp efficiency...

Edited by Infinisearch

Share this post


Link to post
Share on other sites

This topic is 820 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this