Jump to content
  • Advertisement
Sign in to follow this  
dpadam450

Per Triangle Culling (GDC Frostbite)

This topic is 923 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I came across this presentation and they are talking about using compute for per-triangle culling (standard backface culling, standard hi-Z). I'm not sure what exactly they are talking about. Is this meant to rewrite that part of the pipeline completely? and then when it comes time to draw, just disable backface culling and all the other built-in pipeline stages? I'm not getting why you would write a compute shader to determine what triangles are visible when the pipeline does that already. Even if you turn that stuff off, is this really that much better? Can you even tell the GPU to turn off Hi-Z culling?

 

Slides 41-44

http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf

 

Share this post


Link to post
Share on other sites
Advertisement

Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".

 

Given that nvidia doesn't typically allow async compute, does that mean it wouldn't be useful on nvidia? 

 

It's easy to understand why rendering small triangles is expensive, but this culling process won't be free if it can't overlap other parts of the pipeline, right? I suppose I could see a overall positive benefit if the compute shader needs only position information and can ignore other attributes which won't contribute to culling?

Edited by Dingleberry

Share this post


Link to post
Share on other sites

 


Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".

 

Given that nvidia doesn't typically allow async compute, does that mean it wouldn't be useful on nvidia? 

 

It's easy to understand why rendering small triangles is expensive, but this culling process won't be free if it can't overlap other parts of the pipeline, right? I suppose I could see a overall positive benefit if the compute shader needs only position information and can ignore other attributes which won't contribute to culling?

 

Whether it's a net gain or a net loss depends on the scene. Async compute just increases the likelihood of being a net gain.

Share this post


Link to post
Share on other sites

 

 


Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".

 

Given that nvidia doesn't typically allow async compute, does that mean it wouldn't be useful on nvidia? 

 

It's easy to understand why rendering small triangles is expensive, but this culling process won't be free if it can't overlap other parts of the pipeline, right? I suppose I could see a overall positive benefit if the compute shader needs only position information and can ignore other attributes which won't contribute to culling?

 

Whether it's a net gain or a net loss depends on the scene. Async compute just increases the likelihood of being a net gain.

 

 

By a lot unfortunately, and Nvidia's release this year doesn't seem likely to change support for async. Still, it's not going to be a loss generally, so it's not like you'd even have to disable it in a Nvidia specific package.

Edited by Frenetic Pony

Share this post


Link to post
Share on other sites

<snip>
 
By a lot unfortunately, and Nvidia's release this year doesn't seem likely to change support for async. Still, it's not going to be a loss generally, so it's not like you'd even have to disable it in a Nvidia specific package.

 
Currently nVidia's hardware straight up cannot support async compute, at least in the sense most people think of the term. Cool guide here, and the tl;dr is that the nVidia compute queue implementation doesn't support resource barriers and as such cannot implement the current DX12/Vulkan spec. Edited by InvalidPointer

Share this post


Link to post
Share on other sites

I don't think per triangle culling is worth it... per cluster culling on the other hand is something I think most engines should implement.  It's too bad the benches in the slides don't include any cluster culling results.

 

Oh and as to why do it somebody on beyond3d recently posted this link.

http://www.hardware.fr/articles/928-4/performances-theoriques-geometrie.html

 

It seems AMD doesn't see a performance improvement (well at least compared to nvidia) from backface culling in the standard graphics pipeline.  So it seems per tri-culling might be useful for AMD hardware.  But per tri-culling will require you to eat up alot of bandwidth (comparable to a whole extra pass) to accomplish it.  So I still wonder if its worth it.

Share this post


Link to post
Share on other sites
Eh?

There are slides which show that per triangle culling is certainly worth it in that very deck (83 - 85) - yes, it has a very GCN focus but that's what happens when the consoles all use the same GPU arch.

As to backface culling; of course NV show a greater speed up vs AMD - their hardware is setup to process more triangles per clock than AMD so they can also cull more per clock. (AMD on the other hand have focused more on Compute and async functionality, which in the longer term could be the smarter move.)

So, you are probably right, if we could get this working with async compute on NV hardware you might not see the same improvement (or maybe you would; less triangles to setup is less after all?) but given the lack of async compute support on NV hardware that isn't likely to happen for a while... (and from what I've been hearing the next chip isn't going to fix that problem either; keep an eye on NV PR, if they go full anti-async spin, more than they have already of course, then we'll know...)

Share this post


Link to post
Share on other sites

If you look at the performance slides it says no cluster culling and no tessellation was used. (page 84-85)  I'm saying I think cluster only culling would be better than per triangle, but that's just my guess.

 

edit - page 85 results has tessellation enabled.

Edited by Infinisearch

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!