Per Triangle Culling (GDC Frostbite)

Started by
22 comments, last by Anteru 7 years, 10 months ago

I came across this presentation and they are talking about using compute for per-triangle culling (standard backface culling, standard hi-Z). I'm not sure what exactly they are talking about. Is this meant to rewrite that part of the pipeline completely? and then when it comes time to draw, just disable backface culling and all the other built-in pipeline stages? I'm not getting why you would write a compute shader to determine what triangles are visible when the pipeline does that already. Even if you turn that stuff off, is this really that much better? Can you even tell the GPU to turn off Hi-Z culling?

Slides 41-44

http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Advertisement

The point is that the ALU processing capabilities far exceed that of the fixed function triangle setup and rasterizer. Using compute you can prune the set of triangles to get rid of the ones that don't lead to any shaded pixels and are therefore discarded. Its purely there to get a bit more performance.

Pixel shaders are run in 2x2 blocks.

When a triangle covers one pixel; the pixel shader wastes up to 75% of its power in dummy pixels. If this triangle also happens to be occluded by a bigger triangle in front of it, it becomes unnecessary.

Single-pixel triangles are also a PITA for concurrency since there might be not enough pixels gathered to occupy a full wavefront; and when you do, divergence could be higher (since there could be little space coherence between those pixels).

Also triangles that occupy no pixels (because they're too tiny), if too many, can drown the rasterizer and starve the pixel shader.

Due to this, GPUs love triangles covering large areas, and hate super small triangles. Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".

It's like a Z Prepass but on steriods. Because instead of only populating the Z-buffer for the 2nd pass, it also removes useless triangles (lowering vertex shader usage and alleviating the rasterizer).

So yeah... it's a performance optimization.


Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".

Given that nvidia doesn't typically allow async compute, does that mean it wouldn't be useful on nvidia?

It's easy to understand why rendering small triangles is expensive, but this culling process won't be free if it can't overlap other parts of the pipeline, right? I suppose I could see a overall positive benefit if the compute shader needs only position information and can ignore other attributes which won't contribute to culling?


Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".

Given that nvidia doesn't typically allow async compute, does that mean it wouldn't be useful on nvidia?

It's easy to understand why rendering small triangles is expensive, but this culling process won't be free if it can't overlap other parts of the pipeline, right? I suppose I could see a overall positive benefit if the compute shader needs only position information and can ignore other attributes which won't contribute to culling?

Whether it's a net gain or a net loss depends on the scene. Async compute just increases the likelihood of being a net gain.


Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".

Given that nvidia doesn't typically allow async compute, does that mean it wouldn't be useful on nvidia?

It's easy to understand why rendering small triangles is expensive, but this culling process won't be free if it can't overlap other parts of the pipeline, right? I suppose I could see a overall positive benefit if the compute shader needs only position information and can ignore other attributes which won't contribute to culling?

Whether it's a net gain or a net loss depends on the scene. Async compute just increases the likelihood of being a net gain.

By a lot unfortunately, and Nvidia's release this year doesn't seem likely to change support for async. Still, it's not going to be a loss generally, so it's not like you'd even have to disable it in a Nvidia specific package.

<snip>

By a lot unfortunately, and Nvidia's release this year doesn't seem likely to change support for async. Still, it's not going to be a loss generally, so it's not like you'd even have to disable it in a Nvidia specific package.


Currently nVidia's hardware straight up cannot support async compute, at least in the sense most people think of the term. Cool guide here, and the tl;dr is that the nVidia compute queue implementation doesn't support resource barriers and as such cannot implement the current DX12/Vulkan spec.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

I don't think per triangle culling is worth it... per cluster culling on the other hand is something I think most engines should implement. It's too bad the benches in the slides don't include any cluster culling results.

Oh and as to why do it somebody on beyond3d recently posted this link.

http://www.hardware.fr/articles/928-4/performances-theoriques-geometrie.html

It seems AMD doesn't see a performance improvement (well at least compared to nvidia) from backface culling in the standard graphics pipeline. So it seems per tri-culling might be useful for AMD hardware. But per tri-culling will require you to eat up alot of bandwidth (comparable to a whole extra pass) to accomplish it. So I still wonder if its worth it.

-potential energy is easily made kinetic-

Eh?

There are slides which show that per triangle culling is certainly worth it in that very deck (83 - 85) - yes, it has a very GCN focus but that's what happens when the consoles all use the same GPU arch.

As to backface culling; of course NV show a greater speed up vs AMD - their hardware is setup to process more triangles per clock than AMD so they can also cull more per clock. (AMD on the other hand have focused more on Compute and async functionality, which in the longer term could be the smarter move.)

So, you are probably right, if we could get this working with async compute on NV hardware you might not see the same improvement (or maybe you would; less triangles to setup is less after all?) but given the lack of async compute support on NV hardware that isn't likely to happen for a while... (and from what I've been hearing the next chip isn't going to fix that problem either; keep an eye on NV PR, if they go full anti-async spin, more than they have already of course, then we'll know...)

If you look at the performance slides it says no cluster culling and no tessellation was used. (page 84-85) I'm saying I think cluster only culling would be better than per triangle, but that's just my guess.

edit - page 85 results has tessellation enabled.

-potential energy is easily made kinetic-

This topic is closed to new replies.

Advertisement