Overhead of Using Degenerate Triangles

Started by
6 comments, last by Volzotan 11 years ago

Hi everyone,

I have been wondering how big the overhead of using degenerate triangles with indexed triangle lists is.

I've been asking around at NVIDIA DevZone, but not getting any reply:

https://devtalk.nvidia.com/default/topic/534530/general-graphics-programming/overhead-for-degenerate-triangles/?offset=2#3761674

I also saw that old GD thread, which was interesting, but did not give me a definite answer:

http://www.gamedev.net/topic/221281-render-cost-of-degenerate-triangles/

The situation is the following:

- Render a list of indexed triangles

- Do some fancy stuff in a vertex shader

- After the vertex shader, some vertices will be located at the same world position, meaning there will be some degenerate tris

My Question:

Assume I would know beforehand which triangles will become degenerate, and I could exclude them from rendering.

How big do you think would the speedup be in that case?

Please note that the indices of the corresponding joint vertices might still be different, so the GPU should not be able to discard triangles before the vertex processing stage, meaning that it still has to transform each vertex before finding out which triangle is degenerate. BTW, does the GPU realize this at all in that case? Does anyone have a reference where some GPU manufacturer has explained how the filtering of degenerate triangles works, and when it is applied?

Any help is appreciated. Thanks a lot in advance!

Best,

Volzotan

Advertisement

Degenerate triangles have zero area so they're not rasterized; this is explained some here: http://fgiesen.wordpress.com/2011/07/05/a-trip-through-the-graphics-pipeline-2011-part-5/

This is actually part of both the OpenGL and D3D specifications, where the rasterization rule is that pixels where the center is inside a triangle get shaded - by definition, nothing can be inside a zero area triangle.

Removing such triangles before the vertex shader runs is one of those ideas that seems nice in principle, but when you think about it a little, it actually makes more sense to just let the GPU plough on. You've touched on one of the reasons in your question - how do you know which triangles? Assuming, based on your question, that you're talking about triangles that are non-degenerate before the VS/etc stages run, but become degenerate after (due to transforming, clipping, etc), the only way to know for sure is to run the entire pipeline up to that stage, emulated in software, and in a manner that's invariant with the GPU, for every triangle. So you've just doubled your workload, wiped out the benefits of hardware T&L, and all for a theoretical gain that you may not even get.

Now, about that theoretical gain, one huge disadvantage of this that now arises is that it's no longer possible for you to keep any model data in static vertex buffers. Instead you're going to need to re-send all of your scene geometry to the GPU every frame. Of course you could implement some caching schemes to avoid a resend, but that's more work and will result in uneven performance between the frames where you do send and the frames where you don't.

Slightly more insidious is that the driver and/or hardware may be implementing it's own caching scheme to accomplish a similar result, so you'd also disrupt that. This would, of course, be vendor/hardware/driver dependent.

So the summary is - don't. Your GPU is good at this kind of operation and will be substantially faster if you just let it plough on than if you try to do anything fancy that runs counter to the way it's been designed to work.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Rendering is almost always fragment shader or ROP bound (or geometry shader bound, if there's enough output), and only very rarely vertex shader bound. Fragment shader work, as pointed out by mhagain, does not apply.

Therefore, you can pretty much consider the cost "zero".

This is true even more so as degenerate triangles reuse vertex indices of adjacent non-degenerate triangles. Which means they will be transformed, go into the post-transform-cache, and be reused. You transform them once, and you would need to transform them once anyway.

So the only real cost is 3 extra indices in the index buffer, which is neglegible both memory and bandwidth wise.

Please note that the indices of the corresponding joint vertices might still be different, so the GPU should not be able to discard triangles before the vertex processing stage, meaning that it still has to transform each vertex before finding out which triangle is degenerate.

If you have an unusual situation where this is particularly common, then you might find you're better off using a triangle list instead of a triangle strip. Triangle lists can be optimized to make better use of the post-transform cache and can end up faster than strips even on fairly strip-friendly geometry, so if you have some special case which makes your geometry strip-unfriendly then I would imagine you'll get better performance with a triangle list. I'd echo the other posters though, that in the grand scheme of things it's unlikely to make much difference.

Another point of view is accounting for wasted vertex processing. If every vertex, degenerate or important, gets the same moderate amount of processing (transforming, a few interpolations and texture lookups, etc.) adding x% useless vertices to the real geometry is a x% load increase, which up to a certain point is free.

Anything you do to avoid processing degenerate geometry needs to cost less than x% of clean geometry processing to have a chance of being useful; testing every triangle for degeneracy, even if the cost of rebuilding buffers could be avoided, appears quite out of the question.

Omae Wa Mou Shindeiru

The main thing though is that it's a completely pointless exercise. The GPU is already going to do this anyway, so all the proposal involves is repeating calculations that will already be done. There may be a minor saving in bandwidth and vertex processing (although the degenerate verts are quite likely to already be in the cache anyway, so the latter saving is nowhere near as big a deal as one might thing), but at the expense of having to rebuild the vertex/index buffers.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Hey,

thanks for your quick and detailed replies!

Now, about that theoretical gain, one huge disadvantage of this that now arises is that it's no longer possible for you to keep any model data in static vertex buffers. Instead you're going to need to re-send all of your scene geometry to the GPU every frame. Of course you could implement some caching schemes to avoid a resend, but that's more work and will result in uneven performance between the frames where you do send and the frames where you don't.

I'm not going to do such fancy things. Please excuse me for not being able to go into detail here, but let me clarify this a little bit:

The situation is simply that we know beforehand that, for our particular case, after a certain transformation in the vertex shader, we have - let's say - 50% of the triangles being degenerate, and at the same time being located at the back of our vertex and index buffers. So we could just ignore them in our drawcall, with actually zero overhead.

Of course, having the data re-organized this way (only once, during preprocessing!) instead of ordering it with a cache optimizer implies a certain overhead itself, as it potentially limits cache performance.

Please let us also assume that our application is vertex bound, e.g. because we have a large laser-scanned model, which is tesselated very regularly with many small triangles, and we use a moderately-sized viewport, instead of having a high-resolution viewport and optimized low-poly game models.

So, if I get you right, I can still expect a performance gain (-> vertex bound, 50% less vertex processing) by limiting my draw call to non-degenerate triangles, but in order to evaluate whether it's worth the effort, I have to compare my method with its re-organized data layout against a cache-optimized variant that renders all triangles and uses the GPU to discard degenerate ones, right? :-)

OK, that makes more sense and sounds about right. If it's purely a preprocessing step and everything is arranged accordingly, then yeah, you're going to get better performance under the constraints you mentioned.

Another option you may consider is to make some use of indexing. In that case you retain the cache-optimized vertex buffer but keep two index buffers around, one just omitting indices for the degenerates. There would be some extra memory overhead from having two index buffers, but it could be a worthwhile performance gain.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Okay, here's a first result. Rendering with 77% of the triangles (by ignoring the last 23%, which are degenerate) gave me 110 fps instead of 90 fps (which I get if I render 100% of the triangles). So yes, there's indeed a speedup.

However, I still have to compare that particular example against the cache-optimized alternative mhagain also mentioned.

This topic is closed to new replies.

Advertisement