thanks for your quick and detailed replies!
Now, about that theoretical gain, one huge disadvantage of this that now arises is that it's no longer possible for you to keep any model data in static vertex buffers. Instead you're going to need to re-send all of your scene geometry to the GPU every frame. Of course you could implement some caching schemes to avoid a resend, but that's more work and will result in uneven performance between the frames where you do send and the frames where you don't.
I'm not going to do such fancy things. Please excuse me for not being able to go into detail here, but let me clarify this a little bit:
The situation is simply that we know beforehand that, for our particular case, after a certain transformation in the vertex shader, we have - let's say - 50% of the triangles being degenerate, and at the same time being located at the back of our vertex and index buffers. So we could just ignore them in our drawcall, with actually zero overhead.
Of course, having the data re-organized this way (only once, during preprocessing!) instead of ordering it with a cache optimizer implies a certain overhead itself, as it potentially limits cache performance.
Please let us also assume that our application is vertex bound, e.g. because we have a large laser-scanned model, which is tesselated very regularly with many small triangles, and we use a moderately-sized viewport, instead of having a high-resolution viewport and optimized low-poly game models.
So, if I get you right, I can still expect a performance gain (-> vertex bound, 50% less vertex processing) by limiting my draw call to non-degenerate triangles, but in order to evaluate whether it's worth the effort, I have to compare my method with its re-organized data layout against a cache-optimized variant that renders all triangles and uses the GPU to discard degenerate ones, right? :-)