how to "discard" vertices

Started by
15 comments, last by Brisco 16 years, 10 months ago
Hello! I try to implement some sort of particle system on the GPU and have a problem with it - the particle data is read from a texture in the vertex shader and some vertices should be "removed" because they are inactive. I thought it would be enough to write position values which are out of the clipping planes, but the framerate remains still constant, even if not a single vertex is within the frustum. Is there any better method to remove billboards which should not be processed without modifying the vertex buffer itself?
Advertisement
Well - the problem is that you're discarding the vertices too late in the pipeline. Moving a polygon out of the clipping plane will save you some fillrate, but won't save you the processing cost for the polygon itself.

Let's say you have 5000 particles. The vertex shader processes 5000 particles - and moves them all out of the view frustrum. No pixels are modified, but the vertices have already gone through the pipeline, only to be discarded through clipping.

I think a solution to this problem would be to use geometry shaders - it'll allow you to completely discard particles, so that the vertices never reach the pipeline. I'm not sure, though, I've not had any direct experience with Direct3D10.

NextWar: The Quest for Earth available now for Windows Phone 7.
Quote:Original post by Brisco
[snip]I thought it would be enough to write position values which are out of the clipping planes, but the framerate remains still constant, even if not a single vertex is within the frustum.[/snip]


Are you sure that the GPU is the bottleneck for your application? If the CPU is bottleneck, then the frame-rate will remain rougly constant regardless of what you do. If the bottleneck is on the GPU, have you profiled which part it is? For example could the bandwidth to transfer your vertex texture be the bottleneck?


You can't add or remove vertices in a vertex shader as such. What you can do though is produce triangles with an area of 0 (degenerate) and move them into the guardband or completely beyond the guardband (experiment). This should make most of the cost from triangle setup onwards very cheap.

BUT not drawing a few triangles won't help you at all if you're CPU bound and won't help significantly if you're vertex processing (including texture fetch) bound...

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

Thank you for your answers!

Quote:Moving a polygon out of the clipping plane will save you some fillrate, but won't save you the processing cost for the polygon itself


Are you sure about that? The texture fetch in the vertex shader is the first thing that happens - and if the texture value is i.e. less than 0.5f, the vertex is not even transformed. But I guess fillrate is the most critical factor in my application, because the billboards are used for cloud rendering and a few hundred thousend of them exist - and all of them need to be rendered without depth test and with blending. It's just an experiment for my diploma thesis, maybe it will be too slow, but volume raycasting with all possible optimizations seems to be even slower than that.

Quote:Are you sure that the GPU is the bottleneck for your application?


No, absolutely not. For test purposes, I don't use any VFD test at the moment, to there are 16 layers, each with 256x256 particles rendered. The cloud volume is completely generated within the pixel shader and written to a 2D texture which stores all slices. A vertexbuffer with 32x32 billboards in xz-area is used for rendering. To do that, each voxel in the volume is read in the vertex shader - if a cloud particle is there, the vertex will be billboard will be correctly transformed, if no particle exists on the position, the billboard is set out of the frustum to avoid rendering. This method allows to always use the same vertex buffer with 32x32 billboards - only the texture fetch decides if a billboard is rendered or not.

Because I did not perform any VFD test and no empty space skipping until now, the vertexbuffer is rendered 8x8x16 times - which leads to 1024 calls (and I know, that this is too much, but as I already said, it's just for test purposes and I already thought about some methods to reduce the number of calls.

Quote:For example could the bandwidth to transfer your vertex texture be the bottleneck?


Well, I already replaced the tex2dlod instruction by a constant value and got only about 5 frames more per second, so I think that should not be the bottleneck.

The reason why I posted the question is, that it is really hard to believe that the framerate is nearly constant when drawing about one million alphablended billboards without depth test enabled or just performing the draw calls while all vertices are transformed out of the frustum by the vertex shader. In my opionion, the application should be fillrate limited, but it seems that this is not the case.
Quote:Original post by Brisco
Are you sure about that? The texture fetch in the vertex shader is the first thing that happens - and if the texture value is i.e. less than 0.5f, the vertex is not even transformed. But I guess fillrate is the most critical factor in my application, because the billboards are used for cloud rendering and a few hundred thousend of them exist - and all of them need to be rendered without depth test and with blending. It's just an experiment for my diploma thesis, maybe it will be too slow, but volume raycasting with all possible optimizations seems to be even slower than that.

Vertex texture fetch is not particularly quick on most cards. You'll probably find that the cost of the fetch is the primary cost in the shader. Skipping the transform will save next to nothing.

Quote:Because I did not perform any VFD test and no empty space skipping until now, the vertexbuffer is rendered 8x8x16 times - which leads to 1024 calls (and I know, that this is too much, but as I already said, it's just for test purposes and I already thought about some methods to reduce the number of calls.

How many FPS are you getting? You could well be batch limited with that many submissions per frame. If that's the case then you can tweak your shaders as much as you like with only marginal changes in framerate.
Quote:How many FPS are you getting? You could well be batch limited with that many submissions per frame. If that's the case then you can tweak your shaders as much as you like with only marginal changes in framerate.


I changed the number of billboards in the vertexbuffer to reduce the drawcalls. The framerate remains still the same, regardless if the buffer stores 32x32 billboards with 1024 calls, 64x64 with 256 calls or even 128x128 with only 64 drawprimitive calls. I got about 12 fps in any case - even if not a single triangle were in the frustum. I also removed everything from the vertexshader which could be a bottleneck, especially the texture fetch and the condition which was used to decide if a billboard should be moved out of the frustum.

The full vertex shader code looks now like this:

  float4x4 mtWorld = { mtView._m00, mtView._m10, mtView._m20, 0.0,                       mtView._m01, mtView._m11, mtView._m21, 0.0,                       mtView._m02, mtView._m12, mtView._m22, 0.0,                       vPosition.x + IN.BBCenter.x, vPosition.y + 0.0,                        vPosition.z + IN.BBCenter.y, mtView._m33 };  float4x4 mtWorldView = mul(mtWorld, mtView);  float4x4 mtWorldViewProj = mul(mtWorldView, mtProj);  OUT.Position = mul(float4(IN.Position, 1), mtWorldViewProj);  OUT.TexCoord = IN.TexCoord;


So, I think the application can't be batch limited - because 64 calls should really be acceptable... the application can't be fillrate limited because the framerate remains still the same if no polygon is within the frustum. So, the question is: could one million triangles (with 64 drawcalls) be too much for a GF6800 Ultra?
With a trivial shader you should be able to get close to 600 million triangles per second through a GF6800 Ultra. You're getting 1/50 of that which suggests the bottleneck is elsewhere. Are the VBs static or dynamic? A lot of dynamic VB fills could mean you're saturating your bus trying to upload them to the card.
Quote:Are the VBs static or dynamic? A lot of dynamic VB fills could mean you're saturating your bus trying to upload them to the card.


Thank you, i thought that one million triangles should be no problem for a 6800.

It's just a single VB which is used for 64 draw calls. It is created with D3DUSAGE_WRITEONLY in the managed pool. I also tried the default pool, but the framerate remains constant. The buffer is locked just once during initialization; after that, it is only used for rendering without any modifications. The vertices are quite simple - they just use three floats for the position, two floats for the texcoords and another two floats for as second texcoord set which is used to store the billboard center position for correct rotation towards the viewer. Vertices are organized for rendering as trianglelist.
Quote:Original post by Jerax
With a trivial shader you should be able to get close to 600 million triangles per second through a GF6800 Ultra.
As a word of warning, be extremely careful of using marketting material as a benchmark. In short, they lie. [smile]

Those sorts of values are often derived straight from the engineering - multiplying theoretical clock rates of individual components, assuming 100% cache/bandwidth utilization etc...

A while back I was playing with cache optimized rendering and my vertex throughput on a vanilla 6800 was around 35 million triangles per second. I am confident that was at the limit of what the card was capable of in the real-world. Even then, I was doing a tech-demo which didn't have overheads of other 'normal' game/multimedia stuff to consider...


hth
Jack

<hr align="left" width="25%" />
Jack Hoxley <small>[</small><small> Forum FAQ | Revised FAQ | MVP Profile | Developer Journal ]</small>

Quote:Original post by Sc4Freak
I think a solution to this problem would be to use geometry shaders - it'll allow you to completely discard particles, so that the vertices never reach the pipeline. I'm not sure, though, I've not had any direct experience with Direct3D10.

You can do this with a geometry shader. Although, the correct technical description is : "just make a geometry shader that don't generate these". Theoricaly, you can build an entire particle system with GS - including particle creation and destruction.

This topic is closed to new replies.

Advertisement