How many triangles per frame ?
Hi all,
I was thinking about how many triangles should be sent for each frame, and how many texture changes we should perform per frame. I mean, the limit.
I'm pretty disappointed about my deferred engine, which can now load .MAP files (quake1-2-3), because switching textures and doing multiple 'DrawIndexedPrimitive' calls turned out to be too slow even for a medium/small map of 20k triangles. I used the E1M1 Quake1 map.
Particularly, it turns out that doing twenty 'DrawIndexedPrimitive' is way slower than doing a single, but much bigger, one.
But if one wants to use materials, many 'DrawIndexedPrimitive' has to be done.
So how the hell does one resolve this problem ?
How many triangles do you send on average per frame ?
How many calls do you have ?
---
my example: a 20k triangles map has 20 materials, so I average 1k triangles for call. I have three passes that involves geometry (depth, normal and final pass), so I end up with 60k sent in 60 calls. DAMNED SLOW.
I don't even venture to guess what happens if I would have shadow maps right now... since you render again the scene for each light.
---
I use 'SetTexture' every time the diffuse texture changes. Am I wrong ? Do I have to put many different textures in a big unique one ?
THANKS TO ALL
What hardware are you running this on?
Also, why are you doing 3 passes for your g-buffer, why not bind 3 render targets at once and write out all three in one pass.
Also, why are you doing 3 passes for your g-buffer, why not bind 3 render targets at once and write out all three in one pass.
Basially your findings are correct - the number of DrawIndexedPrimitive calls is far more important than the number of triangles. Have a read of the following NVidia presentation:
http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf
http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf
Quote:Original post by adt7
What hardware are you running this on?
Also, why are you doing 3 passes for your g-buffer, why not bind 3 render targets at once and write out all three in one pass.
It's a light pre pass renderer, so at least you have three passes. One to make the G-Buffer, one to compute light values (but this doesn't involve geometry at all!) and the last to render geometry taking light values from the Light Buffer.
In my case there is an extra step: I first make the depth buffer, and only then I make the normal buffer (which contains other data also), because it seems to me that in this way I can take advantage of z-buffer by discarding computations for invisible pixels.
---
Anyway, I have a geforce 6800 Ultra. It's not new, but I can't accept that it cannot render quickly a 10 years old FPS map.
Quote:Original post by PolyVox
Basially your findings are correct - the number of DrawIndexedPrimitive calls is far more important than the number of triangles. Have a read of the following NVidia presentation:
http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf
The article was extremely interesting, but what does it teach ?
As far as I'm concerned, it suggests to decrease the number of calls... but how can one do it if he is using multiple materials. The only way is to pack many texture on the same surface and to update texture coordinates... doesn't seem so easy.
Plus, if I wanna use a screen-space occlusion culling system, I cant send all the triangles contained in the frustum at one time, it would nullify the occlusion system... rather, I should send bunches of triangles => SLOW
I mean... WHAT THE HELL do you do ?!
Thanks !! :-D
Quote:Original post by NiGoea
In my case there is an extra step: I first make the depth buffer, and only then I make the normal buffer (which contains other data also), because it seems to me that in this way I can take advantage of z-buffer by discarding computations for invisible pixels.
Doing a depth-only pass only helps if you're using a heavy pixel shader, and your g-buffer pass for a light-prepass renderer should be very light (you're not doing any actually shading, after all). You're probably better off just doing depth+normals in one pass.
Quote:Original post by MJPQuote:Original post by NiGoea
In my case there is an extra step: I first make the depth buffer, and only then I make the normal buffer (which contains other data also), because it seems to me that in this way I can take advantage of z-buffer by discarding computations for invisible pixels.
Doing a depth-only pass only helps if you're using a heavy pixel shader, and your g-buffer pass for a light-prepass renderer should be very light (you're not doing any actually shading, after all). You're probably better off just doing depth+normals in one pass.
Well, you're right. Actually, my normal pass involves two samples, one cross product and one normalize per pixel...
You might save on some texture calls by sorting each model by the material they use. That way you can render all objects with that texture without having to change materials in between.
Are you using any culling methods?
Are you using any culling methods?
Quote:Original post by NiGoeaYou are right. It is not easy.
The article was extremely interesting, but what does it teach ?
As far as I'm concerned, it suggests to decrease the number of calls... but how can one do it if he is using multiple materials. The only way is to pack many texture on the same surface and to update texture coordinates... doesn't seem so easy.
...
WHAT THE HELL do you do ?!
Depending on the shader complexity, there are various possibilities. The texture atlas approach you're describing is effective but quite involved to get right as some tcCoord remapping is involved, and tcCoords, in today's shader-driven-world, may be accessed in arbitrary ways.
A somewhat more robust way is to use spare sampler registers (don't tell me you're already using all 16) and discard one's contribution depending on a vertex attrib value. Whatever this is to be done thuru branching or math zeroing-out is nontrivial (also recall dynamically indexing samplers is not allowed on D3D9 HW). It is essentially an "ubershader" approach.
I am very unlucky since I don't like ubershaders at all... and I ended up writing a shader disassembler which walks in the compiled code and modifies everything. I don't count anymore the number of times I've shot myself in the foot with this beast, not to mention that I need D3DX to make it work, which I find rather ugly.
I urge you to strongly resist trying shader re-mangling, unless you don't care for your mental health, which I clearly didn't have since the start!
If you can live with ubershaders, just modify the source assets to include the 'switching' per-vertex attrib and you'll be right at home with none of the above mentioned issues. Much better.
Quote:Original post by NiGoeaIf you're sorting front-to-back for more z-reject, no, it is not. Sending large batches will outweight by far the deficit of a worse zbuff rejection ratio, culling can still be performed on a per-batch basis. Yes, it will trash more fillrate, but I've had a rather good experience with it so far.
Plus, if I wanna use a screen-space occlusion culling system, I cant send all the triangles contained in the frustum at one time, it would nullify the occlusion system... rather, I should send bunches of triangles => SLOW
Anyway, 60 calls shouldn't be a problem: I think the rendertarget switch is really killing your GPU. Also, mixing a lay-z-only with deferred shading makes little sense to me as you're essentially pretending that the per-pixel attrib to be costly to compute... which actually is, if you're doing parallax occlusion mapping or complex shading. It is my personal opinion that in those cases, the benefits for deferred shading are nullified... is the problem looping on itself?
Quote:Original post by Darg
You might save on some texture calls by sorting each model by the material they use. That way you can render all objects with that texture without having to change materials in between.
I'm already doing this. But in a indoor map it's normal to have 10-20-30 different materials.
Quote:
Are you using any culling methods?
Yes. An octree. But using DX is a pain in the ass anyway, because the more nodes the octree have, the more calls you have to do => SLOW.
On the opposite, if you use an octree with few nodes, it doesn't have much sense.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement