Sign in to follow this  
NiGoea

How many triangles per frame ?

Recommended Posts

Hi all, I was thinking about how many triangles should be sent for each frame, and how many texture changes we should perform per frame. I mean, the limit. I'm pretty disappointed about my deferred engine, which can now load .MAP files (quake1-2-3), because switching textures and doing multiple 'DrawIndexedPrimitive' calls turned out to be too slow even for a medium/small map of 20k triangles. I used the E1M1 Quake1 map. Particularly, it turns out that doing twenty 'DrawIndexedPrimitive' is way slower than doing a single, but much bigger, one. But if one wants to use materials, many 'DrawIndexedPrimitive' has to be done. So how the hell does one resolve this problem ? How many triangles do you send on average per frame ? How many calls do you have ? --- my example: a 20k triangles map has 20 materials, so I average 1k triangles for call. I have three passes that involves geometry (depth, normal and final pass), so I end up with 60k sent in 60 calls. DAMNED SLOW. I don't even venture to guess what happens if I would have shadow maps right now... since you render again the scene for each light. --- I use 'SetTexture' every time the diffuse texture changes. Am I wrong ? Do I have to put many different textures in a big unique one ? THANKS TO ALL

Share this post


Link to post
Share on other sites
What hardware are you running this on?

Also, why are you doing 3 passes for your g-buffer, why not bind 3 render targets at once and write out all three in one pass.

Share this post


Link to post
Share on other sites
Quote:
Original post by adt7
What hardware are you running this on?

Also, why are you doing 3 passes for your g-buffer, why not bind 3 render targets at once and write out all three in one pass.


It's a light pre pass renderer, so at least you have three passes. One to make the G-Buffer, one to compute light values (but this doesn't involve geometry at all!) and the last to render geometry taking light values from the Light Buffer.

In my case there is an extra step: I first make the depth buffer, and only then I make the normal buffer (which contains other data also), because it seems to me that in this way I can take advantage of z-buffer by discarding computations for invisible pixels.

---

Anyway, I have a geforce 6800 Ultra. It's not new, but I can't accept that it cannot render quickly a 10 years old FPS map.

Share this post


Link to post
Share on other sites
Quote:
Original post by PolyVox
Basially your findings are correct - the number of DrawIndexedPrimitive calls is far more important than the number of triangles. Have a read of the following NVidia presentation:

http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf


The article was extremely interesting, but what does it teach ?
As far as I'm concerned, it suggests to decrease the number of calls... but how can one do it if he is using multiple materials. The only way is to pack many texture on the same surface and to update texture coordinates... doesn't seem so easy.

Plus, if I wanna use a screen-space occlusion culling system, I cant send all the triangles contained in the frustum at one time, it would nullify the occlusion system... rather, I should send bunches of triangles => SLOW

I mean... WHAT THE HELL do you do ?!

Thanks !! :-D

Share this post


Link to post
Share on other sites
Quote:
Original post by NiGoea
In my case there is an extra step: I first make the depth buffer, and only then I make the normal buffer (which contains other data also), because it seems to me that in this way I can take advantage of z-buffer by discarding computations for invisible pixels.


Doing a depth-only pass only helps if you're using a heavy pixel shader, and your g-buffer pass for a light-prepass renderer should be very light (you're not doing any actually shading, after all). You're probably better off just doing depth+normals in one pass.

Share this post


Link to post
Share on other sites
Quote:
Original post by MJP
Quote:
Original post by NiGoea
In my case there is an extra step: I first make the depth buffer, and only then I make the normal buffer (which contains other data also), because it seems to me that in this way I can take advantage of z-buffer by discarding computations for invisible pixels.


Doing a depth-only pass only helps if you're using a heavy pixel shader, and your g-buffer pass for a light-prepass renderer should be very light (you're not doing any actually shading, after all). You're probably better off just doing depth+normals in one pass.


Well, you're right. Actually, my normal pass involves two samples, one cross product and one normalize per pixel...

Share this post


Link to post
Share on other sites
You might save on some texture calls by sorting each model by the material they use. That way you can render all objects with that texture without having to change materials in between.

Are you using any culling methods?

Share this post


Link to post
Share on other sites
Quote:
Original post by NiGoea
The article was extremely interesting, but what does it teach ?
As far as I'm concerned, it suggests to decrease the number of calls... but how can one do it if he is using multiple materials. The only way is to pack many texture on the same surface and to update texture coordinates... doesn't seem so easy.

...

WHAT THE HELL do you do ?!
You are right. It is not easy.
Depending on the shader complexity, there are various possibilities. The texture atlas approach you're describing is effective but quite involved to get right as some tcCoord remapping is involved, and tcCoords, in today's shader-driven-world, may be accessed in arbitrary ways.
A somewhat more robust way is to use spare sampler registers (don't tell me you're already using all 16) and discard one's contribution depending on a vertex attrib value. Whatever this is to be done thuru branching or math zeroing-out is nontrivial (also recall dynamically indexing samplers is not allowed on D3D9 HW). It is essentially an "ubershader" approach.
I am very unlucky since I don't like ubershaders at all... and I ended up writing a shader disassembler which walks in the compiled code and modifies everything. I don't count anymore the number of times I've shot myself in the foot with this beast, not to mention that I need D3DX to make it work, which I find rather ugly.
I urge you to strongly resist trying shader re-mangling, unless you don't care for your mental health, which I clearly didn't have since the start!

If you can live with ubershaders, just modify the source assets to include the 'switching' per-vertex attrib and you'll be right at home with none of the above mentioned issues. Much better.

Quote:
Original post by NiGoea
Plus, if I wanna use a screen-space occlusion culling system, I cant send all the triangles contained in the frustum at one time, it would nullify the occlusion system... rather, I should send bunches of triangles => SLOW
If you're sorting front-to-back for more z-reject, no, it is not. Sending large batches will outweight by far the deficit of a worse zbuff rejection ratio, culling can still be performed on a per-batch basis. Yes, it will trash more fillrate, but I've had a rather good experience with it so far.

Anyway, 60 calls shouldn't be a problem: I think the rendertarget switch is really killing your GPU. Also, mixing a lay-z-only with deferred shading makes little sense to me as you're essentially pretending that the per-pixel attrib to be costly to compute... which actually is, if you're doing parallax occlusion mapping or complex shading. It is my personal opinion that in those cases, the benefits for deferred shading are nullified... is the problem looping on itself?

Share this post


Link to post
Share on other sites
Quote:
Original post by Darg
You might save on some texture calls by sorting each model by the material they use. That way you can render all objects with that texture without having to change materials in between.


I'm already doing this. But in a indoor map it's normal to have 10-20-30 different materials.

Quote:

Are you using any culling methods?


Yes. An octree. But using DX is a pain in the ass anyway, because the more nodes the octree have, the more calls you have to do => SLOW.
On the opposite, if you use an octree with few nodes, it doesn't have much sense.

Share this post


Link to post
Share on other sites
Quote:
Original post by NiGoea
my example: a 20k triangles map has 20 materials, so I average 1k triangles for call. I have three passes that involves geometry (depth, normal and final pass), so I end up with 60k sent in 60 calls. DAMNED SLOW.

How slow? There's no way 60k triangles or 60 calls to DIP should ruin performance that much. In addition, too many DIP calls would put a burden on your CPU, so what CPU s it you have?
Last, how do you know the 60 DIP calls are indeed the bottleneck?

And since you have nVidia hardware, you should learn to use nvperfhud if you haven't already

Good luck

Share this post


Link to post
Share on other sites
If you render everything with the same texture (say a pure white texture), does the FPS increase? Remove all the setTexture() calls in favor of one initial SetTexture() call. IF you get dramatic improvement, I would suggest looking at creating U,V atlases in a pre-processing step.

Another good question raised was what kind of occlusion culling are you using? If you render your scene in wireframe mode is the horizon line a smear of pure black? Can you see rooms through walls? I read you are using an octtree, but that isn't very good at culling out geometry when you have a large distance between z-near and z-far, it just gives you fast access to nodes in the frustum, but there are still a lot of occluded nodes in the frustum.

Those quake maps were meant for indoor rendering and were before the days of batched hardware accelerated rendering. What I mean to say by that is they inserted each and every triangle into a BSP and rendered only the visible triangles and always in front-to-back order from the camera. Likely they did some batching after they discovered the visible set, but also likely they merged all their textures onto as few pages of video memory as possible.

To get equivalent results you would have to either implement that type of per-triangle BSP algorithm in software (and likely skip directX and go right for the video buffer yourself) or figure out some form of portal culling system that you can implement (those usually require a little bit of input from the artists though, extra portal geometry and what not).

Other than more careful scene management, I can't think of any glaring error in what you are doing.

Share this post


Link to post
Share on other sites
I will try to answer to all of you.


** My situation in a very simple form **

Light pre pass renderer made of three passes:
1- render GBuffer
2- render lights (light buffer)
3- render final scene

step 1 and 3 require to render the entire scene, by calling 'RenderScene'. Every call to 'RenderScene' implies:

- obtaining the visible octree nodes and doing other culling stuff (not important which)
- for each node render the triangles contained sorted by material. Every material implies a single 'DrawIndexedPrimitive' call.


** The problem **

Since I want to take advantage of the occlusion algorithm called Hierarchical ZBuffer Visibility, I have to use many "not-much-big" octree nodes.

But for simplicity, let's suppose I have only ONE big node that represents the entire map.

So I have two passes which are gonna render the octree one time each. Every time the octree is rendered, there will be as many 'DrawIndexedPrimitive' calls as the number of

different materials.

The problem is: having 20 materials implies having 20 drawing call for each pass => 40 'DrawIndexedPrimitive' calls.
20k triangles are visible on average, so every draw call will average 1k triangles.
=> 1k * 20 materials * 2 passes = 40k triangles in total (not so many)

RESULT: 15 fps on a 6800 Ultra.


** Without using any material **

The number of triangles is the same, but if I use only one material, I have one big call instead of 20 for each pass

=> 20k * 2 passes = 40k triangles as before... but with 100 fps.



** Conclusion **

It seems that doing many 'DrawIndexedPrimitive' slashes the performance.

I tried to do an single SetTexture call, but nothing change.

Moreover, I don't f****** know why, but even with a single big draw call of 20k triangles, things are slow... WHAT THE HELL ???


** Questions **

1- is it possible to use texture atlases ? I dont know... the texture combinations are very high. One time you need one, one time another, maybe a texture that resides in another atlas...
Wasn't it an old practice ?

2- What happens if you have to handle hundreds of different materials ?

3- How can one carry out a good occlusion culling scheme if it seems that it's better to send all the triangles in a single bunch ??

4- Does it have sense to make on the fly the vertex buffer and the index buffer for the visible geometry ??


I HATE 3D GRAPHICS WORLD :D!


-----

@ Krohm
I finished a complete software renderer engine last winter. I have been about to become crazy many times, because any library was used in this project of mine. So I'm not scared about hard things... I'm only scared about wasting time in making hard things that don't worth it
:°(

@ Steve_Segreto
No, any increase noted by using a single SetTexture.
Octree is perfect with occlusion culling if you use the Hierarchical buffer visibility (greene 1993). That's the one I used in the software engine. I'm discovering that maybe it's NOT a good approach with HW acceleration.



THANK YOU ALL

Share this post


Link to post
Share on other sites
Quote:
It seems that doing many 'DrawIndexedPrimitive' slashes the performance.

Yes but 60 is NOT many even for a pretty old (couple years) CPU
Quote:
is it possible to use texture atlases?...

That's going to be a pain in the...
Quote:
What happens if you have to handle hundreds of different materials ?

You do hundreds of DIP calls, which again is an acceptable number (a couple hundred, not tens of hundred)
Quote:
How can one carry out a good occlusion culling scheme if it seems that it's better to send all the triangles in a single bunch ??

You don't have to (you can't) send all of them in a single bunch. See above.
Quote:
Does it have sense to make on the fly the vertex buffer and the index buffer for the visible geometry ??

NO. Static geometry with a good balance of geometric size (for occlusion) vs. triangle count (for batching) is usually the most efficient approach

Bottom line: use the available tools (nvperf hud, profilers...) to figure out what is wrong with your application.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this