Sign in to follow this  
Norman Barrows

lots of small meshes or one big dynamic buffer?

Recommended Posts

Norman Barrows    7179

what's faster? lots of small meshes ? or copying them all into one big dynamic buffer and drawing that?

 

lets say 150 to 500 meshes, of about 10 tri's ech. all with the same texture.

 

another way to phrase it:

 

would it be faster to take every triangle in the scene that uses the same texture and copy them all to a dynamic vertex buffer, do that for every texture in the scene, then just just draw the dynamic vertex buffers?

 

you basically be sorting on texture, all the triangles in the scene, into separate dynamic vertex buffers.

 

but i'm thinking that if you had say 5 textures, and 500 meshes of 10 tri's, that 5 batch calls of 1000 tri's each might be faster than 500 batch calls of 10 tri's each. despite the overhead of copying the triangles to the dynamic vertex buffers each frame.

 

has anyone ever tried this?

 

 

 

 

 

 

 

Share this post


Link to post
Share on other sites
Jason Z    6434

The general advice is that if you are CPU limited at all, then you should reduce the number of draw calls if possible.  With such small triangle counts per primitive, I would suggest using some form of instancing if possible, which could give you the benefits of both worlds.

 

To be perfectly honest though, nobody can tell you which one will be faster, or by how much.  It completely depends on the scene, and the rendering techniques that you are using.  You need to attempt each method, and see which one is faster in the given situation - there is no hard rule to go by...  The best case is that you could configure your engine to do either method in the appropriate time.  That would let you customize your rendering approach for each scene that you render.

Share this post


Link to post
Share on other sites
Hodgman    51234
I use a method like you describe, but at data-compilation time, and using static buffers at run-time.

On the last console game I made, we did extensive profiling and settled on a rough rule of thumb that every draw call should cover at least 400 pixels, in order to avoid stalls inside the GPU pipeline, and maybe 1 triangle per 16 pixels. These guidelines vary *hugely* depending on your shaders and the actual GPU though...

Depending on the API that you're using, there can be large amounts of CPU overhead in calling any graphics API function, so you often want to reduce API calls to a minimum - that one is easy to profile yourself though, by measuring the time taken by your D3D/GL calls.

Do you want to optimize CPU time, GPU time, or both?

Share this post


Link to post
Share on other sites
Norman Barrows    7179

With such small triangle counts per primitive, I would suggest using some form of instancing if possible, which could give you the benefits of both worlds.

 

that was the very first thing i checked, but it requires shaders, and i'm trying to stick with fixed function for maximum compatibility. 

 

i've been working on a design for some game library modules, and came to the conclusion that the whole problem with games is the graphics, they take up too much time. computers are fast enough to model most anything we want for game purposes, but much/most/almost all of the computer's time is spent drawing.

 

we can draw scenes at the complexity we want, or at the detail we want, but not both yet really.

 

i don't think there's a graphics programmer out there who wouldn't draw more if they had twice the processing power. I don't think anyone one would say "naw, thats ok, i got enough stuff in my scene".

 

since apparently graphics cards like to draw lots of triangles at once using the same texture, i was thinking one big buffer wth all the triangles for a texture might be faster.

 

To be perfectly honest though, nobody can tell you which one will be faster, or by how much.  It completely depends on the scene, and the rendering techniques that you are using.  You need to attempt each method, and see which one is faster in the given situation - there is no hard rule to go by.

 

i know what you mean. you would think it wouldn't be that way and that some methods would tend to rise to the top, but like everything else in games there's 6+ ways to do it and it all depends.

 

looks like i might be spending some quality time with vertex and index buffers.

 

am i correct in the assumption that i want to copy the vertices one after the other, and add the "vertex base index" of each mesh to its index values?

 

since i've come to the conclusion that graphics is the problem, i'm going to see what i can do to get some better performance. perhaps even go to shaders and sacrifice some backward compatibility.

 

its either that, or i don't draw rich environments, or i only draw them out to 50 feet, or i do it all with 2d billboards.   : P

Share this post


Link to post
Share on other sites
Norman Barrows    7179

Do you want to optimize CPU time, GPU time, or both?

 

not sure. the goal is to be able to draw rich environments, and still have cpu power left for semi-serious simulation.

 

i'm not necessarily thinking terms of a specific title, more like general approaches that can be used in multiple titles.

 

I use a method like you describe, but at data-compilation time, and using static buffers at run-time.

 

is it fast enough that i might do that at the start of a new game when i generate the world? Or are we talking 30 days runtime on a MIPS alpha?

 

Depending on the API that you're using, there can be large amounts of CPU overhead in calling any graphics API function, so you often want to reduce API calls to a minimum - that one is easy to profile yourself though, by measuring the time taken by your D3D/GL calls.

 

i'm using DX9.0c fixed function.    looks like i might finally have a reason to fire up the profiler.   but as i said i'm thinking more in terms of general approaches rather than a specific title, so i guess, technically, i still don't have anything to profile.   i guess i'll need to try it both ways and see what happens.  God!   so much time in game development is spent on experimentation and R&D!

Share this post


Link to post
Share on other sites
Juliean    7068

Just wanted to add something that stuck out for me:

 

that was the very first thing i checked, but it requires shaders, and i'm trying to stick with fixed function for maximum compatibility.

 

I would strongly advice you against support the fixed function pipeline anymore, especially for the reason of "compatibility". What do you want to be compatible with? 15 years old graphics hardware? Outdated fixed function samples, where probably twice as much shader equivalent tutorials exist? I don't see any point in carrying on with the fixed function pipeline for any reason. Recent GPUs e.g. doesn't even have a fixed function pipeline in that way, they will probably just emulate them, so there likely isn't even any performance gain from this. As for compatibility, almost all relevant graphic chips support shaders.

 

Of course it is your choice, but I see fixed function as a waste of time and something that should only be used for beginners to learn the very basics, before going on to shaders. Especially if it keeps you from using techniques like instancing, this should be an alert sign!

Share this post


Link to post
Share on other sites
Jason Z    6434

Especially if it keeps you from using techniques like instancing, this should be an alert sign!

This is good advice - you should probably stay away from fixed function stuff unless you have a very specific reason to use it!

Share this post


Link to post
Share on other sites
Norman Barrows    7179

Especially if it keeps you from using techniques like instancing, this should be an alert sign!

This is good advice - you should probably stay away from fixed function stuff unless you have a very specific reason to use it!

 

 

is there boilerplate shader code available that implements basic fixed function capabilities (aniso mipmaps, gouraud and phong) ?

 

i could use that to quickly convert to programmable then implement instancing. i could really use it to draw all these bushes and rocks and plants and such for caveman.

 

when MS bought rend386, i was forced to write my own perspective correct texture mapped poly engine.

 

i've also written assembly blitters for sprite engines that did mirror, zoom, and rotate simultaneously in real time.

 

but i don't relish the thought of having to twiddle xyz's, uv's, and rgb's. all i want is 1000 rocks on the screen ! <g>.

 

then again, it would allow me to write a shader that did mipmapped sprite textures without blending the background into the edges. i can't believe MS released directx with that basic incompatibility between their color key transparency  / alpha test system and their mip filtering system.  then you could actually do if alpha == 0 instead of if alpha < threshold, and it would work correctly.

 

by now i would have thought that the most common shader implementations would be widely available. while i haven't ever gone searching for some, i also haven't seen any posted anywhere.

Edited by Norman Barrows

Share this post


Link to post
Share on other sites
VladR    722

has anyone ever tried this?

I did, about 10 yrs ago on a GeForce 2 GTX for a top-down 3D scene consisting of walls/props/floor of a quad-grid based level.

1. Brute-Force : SetTexture per quad + VB/IB per quad

2. Single VB / IB for whole level, DIP per quad

3. Dynamic VB - Recreating Single VB per frame (sorting/copying objects in frustum)- upon camera change. DIP per texture

 

Slowest - Option 1

Faster - Option 2

Fastest - Option 3, since the VB does not really get recreated every frame, just every time new quads from the grid pop into frustum

 

 

Honestly, it took me a single afternoon to code 2 additional render methods so that I could switch between them at runtime upon a keypress. So, I propose you spend a little bit of an effort and do the same, it's really drop-dead easy and straightforward (just watch the pool / flags for VB / IB create/update - check the nVidia papers for that).

 

 

When benchmarking, make sure to switch off everything else in the engine. It is pointless to make these optimizations and then run them at full load at 12 fps and wonder why you can't see any difference - e.g. go for lowest resolution, no Vsync / AA / AF, no AI / Physics...

Share this post


Link to post
Share on other sites
Norman Barrows    7179

I did, about 10 yrs ago on a GeForce 2 GTX for a top-down 3D scene consisting of walls/props/floor of a quad-grid based level.
1. Brute-Force : SetTexture per quad + VB/IB per quad
2. Single VB / IB for whole level, DIP per quad
3. Dynamic VB - Recreating Single VB per frame (sorting/copying objects in frustum)- upon camera change. DIP per texture
 
Slowest - Option 1
Faster - Option 2
Fastest - Option 3, since the VB does not really get recreated every frame, just every time new quads from the grid pop into frustum

 

interesting.      

 

i do most of my drawing by creating (outdoor) scenes from many small meshes (rock, plants, trees, etc).   which is analogous to option 1.

 

 

guess its time to write some test code.

 

so we're talking the GPU's slower memory access of the dynamic VB and IB, vs. the additional quads of one big VB, vs. the API overhead of drawing individual quads.

 

and dynamic was still fastest eh?

 

 

sound like clip to frustrum and place in dynamic buffer may be the trick.   thanks!

Share this post


Link to post
Share on other sites
VladR    722

Well, now that you mention outdoor scenes, you will actually get better performance by grouping the small objects(clutter/props/rocks) into few chunks, where you can render each group using a single DIP call.

 

The threshold value, obviously, depends on the gfx/CPU combo you use, but it is obviously faster to just render a group of 10 objects of, say, 3000 tris in one DIP compared to:

- frustum culling of 10 objects on CPU

- 10 DIP calls for some measly, 300 tris on average per object

 

Think of it the same way you partition the terrain. I assume you use some kind of a quadtree-like scheme for cutting the terrain into chunks and doing the frustum culling.

 

Now, while a single terrain chunk, say 128x128, will be considered a leaf (in terms of a quadtree), you may have lots and lots of props/rocks/clutter and it may very well be prohibitive to just render ALL props from a single VB - and especially in a scenario when there will be 4 terrain chunks in the frustum and only a small part of each is actually visible, yet you'd be sending 4 huge props VBs through the gfx card's pipeline.

 

 

As for the dynamic VB, I forgot to mention that the framerate really dropped for a short moment when the VB was being recreated- you might very well be allergic to such behaviour, but if amount of RAM is an issue, this is a great option, especially if you can spread the task of creating the dynamic VB across multiple frames.

Which, admittedly, becomes harder to manage, since during those few frames you might actually change the camera position, and thus recreate the VB that wasn't even created fully in the first place...

Share this post


Link to post
Share on other sites
Norman Barrows    7179

The threshold value, obviously, depends on the gfx/CPU combo you use, but it is obviously faster to just render a group of 10 objects of, say, 3000 tris in one DIP compared to:
- frustum culling of 10 objects on CPU
- 10 DIP calls for some measly, 300 tris on average per object

 

its worse than that!     its more like frustum culling 2000+ objects of 10-50 triangles each, and still having 500 DIP calls of 10-50 triangles each when you're through.

 

Think of it the same way you partition the terrain. I assume you use some kind of a quadtree-like scheme for cutting the terrain into chunks and doing the frustum culling.

 

the ground is drawn as individual 10x10 quads out to clip range (50-300 units). a heightmap function is used to heightmap a dynamic quad, a "pattern map" determines the texture ID # to use for the quad. superclip4() is called on each quad.  superclip is the "clip to frustum" routine. but it does a bit more, like trivially rejecting things behind the camera, etc.

 

 

this is for a fps/rpg title.

 

so i guess you could say the ground is in 10x10 chunks. the size is small so i can have seamless ground textures tile sets that are only 10x10 units in size (10 feet x 10 feet with the scale i'm using of 1 d3d unit = 1 foot).

 

since it appears (according to various docs at least) that changing textures is the worst thing you can do to a GPU, i've been following the mantra of one mesh, one texture, and sort everything into optimal order before sending it off to the pipeline.

 

ground quads are the only thing thats not sorted on texture before drawing. to do that, i'd need to do a pass for each ground quad texture tile used, and height map and draw just those quads on each pass.

 

i'm approaching the point where its time for final graphics. i do final graphics last.   so nothing has been optimized within an inch of its life yet.    all i've done so far is make sure the frame rate stays up, and that i can achieve the desired visual results. most of the optimization in my future will be geared towards pushing the cutoff range between high and low lod out farther from the camera. in thick woods and jungle, the cutoff is 50 feet right now. then again, you're hard pressed to see 50 in that kind of bush anyway.

Share this post


Link to post
Share on other sites
Norman Barrows    7179

Now, while a single terrain chunk, say 128x128, will be considered a leaf (in terms of a quadtree), you may have lots and lots of props/rocks/clutter and it may very well be prohibitive to just render ALL props from a single VB - and especially in a scenario when there will be 4 terrain chunks in the frustum and only a small part of each is actually visible, yet you'd be sending 4 huge props VBs through the gfx card's pipeline.

 

yes, i've recently started considering how i'd do a shooter type title, and came to the same question:   # of batches (size of  "level" chunks), vs # of triangles in a chunk that are entirely outside the viewing frustum.   IE    DIP overhead vs directx clipping overhead.

 

its possible that the best way (app dependant of course) would be one pass per texture, for each texture, clip all objects to frustum. things that are inside, add to vb. those that are partially inside, clip and add one triangle at a time. then draw that vb with its texture, then move on to the next texture. each texture gets touched exactly once. each vb only has triangles that are partially or entirely in the viewing frustum (or darn close).   and the scene is "composited" in layers, one texture at a time.

Share this post


Link to post
Share on other sites
Norman Barrows    7179

As for the dynamic VB, I forgot to mention that the framerate really dropped for a short moment when the VB was being recreated- you might very well be allergic to such behaviour, but if amount of RAM is an issue, this is a great option, especially if you can spread the task of creating the dynamic VB across multiple frames.
Which, admittedly, becomes harder to manage, since during those few frames you might actually change the camera position, and thus recreate the VB that wasn't even created fully in the first place...


created or filled?

it looks like the way to go is create once, lock many.

i'm thinking about filling a buffer each frame before drawing. perhaps one buffer for each texture. or just a few for the textures used on lots of small meshes. Edited by Norman Barrows

Share this post


Link to post
Share on other sites
mhagain    13430

I would strongly advice you against support the fixed function pipeline anymore, especially for the reason of "compatibility". What do you want to be compatible with? 15 years old graphics hardware? Outdated fixed function samples, where probably twice as much shader equivalent tutorials exist? I don't see any point in carrying on with the fixed function pipeline for any reason. Recent GPUs e.g. doesn't even have a fixed function pipeline in that way, they will probably just emulate them, so there likely isn't even any performance gain from this. As for compatibility, almost all relevant graphic chips support shaders.

 

Of course it is your choice, but I see fixed function as a waste of time and something that should only be used for beginners to learn the very basics, before going on to shaders. Especially if it keeps you from using techniques like instancing, this should be an alert sign!

 

I need to second this - nowadays the maximum compatibility path is shaders.  Especially since SM3 hardware became ubiquitous, all graphics hardware will actually emulate the fixed pipeline by using driver-provided shaders; what that generally means is tortuous code-paths with dynamic branching and/or lots of runtime shader recompilation and/or lots of shader changes, not to mention exercising code paths that driver writers no longer put much effort into.  Maybe 5 years ago you could just about get away with not wanting to use shaders for compatibility reasons, but nowadays there really is no longer any excuse.

 

The sole exception would be if you're targetting a very specialized community that you know uses retro hardware, but otherwise using shaders just makes sense.

 

I did, about 10 yrs ago on a GeForce 2 GTX for a top-down 3D scene consisting of walls/props/floor of a quad-grid based level.

1. Brute-Force : SetTexture per quad + VB/IB per quad

2. Single VB / IB for whole level, DIP per quad

3. Dynamic VB - Recreating Single VB per frame (sorting/copying objects in frustum)- upon camera change. DIP per texture

 

I generally prefer a variant on your option 3 - a static VB (sorted by texture/material at build time) with dynamic IB but it's a tradeoff - you avoid the overhead of rebuilding the VB but you accept the overhead of draw calls jumping randomly about in the VB (hoping to come out on the right side of the tradeoff).  The old advice about constraining your DIP to a specific range of vertices isn't relevant with hardware T&L (and it's worth noting that D3D10+ no longer specifies a vertex range, with the reasoning being that many D3D9 drivers actually ignored it) so that's nothing to be concerned about any more.

Share this post


Link to post
Share on other sites
_the_phantom_    11250

and i'm trying to stick with fixed function for maximum compatibility.

Compatibility with what?
2002 - ATI releases the R300 GPU. No fixed function hardware.
2004 - NV release the NV40. No fixed function hardware.

Heck, everything you've written about being worried about smacks of problems from nearly 10 years ago...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this