Questions about batching static geometry

Started by
6 comments, last by Lightness1024 11 years ago

Hello all,

I have come into a road block with my current project which goal is to test my rendering engine. The problem occurs when attempting to draw a multitude of small models (less than 50 triangles). Since each of those model have their own VB and IB, they are each drawn individually. Of course this higher the draw calls and each of them have very little vertices to output, leaving me bounded by CPU speed.

The solution is to batch these suckers up, but I have some doubts about how to proceed.

Note that all the models are static.

My concerns:

- Since each model have their own transformation matrix, do I have to pre-transform each vertex before adding it to my batch vertex buffer?

- How do I still take advantage of per-model frustum culling?

- How efficient is it to re-create the batch every frame post frustum cull checks?

The way I am currently thinking about doing this is like so:

- On startup, allocate enough memory for a static geometry batch VB & IB.

- Do the frustum cull checks on each model...

- Static models that can be batched (share same textures, materials, etc...) have each of their vertex transformed and added to the batch.

- Draw

- Flush batch, rebuild it for other models that share common textures, materials...

- Draw

and so on....

How efficient would it be do transform vertices and recreate the batches (multiple times) per frame? Does anyone have any insights?

Thank you.

Advertisement

To batch like this you either need to pre-transform the vertices, or in the vertex shader you need to be able to look up which mesh any given vertex belongs to so that you can retrieve the correct world matrix. However you're talking about static meshes here, so you could just pre-transform once when building the level or at load time. That said, your approach may not scale well for scenes with high-polygon scenes since you will have to move and process a whole lot of data on CPU. There can also be GPU performance implications from accessing memory that's CPU-writable.

The more modern approach would be to use instancing, which is where you draw the same mesh many times in a single draw call. Of course, this relies on you having meshes that are duplicated many times.

With instancing, you can draw multiple objects with a single draw call. The optimal scenario is of course that you have one draw call per object type, so you'll minimize everything from draw calls to material, buffer switches.

There are multiple ways of doing it of course, but you can start simple: what you'll need is that each mesh (which knows about buffers, shaders etc) keeps a list of transforms for different instances. Instead of drawing a mesh directly, you'll just add a transform matrix to the buffer. I'm using currently std::vector for dynamic storage and I haven't had any performance issues. After you have submitted all the instances to each mesh, you can transfer the transform data to a cbuffer, buffer or vertex buffer and then draw multiple instances. I prefer a generic buffer<float4> object since it's size limitation is at 128 megabytes and it can be used for skinned objects too.

So, this way you can minimize all draw calls, shader program bindings, material changes etc.

Of course, that'll eat lots of CPU power to perform culling per object and adding objects individually to the list. So next step would be logically to divide your world with quad / octree and have each node to keep lists of list of meshes (ie. a list for each different kind of mesh inside the node). When going through your spatial tree, instead of checking and adding individual objects, you can add full lists of objects to the drawing queue.

Of course, this will result out-of-frustum objects to be drawn, but it is of course a trade off , lots of CPU saved, a bit more GPU used.

The advantage of the instancing system is that it is a bit more scalable than transforming each mesh manually. Also, I like the instancing way since you don't need to touch the vertices or indices. Probably for very few meshes the performance would be better if they were transformed manually.

Best regards!

Thanks a lot for the insight.

Instancing seems like the go to technique... Though it won't work if the application uses different geometries.

In the end, it seems like batching is more client specific. I don't really see a convenient way to implement it at the "engine" level in a generic way. I think I will take it at a higher level and create some sort of flexible "scene management" layer that can tailor more specifically to what an application actually needs when it comes to batching geometry together.

Cheers.

Of course, this will result out-of-frustum objects to be drawn, but it is of course a trade off , lots of CPU saved, a bit more GPU used.

In cases where instancing is heavily used, vertex and polygon count quickly becomes a bottleneck issue for me, so I still require some form of frustum culling.

It is totally possible to do frustum culling instance by instance, and it does not take much CPU time. Just keep a transformed version of a bounding sphere for every instance and cull each sphere as usual. For every instance that is in view, store its index number in a separate array to keep track of the visible ones and use this as to update your instancing data to a dynamic vertex buffer for the instances.

New game in progress: Project SeedWorld

My development blog: Electronic Meteor

I'm told that UnityEngine is doing batching of static geometries during resource baking phase. About frustum culling, the primitive assembler will drop your invisible triangles at the speed of light, and recent graphic cards push vertices very fast as long as your shader is not too fancy. So you may not even bother with that. doing some preparation work on CPU would kill the perf compared to rendering a unique static VB up to ~ 1 million polygons on a high end card.

I'm told that UnityEngine is doing batching of static geometries during resource baking phase. About frustum culling, the primitive assembler will drop your invisible triangles at the speed of light, and recent graphic cards push vertices very fast as long as your shader is not too fancy. So you may not even bother with that. doing some preparation work on CPU would kill the perf compared to rendering a unique static VB up to ~ 1 million polygons on a high end card.

When you're talking about batching geometry, do you mean combining similar meshes into one huge vertex buffer, or just batching the state changes? I've gotten performance gains doing frustum culling when dealing with thousands of instanced models a few hundred triangles each, which is still acceptable with DirectX. Maybe it's different for OpenGL, though. Unity prefers using DirectX on Windows platforms.

For my case, though, I am not using a high end card, but a mid-range card from the DirectX 10 days (HD 4670, which is from the generation just before DX11 cards came out). My target specs are towards DX10 cards but using shader model 3.0 as the minimum.

New game in progress: Project SeedWorld

My development blog: Electronic Meteor

I really don't know, but what I understood is that the it is all batched so that to reduce the draw calls to a minimum. it could be one vertex buffer per material, or one huge VB I don't know. Surely reading through the doc would reveal that :)

Now, indeed your card is a mid range, so I would not expect 1M polys to be easily edible, if the frustum culling is not costing anything (just a decision whether to render or not), it is certain that it will win some frame rate. Though, if it involves the reconstruction of a whole buffer I doubt it will be faster than do render the whole thing. Because invisible triangles are quite quickly handled by the graphic pipeline. and also because even when visible, vertex treatment tends to be damn fast. rasterization speeds depends on the distance, far objects projects into very small screen coverage, thus the rasterization is fast. Problems may only arise when lots of overdraw occur, and with objects that projects widely on screen.

last thing, shader model 3 is compatible with DX9, obviously the API is different, to you intend to use DX10 directly yourself ? so I suppose you don't want to spend the time to propose an alternative DX9 layer as well ? If so, I suggest you compile your shaders with model 4 (DX10 generation), because of various constructs that will get compiled into forms that are more natural for cards of this generation. Notably, no unrolling and really dynamics "if"s. Also the build speed will probably diminish thanks to less need of static analysis.

This topic is closed to new replies.

Advertisement