Archived

This topic is now archived and is closed to further replies.

kirk

Primitive batching

Recommended Posts

I''m developing a 3D engine, and I have a problem with the management of the VertexBuffer and the Mesh. The problem has a name: Batching of primitive. Now I illustrate it. We suppose to have a 3D world in which there are 10000 trees of the same species (as an example a pine). Logically the engine is taken care to only load a mesh that represents one tree exactly, and to create 10000 instances of it. In this way the mesh data(vertices, normals, uv etc) are shared, with consequent saving of memory. Now following all the councils of Microsoft, Nvidia, ATI, and reading the myriad of tutorial on internet, I understand that to catch up good performance I must make in way to batching more polygons together for execute a single call to DrawPrimitive with an high number of primitive (Microsoft says around 2000, bah). I premise that all the councils of Microsoft, Nvidia, ATI and all the tutorial do not refer absolutely to a real situation, because all the examples are not heavy from the point of view of calculations and the quantity of data to process. Ok, let''s go. My tree is composed from 2 submesh: the bark (with material 1) and the foliage (with material 2). This is my rendering cycle. start: Begin scene rendering. The scenegraph cut out 9700 trees, because are out of view-frustum. At the end of traversing scenegraph I have 300 trees to render. Now I have three choices: 1st choice) Filling the VB with all bark (material 1). For each bark: Setting the WORLD transformation matrix. DrawPrimitive only one bark each time. Continue with foliage (materil 2). 2nd choice) Filling the VB with only one bark (material 1). For each bark passed from the scenegraph: Setting the WORLD transformation matrix. DrawPrimitive only one bark each time. Continue with foliage (material 2). 3rd choice) Compute all transformations of all barks with host CPU. Copying data into VB. Calling only one DrawPrimitive Continue with foliage (material 2). End render cycle. Ok, the unique way for batching is the 3rd choice, but the performances depends from the CPU, because for each frame I need compute many transformations. And now the final question: Is the batching useful? And if yes, which is the correct way (my 3rd choice or another way)? Thank you for answers, and sorry for my poor english.

Share this post


Link to post
Share on other sites
I really fail to see where you ahve a problem.

quote:
Ok, the unique way for batching is the 3rd choice, but the performances depends from the CPU, because for each frame I need compute many transformations.


Why? Really - why? Every tree has a world transformation that is constant (unless the tree moves or rotates etc.), so it is never recomputed.

The cycle basically looks like this:
(a) upload meshes into vertex buffer (static, btw).

In the loop:
(a) loop through all the meshes (submesh 1)
(a.1) load the tree''s workd matrix
(a.2) draw.
(b) loop through all the meshes (submesh 2)
b.1, b.2 identical to a.1, a.2

You could, with a good shader, though, draw a number of tree meshes at the same time - not that it makes a hughe difference, possibly. Anyhow - i dont see any recomputation of the world matrix happening at all on every frame. If the camera moves, this is the camera matrix changing. I never compute the final combined matrix in the CPU - so no, I dont see the thousands of transformations happening every render pass. Well, yes, in the GPU - this it can deal with this.

Regards

Thomas Tomiczek
THONA Consulting Ltd.
(Microsoft MVP C#/.NET)

Share this post


Link to post
Share on other sites
quote:

Why? Really - why? Every tree has a world transformation that is constant (unless the tree moves or rotates etc.), so it is never recomputed.



Ok I agree with you, but I need the transformation because I stored in RAM only one tree, with LOCAL COORDINATES, and then
put it in the scene at WORLD COORDINATES.
I cannot allocate one big static VB for 10000 trees (1 tree = 500 polys * FVF of 24 byte * 10000 = 115MBca), I need to reconstruct the VB every frame to upload only necessary trees.


[edited by - kirk on May 28, 2003 6:56:09 AM]

Share this post


Link to post
Share on other sites
My Opinion:
Sounds like the other gentleman said #2 is your answer. I would agree with both of you. I dont think creating 150Megs of vertex buffers with world transformed data will work. Cards dont have that kind of memory for cache!

The second choice just seems right. Your only setting the VB twice for 300 trees. Once for each submesh. NVIDIA suggest as few VB changes as possible. Your looping through all trees and rendering the same submesh. Sounds great to me! The only downfall is applying a world matrix a total of 600 times, which I dont know if this is as slow.

Good Luck.

Share this post


Link to post
Share on other sites
#1 is just a bad version of #3, because you''re storing all your tree data in a (dynamic) vertex buffer after all. Since you''re doing that, it''d be better to transform them as you store them anyway. So, #1 is not a choice, as far as I see.

I think the performance of #2 depends on how many triangles you have per-tree. If you''re using billboards, then you''re killing performance by calling DP hundreds of times per second for just 2 triangles each.

I''ve always thought that dynamic vertex buffers were the way to go with visibility structures, i.e. cases where you don''t know exactly what you''re going to render (depends on the frustum).

I''d do something like the following:
Use a dynamic vertex buffer, fill it with the transformed data of - say - 30/40 trees, unlock it and render. Then I''d lock, fill it in with the next 30/40 trees, ...etc
i.e. I''d use a DYNAMIC DISCARD/NOOVERWRITE scheme.

By the way, testing is your friend. So make sure you test all available choices, time them and see. (And don''t forget to post your results here so that others can make use of them )

Share this post


Link to post
Share on other sites
quote:
Original post by Coder
I''d do something like the following:
Use a dynamic vertex buffer, fill it with the transformed data of - say - 30/40 trees, unlock it and render. Then I''d lock, fill it in with the next 30/40 trees, ...etc



This sounds like my 3rd choice, because I need to transform data
with CPU before filling VB.
Is this the right way?


Share this post


Link to post
Share on other sites
I guess, the 3rd way is the best, I guess.
You need to cut the trees in 2-3 chunks, and render them.
But are you crazy? Why would you need 300 trees? I guess the best way is to overwrite your farplane, when creating the frustum.
Then you would only need to render 150-200 or at least less than 300.

Another way is to compute the positions of the trees, when you load the map, since the trees are not moving. Then you COULD create a VB for those trees, and switch to the treesVB once a frame.

Of course I''m only learning the VB-optimizations right now, I am probably soooooooo wrong.

.lick

Share this post


Link to post
Share on other sites
The options discussed so far seem to be:

1) store one tree in VB, transform 300 times each frame, or
2) store thousands of trees in VB, no transform each frame

How about something in between: Store a representative "stand" of trees in the VB, say a dozen or even a hundred. Then when you render a stand, you are doing just one world transform for the whole stand. This would keep the VB size managable, and also cut the number of transforms each frame to 1/x, where x is the number of trees in the VB.

Share this post


Link to post
Share on other sites
The stand idea is good, render a group of +-50 to 100 trees (120 polies each? ) of all the same species (same materials/textures) in one VB, with one world matrix setting, maybe if you have a level editor, allow the level designer to build this stand in 3d max/milkshape etc. (all 50 trees) and then just dump that onto the required spot on the heightmap.

I require all my trees to sway individually, so this causes problems. Each tree is rendered as a full 3D model out to a distance that is customizable per tree/object species, and as sprites past this distance, each tree species has a mesh all on it''s own and the world matrix is set per tree. My sprites aren''t working yet, but already without them the speed is not bad with up to 2000 trees in the view - as long as you sort by texture everything is fine, I''ve found fillrate to be the problem, not setting matrices.

Share this post


Link to post
Share on other sites
If you''re using vertex shaders then you have other options:

Store a set of about 20-25 trees in a vertex buffer. For each vertex, set an matrix index (like skinning).

Setup your renderstates for Material1
for (each each tree group)
{
LoadMtxPalette() //This is all 20-25 unique matrices
SetVB/IB
Render()
}
This way you can have about 20-25 trees in a group, thus cutting your number of batches from 300 to about 15 while still allowing each tree to have its own unique orientation.
So your treating this as a 1 bone skinning problem.

Share this post


Link to post
Share on other sites
Ok thanks for the answers, now I do some tests and post the results here so you can make use of them.
If there are other ideas simply post here.

Share this post


Link to post
Share on other sites
Hmmm. My problem is slightly more complex. I am using the same routines for rendering all scenery objects (trees, rocks, enemies, buildings...). The problem is that some of these are static, some are mobile, some have rendering restrictions (such as transparent textures that need to be depth-sorted), and some that don''t, and some have multiple materials (the tree/bark)while others have only one.

I''ve experimented with material batching, transform matrix batching, VB caching and I got the following:

On my dev machine (2ghz, GeForce4 Ti), the fastest method was to just do several transforms, ignoring the materials, and depth-sorting only those items that contained transparent textures (this required a little more memory, because I had to tag the textures as opaque/not opaque).

On my test machine (750mhz, GeForce2 MX), VB caching with texture batching was the fastest.

My guess is that if you''ve got a slow card, you need to do more work on the CPU. If you''ve got a fast card, it''s better to leave the CPU idle. So... nothing too revalatory there.

Share this post


Link to post
Share on other sites
quote:
Original post by SoaringTortoise
My guess is that if you''ve got a slow card, you need to do more work on the CPU. If you''ve got a fast card, it''s better to leave the CPU idle. So... nothing too revalatory there.

The main problem is to found the right balance.
In my engine I use a VB pool for static and dynamic data, it seems to be the right choice but this solution not permit the batching.

Share this post


Link to post
Share on other sites