Primitive batching

Started by
12 comments, last by kirk 20 years, 10 months ago
I''m developing a 3D engine, and I have a problem with the management of the VertexBuffer and the Mesh. The problem has a name: Batching of primitive. Now I illustrate it. We suppose to have a 3D world in which there are 10000 trees of the same species (as an example a pine). Logically the engine is taken care to only load a mesh that represents one tree exactly, and to create 10000 instances of it. In this way the mesh data(vertices, normals, uv etc) are shared, with consequent saving of memory. Now following all the councils of Microsoft, Nvidia, ATI, and reading the myriad of tutorial on internet, I understand that to catch up good performance I must make in way to batching more polygons together for execute a single call to DrawPrimitive with an high number of primitive (Microsoft says around 2000, bah). I premise that all the councils of Microsoft, Nvidia, ATI and all the tutorial do not refer absolutely to a real situation, because all the examples are not heavy from the point of view of calculations and the quantity of data to process. Ok, let''s go. My tree is composed from 2 submesh: the bark (with material 1) and the foliage (with material 2). This is my rendering cycle. start: Begin scene rendering. The scenegraph cut out 9700 trees, because are out of view-frustum. At the end of traversing scenegraph I have 300 trees to render. Now I have three choices: 1st choice) Filling the VB with all bark (material 1). For each bark: Setting the WORLD transformation matrix. DrawPrimitive only one bark each time. Continue with foliage (materil 2). 2nd choice) Filling the VB with only one bark (material 1). For each bark passed from the scenegraph: Setting the WORLD transformation matrix. DrawPrimitive only one bark each time. Continue with foliage (material 2). 3rd choice) Compute all transformations of all barks with host CPU. Copying data into VB. Calling only one DrawPrimitive Continue with foliage (material 2). End render cycle. Ok, the unique way for batching is the 3rd choice, but the performances depends from the CPU, because for each frame I need compute many transformations. And now the final question: Is the batching useful? And if yes, which is the correct way (my 3rd choice or another way)? Thank you for answers, and sorry for my poor english.
Advertisement
I really fail to see where you ahve a problem.

quote:Ok, the unique way for batching is the 3rd choice, but the performances depends from the CPU, because for each frame I need compute many transformations.


Why? Really - why? Every tree has a world transformation that is constant (unless the tree moves or rotates etc.), so it is never recomputed.

The cycle basically looks like this:
(a) upload meshes into vertex buffer (static, btw).

In the loop:
(a) loop through all the meshes (submesh 1)
(a.1) load the tree''s workd matrix
(a.2) draw.
(b) loop through all the meshes (submesh 2)
b.1, b.2 identical to a.1, a.2

You could, with a good shader, though, draw a number of tree meshes at the same time - not that it makes a hughe difference, possibly. Anyhow - i dont see any recomputation of the world matrix happening at all on every frame. If the camera moves, this is the camera matrix changing. I never compute the final combined matrix in the CPU - so no, I dont see the thousands of transformations happening every render pass. Well, yes, in the GPU - this it can deal with this.

Regards

Thomas Tomiczek
THONA Consulting Ltd.
(Microsoft MVP C#/.NET)
RegardsThomas TomiczekTHONA Consulting Ltd.(Microsoft MVP C#/.NET)
quote:
Why? Really - why? Every tree has a world transformation that is constant (unless the tree moves or rotates etc.), so it is never recomputed.


Ok I agree with you, but I need the transformation because I stored in RAM only one tree, with LOCAL COORDINATES, and then
put it in the scene at WORLD COORDINATES.
I cannot allocate one big static VB for 10000 trees (1 tree = 500 polys * FVF of 24 byte * 10000 = 115MBca), I need to reconstruct the VB every frame to upload only necessary trees.


[edited by - kirk on May 28, 2003 6:56:09 AM]
Someone can help me?
Pleassssssse!
My Opinion:
Sounds like the other gentleman said #2 is your answer. I would agree with both of you. I dont think creating 150Megs of vertex buffers with world transformed data will work. Cards dont have that kind of memory for cache!

The second choice just seems right. Your only setting the VB twice for 300 trees. Once for each submesh. NVIDIA suggest as few VB changes as possible. Your looping through all trees and rendering the same submesh. Sounds great to me! The only downfall is applying a world matrix a total of 600 times, which I dont know if this is as slow.

Good Luck.

#1 is just a bad version of #3, because you''re storing all your tree data in a (dynamic) vertex buffer after all. Since you''re doing that, it''d be better to transform them as you store them anyway. So, #1 is not a choice, as far as I see.

I think the performance of #2 depends on how many triangles you have per-tree. If you''re using billboards, then you''re killing performance by calling DP hundreds of times per second for just 2 triangles each.

I''ve always thought that dynamic vertex buffers were the way to go with visibility structures, i.e. cases where you don''t know exactly what you''re going to render (depends on the frustum).

I''d do something like the following:
Use a dynamic vertex buffer, fill it with the transformed data of - say - 30/40 trees, unlock it and render. Then I''d lock, fill it in with the next 30/40 trees, ...etc
i.e. I''d use a DYNAMIC DISCARD/NOOVERWRITE scheme.

By the way, testing is your friend. So make sure you test all available choices, time them and see. (And don''t forget to post your results here so that others can make use of them )

quote:Original post by Coder
I''d do something like the following:
Use a dynamic vertex buffer, fill it with the transformed data of - say - 30/40 trees, unlock it and render. Then I''d lock, fill it in with the next 30/40 trees, ...etc


This sounds like my 3rd choice, because I need to transform data
with CPU before filling VB.
Is this the right way?


I guess, the 3rd way is the best, I guess.
You need to cut the trees in 2-3 chunks, and render them.
But are you crazy? Why would you need 300 trees? I guess the best way is to overwrite your farplane, when creating the frustum.
Then you would only need to render 150-200 or at least less than 300.

Another way is to compute the positions of the trees, when you load the map, since the trees are not moving. Then you COULD create a VB for those trees, and switch to the treesVB once a frame.

Of course I''m only learning the VB-optimizations right now, I am probably soooooooo wrong.

.lick
The options discussed so far seem to be:

1) store one tree in VB, transform 300 times each frame, or
2) store thousands of trees in VB, no transform each frame

How about something in between: Store a representative "stand" of trees in the VB, say a dozen or even a hundred. Then when you render a stand, you are doing just one world transform for the whole stand. This would keep the VB size managable, and also cut the number of transforms each frame to 1/x, where x is the number of trees in the VB.
The stand idea is good, render a group of +-50 to 100 trees (120 polies each? ) of all the same species (same materials/textures) in one VB, with one world matrix setting, maybe if you have a level editor, allow the level designer to build this stand in 3d max/milkshape etc. (all 50 trees) and then just dump that onto the required spot on the heightmap.

I require all my trees to sway individually, so this causes problems. Each tree is rendered as a full 3D model out to a distance that is customizable per tree/object species, and as sprites past this distance, each tree species has a mesh all on it''s own and the world matrix is set per tree. My sprites aren''t working yet, but already without them the speed is not bad with up to 2000 trees in the view - as long as you sort by texture everything is fine, I''ve found fillrate to be the problem, not setting matrices.
Uhhh...

This topic is closed to new replies.

Advertisement