Rendering hundreds of objects with same geometry but at different locations

Started by
10 comments, last by matches81 18 years, 10 months ago
Hi there again! I got a little terrain up and running, now I want to add some stuff to it, especially trees. But I have a major problem: I know it is quite slow to have many draw-calls per frame, and I know, that locking a vertex buffer isn´t fast, either. So, my problem is: How should I setup the data for, say 500 trees, each of them at different location and rotation? A) The current idea I´m working on would do the following: 1. transform them manually by their world matrix, to get their vertices in world space (distinct for every single tree) 2. copy that stuff into one large array 3. lock the vertex buffer 4. put the array into the buffer 5. unlock the buffer 6. render the stuff in ONE draw-call. Steps 1 to 5 would only occur when the trees change (one is destroyed or generated). Step 6 would logically be done every frame. B) the only other approach I could think of would be the following 1. set the object´s world matrix as transform 2. render the model Approach B seems to be more efficient at first glance, but it would result in 500 draw-calls for the trees alone, which wouldn´t be efficient, I guess. But I would need that many draw-calls in order to set the world matrix for every single tree. But approach A gives me a headache in 2 points: Firstly I don´t use the GPU for transforming the vertices to world space (this would probably done only once, when generating the terrain, so not THAT bad), but secondly I have a buffer lock, that I need to make it work... The other hand is I would only have 1 draw call for all the trees. So I basically want to know, what is faster? 500 draw calls or 1 buffer lock, copy, unlock 'cycle'?
Advertisement
A single lock-copy-unlock once every few frames would usually be faster than 500 individual draw calls, so method A is usually faster, even with the CPU cost of transforming the vertices of the trees into world space.

To transform the trees, D3DXVec3TransformCoordArray() and similar functions will provide SSE optimised code to transform the vertices very quickly.



A couple of alternatives to consider:

C) DirectX 9 introduced support for hardware "instancing" (instancing is the common term for what you're trying to do). It does only work on the latest graphics (SM 3.0) hardware/drivers, but does solve the problem quite nicely. Take a look at the SDK docs at:

DirectX Graphics ->
Programming Guide ->
Advanced Topics ->
Efficiently Drawing Multiple Instances of Geometry


D) If the same set of tree models are used quite a lot just in different places, you could use a technique similar to matrix palette skinning (and similar to what hardware instancing supports more elegantly):

- you have a VB that contains multiple tree models, say 20, all defined in object space (either all different, or multiple copies of exactly the same tree)

- all trees in a VB share the same textures

- if you're using indexed primitives (you should, they're good [smile]), then like with the VB, the IB should have the indices for all the trees that are contained in the VB.

- in the vertex format for the tree VB, each vertex contains a "tree number". This could be stored in the alpha channel of a vertex colour or in its own element.

- all the vertices for the first tree model in the VB would have a tree number of 0, all the vertices for the second tree model in the VB would have a tree number of 1, all the vertices for the third would have 2 and so on.

- in your vertex shader constants, you have an array of 20 (or so) transformation matrices.

- in your vertex shader, you read the tree number from the vertex and use this to index into the matrix array in the constant registers. (Very similar to skinning).

- That way you can draw 20-30 trees with a single draw call (more if the graphics card supports more constant registers) - a big improvement.

- The above can even be done with fixed function matrix palette skinning if you're supporting really old hardware.



Final note: there's no reason why you can't mix techniques - for example for far away, low LOD trees transform them all to world space as with method A, and for nearby high LOD trees, use method C or D.

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

hi,

just to say thx for the reply, even if I didn't ask the question. Never thought about the method D, but I like it pretty much ^^

would it be efficient for blades of grass ? probably not as much as for trees I guess ... If you have an E method for grass, it'd be perfect ^^
Thx for the answer S1ca. And thx for the tip with D3DXVec3TransformCoordArray().
I am already rendering using indexed primitives, started using them just for the memory benefit of it, but recently read somewhere, they´re also faster, because nowadays graphics card more or less expect you to use them?
Yes, I have thought about using geometry instancing, until I read that little SM3.0 remark, so it would only work on my PC, at least for the friends I have ;)
But perhaps I will implement that feature later on to render more trees in full detail.
I have also already thought of using some kind of billboarding for distant trees. Perhaps I will put my trees in the D3DXMesh-Stuff to use their 'LOD'-solution, but that would require more buffer-lock-copy-unlock cycles with my approach A, since the trees would obviously go through more transitions.
For your approach D, which sounds really nice for things like buildings in an RTS or stuff like that: How large could the array of matrices be?

Oh, and I forgot a point:
As I understoof the documentation of hardware instancing, I use 2 streams for rendering, the first containing the mesh I want have several instances of, and one for every world-matrix I would use. So stream 1 contains the traditional vertices and stream 2 contains 'vertices' that only contain a matrix, right? Could I do something similar, like S1Cas approach D mixed with that method?
I think about the following: Let´s say I want 100 trees rendered, so I would copy the vertices of a tree 100 times into my vertex buffer 1. After that I would create a second vertex buffer containing 100 world matrices for the trees.
Could I do the transform in my shader in that case? Would that be a good idea?

But it seems I don´t get around that buffer-lock, hm?
Quote:Original post by matches81
...Let´s say I want 100 trees rendered, so I would copy the vertices of a tree 100 times into my vertex buffer 1. After that I would create a second vertex buffer containing 100 world matrices for the trees.
Could I do the transform in my shader in that case? Would that be a good idea?


No, because without SM3 you can't control the stream frequency, therefore the second stream would need the same number of vertices as the first stream. You could do this if you duplicated each of your 'matrix vertices' for every vertex in the model, but that could be a bit tiresome to update each time.


The array of matrices is limited by the number of shader constants available. In SM2 you have 256 constants, so at 3 per matrix you could get up to 85 in there. Maybe more if you are cunning.
This might be obvious, but it seems like a waste to store 20 identical trees in a vertex buffer. You should take full advantage of the situation and make all of them unique [smile]
Jiia:
Since I´m going to generate the trees randomly, I guess that could be possible, but generating around 500 to 1000 trees would take some time I guess... but will try that out, when I implemented tree generation.
Quote:Original post by matches81
Jiia:
Since I´m going to generate the trees randomly, I guess that could be possible, but generating around 500 to 1000 trees would take some time I guess... but will try that out, when I implemented tree generation.

I was assuming you were going to try the vertex matrix lookup suggested by S1CA. If so, you will never need to lock buffers. You would just upload a set of 20 or so matrices to the shader and render.

I'm wondering how you plan to cull these objects if they are to be rendered in one call? I suppose you could make sure they all exist in a grouped area.
This page gives you a list of ways to do instancing with various levels of hardware support.
outRider:
thx for the link, could have looked in the samples beforehand...
interesting sample ;) but what bothers me:
if I run the sample and use hardware instancing, it isn´t really faster than shader instancing. And shader instancing could be used down to SM 1.1, though not too efficient. Seems I could use that too.

Jiia:
Hmm... didn´t think about culling yet. But since I already use a quadtree to cull my terrain, I could just use that one to cull my trees as well.
The thesis with one draw call for the trees would require that I lock / unlock the buffer every time visibility changes and/or a tree is destroyed, which could be quite often, not every frame, but I assume in more than 1/10 of them.


General:
Both methods I consider usable for me, shader instancing like in the DX SDK sample posted by outRider, do require that I have one copy of vertices and indices for every 'instance' of the mesh. So no memory gain by those methods. Furthermore both methods would require me to remove instances I don´t want to render from the buffers, and add new to it. So I still need the buffer lock/unlock stuff.
Only difference (and advantage) I see for shader instancing: I use the GPU for transforming those vertices, instead of putting load on the CPU for graphics stuff. Disadvantage of shader instancing is that I have to accustom it to SM1.1, 2.0 and 3.0, because of the different number of shader constants. This shouldn´t be too big of a problem though, but it also implies that I need more than one draw call for it.
So, now I´ve got another question:
GPU transformation of instances into world spaces and multiple draw calls (around 50 or more) versus CPU transformation of instances into world space and LESS draw calls (one to 16 or something like that)
Any clues which should be faster? 50 draw calls should put a significant load on the CPU...

This topic is closed to new replies.

Advertisement