Archived

This topic is now archived and is closed to further replies.

DrawIndexedPrimitive's Performance

This topic is 5776 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I''ve made an octree by splitting a mesh object scene into an octree data structure. The data structure consists of the face within the node. The creation of the octree is fine as I''ve managed to divide n conquer all the triangle into an octree correctly. Everything is working fine except the performance issues, it''s so damn slowwwwwww..... I''m using DrawIndexedPrimitive to render my octree node. I''ve tested that even if I''m just rendering 8 nodes (inside the viewing frustum) and the number of triangles in a node is just 500 (it makes 8 * 500 = 4000 faces) I only get 10 fps.. what''s wrong ?? Hmm.. my guess is that I''ve used too many DrawIndexedPrimitive function (about 40 times for each node, but this is bcoz I''m using a single Indexbuffer from the mesh). What''s wrong with my method ?? is it true ?? ps : hardware is not the issue here

Share this post


Link to post
Share on other sites
1) 500 triangles per node is ok, but 40 DrawIndexedPrimitive() calls for 500 triangles is very inefficient - ~12.5 triangles per call. Anything below 20 for software vertex processing and 200 for hardware vertex processing is going to negatively affect your performance!.

2) It should be possible to draw each node with a single DrawIndexedPrimitive call if you ordered your data nicely. With modern hardware there is usually very little to be gained (and often a lot to be lost) from culling down to a per-polygon level. With T&L hardware you can stop your culling at the bounding box level where the box contains ~200 polygons

3) Do make sure to check the D3D debug output with the debug runtime installed - it''ll give you warnings for obvious bad things.

4) You should make sure you set the range parameters correctly for the DrawIndexedPrimitive() calls to only include the range of vertices which matters to you. All of the vertices in the range you specify will be transformed, regardless of whether the indices refer to them or not!!! (it''d be much more inefficient for the vertex processing to dereference each index at transform time).

5) Make sure the vertex and index buffers are created with flags which reflect what you''re doing. Such as if you have a hardware vertex processing device, make sure that all buffers don''t have the software flag set. Also set write only and dynamic hints as necessary. The same goes for the way you lock vertex and index buffers - if you get it wrong you''ll get bad perf.

--
Simon O''Connor
Creative Asylum Ltd
www.creative-asylum.com

Share this post


Link to post
Share on other sites
thank you for your information, I hope you won''t mind reading my questions :
(this is a reply based on your posting)
1. > anything below 200 for hardware vertex processing is going to negatively affect your performance!.
Is it true ?? but I can''t help it, reason being is that my scene/mesh consists of many texture
so for each Attribute I have my own set of triangles but then I have to split the set because of
the octree splitting process. That''s why I have to call DrawIndexedPrimitve many times to render one node because
I only make one IndexBuffer (to split triangles based on its Attribute)
(do I make any sense here ??)

2. >It should be possible to draw each node with a single DrawIndexedPrimitive call if you ordered your data nicely
Yes, but I used just one IndexBuffer, do you think by making an IndexBuffer for each node will help ?
I think it was memory expensive.
I dunno how usually Direct3D programmer handle this, can you give me an idea ?

3. >Do make sure to check the D3D debug output
I got one exception handling, but it solved (it was my mistake ) but the performance is still slow

4.>You should make sure you set the range parameters correctly for the DrawIndexedPrimitive()
calls to only include the range of vertices which matters to you.

How ?? I mean how to set the range of the vertices that we''re going to render ?
Using SetStreamSource right ??
HRESULT DrawIndexedPrimitive(
D3DPRIMITIVETYPE Type,
UINT MinIndex, -> I always enter 0 as a value, is it true ? but this is the start index for the indices right ?
UINT NumVertices, ->the number of the whole vertices ?? is this what cause the perf degradation ?
UINT StartIndex,
UINT PrimitiveCount
);


Thanks a bunch..
but my last question if you don''t mind.. How do you guys learn all of this ??
I mean I can''t find a good book that explain Direct3D thoroughly



Share this post


Link to post
Share on other sites
1. The general advice is batch as much as possible whenever possible. If you have many different textures and different sets of renderstates, this can be difficult (as you've discovered). It can help to pack many smaller textures together to form one larger texture if it allows less draw calls.


2. One vertex buffer and one index buffer per model should still be possible in most situations as long as the vertices are arranged in some sort of node order and you use the vertex and index ranges to only transform what's visible.


4. Aha - the MinIndex parameter is very badly named IMO. It really means FirstVertex. I'll try an illustrate how the parameter works with a bit of imaginary psuedo-C++:

MyVertex vertices[10000];
...
for (i=MinIndex; i<MinIndex+NumVertices; ++i)
{
transform_vertex( vertices[ i ] );
}


D3D then uses the BaseVertexIndex parameter passed to SetIndices(), and the StartIndex and PrimitiveCount parameters passed to DrawIndexedPrimitive as follows (for simplicity I present only a triangle list):

WORD indexbuffer[10000]
...
int indexbufpos = StartIndex;
for (i=0; i<PrimitiveCount; ++i)
{
int vapos = indexbuffer[indexbufpos] + BaseVertexIndex;
++indexbufpos;
int vbpos = indexbuffer[indexbufpos] + BaseVertexIndex;
++indexbufpos;
int vcpos = indexbuffer[indexbufpos] + BaseVertexIndex;
++indexbufpos;

MyPostTransformVertex va, vb, vc;
va = vertices[vapos];
vb = vertices[vbpos];
vc = vertices[vcpos];

RenderATriangle( va, vb, vc );
}



As for how to learn this stuff - I'm still learning stuff - it never ends! - read *everything*, the docs, Microsoft conference slides, hardware manufacturers conference slides, mailing lists etc. Since I'm doing this stuff professionally I also get to attend conferences like Meltdown in person, and get registered developer status with the IHVs.

[EDIT] my use of the i variable in code made the board think I wanted italic text [/EDIT]

--
Simon O'Connor
Creative Asylum Ltd
www.creative-asylum.com


Edited by - S1CA on February 19, 2002 10:34:37 AM

Share this post


Link to post
Share on other sites
well.. it seems that you''ve correct ur code.. thx
so, uhmm.. judging your explanation how indexedPrimitive works I get the answer that what I was doing wrong is of course I always set minIndex to 0 and NumVertices to the number vertices of the mesh so everytime I call DrawIndexedPrimitive I transformed the whole vertices of the mesh.. geee.. no wonder..
but now I''ve managed to change my code so that the range is always correct but the number of DrawIndexedPrimitive still the sam.. well.. I do gain some performance but the result is still not satisfying.. it''s slower than if I draw the whole mesh (pretty weird huh ?)

Share this post


Link to post
Share on other sites
There could be other reasons for the problem too - but mostly related to the number of draw calls you make, the number of vertices passed in each one, how and when you create, lock and fill vertex and index buffers, what device and renderstate settings you have set.

Can you post the compilable source code on a website anywhere ?

It''d be easier to spot problems with a clear overview of all of the code.

Share this post


Link to post
Share on other sites
Hey, S1CA, thanks for explaining that. I agree, it is poorly worded, and very poorly explained in the documentation.

So is it correct to say that the DrawIndexedPrimitive() function could be more neatly broken down (for us, not necessairly them) into:
TransformVertices(int MinIndex, int NumVertices);
...and...
Render(int Type, int StartIndex, int PrimitiveCount); ?

For me this general area of DirectX is kind of a grey area, as I''m not sure exactly what happens after it''s passed off to DirectX. I would be kind of nice to know though, as sometimes knowing that kind of info will let you make smarter decisions earlier on. Anyone have any source of information on this subject?

-------------

To Yanuart, you said you have one Vertex/Index buffer per node, causing you do to something like:
  
for (i = 0; i < NumNodes; ++i)
{
for (j = 0; j < NumTexturesInThisNode; ++j}
{
SetTexture();
DrawIndexedPrimitive();
}
}


You might want to go another way and have one Vertex/Index buffer per Texture (or maybe even one per scene if you''re only working with around 4000 triangles max, depending). Then you can do:
  
for (i = 0; i < NumTextures; ++i)
{
SetTexture();
DrawIndexedPrimitive();
}


Building the index buffers will be more of a pain but you''ll get better performance than by using multiple DrawIndexedPrimitive and SetTexture calls.

You can''t really do better than one DrawPrimitive call per texture, except by merging and/or eliminating some textures



-ns-

Share this post


Link to post
Share on other sites
OK, I''ll post my octree data structure and the
code to render it
Please give me a comment if my approach on this matter is wrong
ps : the term face group is my own term, it''s bassically a grouping of faces in a mesh by
their attribute(texture) after I''ve done the Optimize() method.
A node in octree :
struct SOctreeNode
{
D3DXVECTOR3 vectBB[8]; // Bounding box coordinates
SFacesGroup* m_pListGroups; // array of the facegroup for each attribute in a node
DWORD cGroups; // number of element in the array
SOctreeNode* m_pChild[8]; // child pointer
}

A face group for each attribute
struct SFacesGroup
{
DWORD AttribId; // the AttribId
DWORD cFacesGroups; // number of the grouping
DWORD *m_pListFacesStart; // Array of fstart for the RenderPrimitive call
DWORD *m_pListFacesCount; // Array of fcount for the RenderPrimitive call
}

Hmm.. I''m not good at explaining this data structure but I give an example how
it works :
Let''s say a mesh consists of 200 faces and they have been grouped by its attribute,
so we have 2 attributetable :
Attribute 1 : faces index 0-99
Attribute 2 : faces index 100-199
and we have one IndexBuffer for it : pIdxBuff

Now, for example a node consists of a faces both from each attribute, so we have 2 SFacesGroup
in a node for each attribute. Notice that the grouping is not done by that time because the faces
in each face group still need to be checked wether its in the node or not.
Let''s skip the checking process ...
And the result is, for ex. :
Attribute 1 : faces index 0-99
faces in the node : 0-10, 30-60, 90-99
This is why I have a list of face start and face count in a face group to save my faces index
(0-10,30-60,90-99)
so each time I render a node I just look at the array and get the value of facestart&facecount
for the DrawIndexedPrimitive. This goes the same with the next attribute.

Now this is the code on how I render a node, there''s nothing fancy about it. I believe you''ll
understand it judging by my data structure.

RenderOctreeNode(SOctreeNode *pNode)
{
m_pd3dDevice->SetStreamSource(0,ppVB,vertsize);
m_pd3dDevice->SetVertexShader(curFVF);
m_pd3dDevice->SetIndices(ppIB,0);

for(DWORD idx=0;idxcGroups;idx++)
{
DWORD i=pNode->m_pListGroups[idx].AttribId;
// Set the material and texture for this subset
m_pd3dDevice->SetMaterial( &m_pMeshMaterials );
m_pd3dDevice->SetTexture( 0, m_pMeshTextures[i] );

// Draw the mesh subset
for(DWORD k=0;km_pListGroups[idx].cFacesGroups;k++)
{
UINT maxIndex=0;
UINT numVertices=cVertices;
UINT startIndex=pNode->m_pListGroups[idx].m_pListFacesStart[k]*3;
UINT primCount=pNode->m_pListGroups[idx].m_pListFacesCount[k];
UINT minIndex=m_pIndices[startIndex];
for(WORD x=0;x for(WORD y=0;y<3;y++)
{
if(minIndex>m_pIndices[startIndex+(x*3)+y]) minIndex=m_pIndices[startIndex+(x*3)+y];
if(maxIndex }
numVertices=(maxIndex-minIndex)+1;
if(primCount>0)
m_pd3dDevice->DrawIndexedPrimitive(D3DPT_TRIANGLELIST,minIndex,numVertices,startIndex,primCount);
}
}
}

I hope u understand and give a comment ..
My biggest mistake is that I made this without ever knowing how D3D works so I''m afraid if I''m wrong
from the beginning


Share this post


Link to post
Share on other sites
Sorry.. I dunno much about formating my post there..
it''s all kinda mess up, but I hope u get it..
Hmm..
Is it true that the performance of a single DrawIndexedPrimitive with more triangles is better than multiple calls but less triangles ?

Share this post


Link to post
Share on other sites
"Is it true that the performance of a single DrawIndexedPrimitive with more triangles is better than multiple calls but less triangles ?"

Absolutely. Well, as long as you don''t mean a single call with 64K triangles compared with 4 calls with a total of 100 triangles.. :o)

T

Share this post


Link to post
Share on other sites
So at what level should I draw the line, for ex. I have an option such as this :
I can call DrawIndexedPrimitive just once but the number triangles is "Count1"
or
I can optimized my code and call DrawIndexedPrimitive "numCalls" times and the number of the total triangles is less ("Count2" triangles)
Now.. how can I determined those parameter ??
ps : Count1 > Count2 & numCalls > 1

Share this post


Link to post
Share on other sites
What i''m doing is:

- Setup all VertexBuffers before rendering starts
- Create a streaming Indexbuffer (created with D3DUSAGE_DYNAMIC | D3DUSAGE_WRITEONLY)
before rendering starts)
- During rendering collect the indices of all triangles of all visible leaves.
- Update the streaming Indexbuffer (Locked with D3DLOCK_DISCARD | D3DLOCK_WRITEONLY) with vertex indices referenced by the triangles collected in the previous step
- Render as much triangles as fit into the Indexbuffer in one call (up to 3000 in my case)

Hope this helps.

quote:
Original post by yanuart
I''ve made an octree by splitting a mesh object scene into an octree
data structure. The data structure consists of the face within the node.
The creation of the octree is fine as I''ve managed to divide n conquer
all the triangle into an octree correctly.
Everything is working fine except the performance issues, it''s so damn slowwwwwww..... I''m using DrawIndexedPrimitive to render my octree node.
I''ve tested that even if I''m just rendering 8 nodes (inside the viewing frustum)
and the number of triangles in a node is just 500 (it makes 8 * 500 = 4000 faces)
I only get 10 fps.. what''s wrong ??
Hmm.. my guess is that I''ve used too many DrawIndexedPrimitive function
(about 40 times for each node, but this is bcoz I''m using a single Indexbuffer from the mesh).
What''s wrong with my method ??

is it true ??

ps : hardware is not the issue here




Share this post


Link to post
Share on other sites
eRazor :
Hmm.. does that means I''ve to create & release a new IndexBuffer for the faces which are visible each time I want to render ?? since the size of the IndexBuffer will be varies each time I want to render my scene.
Do you think that those steps are time expensive ?
I''ve read about that once but I haven''t prove it myself,
or do you think it''s wise that I just made one IndexBuffer with a static size ?

Share this post


Link to post
Share on other sites
Nah, just create an index buffer of reasonable size (let's 1000 verts) upfront. Then during rendering fill the buffer up to the specified size and call DrawPrimitive. Loop.

Don't forget to lock the VB with D3DLOCK_DISCARD at the start of each loop iteration as this will allow the driver to return a fresh memory pointer while it streams the contents of the previous lock operation at AGP speed to the card.


quote:
Original post by yanuart
eRazor :
Hmm.. does that means I've to create & release a new IndexBuffer for the faces which are visible each time I want to render ?? since the size of the IndexBuffer will be varies each time I want to render my scene.
Do you think that those steps are time expensive ?
I've read about that once but I haven't prove it myself,
or do you think it's wise that I just made one IndexBuffer with a static size ?




Edited by - eRAZOR on February 20, 2002 3:08:42 PM

Share this post


Link to post
Share on other sites