Archived

This topic is now archived and is closed to further replies.

90 calls drawprimitive@3 verts each, == 33 fps on GF3! 49 on GF4ti 4200

This topic is 5041 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

90 calls drawprimitive@3 verts each, == 33 fps on GF3! 49 on GF4ti 4200 here's what I'm doing: >Fill a static VB with 3 verts >loop through . >setstreamsource for that VB . >settexture . >set fixed function pipeline( fvf ) . >loop through 90 times . . >call drawprimitive on that ONE static VB >repeat the stats I got are above. It doesn't seem right to me. The texture is dxt3 64x64 mode, changing it to 512x512 doesn't matter either. How is the drawprimitive overhead THAT expensive! You're supposed to be able to call it less 90 times on huge VBs with around what I'm getting with 3 vertices for this VB. Does it have to do with me recalling drawprimitive to the same portion of the VB 90 times? The pseudo-code above is precise to what I'm doing in the code. Any thoughts? thanks -Marv [edited by - Wicked Ewok on February 22, 2004 10:41:15 PM] [edited by - Wicked Ewok on February 22, 2004 10:42:27 PM]

Share this post


Link to post
Share on other sites
Profile your code. It's very likely not the the DrawPrimitive() call that's slowing things down. Find out what is.

EDIT: Also make sure you're not using the reference device.

[edited by - glassJAw on February 22, 2004 10:47:03 PM]

Share this post


Link to post
Share on other sites
But I'm sure it is...It runs at 450 fps if I render one VB. Here's a snippet...I still don't know how to paste snippets correctly, such shame:

for( k = 0; k < k_size; k++ )
{
TextureBatch * texbatch = CurShaderBatch->_tBatch.GetPrev();
//set the base texture:
_pD3DDevice->SetTexture(0, _TextureManager->GetTexture( texbatch->_tex ) );
//loop through stack:
while( texbatch->_sVB.empty() == false )
{
CasperVertexBuffer * vb = texbatch->_sVB.top();
vb->SetOtherTextures( _pD3DDevice, _TextureManager );
vb->ApplyRenderState();
_pD3DDevice->DrawPrimitive( vb->GetPrimitiveType(), vb->GetOffset(),
vb->GetNumTriangles() );
texbatch->_sVB.pop();

}

}



now all I do is edit the while( stack.empty() == false ) to while(stack.empty() == false && somenumber++ < 90 ) and it chugs..I further loop everything in that while loop( the renderstate setting and the texture setting) to 90 without looping the drawprimitive again, and it comes out the same. Only the 90 affecting the drawprimitive the one that affects it so drastically.

[edited by - Wicked Ewok on February 22, 2004 10:54:02 PM]

Share this post


Link to post
Share on other sites
Does that repeat include filling the buffer...? Are you using software or hardware transformations? Are you looking at the debug output to see if there''s anything weird going on?

Share this post


Link to post
Share on other sites
Just profiled it..it''s definately a display driver overhead. I''m getting around 85% processing in the display driver:

3202 Samples of a ''salc'' instruction( what does that mean? )
and 1661 samples of some pointer moving:
movsb es:[edi],ds:[esi]

does that mean anything to anyone?

Share this post


Link to post
Share on other sites
here''s a synopsis of what I''ve discovered about vb rendering:

- don''t render the same batch of information using the same VB twice, don''t call drawprimitive twice on the same set of data! even if it''s only 3 vertices, there''s a lot of overhead from the renderer using that data and then reseting it up again..damn odd reason.
- draw each vertex in a vb only once

So here''s what I did before:
created a static VB of 126 vertices size, set to TriangleList
and repeatedly called DrawPrimitive on it 90 times, results:

~1 FPS

here''s what I did after:
created 90 static VBs of 126 vertices in size
called all of them once

~30 FPS

to humor myself, I used the same code in the latter experiment and instead of calling 90 drawprimitives, each associated with its own VB, I called the 90 using just the first VB[0]:
~1 FPS

I would like to know exactly what goes on here. Is it really that the renderer has a hard time re-reading a part of a VB?

Share this post


Link to post
Share on other sites
Sort of solved the issue already if you read the last post. But to make it a little more clear, I was rendering the same VB 90 times, that was what was wrong with it. The solution was to make 90 VBs and call each of them once.

Share this post


Link to post
Share on other sites
In a sense, your tests are so outside the norm that it's difficult to see if they really prove anything at all other than; when used badly, the driver will behave badly.

I don't have any great answer for why re-rendering a VB would be costlier than rendering several different VBs, but I wouldn't be too surprised if it has something to do with the fact that the driver is making some assumptions that you'll never do what your test does. For instance, you would usually render a much larger buffer and then a different much larger buffer. Or, you might render the same buffer twice, but with different transformations. For instance, some of my 2D stuff that re-renders the same 4 vertices with different matrices is about an order of magnitude faster than your numbers on a GF2MX. That particular app is more fill limited than anything else.

Basically, if I'm reading it correctly, you are creating a test for a situation that will never happen. One could argue that a driver tuned for your test would do poorly on a real task.

nVidia has presentations about efficient buffer usage. The answer might be there and/or ask nVidia directly and post your findings here.

Author, "Real Time Rendering Tricks and Techniques in DirectX", "Focus on Curves and Surfaces", A third book on advanced lighting and materials

[edited by - crazedgenius on February 23, 2004 2:29:43 AM]

Share this post


Link to post
Share on other sites
If your triangles are "big" on screen, then it''s very likely to be fillrate. You don''t even need fullscreen triangles to reach the fillrate limits of your card. In addition if your Z test function was poorly chosen, rendering the same triangle over and over could have resulted in clearing the ZBuffer values every time, too.

Y.

Share this post


Link to post
Share on other sites
How many times does your set texture get called... it''s a slow call. You have a settexture in your outer loop, and a call to something that implies it will set even more textures in the inner loop. You''re also setting render states 90 times... which is bad.

Using one VB 90 times isn''t a problem. It''s certainly better than using 90 VBs once each.

Outside the code you gave, are you locking and filling the VB each frame? ie: Are you doing lock, draw * 90, lock, draw * 90, etc?

Do you have any stats on how it performs on an ATI card? It might be something odd due to nVidia''s driver.

Share this post


Link to post
Share on other sites
I don''t think this has been mentioned, but I think your interpretation of your profiling results is wrong. Just because it spends a lot of time in the display driver doesn''t mean anything; a game spends a lot of time rendering, and the display driver could be taking so much time because it''s blocking when you call Present(), waiting for the card to finish rendering.

Share this post


Link to post
Share on other sites
One thing I am doing wrong but just for the sake of testing was setting the Z comparison function to its the default lessequal. I use the same set of vertices over and over, even if they are in a list of 90 vbs( actually, I have a VB limit and depending on how many vertices this is, it goes into about 20 actual vbs ). Profiling with the latest nvidia beta developer drivers( just installed them now ), I get most of my time consumption on the Hal driver, and just a little less on the nv4 display driver.

I am going to release my code for my rendering system. Been working on it for a few weeks now, but I just can't get the triangle rates that my GF3 should be able to pump up, especially with just one texture. If anyone wants to help, it's here:

http://www.mojo9.com/~wickedewok/DXRenderer_source.zip

Use what you want from it too, it's supposed to batch certain VB elements during rendering( though not entirely finished ). I think it's well commented. The main driver function is the only really messy section I think.

I have a UML diagram of it somewhere but I can't find it right now. If anyone would be willing to play around with this for me, I would be most thankful. Thanks.

-Marv


[edited by - Wicked Ewok on February 23, 2004 3:33:20 PM]

Share this post


Link to post
Share on other sites
I''m also curious how much of the screen that the primitive was taking up. Do you have a percentage? Also, are you culling the backside or is it drawing both sides? Also, what are you setting your width and height to when you create the device? A high resolution could certainly be a problem.

Chris

Share this post


Link to post
Share on other sites
Hmm thanks supernat, I think you got what part of the problem was. The backbuffer''s width and height was bigger than what I set the device to. I rescaled it to 600x800 and enabled zculling func to be LESS instead of LESSEQUAL and it works a lot better now. I''ve gotten a throughput of around 900,000 triangles/second(triLIST) with every triangle actually on the screen and covering at least a total of 3/4th the whole screen. this rate of course drops when the zfunc is defualt at LESSEQUAL since I''m drawing fixed depth triangles. Tristrips give me a lot more throughput. This is all with about 2 static VBs with 32k vertices each, and 80 drawprimitive calls to many ''801 vertex'' sections of each VB. seems to have fixed it..cool, thanks.

-Marv

Share this post


Link to post
Share on other sites