Archived

This topic is now archived and is closed to further replies.

Very strange bottleneck...

This topic is 5649 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I''m using DX8.1 for my terrain ngine which is based on the ROAM algorithm... THe engine works like magic only I have a bottle neck in it... it renders 5,000 tris at 5 fps on my GeForce3... I foundout that my bottleneck lies in the Lock(); Unlock(); calls I do with my vertxbuffer everytime I reach an end leaf... I can''t find a way to render my landscape without rendering each triangle at a time... any ideas??

Share this post


Link to post
Share on other sites
Hi there,

i dont know much about 3D, but isnt there a way that you could programaticly precalculate the ''terrain'' in some sort of array or list, then after all the calculation is done, render the terrain know exactly what tri''s nees to be renderd, that way you can lock then at once and unlock them all after.

i belive it falls into the same kind of idea as DirectDraw surfaces lock and unlock, in that you want to do as many things as possible per lock, so some precalculation may be your answer.


hope that helps, like i said i cant give any specifics casue i really dont know what your doing.



Raymond Jacobs,
Developer,
www.EtherealDarkness.com

Share this post


Link to post
Share on other sites
Maybe you''re locking your vertices in a wrong way. Would you explain the properties of your created vertex buffer? Also, what flags are you sending to the Lock function?
Are you using Index Buffers? If not, are you drawing as TRIANGLELIST or strips?
Is the vertex information on the buffer arranged?
What''s the size of your FVF?
How many primitives are you batching per drawing? Are you grouping the primitives per render state changes (ie. drawing all the vertices who uses X texture, then drawing all the vertices who uses the other texture, etc.)?

Locking and Unlocking won''t cause CPU bottlenecks unless the GPU part is correctly programmed, in fact, on nowadays computers and compilers, copying memory can be lighting fast if it isn''t abused.

I would recommend you downloading some performance papers from nVidia (developer.nvidia.com), they have very interesting topics about the correct use of the GPU.

Share this post


Link to post
Share on other sites
Yep. Lock/Unlock calls should be kept to a minimum. Build up your array then copy it all in one lock/unlock. Check the DX docs, some modes perform better than others. The default mode isnt too bad, but still something you should _not_ do every poly.

In the DirectX8 Docs theres a good section on Lock/Unlock combos and their ''relative'' performance comparisons.
Section - "Using Dynamic Vertex and Index Buffers"

I assume your not making a new vertbuffer every call (thats always a good way to stuff things up :-).

Tips for making polys go quick:
- reduce texture swapping, try to render all polys of the same texture together. swapping textures can be a killer.
- strip polys, which reduces the number of vertices you need to send to the card.
- index buffers are good, and can help with the above to performance issues.
- precalc or prebuild as much vertex and state information as you can, then blast it across when its needed. PC''s have huge storage capabilites, so doing this is well worth it.

Hope this helps.

Share this post


Link to post
Share on other sites
the thing is that I can''t know the number of triangles that is to be renered untill render time....

I though about pushing triangle into a linked list and then creating one vertexbuffer and spill all the triangles to it... the thing is that I''m afraid to use linked lists since then are SLLLOW with a big amount of data....

Share this post


Link to post
Share on other sites
Don''t even think about using linked list for this, that''s way too slow. Figure out a maximum number of polys that are going to be used, and allocate enough memory at init of the program. Then jusr build your indexlist in this memory, and copy the list to the indexbuffer before rendering.

T

--
MFC is sorta like the swedish police... It''''s full of crap, and nothing can communicate with anything else.

Share this post


Link to post
Share on other sites
To get the amount of primitives to be rendered is easy, just divide the amount of indices by 3, I don''t see a problem there. Again, I think you''re missusing the GPU, why don''t you answer me the questions I did?

Share this post


Link to post
Share on other sites
Well MatuX, You''ve asked a lot of questions... I''ll try answering them all...

For start, I create my vertex buffer like this:

g_pD3DDevice->CreateVertexBuffer(dwNum * sizeof(MYVERTEX), 0, D3DFVF_MYVERTEX, D3DPOOL_DEFAULT, &g_pBuffer));

Nothing specail.

D3DFVF_MYVERTEX defined as:
#define D3DFVF_MYVERTEX (D3DFVF_XYZ | D3DFVF_NORMAL | D3DFVF_DIFFUSE | D3DFVF_TEX1)

I am not using Index Buffer since I am rendering my terrain triangle-by-triangle. Instate, I use D3DPT_TRIANGLELIST, although that''s not nessecery either since I am only batching one triangle per draw.

As I''e said, The vertices are not arranged in the buffer since the buffer only holds one poly (3 vertices) at a time.

The size of my FVF you can figure out from the above definition (36 bytes I think)

I know that batching one triangle per draw is VERY UNOPTIMIZED, but I can''t do it elseway since I don''t want to create a pool that will hold the miximum amount of triangles possible...

Currently I am not implementing textures or materials so grouping is unneeded at a time.

Hope that gives you a bit more information about my VertexBuffer...

10x for ur help.

Share this post


Link to post
Share on other sites
So you''re drawing one poly at a time (which is slow), and locking the vertex buffer for each (also very slow)?

You need to batch things up so you''re rendering as many polys as you can in one go, using one vertex buffer.

This isn''t too hard, you can build up a list of triangles while traversing your data, then whack it all in one vertex buffer, then draw the lot with one call to DrawPrimitive or whatever.


Helpful links:
How To Ask Questions The Smart Way | Google can help with your question | Search MSDN for help with standard C or Windows functions

Share this post


Link to post
Share on other sites
batching only a single triangle per call? ARE YOU MAD?!?!

you MUST, i repeat MUST draw at least 100 or so triangles per call to get any speed at all. for your terrain, their should be a single lock()/unlock() pairs per frame since you are only rendering a max of 15000 vertices. you may wish to break the vertices batches up and use a vertex buffer that is only 3000 vertices in size (see the particle sdk sample to see how to render large amounts of dynamic vertices in a somewhat efficent manner).

you should be able to batch more then a traingle per call, if not then i HIGHLY suggest you learn some memory mangment and more about dynamic memory (especially things like circular buffers). you should never have to allocate any memory during yoru rendering of the terrain (unless somethign special in the ROAM algo requires it, but i dont recall anything requiring it).

basically you want to:
(PSEUDO code)
doingROAM=TRUE;
while(doingROAM)
{
actualVertexCount = FillCircularBufferUsingROAM(1000);
LockVertxBuffer(actualVertexCount)
CopyCircularBufferTopVertexBuffer(actualVertexCount);
UnlockVertexBuffer();
DrawPrimitive(actualVertexCount/3);
if(actualVertexCount<1000)
doingROAM=FALSE;
}

drawing only a single triangle per call is so unoptimized that no matter how well you could the rest of the game it owuld run that slow. z9u2K, drawing a single triangle at a time is not the only way you know how to do it. yoru just being lazy and not generalzing concpets you should know about programmign and problem solving. you should see things like particle samples in the sdk since after all its solving the EXACT same rendering problem you are currently having. this happens to be: how to you render a group of dynamic vertices in which you dont know how many may be created during runtime nor can allocate a maxium since they wont all be on the screen at once or the maxium would be too high? answer, see the sdk (which basically does what i shown above, though you may need to see actual code which means you should practice a bit more and learn some more about the basic/intermeadiate aspects of coding before going on to large projects like creating a terrain engine in dx8.1).

Share this post


Link to post
Share on other sites
Listen to these wise guys, I think you've found the source of your bottleneck.
When you batch 1 triangle at a time you're getting a terribly high overhead of DrawPrimitive() calls which is bad, very bad... TERRIBLY bad

As these guys said, and I repeat, batch above 200 polys (not 100, 200 is The number (see the nVidia papers!) ).
Also, you should Lock all the primitives using X texture, drawing them, then Lock all the prims using X2 texture, drawing them, etc.

Also, you're creating your vertex buffer in a wrong way. Use these flags: D3DUSAGE_WRITEONLY | D3DUSAGE_DYNAMIC;
With that, you assure you won't read the buffer (that's reading video memory and is the most slow thing in the world), and with Dynamic you're specifying you'll be re-writting the buffer each frame (which isn't bad, all the contrary, it's perfectly fine and fast!).

The FVF you're using is slow, too. If you D3DFVF_XYZ | D3DFVF_NORMAL | D3DFVF_DIFFUSE | D3DFVF_TEX1 it means you're
struct { float x, y, z, nx, ny, nz; ulong color; float u, v; }
And, exactly as you said, it sums 36 bytes. On 3D video cards, anything above 32 bytes is B A D (again, read nVidia papers!). My suggestion, get that DIFFUSE outta there.

Getting back to batching. Don't be afraid of generating a 50.000 vertices buffer, it's fine, the GPU will be able to handle that much better than a 2 vertices buffer. And, USE INDEX BUFFERS! Make a list of vertices for the vertex buffer and link them as TRIANGLELIST on the index buffer, that will help no matter what (people taught me that here... and it's true ).

With these tips you should get a 9000000000% boost in your game

Edit:
I forgot, make sure you use the flag D3DLOCK_DISCARD. This will force the device to discard all the memory allocated, you must use this flag if you're using Dynamic buffers (VB or IBs).
Also, if you feel your buffer Lock is consuming time you can use the flag D3DLOCK_NOSYSLOCK, it'll allow Windows to do everything it can't do when you don't specify the flag (read DX8 SDK ).

And, go to developer.nvidia.com and download the DX8 performance papers!

[edited by - MatuX on June 24, 2002 8:24:40 AM]

Share this post


Link to post
Share on other sites
10x guys!
I''ll think it over and re-build my VertexBuffer class...

I knew the batching one poly per frame is BAD...
It just was the easiest way to do it..

I guess beening lazy didn''e worked...

Share this post


Link to post
Share on other sites
Uhm. Being lazy will NEVER get you a framerate over 20fps
Unless you''re lazy enough not to code in the first place =)


--
MFC is sorta like the swedish police... It''''s full of crap, and nothing can communicate with anything else.

Share this post


Link to post
Share on other sites
Hey... It''s me again...

I still have his render bottleneck in my code...

I have ceated a vertex buffer w/ dwDesiredTris * 3 vertices (dwDesiredTris is the amount of tris I would like in my scene, I set it to 10,000, which means 30,000 vertices in my buffer) and w/ WRITEONLY & DYNAMIC fags. I also lock it w/ the DISCARD flag.

I buffer the verices until the VertexBuffer is full, then I render it and start refilling it.

btw, the VertexBuffer never fills because the landscape never reaches 10,000 tris... It''s about 3,000 to 8,000 tris per render...

After all of this, I get ~5,000 tris w/ ~50FPS... better then last time, but still slow...

If you are asking, I am rendering the buffer w/ TRIANGLELIST flag since every three vertices in the buffer are a single triangle. I am not using an IndexBuffer since writing an index buffer w/ the values 0,1,2,3,4,5,6...dwDesiredTris * 3 - 1 is just a wast of memory and won''t speed up my code... (?)

Share this post


Link to post
Share on other sites
I don''t rebuild my VertexBuffer object and release it every frame...

I create it in the landscape initialization and destry w/ the landscape...

Per frame I do:
Lock, filling vertices, unlock, render, relocking for more vertices...

Because the terrain never renders more then 10,000 tris, it turns out that the VertexBuffer lockes and unlocking once-per-frame...

Share this post


Link to post
Share on other sites
Its best to have an array of Vertex Buffers and cycle through them. This reduces the amount of stalls on the GPU. Since you may be trying to lock a buffer which the gpu is still using, causing a stall.

ie
#define MAXVB 6
VertexBuffers[MAXVB];
int VBCount=0;

BuildNDraw()
{
FillVB( VertexBuffers[VBCount] );
DrawVB( VertexBuffers[VBCount] );
VBCount++;
if(VBCount>=MAXVB ) VBCount = 0;
}


Note if you define MAXVB as power of 2, you can do this

  
define MAXVB (1<<n)

VBCount = (VBCount+1) & (MAXVB-1);




[edited by - mark duffill on June 26, 2002 8:32:02 AM]

Share this post


Link to post
Share on other sites
quote:
Original post by z9u2K
Hey... It's me again...

I still have his render bottleneck in my code...

I have ceated a vertex buffer w/ dwDesiredTris * 3 vertices (dwDesiredTris is the amount of tris I would like in my scene, I set it to 10,000, which means 30,000 vertices in my buffer) and w/ WRITEONLY & DYNAMIC fags. I also lock it w/ the DISCARD flag.

I buffer the verices until the VertexBuffer is full, then I render it and start refilling it.

btw, the VertexBuffer never fills because the landscape never reaches 10,000 tris... It's about 3,000 to 8,000 tris per render...

After all of this, I get ~5,000 tris w/ ~50FPS... better then last time, but still slow...

If you are asking, I am rendering the buffer w/ TRIANGLELIST flag since every three vertices in the buffer are a single triangle. I am not using an IndexBuffer since writing an index buffer w/ the values 0,1,2,3,4,5,6...dwDesiredTris * 3 - 1 is just a wast of memory and won't speed up my code... (?)


Learn and use IndexBuffers! TRIANGLELIST is *very* slow
This is the best combo (suggested by the DX dev guys and nVidia papers): A list of vertices on the VertexBuffer + You connect them on the IndexBuffer as a TRIANGLELIST (it's fast on an IB).

I don't think the bottleneck is on Lock and Unlock anymore, you won't believe how FAST a computer copies thousands of DWORDs to video memory, most of the time you won't see a single fps being lost because of it, even on a Pentium 2, having multiple VBs (like Mark said) isn't a bad a idea while you don't have less than 5.000 or 10.000 vertices per VB (*cough* did I already mention nVidia papers? ).

Are you doing frustum culling? ROAM? PVS?

You should be really happy, it's 50 times faster now and you've learn a lot

[edited by - MatuX on June 26, 2002 10:14:52 AM]

Share this post


Link to post
Share on other sites
I havn't implemented the frustum culling yet...
I am planning ti do a frustum-to-AABB collision detection for each patch to decide if I should render it or not. Currently I am rendering all the patches...

MatuX, are you suggesting that an IB with the values 01,2,3,4,5,6,etc... will be faster the just sending the vertices without an IB?

btw, I've Implemented the buffer switching thing w/ 4 buffers...
FPS is the same and every couple of second I have this half-second freeze of the app (which was here even before I've implemented the buffer switching)... :|

If the lock\unlock is not my bottleneck nor more, what could it be?

[edited by - z9u2K on June 26, 2002 10:30:28 AM]

Share this post


Link to post
Share on other sites
Yep, use IndexBuffers for everything with more than 1 triangle!

You aren't actually having a bottleneck now, I really think it's just the way you're using the GPU. Try IndexBuffers, you won't regret it
And be sure to get the VertexBuffer ordered to maximize cache use.

[edited by - MatuX on June 26, 2002 10:55:46 AM]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
If you want to use index buffers you must arrange the vertex buffer(s) so that a vertex that is shared between several triangles only occurs in the vertex buffer once, but has several entries in the index buffer (once per triangle it''s a part of). This way you can utilise the vertex translation cache of the GPU, and you should see a speedup. This way your index buffer won''t only contain [1,2,3,4,5,6,...]. You are right that having an index buffer with only a list of consecutive vertices will most likely be slower than just sending vertices, since you have no vertex sharing. It shouldn''t be too difficult to figure out the vertex sharing in a ROAM algorithm though (within a patch at least).

Share this post


Link to post
Share on other sites
arrrrrrg!!!

It''s me agin...

The DrawPrimitive lines is not the reason for my bottleneck...

Whene I lock the buffer I get a pointer (pVertex) to an MYVERTEX array right?

What causes the FPS dropdown is the 4 lines I excute for each vertex... which are:

pVertex[dwCurrentVertex].p = D3DXVECTOR3(x, y, z);
pVertex[dwCurrentVertex].n = D3DXVECTOR3(0, 1, 0);
pVertex[dwCurrentVertex].u = x / dwMapSize * 8;
pVertex[dwCurrentVertex].v = z / dwMapSize * 8;

These four lines causes my bottleneck!!!
The writing to the VertexBuffer!!

Why is it happenning..?

Share this post


Link to post
Share on other sites
pVertex[dwCurrentVertex].p = D3DXVECTOR3(x, y, z);

This causes a temporary vector to be created then copied to your vertex. I''d manually copy the xyz values.

pVertex[dwCurrentVertex].u = x / dwMapSize * 8;

You can easily precalculate 8 / dwMapSize and then multiply x & y by this instead of doing it per loop.


Helpful links:
How To Ask Questions The Smart Way | Google can help with your question | Search MSDN for help with standard C or Windows functions

Share this post


Link to post
Share on other sites
I lose about 5FPS when the only thing I do per vetex is:
pVertex[dwCurrentVertex].u = 1.0f;

5FPS loss without even rendering anything or puting anything else into the buffer... I use multiple vertex buffers, dynamic, write only, discarded every lock...

I lose 5FPS when I write 4 bytes per vertex into the buffer... this is so strange!!

EDIT: Can it be a driver problem and not an application problem?

[edited by - z9u2K on June 28, 2002 3:23:10 PM]

Share this post


Link to post
Share on other sites
Are you locking the vertex buffer and retrieving a pointer to a vertex structure and manually copy all your values, if so make a pointer to a byte structure and use memcpy to copy your vertices this is faster and the shorter time you lock the faster it''ll go.

PS. This is how it''s done in the SDK check the vertices tutorial DS.

// Shadows

Share this post


Link to post
Share on other sites