Urgent: Why does hardware vertex processing slow me down?

Started by
3 comments, last by malachii 22 years, 8 months ago
Using D3D8, on a P2 750, GeForce2 MX. I''m putting the finishing touches on my demo (which I need to have done before the end of the weekend), and just noticed something. When I turn on hardware vertex processing, things slow down (about 10%). They used to speed way up! Somehow something must have changed, but I don''t know what it is. I have turned off lighting since I''m using light maps. Perhaps that is it? I''m still using all of D3D8''s transformations though, so I would expect a speed increase anyhow. My vertex buffer size is set to 256 (that used to give me optimal performance, and hasn''t changed). I''m sending indexed triangle lists. They are quite dynamic, so I''m not using managed memory (as per the documentation suggestion in that direction). If you don''t know because I haven''t given you much background of my app (it''s quite a complex landscape rendering application), then please share some thoughts as to what may be causing the problem. Here is my init code for the vertex buffers:

if FAILED(m_pd3dDevice->CreateVertexBuffer( m_maxVerts*sizeof(TRILISTVERTEX), D3DUSAGE_WRITEONLY|D3DUSAGE_DYNAMIC , D3DFVF_TRILISTVERTEX, D3DPOOL_DEFAULT, &m_vb)) return 0;

if FAILED(m_pd3dDevice->CreateIndexBuffer( m_maxVerts*3*sizeof(TRILISTVERTEX), D3DUSAGE_WRITEONLY|D3DUSAGE_DYNAMIC , D3DFMT_INDEX16, D3DPOOL_DEFAULT, &m_vbi)) return 0;
 
Begging your assistance, Malachii.
Advertisement
Several things.. First off, try to make sure your vertex format is aligned on a dword boundary. There''s penalties for unaligned access. (So 16 over 12). Make sure that you''re not reading from this buffer at any point in time, even if this means you''re keeping a copy of it for yourself. (Yeah, while it says write-only, you can still read from it). The odds are the change has to do with the VB now residing in AGP memory.
1. You''re specifying DYNAMIC buffers, so therefore should be updating the buffers regularly. One key to dynamic buffer performance is correct use of the locking flags when you lock the buffer - could you post the code/relevent parts you use to lock and fill the buffer... locking with just DISCARD or just OVERWRITE is bad. If you don''t use the flags properly you lose parallelism - (ie. the GPU and CPU serialise - they can run in parallel).


2. The sweet spot size for a dynamic buffer assuming your vertex format is around 32bytes per vertex tends to be between 2000 and 3000 vertices across a range of T&L cards.


3. The number of polygons passed in a single DrawIndexedPrimitive call affects performance. There is an overhead in both API and driver terms for issuing the call. With software vertex processing, you need to pass at least 20 polygons per call and with hardware vertex processing this jumps to 200, if you submit less per call it costs performance.


4. Lightmaps shouldn''t take away from performance since multitexturing happens in a single pass, there are however some things to check with your texturing in general:

a. Use mip-maps, 3d hardware, and GeForce based cards in particular suffer badly from texture minification - mip-maps are the solution. The artists might not like them but your frame rate will!!

b. Check you''re not overcommiting textures - we let the artists loose on a prototype engine and they made a bumpy lit demo - for some reason it ran at 10fps on 32Mb GeForces, slower than all previous engines... it was only when we had a peek into what the engine was doing that we discovered what was killing the performance - they''d done scenes which required 64Mb of textures in a single frame - moral of the story: make sure everything between BeginScene and EndScene fits into video memory!

c. Are you locking or doing any SetRenderTarget things with any textures in the frame ? - rendering with a recently updated texture stalls the card - if you''ve updated a texture, use it as far away from the update as possible - multi-buffer if necessary.

d. To check if texturing is a big bottleneck, see what the framerate is with no textures - if you''ve wrapped D3D in some way it should be easy - just comment out all the calls to SetTexture.

5. If you''re doing random access on the vertex buffer, make your vertex format a multiple of 32bytes (think CPU caches, GPU & AGP has same), otherwise for sequential access use as lightweight a format as possible!


6. Make sure ALL the memory and processing types are set to use hardware and video memory - one SYSTEMMEMORY or SOFTWARE_VERTEX_PROCESSING could screw the whole hardware vertex pipe up.


7. IF you''re using any vertex shaders, there are things to be aware of, particularly on mixed devices.


8. If you''re using a MIXED device, DO NOT switch between software and hardware unless you really need to - that should be the MAIN thing to sort on in a mixed engine - doing the switch essentially resets all of D3D for a different device - VERY slow.


9. Make sure that lighting renderstate is definately off, as well as states related to it such as NORMALIZENORMALS. Be careful with the clipping renderstate - if you''re using it intelligently its a win, if not it''ll bite you (therefore if you haven''t thought through things like guardbands don''t bother toggling it!)

1x.
- Vertex cache friendly indices make a difference.
- Clear the whole of the render target, not just a subrect.
- For multiple viewports you don''t need multiple BeginScene/EndScene.
- Only use one BeginScene/EndScene per target.
- Clear the frame and Z buffer explicitly, drawing big triangles at the back of the view isn''t the best way anymore.
- If you have a stencil buffer, clear it at the same time as the frame and Z buffer - don''t clear them separately - if a buffer which is active is left out of the Clear you incur an aditional Read, Modify, Write since the card has to preserve part of the destination buffers.

Plenty more things which bite you if you''re not careful - more info about what your app is doing might bring more relevent descriptions of pitfalls. From what you''ve posted, pretty much everyone create buffers like that, as Johnny 5 said many years ago "Need Input"

--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

Thanks for replying. S1CA - I''m obviously still new to this as compared to yourself, so a few items were a bit confusing for me. Nevertheless, I''ll provide you with what information I can.... Also, most of the tips seemed to be general in nature. Which of them are specific to hardware processing, versus software processing? This is the part that is stuping me (the performance difference is backwards!).

1. I''m updating the buffers a LOT. Here''s what i''m doing with them. First I create them (good thing too!) (currently 256 bytes), then I stuff vertices into the triangle list in "fan" order. (originally I was using triangle fans, but wanted larger batches, so moved to indexed triangle lists, but I''m keeping the fan-style since my landscape algo uses it). I lock at the start of the fill process, and prior to sending them all. I always send them in a single drawprimitive call. I tried using larger buffers, but didn''t get a big difference (256 was faster in software mode though, so I kept that size). My vertex buffer format (FVF) is tiny, because I stripped out "all fancy code" to get this problem down to basics prior to posting the first message in this thread. The operation of the class is as follows. "StartDrawingChain" initializes things. Then the user rams in vertices directly into the vertex buffer memory. When the memory is full (256 bytes), the buffer is sent to the card (unlocked then relocked), without the class user knowing it. When the users calls "EndDrawingChain", it spits out the few remaining vertices, unlocks, and cleans up. Basically, about 10,000 triangles per frame are being spit through here, all in one round as I was describing, only locking and unlocking every 256 bytes (buffer size), and only 1 drawprimitive call per buffer.

In case you want to look at the whole class code, I''ve included it below.

class CFanList{	LPDIRECT3DVERTEXBUFFER8 m_vb; 	LPDIRECT3DDEVICE8		m_pd3dDevice;	LPDIRECT3DINDEXBUFFER8	m_vbi;	WORD*					m_pIndexList;				TRILISTVERTEX*			m_pVertexList;	// info about our current buffer	unsigned short			m_curVerts;			// nodes for the current fan including center	unsigned short			m_maxVerts;			// total memory size in vertices	unsigned short			m_totInds;			// total number of indexes into the vertex buffer	unsigned short			m_totVerts;			// the total number of vertices currently in the buffer	// info about our current drawing chain (for debugging purposes)	long					m_totChainFans;			// tot number of fans for out chain.	long					m_totChainVerts;		// tot number of verts for our chain	long					m_totChainTris;			// tot number of triangles for our chainprivate:	inline void DrawBuffer()	{		m_vb->Unlock();		m_vbi->Unlock();		m_pd3dDevice->DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0, m_totVerts, 0, m_totInds/3);		m_vb->Lock( 0, m_maxVerts*sizeof(TRILISTVERTEX), (BYTE**)&m_pVertexList, 0);		m_vbi->Lock( 0, m_maxVerts*3*sizeof(WORD), (BYTE**)&m_pIndexList, 0);		// advance our debugging info		m_totChainVerts+= m_totVerts;		m_totChainTris+= m_totInds/3;		// reset our fan info now since we will really be starting from the beginning of the buffer		m_totVerts= 0;		m_totInds= 0;		m_curVerts= 0;		m_totInds= 0;	}public:	CFanList()	{		m_totVerts= 0;		m_totInds= 0;		m_totInds= 0;		m_curVerts= 0;	}		~CFanList()	{		m_vb->Release();		m_vbi->Release();	}		int Init(long BufferSize, LPDIRECT3DDEVICE8 d3ddevice)	{   		m_pd3dDevice= d3ddevice;		m_maxVerts= (unsigned short)BufferSize;		if FAILED(m_pd3dDevice->CreateVertexBuffer( m_maxVerts*sizeof(TRILISTVERTEX), D3DUSAGE_WRITEONLY|D3DUSAGE_DYNAMIC , D3DFVF_TRILISTVERTEX, D3DPOOL_DEFAULT, &m_vb)) return 0;		if FAILED(m_pd3dDevice->CreateIndexBuffer( m_maxVerts*3*sizeof(TRILISTVERTEX), D3DUSAGE_WRITEONLY|D3DUSAGE_DYNAMIC , D3DFMT_INDEX16, D3DPOOL_DEFAULT, &m_vbi)) return 0;		m_pd3dDevice->SetIndices(m_vbi, 0);		return 1;	}		inline void StartDrawingChain()	{		m_pd3dDevice->SetStreamSource( 0, m_vb, sizeof(TRILISTVERTEX) );		m_pd3dDevice->SetVertexShader( D3DFVF_TRILISTVERTEX );		m_vb->Lock( 0, m_maxVerts*sizeof(TRILISTVERTEX), (BYTE**)&m_pVertexList, 0);		m_vbi->Lock( 0, m_maxVerts*3*sizeof(WORD), (BYTE**)&m_pIndexList, 0);		m_totChainFans= 0;		m_totChainTris= 0;		m_totChainVerts= 0;	}		inline long EndDrawingChain()	{		static long output=0;		// there might be a partial fan left to be drawn, and/or data in the buffer to be drawn		if (m_curVerts >= 3) SendFan();		if (m_totVerts >= 3) DrawBuffer();		// clean up		m_vb->Unlock();		m_vbi->Unlock();		// output debugging info		if (((++output)%100)==0)		{			TRACE("%d verts, %d triangles, %d fans (%1.1f tris/fan)\n", m_totChainVerts, m_totChainTris, m_totChainFans, (float)m_totChainTris/(float)m_totChainFans);		}		return 0;	}		// after calling this you should add 3 nodes to the memory starting with the center, going clockwise	inline TRILISTVERTEX* StartFan()	{		// before going adding a new fan, make sure we have enough memory. if not draw & purge the buffer		if (m_totVerts >= (m_maxVerts-16)) DrawBuffer();				// if the previous fan has not yet been put in the buffer, do so now.		if (m_curVerts >= 3) SendFan();		// directly update the indexing for a whole new triangle		m_pIndexList[m_totInds++]= m_totVerts;		m_pIndexList[m_totInds++]= m_totVerts+1;		m_pIndexList[m_totInds++]= m_totVerts+2;		// increment our counters		m_curVerts= 3;		m_totVerts+= 3;		m_totChainFans++;		return &m_pVertexList[m_totVerts-3];	}		// after calling this you should add 1 node to the memory, going clockwise	inline TRILISTVERTEX* AddNode()	{		m_curVerts++;		m_totVerts++;		// directly update the indexing for the new node		m_pIndexList[m_totInds++]= m_totVerts-m_curVerts;	// the hub		m_pIndexList[m_totInds++]= m_totVerts-2;			// the one prior to the mode recent					m_pIndexList[m_totInds++]= m_totVerts-1;			// the most recent		// return the memory		return &m_pVertexList[m_totVerts-1];	}		// calling this function will "output the fan". in reality we simply reset the current number of vertices	inline unsigned short SendFan()	{		m_curVerts= 0;		return m_curVerts;	}	// return the number of nodes in the current fan (including the center)	inline unsigned short Count() 	{ 		return m_curVerts; 	}}; 


2. I''ve played with that size, and for me 256 seems good given my tiny FVF (I run in software or hardware mode optionally). Still, I can play with that easily. I''m sure that''s not my big problem now (I just verified it). Made no difference between software and hardware really (went to 1024).

3. I''m sending all my polygons in one call of drawprimitive. Either 256 vertices, or 1024 (when I changed the size above, to no avail).

4. I was more worried about how it would impact hardware processing versus software.

4a. I''m not using mipmaps because I couldn''t get D3DXCreateTextureFromFile to create them for me. At least, the structure it fills told me there was 1 miplevel. I assume that means AFTER it creates them (at which point, none were created. If you have input on this subject, please let me know. But this shouldn''t make a big difference between SOFTWARE and HARDWARE rendering right?

4b, c, d: I''m not using textures at all (I took them out for my testing). But again, hardware, versus software rendering is my concern, and I don''t believe these would make a difference between the two modes, right?

5. My vbuffer access is totally sequential as you can see from the code I included above. And you can''t get a lighter FVF

6. Where would I find these? You can see my vertex buffer creation code. My d3d device creation code creates with hardware or software optionally. I''m not using any textures. Is there anywhere else I should look for this?

7. No vertex shaders.

8. No mixing at the moment.

9. I''ve never turned on normalizenormals, so that shouldn''t be a problem right? (defaults to the faster value?). Here''s my d3d device init type code:

		if( FAILED( g_pD3D->CreateDevice( D3DADAPTER_DEFAULT, D3DDEVTYPE_HAL, hWnd, D3DCREATE_HARDWARE_VERTEXPROCESSING, &d3dpp, &m_pd3dDevice ) ) )		{			return E_FAIL;		}	// set some rendering flags (z buffer)    m_pd3dDevice->SetRenderState( D3DRS_ZENABLE,    FALSE );	m_pd3dDevice->SetRenderState( D3DRS_FILLMODE,	D3DFILL_WIREFRAME);		m_pd3dDevice->SetRenderState( D3DRS_LIGHTING,	FALSE);	  	// set the textureing	m_pd3dDevice->SetTextureStageState( 0, D3DTSS_MAGFILTER, D3DTEXF_LINEAR );    m_pd3dDevice->SetTextureStageState( 0, D3DTSS_MINFILTER, D3DTEXF_LINEAR ); 


10+ (now I will be quoting you since you didn''t number after 10).

"Vertex cache friendly indices make a difference."

What does this mean?

"Clear the whole of the render target, not just a subrect"

Yep

" For multiple viewports you don''t need multiple BeginScene/EndScene."

Not an issue since I only have 1 viewport.

"Only use one BeginScene/EndScene per target."

I only have 1.

"Clear the frame and Z buffer explicitly, drawing big triangles at the back of the view isn''t the best way anymore."

Done.

"If you have a stencil buffer, clear it at the same time as the frame and Z buffer - don''t clear them separately - if a buffer which is active is left out of the Clear you incur an aditional Read, Modify, Write since the card has to preserve part of the destination buffers."

Nope.



And to give you a little more info, here''s my main rendering loop... I turn Zbuffering on and off for kicks, so don''t mind that I''m clearing it here, but not setting it earlier

Thanks so much for the wonderful reply.

Malachii.


    m_pd3dDevice->Clear( 0, NULL, D3DCLEAR_TARGET|D3DCLEAR_ZBUFFER, D3DCOLOR_XRGB(0,0,0), 1.0f, 0 );	// move the camera around//	AdjustCameraPosition(camx, camy);	float h= m_land.GetHeightAt(camx, camy);		// create view transformation	vecCamera= D3DXVECTOR3(camx, h+0.1f, camy);		// rotating	vecLookAt= D3DXVECTOR3(camx, h-0.3f, camy+1.0f);	vecUp= D3DXVECTOR3(0.0f, 1.0f, 0.0f);	D3DXMatrixLookAtLH(&matView, &vecCamera, &vecLookAt, &vecUp);	// assign the view transform	m_pd3dDevice->SetTransform(D3DTS_VIEW, &matView);	// fov transormation	D3DXMatrixPerspectiveFovLH(&matCamera, D3DX_PI/4, 4.0f/3.0f, 0.01f, 20.0f);	m_pd3dDevice->SetTransform(D3DTS_PROJECTION, &matCamera);	// Begin the scene    m_pd3dDevice->BeginScene();	// draw the polygons	D3DXVECTOR3 cams[2];	cams[0]= vecCamera;	cams[1]= vecLookAt;	m_land.Draw(cams);    // End the scene    m_pd3dDevice->EndScene();    // Present the backbuffer contents to the display    m_pd3dDevice->Present( NULL, NULL, NULL, NULL ); 

I''m going to pull the key points out rather than answer all of them...


A) You have dynamic buffers, but you don''t lock them in the best way for hardware! - also the "Unlock, Draw, Lock" pattern could be keeping the buffer locked for longer than is necessary (depending on how much work you do in between calls to it).

The GPU on T&L hardware is a CPU totally independent of the main CPU in the machine, the rendering part is also totally independent. Which means while you''re doing something with the CPU (such as filling buffers), you want the GPU to be doing T&L and render work in parallel (think multiprocessing).

When you lock a buffer which is in memory used by the graphics card, you''re preventing the GPU from rendering from that buffer until you''ve unlocked, likewise, when you call lock on a buffer which is currently being used by the GPU to draw from, the CPU has to spin doing nothing (inside the Lock call). As you can probably guess, this totally kills any parallelism of the GPU and CPU since one is always waiting for the other...

...The obvious solution to this is to have more than one buffer and cycle through them (think double or triple buffering) so that the buffer you''re writing to with CPU is different to the buffer being used by the GPU...

...Luckily, with T&L drivers, if you ask for a DYNAMIC buffer things are set up for multiple buffering for you. The hardware people term this "Buffer Renaming" (after the register renaming technique used inside CPUs to reduce stalls).
*** On GeForce drivers, when you create a buffer with the DYNAMIC flag, it actually allocates *8* buffers all of the same size (so in your case it really allocates 8*256) ***.
The reason it uses 8 rather than just 2 is the card likes to buffer up to one and a half frames of draw requests so the stage the card is at with drawing might be quite a way behind what is being submitted.

Now comes the part you aren''t currently doing. Unfortunately, the cycling between buffers when you lock isn''t as automatic as it could be, you *must* use the D3DLOCK_DISCARD and D3DLOCK_NOOVERWRITE flags correctly to maximise performance and cycle properly through the buffers.

The DISCARD flag forcibly cycles to another of the buffers. If you haven''t filled in as many vertices as possible, this can be a waste (the driver will stall if it gets to the 8th buffer and theres still rendering happening from the 1st).

The NOOVERWRITE flag specifies that you won''t be accessing any of the vertices you touched in a previous lock call. If the card is hungry for new data the driver can start to send the part of the current buffer you''ve already filled in.

With the correct use of those flags, the 2000 vertices plus buffer size can be a win (but bigger than around 6000 starts to be a loss again), YMMV.

A quote from the D3D documentation may help (you may not have exactly the same docs, I''m working with an unreleased version at the moment):

Using Dynamic Vertex and Index BuffersDynamic vertex and index buffers have a difference in performance based the size and usage. The usage styles below help to determine whether to use D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE for the Flags parameter of the Lock method.Usage Style 1:for loop(){    pBuffer->Lock(...D3DLOCK_DISCARD...); //Ensures that hardware                                           //doesn''t stall by returning                                           //a new pointer.    Fill data (optimally 1000s of vertices/indices, no fewer) in pBuffer.    pBuffer->Unlock()    Change state(s).    DrawPrimitive() or DrawIndexedPrimitive()}Usage Style 2:for loop(){    pVB->Lock(...D3DLOCK_DISCARD...); //Ensures that hardware doesn''t                                       //stall by returning a new                                       //pointer.    Fill data (optimally 1000s of vertices/indices, no fewer) in pBuffer.    pBuffer->Unlock    for loop( 100s of times )    {        Change State        DrawPrimitive() or DrawIndexPrimitives() //Tens of primitives    }}Usage Style 3:for loop(){    If there is space in the Buffer    {        //Append vertices/indices.        pBuffer->Lock(…D3DLOCK_NOOVERWRITE…);    }    Else    {        //Reset to beginning.        pBuffer->Lock(…D3DLOCK_DISCARD…);    }    Fill few 10s of vertices/indices in pBuffer    pBuffer->Unlock    Change State    DrawPrimitive() or DrawIndexedPrimitive() //A few primitives    }    Style 1 is faster than either style 2 or 3, but is generally not very practical. Style 2 is usually faster than style 3, provided that the application fills at least a couple thousand vertices/indices for every Lock, on average. If the application fills fewer than that on average, then style 3 is faster. There is no guaranteed answer as to which lock method is faster and the best way to find out is to experiment. 


The full DISCARD, NOOVERWRITE method is also reiterated in the DirectX FAQ at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndxgen/html/directx8faq.asp




B) Vertex caching is a thing graphics hardware does to avoid re-fetching vertices - with an indexed primitive, the indices are used to look into the vertex stream to fetch the vertices to assemble a triangle, this fetch is a fetch out to video memory (AFAIK) which has a cost associated with it. So what most cards do is have a small FIFO cache (~8-16 vertices long on nVidia cards) which contains the last few vertices which were fetched to assemble triangles. If an index points to a vertex which is already in the cache, a fetch to video memory is avoided (think of CPU caches and cache misses - same deal).
If the indices you pass to the draw call jump all over the vertex stream at high frequency, you''ll get more cache misses and lower performance (although thats the same in both hardware and software vertex processing). Keeping good locality of reference with the indices helps the cache, generally a winding or spiral pattern across the vertices gets the best performance.



--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

This topic is closed to new replies.

Advertisement