D3D #14: Speeding up locking of resources.

Published May 18, 2006
Advertisement
If you've got time, any comments - good or bad - would be appreciated:

D3D #14: Speeding up locking of resources.
This process comes up fairly often in Direct3D programming. Any code that makes use of the LockRect() or Lock() calls (typically associated with textures, surfaces and vertex/index buffers) is locking a resource. You will use this technique if you want to read or write the raw data stored in the buffer - for any number of possible algorithms.

The key problem is that it is often very slow. The fact that, apparently, OpenGL can do it quicker is a moot point - in Direct3D it is slow! One of the key reasons is the "behind the scenes" work that must be done by the API, driver and hardware. Where possible GPU's will run in parallel with the CPU, thus resource modification can end up with some typical concurrent programming problems.

If the resource you are trying to modify is used as a dependency (input or output) of an operation in the command queue then you can incur pipeline stalls and flushes. A stall will occur when the GPU cannot make any progress until you finish manipulating the resource by calling Unlock(). A flush will require that some (or all) pending operations will have to be completed before you can access the resource.

Locking is a blocking operation - If you call Lock() on a resource that is not immediately available it will stall the CPU until it is. This effectively synchronizes the two processing units and reduces overall performance.

The data must be transferred to locally addressable memory - your CPU cannot directly access the memory stored on your video card, instead the driver must stream the requested data back to CPU-addressable RAM. This step can be slow if you are requesting a large amount of data, and must be completed before the API will unlock your application. As a more subtle consequence, the blocking of your application and the usage of the AGP/PCI-E bus effectively stops your application doing any further work, which can severely reduce overall performance.

As described above, locking is slow - mostly due to the latency rather than bandwidth. Avoiding locks is good practice, but for "load-time" or initialization work they are usually fine, acquiring locks in the main application/game loop (mixed in with other GPU/graphics functions) is where you'll get punished the most.

If you really have to manipulate resources in your main loop there are a few tricks you can use to hide the latency. There is no single way of "solving" this problem, it is a case of using clever programming to try and reduce the impact that it has.

Firstly, make sure you get the creation flags correct (see the D3DUSAGE enumeration) - these are often optional and must be specified when the resource is created. When you acquire the lock make sure you get the locking flags correct (see the D3DLOCK enumeration) - its good practice to help the driver/GPU where possible; by giving it additional information via these parameters it might be able to perform a better/faster operation. If you get these combinations wrong then the debug runtimes will often scream and shout at you - make sure you check!

As previously mentioned, the duration of the lock (how much time is spent between Lock() and Unlock() for example) can affect how badly you stall your application and/or GPU. Performing all of your manipulation whilst the lock is held might seem a more obvious way of programming, but it is not good for performance. Only consider doing this if it's a quick operation or you need to read and write data.

If you are only reading the data you can use a quick memcpy_s() operation to copy the locked data to a normal system memory array, unlock the resource, and then do your processing/reading. A bonus is that you could also farm out the work to a "worker thread" and gain some time via concurrent programming. Similarly, if you need to only write data then you can also copy a big chunk of system-RAM data into the resource using a memcpy_s() call. If you need to read data, process it, then write it back again you could explore the possibilities of two locks (one for the read, one for the write) being faster than a lengthy single lock.

// Compute the number of elements in this vertex buffer...D3DVERTEXBUFFER_DESC pDesc;m_pVertexBuffer->GetDesc( &pDesc );size_t ElementCount = pDesc.Size / sizeof( TerrainVertex );// Declare the variablesvoid *pRawData = NULL;TerrainVertex *pVertex = new TerrainVertex[ ElementCount ];// Attempt to gain the lockif( SUCCEEDED( m_pVertexBuffer->Lock( 0, 0, &pRawData, D3DLOCK_READONLY ) ) ){	// Copy the data	errno_t err = memcpy_s( reinterpret_cast< void* >( pVertex ), pDesc.Size, pRawData, pDesc.Size );	// Unlock the resource	if( FAILED( m_pVertexBuffer->Unlock( ) ) )	{		// Handle the error appropriately...		SAFE_DELETE_ARRAY( pVertex );	}	// Make sure the copy succeeded	if( 0 == err )	{		// Work with the data...		// Clean-up		SAFE_DELETE_ARRAY( pVertex );	}}else{	// Clean-up the allocated memory	SAFE_DELETE_ARRAY( pVertex );}


Consider a bounded-buffer (aka "ring buffer") approach. Create multiple copies of each resource (for example 3 render targets or vertex buffers) and alternate between them. The intended goal is that you will be locking/manipulating one resource whilst the pipeline can render to or from the other - the CPU and GPU are no longer reliant on the same resource. The down-side is that the results you'll get back can be "stale" and it doesn't work if the individual steps aren't separable.

// DeclarationsDWORD dwBoundedBufferSize = 4;DWORD dwCurrentBuffer = 0;LPDIRECT3DSURFACE9 *pBoundedBuffer = new LPDIRECT3DSURFACE9[ dwBoundedBufferSize ];// Create the resourcesfor( DWORD i = 0; i < dwBoundedBufferSize; i++ ){	if( FAILED( pd3dDevice->CreateRenderTarget( ..., &pBoundedBuffer, ... ) ) )	{		// Handle error condition here..	}}// On this frame we should render to 'dwCurrentBuffer'DWORD dwIndexToRender = dwCurrentBuffer;// We should lock 'dwCurrentBuffer + 1' - which will be the// oldest of the available buffers, thus hopefully not in the command queue.DWORD dwIndexToLock = (dwCurrentBuffer + 1) % dwBoundedBufferSize;// At the end of each frame we make sure to move the index forwards:dwCurrentBuffer = (dwCurrentBuffer + 1) % dwBoundedBufferSize;// Release the resourcesfor( DWORD i = 0; i < dwBoundedBufferSize; i++ )	SAFE_RELEASE( pBoundedBuffer );SAFE_DELETE_ARRAY( pBoundedBuffer );


If you need to read/write a large amount of data consider a staggered upload/download. Over the course of 10 frames, upload 10% of the data each frame - appending to the previous sections. The idea is to maintain short locks and to allow other graphics operations to be performed between locks. However, this method is not always an improvement - but it is at least something worth considering.

As originally stated, a lock can affect the concurrency of the CPU/GPU, thus you want as few locks as possible. If many resources need to be updated, consider spreading it out over a number of subsequent frames. This way you will get a less noticeable performance drop. A possible implementation is to maintain a simple queue of resources/operations that need to be performed and then allow only 1 (or 2, or 3...) per frame regardless of how many are waiting.

See Also: 'Using Dynamic Vertex and Index Buffers' in the DirectX SDK documentation.
0 likes 7 comments

Comments

Washu
I should also note that another fashion of locking is that when given a dynamic vertex buffer allocated with WriteOnly, then you lock only the portion you need, using NoOverwrite so that you can be guaranteed to be using a segment of the vertex buffer not currently in use. You use offsets and lengths so that you only lock the portions you need. You then draw those portions. When the buffer runs out of space, your next lock gets issued with a Discard flag, which will return (in theory) a new block of memory fresh for use while the card is busy rendering the previously written data.
May 18, 2006 11:10 AM
Muhammad Haggag
Excellent stuff up there. I also believe that a portion should be added explaining:
1) Dynamic buffers, and how vertex buffer renaming makes discard + write-only locks OK things (up to a limit).

2) Index buffers are stored in system memory on a lot of cards so reading them back is not as bad as other resources.
May 18, 2006 11:32 AM
jollyjeffers
Excellent! thanks for the comments - I'll look to roll those into what I've already got [smile]

Cheers,
Jack
May 18, 2006 01:52 PM
_the_phantom_
Quote:
2) Index buffers are stored in system memory on a lot of cards so reading them back is not as bad as other resources


define 'alot of cards'?
I know that older cards certainly can't deal with having the index data in VRAM, but newer onces (DX9 and up iirc) can and in OpenGL the VBO used for it 'can' end up in VRAM (trying to use a VBO for index data on the older cards tends to lead to large slow down as well).

Or is this something you can request and, like OGL, it might end up in VRAM or it might stay in system ram?
May 18, 2006 04:00 PM
jollyjeffers
First time I've heard about system memory IB's; sort of makes sense if the input assembler is driver/CPU based, but that again wouldn't make much sense for the latest-n-greatest all hardware cards [oh]

Quote:Or is this something you can request and, like OGL, it might end up in VRAM or it might stay in system ram?
I think most of the "pool" parameters have to be honoured, but the various "usage" flags are just hints - its free to ignore them if it wants.

Jack
May 19, 2006 05:26 AM
ET3D
Don't know exactly where a card puts what, though I remember reading a document about NVIDIA's placement strategy on their site a couple of years ago. I remember that the Rage Pro put textures in AGP memory from a certain driver onward, once it was discovered to be faster than putting textures in card RAM. (There were no on-card VBs and IBs at that time, of course.)
May 19, 2006 06:42 AM
Muhammad Haggag
IIRC, the 'lot of cards' that do this are all the ATi DX8.1- parts I encountered (8500, 9200 and below). The DX8.1 debug runtime used to issue a diagnostic message about whether your index buffers were allocated in VRAM or not. Anything 9500+ (i.e. DX9+) on ATi's side of things can store index buffers in VRAM, AFAIK.

No idea about nVidia parts, but I seem to recall pre-DX8.1 ones also did that.
May 19, 2006 10:43 AM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement
Advertisement