D3D #14: Speeding up locking of resources.
This process comes up fairly often in Direct3D programming. Any code that makes use of the LockRect() or Lock() calls (typically associated with textures, surfaces and vertex/index buffers) is locking a resource. You will use this technique if you want to read or write the raw data stored in the buffer - for any number of possible algorithms.
The key problem is that it is often very slow. The fact that, apparently, OpenGL can do it quicker is a moot point - in Direct3D it is slow! One of the key reasons is the "behind the scenes" work that must be done by the API, driver and hardware. Where possible GPU's will run in parallel with the CPU, thus resource modification can end up with some typical concurrent programming problems.
If the resource you are trying to modify is used as a dependency (input or output) of an operation in the command queue then you can incur pipeline stalls and flushes. A stall will occur when the GPU cannot make any progress until you finish manipulating the resource by calling Unlock(). A flush will require that some (or all) pending operations will have to be completed before you can access the resource.
Locking is a blocking operation - If you call Lock() on a resource that is not immediately available it will stall the CPU until it is. This effectively synchronizes the two processing units and reduces overall performance.
The data must be transferred to locally addressable memory - your CPU cannot directly access the memory stored on your video card, instead the driver must stream the requested data back to CPU-addressable RAM. This step can be slow if you are requesting a large amount of data, and must be completed before the API will unlock your application. As a more subtle consequence, the blocking of your application and the usage of the AGP/PCI-E bus effectively stops your application doing any further work, which can severely reduce overall performance.
As described above, locking is slow - mostly due to the latency rather than bandwidth. Avoiding locks is good practice, but for "load-time" or initialization work they are usually fine, acquiring locks in the main application/game loop (mixed in with other GPU/graphics functions) is where you'll get punished the most.
If you really have to manipulate resources in your main loop there are a few tricks you can use to hide the latency. There is no single way of "solving" this problem, it is a case of using clever programming to try and reduce the impact that it has.
Firstly, make sure you get the creation flags correct (see the D3DUSAGE enumeration) - these are often optional and must be specified when the resource is created. When you acquire the lock make sure you get the locking flags correct (see the D3DLOCK enumeration) - its good practice to help the driver/GPU where possible; by giving it additional information via these parameters it might be able to perform a better/faster operation. If you get these combinations wrong then the debug runtimes will often scream and shout at you - make sure you check!
As previously mentioned, the duration of the lock (how much time is spent between Lock() and Unlock() for example) can affect how badly you stall your application and/or GPU. Performing all of your manipulation whilst the lock is held might seem a more obvious way of programming, but it is not good for performance. Only consider doing this if it's a quick operation or you need to read and write data.
If you are only reading the data you can use a quick memcpy_s() operation to copy the locked data to a normal system memory array, unlock the resource, and then do your processing/reading. A bonus is that you could also farm out the work to a "worker thread" and gain some time via concurrent programming. Similarly, if you need to only write data then you can also copy a big chunk of system-RAM data into the resource using a memcpy_s() call. If you need to read data, process it, then write it back again you could explore the possibilities of two locks (one for the read, one for the write) being faster than a lengthy single lock.
// Compute the number of elements in this vertex buffer...D3DVERTEXBUFFER_DESC pDesc;m_pVertexBuffer->GetDesc( &pDesc );size_t ElementCount = pDesc.Size / sizeof( TerrainVertex );// Declare the variablesvoid *pRawData = NULL;TerrainVertex *pVertex = new TerrainVertex[ ElementCount ];// Attempt to gain the lockif( SUCCEEDED( m_pVertexBuffer->Lock( 0, 0, &pRawData, D3DLOCK_READONLY ) ) ){ // Copy the data errno_t err = memcpy_s( reinterpret_cast< void* >( pVertex ), pDesc.Size, pRawData, pDesc.Size ); // Unlock the resource if( FAILED( m_pVertexBuffer->Unlock( ) ) ) { // Handle the error appropriately... SAFE_DELETE_ARRAY( pVertex ); } // Make sure the copy succeeded if( 0 == err ) { // Work with the data... // Clean-up SAFE_DELETE_ARRAY( pVertex ); }}else{ // Clean-up the allocated memory SAFE_DELETE_ARRAY( pVertex );}
Consider a bounded-buffer (aka "ring buffer") approach. Create multiple copies of each resource (for example 3 render targets or vertex buffers) and alternate between them. The intended goal is that you will be locking/manipulating one resource whilst the pipeline can render to or from the other - the CPU and GPU are no longer reliant on the same resource. The down-side is that the results you'll get back can be "stale" and it doesn't work if the individual steps aren't separable.
// DeclarationsDWORD dwBoundedBufferSize = 4;DWORD dwCurrentBuffer = 0;LPDIRECT3DSURFACE9 *pBoundedBuffer = new LPDIRECT3DSURFACE9[ dwBoundedBufferSize ];// Create the resourcesfor( DWORD i = 0; i < dwBoundedBufferSize; i++ ){ if( FAILED( pd3dDevice->CreateRenderTarget( ..., &pBoundedBuffer, ... ) ) ) { // Handle error condition here.. }}// On this frame we should render to 'dwCurrentBuffer'DWORD dwIndexToRender = dwCurrentBuffer;// We should lock 'dwCurrentBuffer + 1' - which will be the// oldest of the available buffers, thus hopefully not in the command queue.DWORD dwIndexToLock = (dwCurrentBuffer + 1) % dwBoundedBufferSize;// At the end of each frame we make sure to move the index forwards:dwCurrentBuffer = (dwCurrentBuffer + 1) % dwBoundedBufferSize;// Release the resourcesfor( DWORD i = 0; i < dwBoundedBufferSize; i++ ) SAFE_RELEASE( pBoundedBuffer );SAFE_DELETE_ARRAY( pBoundedBuffer );
If you need to read/write a large amount of data consider a staggered upload/download. Over the course of 10 frames, upload 10% of the data each frame - appending to the previous sections. The idea is to maintain short locks and to allow other graphics operations to be performed between locks. However, this method is not always an improvement - but it is at least something worth considering.
As originally stated, a lock can affect the concurrency of the CPU/GPU, thus you want as few locks as possible. If many resources need to be updated, consider spreading it out over a number of subsequent frames. This way you will get a less noticeable performance drop. A possible implementation is to maintain a simple queue of resources/operations that need to be performed and then allow only 1 (or 2, or 3...) per frame regardless of how many are waiting.
See Also: 'Using Dynamic Vertex and Index Buffers' in the DirectX SDK documentation.