DX12 m_commandAllocator->Reset() is necessary every frame?

Started by
12 comments, last by Dingleberry 8 years, 4 months ago

Having play around MSFT dx12 samples for sometime and looking through the msdn documents, there are still lots of stuffs confuse me. One of which is the ID3D12CommandAllocator, and ID3D12GraphicsCommandList stuffs.

If (as mentioned in the msdn) the id3d12commandallocator is the memory allocator on gpu command buffer, and id3d12graphicscommandlist is the container object which contains sequence of commands why we need to reset the allocator and build the sequence of commands from scratch again and again every frame(at least MSFT examples does). Especially there are only [backbuffer count] of different commandlist.

I have modified the code for D3D12HelloConstBuffers sample, where I create 3 commandlist object from one allocator(backbuffer count is 3), the only difference between these three commandlist is that the

"m_commandList[frameidx]->OMSetRenderTargets(1, &rtvHandle, FALSE, nullptr);"

is different since rtvHandle is different from 3 backbuffer.

Then on the OnRender() function I get rid of the allocator reset, PopulateCommandLIst stuffs. Basically I just execute the right commandlist I create earlier based on current backbuffer index.

Here are relevant code snippet


// Fill the command list with all the render commands and dependent state.
void D3D12HelloConstBuffers::PopulateCommandList(int frameidx)
{
	// Command list allocators can only be reset when the associated 
	// command lists have finished execution on the GPU; apps should use 
	// fences to determine GPU execution progress.
	//ThrowIfFailed(m_commandAllocator->Reset());

	// However, when ExecuteCommandList() is called on a particular command 
	// list, that command list can then be reset at any time and must be before 
	// re-recording.
	ThrowIfFailed(m_commandList[frameidx]->Reset(m_commandAllocator.Get(), m_pipelineState.Get()));

	// Set necessary state.
	m_commandList[frameidx]->SetGraphicsRootSignature(m_rootSignature.Get());

	ID3D12DescriptorHeap* ppHeaps[] = { m_cbvHeap.Get() };
	m_commandList[frameidx]->SetDescriptorHeaps(_countof(ppHeaps), ppHeaps);

	m_commandList[frameidx]->SetGraphicsRootDescriptorTable(0, m_cbvHeap->GetGPUDescriptorHandleForHeapStart());
	m_commandList[frameidx]->RSSetViewports(1, &m_viewport);
	m_commandList[frameidx]->RSSetScissorRects(1, &m_scissorRect);

	// Indicate that the back buffer will be used as a render target.
	m_commandList[frameidx]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_renderTargets[frameidx].Get(), D3D12_RESOURCE_STATE_PRESENT, D3D12_RESOURCE_STATE_RENDER_TARGET));

	CD3DX12_CPU_DESCRIPTOR_HANDLE rtvHandle(m_rtvHeap->GetCPUDescriptorHandleForHeapStart(), frameidx, m_rtvDescriptorSize);
	m_commandList[frameidx]->OMSetRenderTargets(1, &rtvHandle, FALSE, nullptr);

	// Record commands.
	const float clearColor[] = { 0.0f, 0.2f, 0.4f, 1.0f };
	m_commandList[frameidx]->ClearRenderTargetView(rtvHandle, clearColor, 0, nullptr);
	m_commandList[frameidx]->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
	m_commandList[frameidx]->IASetVertexBuffers(0, 1, &m_vertexBufferView);
	m_commandList[frameidx]->DrawInstanced(3, 1, 0, 0);

	// Indicate that the back buffer will now be used to present.
	m_commandList[frameidx]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_renderTargets[frameidx].Get(), D3D12_RESOURCE_STATE_RENDER_TARGET, D3D12_RESOURCE_STATE_PRESENT));

	ThrowIfFailed(m_commandList[frameidx]->Close());
}

// Render the scene.
void D3D12HelloConstBuffers::OnRender()
{
	m_frameIndex = m_swapChain->GetCurrentBackBufferIndex();
	// Execute the command list.
	ID3D12CommandList* ppCommandLists[] = { m_commandList[m_frameIndex].Get() };
	m_commandQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

	// Present the frame.
	ThrowIfFailed(m_swapChain->Present(1, 0));

	WaitForPreviousFrame();
}

And it turned out it works as the original code.

So my question here is that: what the point of spending all the cpu cycles to recreate the commandlist every frame, or even reset the allocator every frame. And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Advertisement

If you don't reset the allocator, and then add more commands, memory consumption will grow until you exhaust it and crash.

However, if you don't add more commands and intend to reuse them, that's perfectly fine.

So my question here is that: what the point of spending all the cpu cycles to recreate the commandlist every frame, or even reset the allocator every frame. And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Because most of the time you don't know how to reuse most of these commands. Most simple example is frustum culling + user-controlled camera.

If the user looks to the right, one set of commands need to be generated to render all the things the user can see. If he looks to the left, another set of commands is needed. Even if you look at little more to the left, the draw order may change significantly (since you still should be sorting your draws, either for performance by sorting by PSO, or for correctness i.e. transparency)

However, this doesn't mean you can't get clever and reuse as much as you can (e.g. real time cubemapping; baked scenery, etc)

Edit: If you wrote a set of commands and never modify them again, you don't need 3 sets (i.e. one for each backbuffer). When generating them dynamically, you need 3 because while you're writing to one from the CPU, the GPU may be reading from the other two (it would be a race condition / hazard otherwise). But this isn't a problem if you never write to them again.


And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Also remember that it is suggested that you have (IIRC) 100 draws per command list. (12 for bundles) Visibility can change on a per frame basis so reusing command lists isn't really optimal.

-potential energy is easily made kinetic-

I'd recommend resetting the command allocator as soon as you're able to do it (but not sooner !).

As mentioned command allocators hold resources so if your game has a high resource usage that puts more pressure on it.

So that typically means an allocator used at frame N, will be reset at most at frame N + M (M being how many frames in advance your CPU is compared to what your GPU has done).

On the other hand an allocator used for a bundle, will not be reset until that bundle has stopped being reused, if it's the same bundle that is used for the whole existence of your application then that means the allocator will not be reset before the end of your application (so you also have to be careful to not mix bundles and command lists that have very different lifecycles).

Also, about the cost of doing reset : doing a reset per frame is NOT costly. Command allocators have been designed to have a lower CPU overhead when recycling resources, and that means you can reset it and be able to more quickly recycle allocations than without an allocator (what dx11 was doing). Allocators reuse is an integral part of the reduction of allocation cost over time.

Silly example : you have 8 threads building command lists. You could have for example one allocator per thread per in-flight frame. If there is two frames between one command being added to a command list and that command resulting in something on the screen (using a fence to guarantee that), then you will at first have 16 command allocators. One frame N, you use the first 8 allocators, then on frame N+1 you use the next 8 allocators, then before frame N+2 starts you wait for the fence that will signal the end of frame N, then you reset the first 8 allocators, and so on.

Then on top of that you have bundles that you built for reuse. If you have bundles that will last for the whole level (for example !), you have a dedicated allocator for them (if you built them in parallel then you will have one dedicated allocator per thread). Then you keep those allocators around until the level ends and those bundles are not used by any command list that is still in flight.

(we're using the term "frame" loosely here, it's any unit of GPU work whose completion you're keeping track of).


And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Also remember that it is suggested that you have (IIRC) 100 draws per command list. (12 for bundles) Visibility can change on a per frame basis so reusing command lists isn't really optimal.

100 draw calls (and 12 for bundles) is minimum recommended, the idea is to avoid small command lists with 1 or 2 commands. That being said, in the game that I work on, we have some very small command lists (like post process) that have maybe 20 commands (if all the steps are on) and that is ok.

Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

Also, the Command Allocator, never shrinks on reset. So no, it doesn't free anything, but you can reuse the memory. Think as a vector that can grows but every time that you call clear it just set the size to 0 instead of releasing memory.

So, my recommendation is to stick 1-1 with CL and Command Allocators, and try to don't build CL too big (500 drawcalls is fine, 20000 is not) nor too small, unless you need it (think again in post process and the like).


Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

I read the opposite... that building commandlists is expensive while submitting them are relatively cheap. In addition if you buffer your entire frames commandlists before submitting them it will take more memory and potentially(most likely) lead to an idle GPU waiting for work to do.

edit - although making lists with 100 draws does reduce the amount of submits.

-potential energy is easily made kinetic-


Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

I read the opposite... that building commandlists is expensive while submitting them are relatively cheap. In addition if you buffer your entire frames commandlists before submitting them it will take more memory and potentially(most likely) lead to an idle GPU waiting for work to do.

edit - although making lists with 100 draws does reduce the amount of submits.

Submit to the Command Queue, it's not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap.

Having a whole frame buffered adds latency and yes it takes memory, but is the best way to have the GPU fully busy.

In any case you should profile! you can always use GPU View to see how CPU and GPU are working (among other tools), but my recommendation is to have a full frame of latency so the GPU can work at its full potential (unless perhaps you are working on VR where low latency is much more important).

Edit: There is one issue with Intel drivers where Reset command list/allocators is very expensive (and by very expensive I mean that it can take several ms each, I have cases of almost 5ms to reset where it takes less than 1ms to build the same CL). This doesn't affect all Intel GPUs (only Haswell/Broadwell) and it doesn't affect nVidia/AMD that much, but I do reset all my CL using jobs in parallel to other stuff.

The directx 12 samples were made to be easy to follow and sometimes do weird things like not reusing identically created command lists. See: https://github.com/Microsoft/DirectX-Graphics-Samples/issues/58

I had a brief discussion about this with a few of my colleagues here are Microsoft before we launched the samples and we decided that since most real games are going to have different command list contents throughout their lifetimes, that we would not write the samples in a way that was tailored to the specific static scenarios that they target (i.e. caching the command lists).


Submit to the Command Queue, is not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap.

Have you profiled? While I agree a kernel call takes more than a regular function call, it really depends on what work is occurring when.


Having a whole frame buffered adds latency and yes it takes memory, but is the best way to have the GPU fully busy.

As long as you keep feeding the GPU commandlists in a steady stream you should be fine since every draw takes a while to complete anyway. (asynchronously)

-potential energy is easily made kinetic-


Submit to the Command Queue, is not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap.

Have you profiled? While I agree a kernel call takes more than a regular function call, it really depends on what work is occurring when.

Yes, I did. And batching command lists (in three groups per frame, one for scene rendering pre scaleform, GUI with scaleform and one after scaleform for more GUI and patching of resource states) gave the best results in all the platform that I tested (GeForce 760, GeForce 980m, Radeon 290x, Intel Haswell HD4600, Broadwell HD6200 and Skylake HD520).

I said two before because I forgot about Scaleform tongue.png

The first batch contains all the GBuffer generation lists, shadows lists, lighting pass (is a light pre pass renderer, that we use on all of our platforms including GL ES 2 for IPad), SSAO/HBAO+, material pass (because is LPP), deferred decals, transparent meshes/FXs and post process (FXAA/CMAA, DoF, etc). Each CL is generated in a separated job running in parallel.

Then we render the GUI using Scaleform (which send its own command lists).

And the third call is to render the rest of the stuff and update some resource states (like going from Render Target to present for the back buffer).

We have a two frame latency with GPU, (I keep everything alive for two frames) and I can tell you that using GPUView I keep busy the GPU without bubbles but this is after TH2 update. Pre TH2 there were issues with that (but it was related to DXGI and the presentation system not DX12 itself).

This topic is closed to new replies.

Advertisement