Jump to content
  • Advertisement
Sign in to follow this  
Mr_Fox

DX12 DX12 m_commandAllocator->Reset() is necessary every frame?

This topic is 951 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Having play around MSFT dx12 samples for sometime and looking through the msdn documents, there are still lots of stuffs confuse me. One of which is the ID3D12CommandAllocator, and ID3D12GraphicsCommandList stuffs. 

 

If (as mentioned in the msdn) the id3d12commandallocator is the memory allocator on gpu command buffer, and id3d12graphicscommandlist is the container object which contains sequence of commands why we need to reset the allocator and build the sequence of commands from scratch again and again every frame(at least MSFT examples does). Especially there are only [backbuffer count] of different commandlist.

 

I have modified the code for D3D12HelloConstBuffers sample, where I create 3 commandlist object from one allocator(backbuffer count is 3), the only difference between these three commandlist is that the

 

"m_commandList[frameidx]->OMSetRenderTargets(1, &rtvHandle, FALSE, nullptr);"

 

is different since rtvHandle is different from 3 backbuffer.

 

Then on the OnRender() function I get rid of the allocator reset, PopulateCommandLIst stuffs. Basically I just execute the right commandlist I create earlier based on current backbuffer index.

 

Here are relevant code snippet

// Fill the command list with all the render commands and dependent state.
void D3D12HelloConstBuffers::PopulateCommandList(int frameidx)
{
	// Command list allocators can only be reset when the associated 
	// command lists have finished execution on the GPU; apps should use 
	// fences to determine GPU execution progress.
	//ThrowIfFailed(m_commandAllocator->Reset());

	// However, when ExecuteCommandList() is called on a particular command 
	// list, that command list can then be reset at any time and must be before 
	// re-recording.
	ThrowIfFailed(m_commandList[frameidx]->Reset(m_commandAllocator.Get(), m_pipelineState.Get()));

	// Set necessary state.
	m_commandList[frameidx]->SetGraphicsRootSignature(m_rootSignature.Get());

	ID3D12DescriptorHeap* ppHeaps[] = { m_cbvHeap.Get() };
	m_commandList[frameidx]->SetDescriptorHeaps(_countof(ppHeaps), ppHeaps);

	m_commandList[frameidx]->SetGraphicsRootDescriptorTable(0, m_cbvHeap->GetGPUDescriptorHandleForHeapStart());
	m_commandList[frameidx]->RSSetViewports(1, &m_viewport);
	m_commandList[frameidx]->RSSetScissorRects(1, &m_scissorRect);

	// Indicate that the back buffer will be used as a render target.
	m_commandList[frameidx]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_renderTargets[frameidx].Get(), D3D12_RESOURCE_STATE_PRESENT, D3D12_RESOURCE_STATE_RENDER_TARGET));

	CD3DX12_CPU_DESCRIPTOR_HANDLE rtvHandle(m_rtvHeap->GetCPUDescriptorHandleForHeapStart(), frameidx, m_rtvDescriptorSize);
	m_commandList[frameidx]->OMSetRenderTargets(1, &rtvHandle, FALSE, nullptr);

	// Record commands.
	const float clearColor[] = { 0.0f, 0.2f, 0.4f, 1.0f };
	m_commandList[frameidx]->ClearRenderTargetView(rtvHandle, clearColor, 0, nullptr);
	m_commandList[frameidx]->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
	m_commandList[frameidx]->IASetVertexBuffers(0, 1, &m_vertexBufferView);
	m_commandList[frameidx]->DrawInstanced(3, 1, 0, 0);

	// Indicate that the back buffer will now be used to present.
	m_commandList[frameidx]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_renderTargets[frameidx].Get(), D3D12_RESOURCE_STATE_RENDER_TARGET, D3D12_RESOURCE_STATE_PRESENT));

	ThrowIfFailed(m_commandList[frameidx]->Close());
}

// Render the scene.
void D3D12HelloConstBuffers::OnRender()
{
	m_frameIndex = m_swapChain->GetCurrentBackBufferIndex();
	// Execute the command list.
	ID3D12CommandList* ppCommandLists[] = { m_commandList[m_frameIndex].Get() };
	m_commandQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

	// Present the frame.
	ThrowIfFailed(m_swapChain->Present(1, 0));

	WaitForPreviousFrame();
}

And it turned out it works as the original code.

 

So my question here is that: what the point of spending all the cpu cycles to recreate the commandlist every frame, or even reset the allocator every frame. And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Edited by Mr_Fox

Share this post


Link to post
Share on other sites
Advertisement


And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Also remember that it is suggested that you have (IIRC) 100 draws per command list. (12 for bundles)  Visibility can change on a per frame basis so reusing command lists isn't really optimal.

Share this post


Link to post
Share on other sites

I'd recommend resetting the command allocator as soon as you're able to do it (but not sooner !).

 

As mentioned command allocators hold resources so if your game has a high resource usage that puts more pressure on it.

 

So that typically means an allocator used at frame N, will be reset at most at frame N + M (M being how many frames in advance your CPU is compared to what your GPU has done).

 

On the other hand an allocator used for a bundle, will not be reset until that bundle has stopped being reused, if it's the same bundle that is used for the whole existence of your application then that means the allocator will not be reset before the end of your application (so you also have to be careful to not mix bundles and command lists that have very different lifecycles).

 

Also, about the cost of doing reset : doing a reset per frame is NOT costly. Command allocators have been designed to have a lower CPU overhead when recycling resources, and that means you can reset it and be able to more quickly recycle allocations than without an allocator (what dx11 was doing). Allocators reuse is an integral part of the reduction of allocation cost over time.

 

Silly example : you have 8 threads building command lists. You could have for example one allocator per thread per in-flight frame. If there is two frames between one command being added to a command list and that command resulting in something on the screen (using a fence to guarantee that), then you will at first have 16 command allocators. One frame N, you use the first 8 allocators, then on frame N+1 you use the next 8 allocators, then before frame N+2 starts you wait for the fence that will signal the end of frame N, then you reset the first 8 allocators, and so on.

Then on top of that you have bundles that you built for reuse. If you have bundles that will last for the whole level (for example !), you have a dedicated allocator for them (if you built them in parallel then you will have one dedicated allocator per thread). Then you keep those allocators around until the level ends and those bundles are not used by any command list that is still in flight.

 

(we're using the term "frame" loosely here, it's any unit of GPU work whose completion you're keeping track of).

Share this post


Link to post
Share on other sites

 


And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Also remember that it is suggested that you have (IIRC) 100 draws per command list. (12 for bundles)  Visibility can change on a per frame basis so reusing command lists isn't really optimal.

 

100 draw calls (and 12 for bundles) is minimum recommended, the idea is to avoid small command lists with 1 or 2 commands. That being said, in the game that I work on, we have some very small command lists (like post process) that have maybe 20 commands (if all the steps are on) and that is ok.

Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

Also, the Command Allocator, never shrinks on reset. So no, it doesn't free anything, but you can reuse the memory. Think as a vector that can grows but every time that you call clear it just set the size to 0 instead of releasing memory. 

So, my recommendation is to stick 1-1 with CL and Command Allocators, and try to don't build CL too big (500 drawcalls is fine, 20000 is not) nor too small, unless you need it (think again in post process and the like).

Share this post


Link to post
Share on other sites

Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

I read the opposite... that building commandlists is expensive while submitting them are relatively cheap.  In addition if you buffer your entire frames commandlists before submitting them it will take more memory and potentially(most likely) lead to an idle GPU waiting for work to do.

 

edit - although making lists with 100 draws does reduce the amount of submits.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

 


Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

I read the opposite... that building commandlists is expensive while submitting them are relatively cheap.  In addition if you buffer your entire frames commandlists before submitting them it will take more memory and potentially(most likely) lead to an idle GPU waiting for work to do.

 

edit - although making lists with 100 draws does reduce the amount of submits.

 

Submit to the Command Queue, it's not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap. 

 

Having a whole frame buffered adds latency and yes it takes memory, but is the best way to have the GPU fully busy. 

 

In any case you should profile! you can always use GPU View to see how CPU and GPU are working (among other tools), but my recommendation is to have a full frame of latency so the GPU can work at its full potential (unless perhaps you are working on VR where low latency is much more important).

 

Edit: There is one issue with Intel drivers where Reset command list/allocators is very expensive (and by very expensive I mean that it can take several ms each, I have cases of almost 5ms to reset where it takes less than 1ms to build the same CL). This doesn't affect all Intel GPUs (only Haswell/Broadwell) and it doesn't affect nVidia/AMD that much, but I do reset all my CL using jobs in parallel to other stuff.

Edited by Sergio J. de los Santos

Share this post


Link to post
Share on other sites

The directx 12 samples were made to be easy to follow and sometimes do weird things like not reusing identically created command lists. See: https://github.com/Microsoft/DirectX-Graphics-Samples/issues/58

 

I had a brief discussion about this with a few of my colleagues here are Microsoft before we launched the samples and we decided that since most real games are going to have different command list contents throughout their lifetimes, that we would not write the samples in a way that was tailored to the specific static scenarios that they target (i.e. caching the command lists).

Share this post


Link to post
Share on other sites


Submit to the Command Queue, is not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap. 

Have you profiled?  While I agree a kernel call takes more than a regular function call, it really depends on what work is occurring when.

 


Having a whole frame buffered adds latency and yes it takes memory, but is the best way to have the GPU fully busy. 

As long as you keep feeding the GPU commandlists in a steady stream you should be fine since every draw takes a while to complete anyway. (asynchronously)

Share this post


Link to post
Share on other sites

 


Submit to the Command Queue, is not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap. 

Have you profiled?  While I agree a kernel call takes more than a regular function call, it really depends on what work is occurring when.

Yes, I did. And batching command lists (in three groups per frame, one for scene rendering pre scaleform, GUI with scaleform and one after scaleform for more GUI and patching of resource states) gave the best results in all the platform that I tested (GeForce 760, GeForce 980m, Radeon 290x, Intel Haswell HD4600, Broadwell HD6200 and Skylake HD520).

I said two before because I forgot about Scaleform tongue.png

 

The first batch contains all the GBuffer generation lists, shadows lists, lighting pass (is a light pre pass renderer, that we use on all of our platforms including GL ES 2 for IPad), SSAO/HBAO+, material pass (because is LPP), deferred decals, transparent meshes/FXs and post process (FXAA/CMAA, DoF, etc). Each CL is generated in a separated job running in parallel. 

 

Then we render the GUI using Scaleform (which send its own command lists).

 

And the third call is to render the rest of the stuff and update some resource states (like going from Render Target to present for the back buffer).

 

We have a two frame latency with GPU, (I keep everything alive for two frames) and I can tell you that using GPUView I keep busy the GPU without bubbles but this is after TH2 update. Pre TH2 there were issues with that (but it was related to DXGI and the presentation system not DX12 itself).

Edited by Sergio J. de los Santos

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!