• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
       
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      MSN>
      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..
      <MSN

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



       
    • By Axiverse
      I'm wondering when upload buffers are copied into the GPU. Basically I want to pool buffers and want to know when I can reuse and write new data into the buffers.
    • By NikiTo
      AMD forces me to use MipLevels in order to can read from a heap previously used as RTV. Intel's integrated GPU works fine with MipLevels = 1 inside the D3D12_RESOURCE_DESC. For AMD I have to set it to 0(or 2). MSDN says 0 means max levels. With MipLevels = 1, AMD is rendering fine to the RTV, but reading from the RTV it shows the image reordered.

      Is setting MipLevels to something other than 1 going to cost me too much memory or execution time during rendering to RTVs, because I really don't need mipmaps at all(not for the 99% of my app)?

      (I use the same 2D D3D12_RESOURCE_DESC for both the SRV and RTV sharing the same heap. Using 1 for MipLevels in that D3D12_RESOURCE_DESC gives me results like in the photos attached below. Using 0 or 2 makes AMD read fine from the RTV. I wish I could sort this somehow, but in the last two days I've tried almost anything to sort this problem, and this is the only way it works on my machine.)


  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 DX12 m_commandAllocator->Reset() is necessary every frame?

This topic is 857 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Having play around MSFT dx12 samples for sometime and looking through the msdn documents, there are still lots of stuffs confuse me. One of which is the ID3D12CommandAllocator, and ID3D12GraphicsCommandList stuffs. 

 

If (as mentioned in the msdn) the id3d12commandallocator is the memory allocator on gpu command buffer, and id3d12graphicscommandlist is the container object which contains sequence of commands why we need to reset the allocator and build the sequence of commands from scratch again and again every frame(at least MSFT examples does). Especially there are only [backbuffer count] of different commandlist.

 

I have modified the code for D3D12HelloConstBuffers sample, where I create 3 commandlist object from one allocator(backbuffer count is 3), the only difference between these three commandlist is that the

 

"m_commandList[frameidx]->OMSetRenderTargets(1, &rtvHandle, FALSE, nullptr);"

 

is different since rtvHandle is different from 3 backbuffer.

 

Then on the OnRender() function I get rid of the allocator reset, PopulateCommandLIst stuffs. Basically I just execute the right commandlist I create earlier based on current backbuffer index.

 

Here are relevant code snippet

// Fill the command list with all the render commands and dependent state.
void D3D12HelloConstBuffers::PopulateCommandList(int frameidx)
{
	// Command list allocators can only be reset when the associated 
	// command lists have finished execution on the GPU; apps should use 
	// fences to determine GPU execution progress.
	//ThrowIfFailed(m_commandAllocator->Reset());

	// However, when ExecuteCommandList() is called on a particular command 
	// list, that command list can then be reset at any time and must be before 
	// re-recording.
	ThrowIfFailed(m_commandList[frameidx]->Reset(m_commandAllocator.Get(), m_pipelineState.Get()));

	// Set necessary state.
	m_commandList[frameidx]->SetGraphicsRootSignature(m_rootSignature.Get());

	ID3D12DescriptorHeap* ppHeaps[] = { m_cbvHeap.Get() };
	m_commandList[frameidx]->SetDescriptorHeaps(_countof(ppHeaps), ppHeaps);

	m_commandList[frameidx]->SetGraphicsRootDescriptorTable(0, m_cbvHeap->GetGPUDescriptorHandleForHeapStart());
	m_commandList[frameidx]->RSSetViewports(1, &m_viewport);
	m_commandList[frameidx]->RSSetScissorRects(1, &m_scissorRect);

	// Indicate that the back buffer will be used as a render target.
	m_commandList[frameidx]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_renderTargets[frameidx].Get(), D3D12_RESOURCE_STATE_PRESENT, D3D12_RESOURCE_STATE_RENDER_TARGET));

	CD3DX12_CPU_DESCRIPTOR_HANDLE rtvHandle(m_rtvHeap->GetCPUDescriptorHandleForHeapStart(), frameidx, m_rtvDescriptorSize);
	m_commandList[frameidx]->OMSetRenderTargets(1, &rtvHandle, FALSE, nullptr);

	// Record commands.
	const float clearColor[] = { 0.0f, 0.2f, 0.4f, 1.0f };
	m_commandList[frameidx]->ClearRenderTargetView(rtvHandle, clearColor, 0, nullptr);
	m_commandList[frameidx]->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
	m_commandList[frameidx]->IASetVertexBuffers(0, 1, &m_vertexBufferView);
	m_commandList[frameidx]->DrawInstanced(3, 1, 0, 0);

	// Indicate that the back buffer will now be used to present.
	m_commandList[frameidx]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_renderTargets[frameidx].Get(), D3D12_RESOURCE_STATE_RENDER_TARGET, D3D12_RESOURCE_STATE_PRESENT));

	ThrowIfFailed(m_commandList[frameidx]->Close());
}

// Render the scene.
void D3D12HelloConstBuffers::OnRender()
{
	m_frameIndex = m_swapChain->GetCurrentBackBufferIndex();
	// Execute the command list.
	ID3D12CommandList* ppCommandLists[] = { m_commandList[m_frameIndex].Get() };
	m_commandQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

	// Present the frame.
	ThrowIfFailed(m_swapChain->Present(1, 0));

	WaitForPreviousFrame();
}

And it turned out it works as the original code.

 

So my question here is that: what the point of spending all the cpu cycles to recreate the commandlist every frame, or even reset the allocator every frame. And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Edited by Mr_Fox

Share this post


Link to post
Share on other sites
Advertisement


And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Also remember that it is suggested that you have (IIRC) 100 draws per command list. (12 for bundles)  Visibility can change on a per frame basis so reusing command lists isn't really optimal.

Share this post


Link to post
Share on other sites

I'd recommend resetting the command allocator as soon as you're able to do it (but not sooner !).

 

As mentioned command allocators hold resources so if your game has a high resource usage that puts more pressure on it.

 

So that typically means an allocator used at frame N, will be reset at most at frame N + M (M being how many frames in advance your CPU is compared to what your GPU has done).

 

On the other hand an allocator used for a bundle, will not be reset until that bundle has stopped being reused, if it's the same bundle that is used for the whole existence of your application then that means the allocator will not be reset before the end of your application (so you also have to be careful to not mix bundles and command lists that have very different lifecycles).

 

Also, about the cost of doing reset : doing a reset per frame is NOT costly. Command allocators have been designed to have a lower CPU overhead when recycling resources, and that means you can reset it and be able to more quickly recycle allocations than without an allocator (what dx11 was doing). Allocators reuse is an integral part of the reduction of allocation cost over time.

 

Silly example : you have 8 threads building command lists. You could have for example one allocator per thread per in-flight frame. If there is two frames between one command being added to a command list and that command resulting in something on the screen (using a fence to guarantee that), then you will at first have 16 command allocators. One frame N, you use the first 8 allocators, then on frame N+1 you use the next 8 allocators, then before frame N+2 starts you wait for the fence that will signal the end of frame N, then you reset the first 8 allocators, and so on.

Then on top of that you have bundles that you built for reuse. If you have bundles that will last for the whole level (for example !), you have a dedicated allocator for them (if you built them in parallel then you will have one dedicated allocator per thread). Then you keep those allocators around until the level ends and those bundles are not used by any command list that is still in flight.

 

(we're using the term "frame" loosely here, it's any unit of GPU work whose completion you're keeping track of).

Share this post


Link to post
Share on other sites

 


And why not to create bunch of commandlists ahead of time, and use the proper one on render function?

Also remember that it is suggested that you have (IIRC) 100 draws per command list. (12 for bundles)  Visibility can change on a per frame basis so reusing command lists isn't really optimal.

 

100 draw calls (and 12 for bundles) is minimum recommended, the idea is to avoid small command lists with 1 or 2 commands. That being said, in the game that I work on, we have some very small command lists (like post process) that have maybe 20 commands (if all the steps are on) and that is ok.

Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

Also, the Command Allocator, never shrinks on reset. So no, it doesn't free anything, but you can reuse the memory. Think as a vector that can grows but every time that you call clear it just set the size to 0 instead of releasing memory. 

So, my recommendation is to stick 1-1 with CL and Command Allocators, and try to don't build CL too big (500 drawcalls is fine, 20000 is not) nor too small, unless you need it (think again in post process and the like).

Share this post


Link to post
Share on other sites

Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

I read the opposite... that building commandlists is expensive while submitting them are relatively cheap.  In addition if you buffer your entire frames commandlists before submitting them it will take more memory and potentially(most likely) lead to an idle GPU waiting for work to do.

 

edit - although making lists with 100 draws does reduce the amount of submits.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

 


Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad.

Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;)

I read the opposite... that building commandlists is expensive while submitting them are relatively cheap.  In addition if you buffer your entire frames commandlists before submitting them it will take more memory and potentially(most likely) lead to an idle GPU waiting for work to do.

 

edit - although making lists with 100 draws does reduce the amount of submits.

 

Submit to the Command Queue, it's not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap. 

 

Having a whole frame buffered adds latency and yes it takes memory, but is the best way to have the GPU fully busy. 

 

In any case you should profile! you can always use GPU View to see how CPU and GPU are working (among other tools), but my recommendation is to have a full frame of latency so the GPU can work at its full potential (unless perhaps you are working on VR where low latency is much more important).

 

Edit: There is one issue with Intel drivers where Reset command list/allocators is very expensive (and by very expensive I mean that it can take several ms each, I have cases of almost 5ms to reset where it takes less than 1ms to build the same CL). This doesn't affect all Intel GPUs (only Haswell/Broadwell) and it doesn't affect nVidia/AMD that much, but I do reset all my CL using jobs in parallel to other stuff.

Edited by Sergio J. de los Santos

Share this post


Link to post
Share on other sites

The directx 12 samples were made to be easy to follow and sometimes do weird things like not reusing identically created command lists. See: https://github.com/Microsoft/DirectX-Graphics-Samples/issues/58

 

I had a brief discussion about this with a few of my colleagues here are Microsoft before we launched the samples and we decided that since most real games are going to have different command list contents throughout their lifetimes, that we would not write the samples in a way that was tailored to the specific static scenarios that they target (i.e. caching the command lists).

Share this post


Link to post
Share on other sites


Submit to the Command Queue, is not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap. 

Have you profiled?  While I agree a kernel call takes more than a regular function call, it really depends on what work is occurring when.

 


Having a whole frame buffered adds latency and yes it takes memory, but is the best way to have the GPU fully busy. 

As long as you keep feeding the GPU commandlists in a steady stream you should be fine since every draw takes a while to complete anyway. (asynchronously)

Share this post


Link to post
Share on other sites

 


Submit to the Command Queue, is not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap. 

Have you profiled?  While I agree a kernel call takes more than a regular function call, it really depends on what work is occurring when.

Yes, I did. And batching command lists (in three groups per frame, one for scene rendering pre scaleform, GUI with scaleform and one after scaleform for more GUI and patching of resource states) gave the best results in all the platform that I tested (GeForce 760, GeForce 980m, Radeon 290x, Intel Haswell HD4600, Broadwell HD6200 and Skylake HD520).

I said two before because I forgot about Scaleform tongue.png

 

The first batch contains all the GBuffer generation lists, shadows lists, lighting pass (is a light pre pass renderer, that we use on all of our platforms including GL ES 2 for IPad), SSAO/HBAO+, material pass (because is LPP), deferred decals, transparent meshes/FXs and post process (FXAA/CMAA, DoF, etc). Each CL is generated in a separated job running in parallel. 

 

Then we render the GUI using Scaleform (which send its own command lists).

 

And the third call is to render the rest of the stuff and update some resource states (like going from Render Target to present for the back buffer).

 

We have a two frame latency with GPU, (I keep everything alive for two frames) and I can tell you that using GPUView I keep busy the GPU without bubbles but this is after TH2 update. Pre TH2 there were issues with that (but it was related to DXGI and the presentation system not DX12 itself).

Edited by Sergio J. de los Santos

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement