• 11
• 9
• 10
• 9
• 11
• ### Similar Content

• While working on a project using D3D12 I was getting an exception being thrown while trying to get a D3D12_CPU_DESCRIPTOR_HANDLE. The project is using plain C so it uses the COBJMACROS. The following application replicates the problem happening in the project.
#define COBJMACROS #pragma warning(push, 3) #include <Windows.h> #include <d3d12.h> #include <dxgi1_4.h> #pragma warning(pop) IDXGIFactory4 *factory; ID3D12Device *device; ID3D12DescriptorHeap *rtv_heap; int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE pinst, PWSTR cline, int cshow) { (hinst), (pinst), (cline), (cshow); HRESULT hr = CreateDXGIFactory1(&IID_IDXGIFactory4, (void **)&factory); hr = D3D12CreateDevice(0, D3D_FEATURE_LEVEL_11_0, &IID_ID3D12Device, &device); D3D12_DESCRIPTOR_HEAP_DESC desc; desc.NumDescriptors = 1; desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_RTV; desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_NONE; desc.NodeMask = 0; hr = ID3D12Device_CreateDescriptorHeap(device, &desc, &IID_ID3D12DescriptorHeap, (void **)&rtv_heap); D3D12_CPU_DESCRIPTOR_HANDLE rtv = ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart(rtv_heap); (rtv); } The call to ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart throws an exception. Stepping into the disassembly for ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart show that the error occurs on the instruction
mov  qword ptr [rdx],rax
which seems odd since rdx doesn't appear to be used. Any help would be greatly appreciated. Thank you.

• By lubbe75
As far as I understand there is no real random or noise function in HLSL.
I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious?

• Hi,
I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
• By NikiTo
Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.

if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

MSN>
discard: Do not output the result of the current pixel.
<MSN

As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

(what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
• By NikiTo
I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!

# DX12 [D3D12] Is ExecuteCommandLists asynchronous, does Present stalls, how signalling works?

This topic is 818 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello,

I'm reading and reading about dx12 but still have some problems with understending basic concepts. I'm an active reader of this section of the forum but still can miss something. Please redirect me to the correct place if my questions are duplicate. So, here's a msdn Hello World example.

1. The method OnRender(). We populated a command list and executed it with ExecuteCommandLists() method. The documentation says: "Submits an array of command lists for execution.". Not a lot . I bet the commands doesn't send immediately, right? What is happening during a travel of command list from CPU to GPU, a stall?
2. Next a method Present(). But what is happening here? CPU stall again? What if it called when commands are not finished and there's nothing to present?
3. Next we're synchronizing by calling first ID3D12CommandQueue::Signal(). Is it possible this command execute before ExecuteCommandLists() (if ExecuteCommandLists() is asynchronous) and the signal become before or in the middle of the supplied commands??
4. Next we're waiting:
if (m_fence->GetCompletedValue() < fence)
{
ThrowIfFailed(m_fence->SetEventOnCompletion(fence, m_fenceEvent));
WaitForSingleObject(m_fenceEvent, INFINITE);
?}


what if fence was updated immediately after GetCompletedValue() but before SetEventOnCompletion()? Isn't it a data race?

Edited by nikitablack

##### Share on other sites
• The method OnRender(). We populated a command list and executed it with ExecuteCommandLists() method. The documentation says: "Submits an array of command lists for execution.". Not a lot . I bet the commands doesn't send immediately, right? What is happening during a travel of command list from CPU to GPU, a stall?
It is asynchronous (meaning it is submitted on the CPU time line, but it will not actively wait for the GPU timeline to complete the task before returning).. but the commands get sent right away (it doesn't wait for anything). What happens is that your program lives and executes in user land. The windows DDI does not allow (currently) to submit commands directly from user land.
Because of that submitting commands to the GPU is triggering a user/kernel transition, which is a bit expensive (to do at every draw call).

Once upon a time, because of this user land limitation, commands would be batched by the runtime and driver and would be submitted all at once at random times (not immediately).
(though some commands would force this submission to happen, this was not how commands were typically submitted).

Now with dx12 YOU control the rate of submission and execute calls are submitted immediately. So you are the one making the judgement call to build a batch of commands (through command lists) big enough to not trigger the user/kernel transition too often.

But will the GPU see the command immediately after the submission ? Well it depends. If there's nothing in the pipe being rendered then that
command list could be seen immediately by the GPU. If there is still work to be executed then it will be put into a queue for execution.

There's actually a higher priority queue that will take up any new work that is posted there before looking at the other work posted by normal apps.
(as an application writer you should not worry about that detail).

• Next a method Present(). But what is happening here? CPU stall again? What if it called when commands are not finished and there's nothing to present?

Present will stall.. but only if you hit the render limit you set (or the one set by the API). Typically by default it is three frames of GPU work can be submitted
before a Present() call will stall. It is to prevent the CPU from going way more ahead than practical. You can control that rate and sometimes it is encouraged to do so to limit latency (the time it takes for an input to be taken into consideration and having an effect visible to the end user on their monitor).

That stall does not need to consume CPU power (it can be paused then resumed at the next vblank), but your app will be stuck in that thread during that time (which can be okay.. or not well it's up to you).
Because it doesn't consume CPU power, your OS/CPU can either run another thread that still has work to do, or go into a idle mode that does not consume as much electricity.

• Next we're synchronizing by calling first ID3D12CommandQueue::Signal(). Is it possible this command execute before ExecuteCommandLists() (if ExecuteCommandLists() is asynchronous) and the signal become before or in the middle of the supplied commands??

There's the notion of API order. And multiple time lines. In the current time line, things are ordered in the order they are submitted to that time line.
If in your timeline you submitted the Signal() AFTER the Execute() then you should be guaranteed that the Execute() is all done when you receive the message that the Signal() has completed.

This is really important as you're using fences before you recycle, reset, destroy, resources and can't have them still in use by the GPU when you do.

• Next we're waiting:
if (m_fence->GetCompletedValue() < fence)
{
ThrowIfFailed(m_fence->SetEventOnCompletion(fence, m_fenceEvent));
WaitForSingleObject(m_fenceEvent, INFINITE);
?}


• what if fence was updated immediately after GetCompletedValue() but before SetEventOnCompletion()? Isn't it a data race?

It's not a race condition, because if the condition becomes true after the if() is taken, then WaitForSingleOBject() will simply return immediately.

This code is functionally equivalent to the one you posted :
// Signal and increment the fence value.
const UINT64 fenceToWaitFor = m_fenceValue;
ThrowIfFailed(m_commandQueue->Signal(m_fence.Get(), fenceToWaitFor));
m_fenceValue++;

// Wait until the fence is completed.
ThrowIfFailed(m_fence->SetEventOnCompletion(fenceToWaitFor, m_fenceEvent));
WaitForSingleObject(m_fenceEvent, INFINITE);
But it doesn't do a quick early check for fenceToWaitFor, as a consequence it will set the event every time (you can see that as a mini-optimization, not changing the meaning of the code). Edited by LeGreg

##### Share on other sites

Thank you guys. Now it's clear.

Recently I read Intel's article and it helped me a lot to understand swap chains. Highly recommend.