• 10
• 12
• 12
• 14
• 16
• ### Similar Content

• While working on a project using D3D12 I was getting an exception being thrown while trying to get a D3D12_CPU_DESCRIPTOR_HANDLE. The project is using plain C so it uses the COBJMACROS. The following application replicates the problem happening in the project.
#define COBJMACROS #pragma warning(push, 3) #include <Windows.h> #include <d3d12.h> #include <dxgi1_4.h> #pragma warning(pop) IDXGIFactory4 *factory; ID3D12Device *device; ID3D12DescriptorHeap *rtv_heap; int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE pinst, PWSTR cline, int cshow) { (hinst), (pinst), (cline), (cshow); HRESULT hr = CreateDXGIFactory1(&IID_IDXGIFactory4, (void **)&factory); hr = D3D12CreateDevice(0, D3D_FEATURE_LEVEL_11_0, &IID_ID3D12Device, &device); D3D12_DESCRIPTOR_HEAP_DESC desc; desc.NumDescriptors = 1; desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_RTV; desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_NONE; desc.NodeMask = 0; hr = ID3D12Device_CreateDescriptorHeap(device, &desc, &IID_ID3D12DescriptorHeap, (void **)&rtv_heap); D3D12_CPU_DESCRIPTOR_HANDLE rtv = ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart(rtv_heap); (rtv); } The call to ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart throws an exception. Stepping into the disassembly for ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart show that the error occurs on the instruction
mov  qword ptr [rdx],rax
which seems odd since rdx doesn't appear to be used. Any help would be greatly appreciated. Thank you.

• By lubbe75
As far as I understand there is no real random or noise function in HLSL.
I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious?

• Hi,
I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
• By NikiTo
Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.

if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

MSN>
discard: Do not output the result of the current pixel.
<MSN

As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

(what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
• By NikiTo
I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!

# DX12 D3D12 / Vulkan Synchronization Primitives

This topic is 728 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Vulkan and DX12 have very similar API's, but the way they handle their synchronization primitives seem to differ in the fundamental design.

Now, both Vulkan and DirectX 12 have resource barriers, so I'm going to ignore those.

DirectX 12 uses fences with explicit values that are expected to monotonically increase. In the simplest case, you have the swap chain present barrier. I can see two ways to implement fencing in this case:

1) You create N fences. At the end of frame N you signal fence N and then wait on fence (N + 1) % SwapBufferCount.

2) You create 1 fence. At the end of each frame you increment the fence and save off the value. You then wait for the fence to reach the value for frame (N + 1) % SwapBufferCount.

In general, it seems like the "timestamp" approach to fencing is powerful. For instance, I can have a page allocator that retires pages with a fence value and then wait for the fence to reach that point before recycling the page. It seems like creating one fence per command list submission would be expensive (maybe not? how lightweight are fences?).

Now compare this with Vulkan.

Vulkan has the notion of fences, semaphores, and events. They are explained in detail here. All these primitives are binary, it is signaled once and stay signaled until you reset it. I'm less familiar with how to use these kinds of primitives, because you can't do the timestamp approach like you can with DX12 fences.

For instance, to do the page allocator in Vulkan, the fence is the correct primitive to use because it involves synchronizing the state of the hardware queue with the host (i.e. to know when a retired page can be recycled).

In order to do this, I now have to create 1 fence for each vkSubmit call, and the page allocator receives a fence handle instead of a timestamp.

It seems to me like the DirectX-style fence is more flexible, as I would imagine that internally the Vulkan fence is using the same underlying primitive as the DirectX fence to track signaling. In short, it seems like the DirectX timestamp-based fencing allows you to use less fence objects overall.

My primary concern is thinking about a common backend between Vulkan and DX12. It seems like the wiser course of action is to support the Vulkan style binary fences because they can be implemented with DX12 fences. My concern is whether I will lose performance due to creating 1 fence per ExecuteCommandLists call vs 1 overall in DirectX.

For those who understand the underlying hardware and API's deeper than me, I would appreciate some insight into these design decisions.

Thanks!

Edited by ZBethel

##### Share on other sites

In order to do this, I now have to create 1 fence for each vkSubmit call, and the page allocator receives a fence handle instead of a timestamp.
...
My concern is whether I will lose performance due to creating 1 fence per ExecuteCommandLists call vs 1 overall in DirectX.

Don't you have the option of using one fence per frame on both API's? (i.e. just fence the final vkSumbit for a frame)
We already do this on D3D9/11/GL/GNM/GCM/etc... so that the user can query whether a frame is retired yet or not, so that they can implement cross-platform ring-buffers, etc (which manage memory on a per-frame basis).

DirectX 12 uses fences with explicit values that are expected to monotonically increase

That's one use-case, not a strict expectation. You can use D3D12's fences to implement equivalents of Vulkan's fences, events, and semaphores (though Vulkan's events would require a D3D backend to finish it's current command list, submit it with a fence, reset it and continue recording commands).

Edited by Hodgman

##### Share on other sites

Don't you have the option of using one fence per frame on both API's? (i.e. just fence the final vkSumbit for a frame)

I believe nVidia explicitly calls out that you should avoid batch submitting your entire frame at the very end. I believe the idea to keep the hardware busy with ~5 vkSubmit / ExecuteCommandList calls per frame?

That said, I guess there are several ways you can pipeline your frame.

##### Share on other sites

I believe nVidia explicitly calls out that you should avoid batch submitting your entire frame at the very end. I believe the idea to keep the hardware busy with ~5 vkSubmit / ExecuteCommandList calls per frame?
That said, I guess there are several ways you can pipeline your frame.

Even still, you know which is the last vkSubmit call for the frame, so you can just fence that one only.

##### Share on other sites

I fail to see how:

waitOnFence( fence[i] );

is any different from:

waitOnFence( fence, i );

Yes, the first one might require more "malloc" (I'm not speaking in the C malloc sense, but rather in "we'll need more memory somewhere") assuming the second version doesn't have hidden overhead.

However since you shouldn't have much more than ~10 fences (3 for triple buffer + 6 for overall synchronization across those 3 frames + 1 for streaming) memory usage becomes irrelevant. If you are calling "waitOnFence(...)" (which has a high overhead) more than 1-3 times per frame you're probably doing something wrong and it will likely begin to show up in GPUView (unless you have carefully calculated why you are fencing more than the norm and makes sense on what you're doing).

Btw you can emulate DX12's style in vulkan with (assuming you have a max limit of what the waiting value will be):

class MyFence
{
#if VULKAN
vkFence m_fence[N];
#else
D3D12Fence m_fence;
#endif
MyFence( uint maxN );

void wait( uint value );
}
due to creating 1 fence per ExecuteCommandLists

Ewww. Why would you do that?

Fence once per frame like Hodgman said. Only exceptions are sync'ing with compute & copy queues (but keep the waits() to a minimum).

Edited by Matias Goldberg