Sign in to follow this  
ZachBethel

DX12 D3D12 / Vulkan Synchronization Primitives

Recommended Posts

Vulkan and DX12 have very similar API's, but the way they handle their synchronization primitives seem to differ in the fundamental design.

 

Now, both Vulkan and DirectX 12 have resource barriers, so I'm going to ignore those.

 

DirectX 12 uses fences with explicit values that are expected to monotonically increase. In the simplest case, you have the swap chain present barrier. I can see two ways to implement fencing in this case:

 

1) You create N fences. At the end of frame N you signal fence N and then wait on fence (N + 1) % SwapBufferCount.

2) You create 1 fence. At the end of each frame you increment the fence and save off the value. You then wait for the fence to reach the value for frame (N + 1) % SwapBufferCount.

 

In general, it seems like the "timestamp" approach to fencing is powerful. For instance, I can have a page allocator that retires pages with a fence value and then wait for the fence to reach that point before recycling the page. It seems like creating one fence per command list submission would be expensive (maybe not? how lightweight are fences?).

 

Now compare this with Vulkan.

 

Vulkan has the notion of fences, semaphores, and events. They are explained in detail here. All these primitives are binary, it is signaled once and stay signaled until you reset it. I'm less familiar with how to use these kinds of primitives, because you can't do the timestamp approach like you can with DX12 fences.

 

For instance, to do the page allocator in Vulkan, the fence is the correct primitive to use because it involves synchronizing the state of the hardware queue with the host (i.e. to know when a retired page can be recycled).

 

In order to do this, I now have to create 1 fence for each vkSubmit call, and the page allocator receives a fence handle instead of a timestamp.

 

It seems to me like the DirectX-style fence is more flexible, as I would imagine that internally the Vulkan fence is using the same underlying primitive as the DirectX fence to track signaling. In short, it seems like the DirectX timestamp-based fencing allows you to use less fence objects overall.

 

My primary concern is thinking about a common backend between Vulkan and DX12. It seems like the wiser course of action is to support the Vulkan style binary fences because they can be implemented with DX12 fences. My concern is whether I will lose performance due to creating 1 fence per ExecuteCommandLists call vs 1 overall in DirectX.

 

For those who understand the underlying hardware and API's deeper than me, I would appreciate some insight into these design decisions.

 

Thanks!

Edited by ZBethel

Share this post


Link to post
Share on other sites

In order to do this, I now have to create 1 fence for each vkSubmit call, and the page allocator receives a fence handle instead of a timestamp.
...
My concern is whether I will lose performance due to creating 1 fence per ExecuteCommandLists call vs 1 overall in DirectX.

Don't you have the option of using one fence per frame on both API's? (i.e. just fence the final vkSumbit for a frame)
We already do this on D3D9/11/GL/GNM/GCM/etc... so that the user can query whether a frame is retired yet or not, so that they can implement cross-platform ring-buffers, etc (which manage memory on a per-frame basis).

DirectX 12 uses fences with explicit values that are expected to monotonically increase

That's one use-case, not a strict expectation. You can use D3D12's fences to implement equivalents of Vulkan's fences, events, and semaphores (though Vulkan's events would require a D3D backend to finish it's current command list, submit it with a fence, reset it and continue recording commands).

Edited by Hodgman

Share this post


Link to post
Share on other sites

Don't you have the option of using one fence per frame on both API's? (i.e. just fence the final vkSumbit for a frame)

 

I believe nVidia explicitly calls out that you should avoid batch submitting your entire frame at the very end. I believe the idea to keep the hardware busy with ~5 vkSubmit / ExecuteCommandList calls per frame?

 

That said, I guess there are several ways you can pipeline your frame. 

Share this post


Link to post
Share on other sites

I believe nVidia explicitly calls out that you should avoid batch submitting your entire frame at the very end. I believe the idea to keep the hardware busy with ~5 vkSubmit / ExecuteCommandList calls per frame?
That said, I guess there are several ways you can pipeline your frame.

Even still, you know which is the last vkSubmit call for the frame, so you can just fence that one only.

Share this post


Link to post
Share on other sites

I fail to see how:

waitOnFence( fence[i] );

is any different from:

waitOnFence( fence, i );

Yes, the first one might require more "malloc" (I'm not speaking in the C malloc sense, but rather in "we'll need more memory somewhere") assuming the second version doesn't have hidden overhead.

 

However since you shouldn't have much more than ~10 fences (3 for triple buffer + 6 for overall synchronization across those 3 frames + 1 for streaming) memory usage becomes irrelevant. If you are calling "waitOnFence(...)" (which has a high overhead) more than 1-3 times per frame you're probably doing something wrong and it will likely begin to show up in GPUView (unless you have carefully calculated why you are fencing more than the norm and makes sense on what you're doing).

 

Btw you can emulate DX12's style in vulkan with (assuming you have a max limit of what the waiting value will be):

class MyFence
{
#if VULKAN
       vkFence m_fence[N];
#else
       D3D12Fence m_fence;
#endif
       MyFence( uint maxN );

       void wait( uint value );
}
due to creating 1 fence per ExecuteCommandLists

 

Ewww. Why would you do that?

Fence once per frame like Hodgman said. Only exceptions are sync'ing with compute & copy queues (but keep the waits() to a minimum).

Edited by Matias Goldberg

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Announcements

  • Forum Statistics

    • Total Topics
      628330
    • Total Posts
      2982112
  • Similar Content

    • By DejayHextrix
      Hi, New here. 
      I need some help. My fiance and I like to play this mobile game online that goes by real time. Her and I are always working but when we have free time we like to play this game. We don't always got time throughout the day to Queue Buildings, troops, Upgrades....etc.... 
      I was told to look into DLL Injection and OpenGL/DirectX Hooking. Is this true? Is this what I need to learn? 
      How do I read the Android files, or modify the files, or get the in-game tags/variables for the game I want? 
      Any assistance on this would be most appreciated. I been everywhere and seems no one knows or is to lazy to help me out. It would be nice to have assistance for once. I don't know what I need to learn. 
      So links of topics I need to learn within the comment section would be SOOOOO.....Helpful. Anything to just get me started. 
      Thanks, 
      Dejay Hextrix 
    • By HD86
      As far as I know, the size of XMMATRIX must be 64 bytes, which is way too big to be returned by a function. However, DirectXMath functions do return this struct. I suppose this has something to do with the SIMD optimization. Should I return this huge struct from my own functions or should I pass it by a reference or pointer?
      This question will look silly to you if you know how SIMD works, but I don't.
    • By lubbe75
      I am looking for some example projects and tutorials using sharpDX, in particular DX12 examples using sharpDX. I have only found a few. Among them the porting of Microsoft's D3D12 Hello World examples (https://github.com/RobyDX/SharpDX_D3D12HelloWorld), and Johan Falk's tutorials (http://www.johanfalk.eu/).
      For instance, I would like to see an example how to use multisampling, and debugging using sharpDX DX12.
      Let me know if you have any useful examples.
      Thanks!
    • By lubbe75
      I'm writing a 3D engine using SharpDX and DX12. It takes a handle to a System.Windows.Forms.Control for drawing onto. This handle is used when creating the swapchain (it's set as the OutputHandle in the SwapChainDescription). 
      After rendering I want to give up this control to another renderer (for instance a GDI renderer), so I dispose various objects, among them the swapchain. However, no other renderer seem to be able to draw on this control after my DX12 renderer has used it. I see no exceptions or strange behaviour when debugging the other renderers trying to draw, except that nothing gets drawn to the area. If I then switch back to my DX12 renderer it can still draw to the control, but no other renderers seem to be able to. If I don't use my DX12 renderer, then I am able to switch between other renderers with no problem. My DX12 renderer is clearly messing up something in the control somehow, but what could I be doing wrong with just SharpDX calls? I read a tip about not disposing when in fullscreen mode, but I don't use fullscreen so it can't be that.
      Anyway, my question is, how do I properly release this handle to my control so that others can draw to it later? Disposing things doesn't seem to be enough.
    • By Tubby94
      I'm currently learning how to store multiple objects in a single vertex buffer for efficiency reasons. So far I have a cube and pyramid rendered using ID3D12GraphicsCommandList::DrawIndexedInstanced; but when the screen is drawn, I can't see the pyramid because it is drawn inside the cube. I'm told to "Use the world transformation matrix so that the box and pyramid are disjoint in world space".
       
      Can anyone give insight on how this is accomplished? 
       
           First I init the verts in Local Space
      std::array<VPosData, 13> vertices =     {         //Cube         VPosData({ XMFLOAT3(-1.0f, -1.0f, -1.0f) }),         VPosData({ XMFLOAT3(-1.0f, +1.0f, -1.0f) }),         VPosData({ XMFLOAT3(+1.0f, +1.0f, -1.0f) }),         VPosData({ XMFLOAT3(+1.0f, -1.0f, -1.0f) }),         VPosData({ XMFLOAT3(-1.0f, -1.0f, +1.0f) }),         VPosData({ XMFLOAT3(-1.0f, +1.0f, +1.0f) }),         VPosData({ XMFLOAT3(+1.0f, +1.0f, +1.0f) }),         VPosData({ XMFLOAT3(+1.0f, -1.0f, +1.0f) }),         //Pyramid         VPosData({ XMFLOAT3(-1.0f, -1.0f, -1.0f) }),         VPosData({ XMFLOAT3(-1.0f, -1.0f, +1.0f) }),         VPosData({ XMFLOAT3(+1.0f, -1.0f, -1.0f) }),         VPosData({ XMFLOAT3(+1.0f, -1.0f, +1.0f) }),         VPosData({ XMFLOAT3(0.0f,  +1.0f, 0.0f) }) } Then  data is stored into a container so sub meshes can be drawn individually
      SubmeshGeometry submesh; submesh.IndexCount = (UINT)indices.size(); submesh.StartIndexLocation = 0; submesh.BaseVertexLocation = 0; SubmeshGeometry pyramid; pyramid.IndexCount = (UINT)indices.size(); pyramid.StartIndexLocation = 36; pyramid.BaseVertexLocation = 8; mBoxGeo->DrawArgs["box"] = submesh; mBoxGeo->DrawArgs["pyramid"] = pyramid;  
      Objects are drawn
      mCommandList->DrawIndexedInstanced( mBoxGeo->DrawArgs["box"].IndexCount, 1, 0, 0, 0); mCommandList->DrawIndexedInstanced( mBoxGeo->DrawArgs["pyramid"].IndexCount, 1, 36, 8, 0);  
      Vertex Shader
       
      cbuffer cbPerObject : register(b0) { float4x4 gWorldViewProj; }; struct VertexIn { float3 PosL : POSITION; float4 Color : COLOR; }; struct VertexOut { float4 PosH : SV_POSITION; float4 Color : COLOR; }; VertexOut VS(VertexIn vin) { VertexOut vout; // Transform to homogeneous clip space. vout.PosH = mul(float4(vin.PosL, 1.0f), gWorldViewProj); // Just pass vertex color into the pixel shader. vout.Color = vin.Color; return vout; } float4 PS(VertexOut pin) : SV_Target { return pin.Color; }  

  • Popular Now