• Advertisement
Sign in to follow this  

DX12 uav barrier and atomic operation

This topic is 440 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey Guys,

 

Think about the following case:

You have multiple dispatch calls with compute shader which launches tons of threads, and each thread will do an atomic add on the same memory location (same address for all threads from all dispatch calls). Since DX12 allow dispatch to be overlapped, all related document will suggest to put a uav barrier between each dispatch since they all operate on same buffer, and has data dependency.

 

I can understand the reasoning for normal read/write, since you read/write will be cached in multiple level, and to get correct result, you need to flush your cache, and do bunch of related cache invalidation. So that's why we need uav barrier between dispatches.(is my understanding correct?)

 

But what if all my write is atomic as described in the above case? In my experience, atomic write will be immediate visible to all other threads from the same dispatch even if threads are in different execution unit (I am not super sure about that, since no doc specifically state that....), which, to me, means all cached data is synced across entire GPU. And if that is the case, I think it will be safe to remove uav barrier between dispatches since data cache has already been taken care, there is no out-of-date data.

 

And from my experiment, it seems to work correctly on my GTX680m without the barrier. But there is no official document out there can confirm it, so I have no idea whether it is officially safe to do it, or it will cause undefined behavior, or corrupted data, and I am just accidentally get the correct result.

 

Please correct me if my assumption is wrong, and it will be great if someone could explain how atomic operation works in GPU (especially for those support reading back the original data, since it seems atomic read/write bypass all caches)

 

Thanks in advance

Share this post


Link to post
Share on other sites
Advertisement

I don't think UAV barriers flush any caches, they just ensure an execution order (I could be wrong, but IIRC UAV's are implemented on the standard cache hierarchy which is coherent).  Which leads to point two which is if the order of execution does not matter then don't use UAV barriers.

 

edit - oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.
 

Thanks Infinisearch, so does that mean all atomic read/write(on uav) bypass L1 cache (which is only shared within EU?), and which gives all read/write behavior kinda equivalent to the std::memory_order_seq_cst

Share this post


Link to post
Share on other sites
edit - oh and IIRC atomics are implemented in the L2 cache which is shared by different execution units.

BTW I only know this about AMD's GCN architecture

 

 

 

Thanks Infinisearch, so does that mean all atomic read/write(on uav) bypass L1 cache (which is only shared within EU?), and which gives all read/write behavior kinda equivalent to the std::memory_order_seq_cst

 

I don't know if it bypass's it but it acts as a cache miss (since the L1 and L2 are coherent and inclusive).  I'm not really familiar with std::memory_order_seq_cst so I can't really comment on it but I will say how do you guarantee execution order within a wave/warp?  I don't think you can so if what you're doing has an order to it, it won't work.  Only coarse grained order using barriers is possible.

 

edit - nevermind that last part.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

This might clear things up for you.

From this page: https://msdn.microsoft.com/en-us/library/windows/desktop/dn903898(v=vs.85).aspx

 

  • D3D12_RESOURCE_UAV_BARRIER - Unordered access view barriers indicate all UAV accesses (read or writes) to a particular resource must complete before any future UAV accesses (read or write) can begin. The specified resource may be NULL. It is not necessary to insert a UAV barrier between two draw or dispatch calls which only read a UAV. Additionally, it is not necessary to insert a UAV barrier between two draw or dispatch calls which write to the same UAV if the application knows that it is safe to execute the UAV accesses in any order. The resource can be NULL (indicating that any UAV access could require the barrier).

Share this post


Link to post
Share on other sites

Thanks Infinisearch, I should have read the doc more carefully, but it's good to know that in my case it is safe to remove the uav barrier B-)

Share this post


Link to post
Share on other sites

You're welcome... but did you read into that quote?  If I'm not mistaken it implies there's no flushing of the cache necessary for writes to become visible to subsequent dispatches.

Share this post


Link to post
Share on other sites

 If I'm not mistaken it implies there's no flushing of the cache necessary for writes to become visible to subsequent dispatches.

 

yup, I have that kept in mind. My read will be in a much later pass, and I transit it into a srv for just reading, but before that, all dispatch just do atomic access(though these including atomic read like interlockedadd, but it should be safe according to some answers in my other post), so should be fine.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Tags

  • Advertisement
  • Popular Now

  • Similar Content

    • By turanszkij
      Hi, right now building my engine in visual studio involves a shader compiling step to build hlsl 5.0 shaders. I have a separate project which only includes shader sources and the compiler is the visual studio integrated fxc compiler. I like this method because on any PC that has visual studio installed, I can just download the solution from GitHub and everything just builds without additional dependencies and using the latest version of the compiler. I also like it because the shaders are included in the solution explorer and easy to browse, and double-click to open (opening files can be really a pain in the ass in visual studio run in admin mode). Also it's nice that VS displays the build output/errors in the output window.
      But now I have the HLSL 6 compiler and want to build hlsl 6 shaders as well (and as I understand I can also compile vulkan compatible shaders with it later). Any idea how to do this nicely? I want only a single project containing shader sources, like it is now, but build them for different targets. I guess adding different building projects would be the way to go that reference the shader source project? But how would they differentiate from shader type of the sources (eg. pixel shader, compute shader,etc.)? Now the shader building project contains for each shader the shader type, how can other building projects reference that?
      Anyone with some experience in this?
    • By DiligentDev
      Hello!
      I would like to introduce Diligent Engine, a project that I've been recently working on. Diligent Engine is a light-weight cross-platform abstraction layer between the application and the platform-specific graphics API. Its main goal is to take advantages of the next-generation APIs such as Direct3D12 and Vulkan, but at the same time provide support for older platforms via Direct3D11, OpenGL and OpenGLES. Diligent Engine exposes common front-end for all supported platforms and provides interoperability with underlying native API. Shader source code converter allows shaders authored in HLSL to be translated to GLSL and used on all platforms. Diligent Engine supports integration with Unity and is designed to be used as a graphics subsystem in a standalone game engine, Unity native plugin or any other 3D application. It is distributed under Apache 2.0 license and is free to use. Full source code is available for download on GitHub.
      Features:
      True cross-platform Exact same client code for all supported platforms and rendering backends No #if defined(_WIN32) ... #elif defined(LINUX) ... #elif defined(ANDROID) ... No #if defined(D3D11) ... #elif defined(D3D12) ... #elif defined(OPENGL) ... Exact same HLSL shaders run on all platforms and all backends Modular design Components are clearly separated logically and physically and can be used as needed Only take what you need for your project (do not want to keep samples and tutorials in your codebase? Simply remove Samples submodule. Only need core functionality? Use only Core submodule) No 15000 lines-of-code files Clear object-based interface No global states Key graphics features: Automatic shader resource binding designed to leverage the next-generation rendering APIs Multithreaded command buffer generation 50,000 draw calls at 300 fps with D3D12 backend Descriptor, memory and resource state management Modern c++ features to make code fast and reliable The following platforms and low-level APIs are currently supported:
      Windows Desktop: Direct3D11, Direct3D12, OpenGL Universal Windows: Direct3D11, Direct3D12 Linux: OpenGL Android: OpenGLES MacOS: OpenGL iOS: OpenGLES API Basics
      Initialization
      The engine can perform initialization of the API or attach to already existing D3D11/D3D12 device or OpenGL/GLES context. For instance, the following code shows how the engine can be initialized in D3D12 mode:
      #include "RenderDeviceFactoryD3D12.h" using namespace Diligent; // ...  GetEngineFactoryD3D12Type GetEngineFactoryD3D12 = nullptr; // Load the dll and import GetEngineFactoryD3D12() function LoadGraphicsEngineD3D12(GetEngineFactoryD3D12); auto *pFactoryD3D11 = GetEngineFactoryD3D12(); EngineD3D12Attribs EngD3D12Attribs; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[0] = 1024; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[1] = 32; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[2] = 16; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[3] = 16; EngD3D12Attribs.NumCommandsToFlushCmdList = 64; RefCntAutoPtr<IRenderDevice> pRenderDevice; RefCntAutoPtr<IDeviceContext> pImmediateContext; SwapChainDesc SwapChainDesc; RefCntAutoPtr<ISwapChain> pSwapChain; pFactoryD3D11->CreateDeviceAndContextsD3D12( EngD3D12Attribs, &pRenderDevice, &pImmediateContext, 0 ); pFactoryD3D11->CreateSwapChainD3D12( pRenderDevice, pImmediateContext, SwapChainDesc, hWnd, &pSwapChain ); Creating Resources
      Device resources are created by the render device. The two main resource types are buffers, which represent linear memory, and textures, which use memory layouts optimized for fast filtering. To create a buffer, you need to populate BufferDesc structure and call IRenderDevice::CreateBuffer(). The following code creates a uniform (constant) buffer:
      BufferDesc BuffDesc; BufferDesc.Name = "Uniform buffer"; BuffDesc.BindFlags = BIND_UNIFORM_BUFFER; BuffDesc.Usage = USAGE_DYNAMIC; BuffDesc.uiSizeInBytes = sizeof(ShaderConstants); BuffDesc.CPUAccessFlags = CPU_ACCESS_WRITE; m_pDevice->CreateBuffer( BuffDesc, BufferData(), &m_pConstantBuffer ); Similar, to create a texture, populate TextureDesc structure and call IRenderDevice::CreateTexture() as in the following example:
      TextureDesc TexDesc; TexDesc.Name = "My texture 2D"; TexDesc.Type = TEXTURE_TYPE_2D; TexDesc.Width = 1024; TexDesc.Height = 1024; TexDesc.Format = TEX_FORMAT_RGBA8_UNORM; TexDesc.Usage = USAGE_DEFAULT; TexDesc.BindFlags = BIND_SHADER_RESOURCE | BIND_RENDER_TARGET | BIND_UNORDERED_ACCESS; TexDesc.Name = "Sample 2D Texture"; m_pRenderDevice->CreateTexture( TexDesc, TextureData(), &m_pTestTex ); Initializing Pipeline State
      Diligent Engine follows Direct3D12 style to configure the graphics/compute pipeline. One big Pipelines State Object (PSO) encompasses all required states (all shader stages, input layout description, depth stencil, rasterizer and blend state descriptions etc.)
      Creating Shaders
      To create a shader, populate ShaderCreationAttribs structure. An important member is ShaderCreationAttribs::SourceLanguage. The following are valid values for this member:
      SHADER_SOURCE_LANGUAGE_DEFAULT  - The shader source format matches the underlying graphics API: HLSL for D3D11 or D3D12 mode, and GLSL for OpenGL and OpenGLES modes. SHADER_SOURCE_LANGUAGE_HLSL  - The shader source is in HLSL. For OpenGL and OpenGLES modes, the source code will be converted to GLSL. See shader converter for details. SHADER_SOURCE_LANGUAGE_GLSL  - The shader source is in GLSL. There is currently no GLSL to HLSL converter. To allow grouping of resources based on the frequency of expected change, Diligent Engine introduces classification of shader variables:
      Static variables (SHADER_VARIABLE_TYPE_STATIC) are variables that are expected to be set only once. They may not be changed once a resource is bound to the variable. Such variables are intended to hold global constants such as camera attributes or global light attributes constant buffers. Mutable variables (SHADER_VARIABLE_TYPE_MUTABLE) define resources that are expected to change on a per-material frequency. Examples may include diffuse textures, normal maps etc. Dynamic variables (SHADER_VARIABLE_TYPE_DYNAMIC) are expected to change frequently and randomly. This post describes the resource binding model in Diligent Engine.
      The following is an example of shader initialization:
      ShaderCreationAttribs Attrs; Attrs.Desc.Name = "MyPixelShader"; Attrs.FilePath = "MyShaderFile.fx"; Attrs.SearchDirectories = "shaders;shaders\\inc;"; Attrs.EntryPoint = "MyPixelShader"; Attrs.Desc.ShaderType = SHADER_TYPE_PIXEL; Attrs.SourceLanguage = SHADER_SOURCE_LANGUAGE_HLSL; BasicShaderSourceStreamFactory BasicSSSFactory(Attrs.SearchDirectories); Attrs.pShaderSourceStreamFactory = &BasicSSSFactory; ShaderVariableDesc ShaderVars[] =  {     {"g_StaticTexture", SHADER_VARIABLE_TYPE_STATIC},     {"g_MutableTexture", SHADER_VARIABLE_TYPE_MUTABLE},     {"g_DynamicTexture", SHADER_VARIABLE_TYPE_DYNAMIC} }; Attrs.Desc.VariableDesc = ShaderVars; Attrs.Desc.NumVariables = _countof(ShaderVars); Attrs.Desc.DefaultVariableType = SHADER_VARIABLE_TYPE_STATIC; StaticSamplerDesc StaticSampler; StaticSampler.Desc.MinFilter = FILTER_TYPE_LINEAR; StaticSampler.Desc.MagFilter = FILTER_TYPE_LINEAR; StaticSampler.Desc.MipFilter = FILTER_TYPE_LINEAR; StaticSampler.TextureName = "g_MutableTexture"; Attrs.Desc.NumStaticSamplers = 1; Attrs.Desc.StaticSamplers = &StaticSampler; ShaderMacroHelper Macros; Macros.AddShaderMacro("USE_SHADOWS", 1); Macros.AddShaderMacro("NUM_SHADOW_SAMPLES", 4); Macros.Finalize(); Attrs.Macros = Macros; RefCntAutoPtr<IShader> pShader; m_pDevice->CreateShader( Attrs, &pShader ); Creating the Pipeline State Object
      To create a pipeline state object, define instance of PipelineStateDesc structure. The structure defines the pipeline specifics such as if the pipeline is a compute pipeline, number and format of render targets as well as depth-stencil format:
      // This is a graphics pipeline PSODesc.IsComputePipeline = false; PSODesc.GraphicsPipeline.NumRenderTargets = 1; PSODesc.GraphicsPipeline.RTVFormats[0] = TEX_FORMAT_RGBA8_UNORM_SRGB; PSODesc.GraphicsPipeline.DSVFormat = TEX_FORMAT_D32_FLOAT; The structure also defines depth-stencil, rasterizer, blend state, input layout and other parameters. For instance, rasterizer state can be defined as in the code snippet below:
      // Init rasterizer state RasterizerStateDesc &RasterizerDesc = PSODesc.GraphicsPipeline.RasterizerDesc; RasterizerDesc.FillMode = FILL_MODE_SOLID; RasterizerDesc.CullMode = CULL_MODE_NONE; RasterizerDesc.FrontCounterClockwise = True; RasterizerDesc.ScissorEnable = True; //RSDesc.MultisampleEnable = false; // do not allow msaa (fonts would be degraded) RasterizerDesc.AntialiasedLineEnable = False; When all fields are populated, call IRenderDevice::CreatePipelineState() to create the PSO:
      m_pDev->CreatePipelineState(PSODesc, &m_pPSO); Binding Shader Resources
      Shader resource binding in Diligent Engine is based on grouping variables in 3 different groups (static, mutable and dynamic). Static variables are variables that are expected to be set only once. They may not be changed once a resource is bound to the variable. Such variables are intended to hold global constants such as camera attributes or global light attributes constant buffers. They are bound directly to the shader object:
       
      PixelShader->GetShaderVariable( "g_tex2DShadowMap" )->Set( pShadowMapSRV ); Mutable and dynamic variables are bound via a new object called Shader Resource Binding (SRB), which is created by the pipeline state:
      m_pPSO->CreateShaderResourceBinding(&m_pSRB); Dynamic and mutable resources are then bound through SRB object:
      m_pSRB->GetVariable(SHADER_TYPE_VERTEX, "tex2DDiffuse")->Set(pDiffuseTexSRV); m_pSRB->GetVariable(SHADER_TYPE_VERTEX, "cbRandomAttribs")->Set(pRandomAttrsCB); The difference between mutable and dynamic resources is that mutable ones can only be set once for every instance of a shader resource binding. Dynamic resources can be set multiple times. It is important to properly set the variable type as this may affect performance. Static variables are generally most efficient, followed by mutable. Dynamic variables are most expensive from performance point of view. This post explains shader resource binding in more details.
      Setting the Pipeline State and Invoking Draw Command
      Before any draw command can be invoked, all required vertex and index buffers as well as the pipeline state should be bound to the device context:
      // Clear render target const float zero[4] = {0, 0, 0, 0}; m_pContext->ClearRenderTarget(nullptr, zero); // Set vertex and index buffers IBuffer *buffer[] = {m_pVertexBuffer}; Uint32 offsets[] = {0}; Uint32 strides[] = {sizeof(MyVertex)}; m_pContext->SetVertexBuffers(0, 1, buffer, strides, offsets, SET_VERTEX_BUFFERS_FLAG_RESET); m_pContext->SetIndexBuffer(m_pIndexBuffer, 0); m_pContext->SetPipelineState(m_pPSO); Also, all shader resources must be committed to the device context:
      m_pContext->CommitShaderResources(m_pSRB, COMMIT_SHADER_RESOURCES_FLAG_TRANSITION_RESOURCES); When all required states and resources are bound, IDeviceContext::Draw() can be used to execute draw command or IDeviceContext::DispatchCompute() can be used to execute compute command. Note that for a draw command, graphics pipeline must be bound, and for dispatch command, compute pipeline must be bound. Draw() takes DrawAttribs structure as an argument. The structure members define all attributes required to perform the command (primitive topology, number of vertices or indices, if draw call is indexed or not, if draw call is instanced or not, if draw call is indirect or not, etc.). For example:
      DrawAttribs attrs; attrs.IsIndexed = true; attrs.IndexType = VT_UINT16; attrs.NumIndices = 36; attrs.Topology = PRIMITIVE_TOPOLOGY_TRIANGLE_LIST; pContext->Draw(attrs); Tutorials and Samples
      The GitHub repository contains a number of tutorials and sample applications that demonstrate the API usage.
      Tutorial 01 - Hello Triangle This tutorial shows how to render a simple triangle using Diligent Engine API.   Tutorial 02 - Cube This tutorial demonstrates how to render an actual 3D object, a cube. It shows how to load shaders from files, create and use vertex, index and uniform buffers.   Tutorial 03 - Texturing This tutorial demonstrates how to apply a texture to a 3D object. It shows how to load a texture from file, create shader resource binding object and how to sample a texture in the shader.   Tutorial 04 - Instancing This tutorial demonstrates how to use instancing to render multiple copies of one object using unique transformation matrix for every copy.   Tutorial 05 - Texture Array This tutorial demonstrates how to combine instancing with texture arrays to use unique texture for every instance.   Tutorial 06 - Multithreading This tutorial shows how to generate command lists in parallel from multiple threads.   Tutorial 07 - Geometry Shader This tutorial shows how to use geometry shader to render smooth wireframe.   Tutorial 08 - Tessellation This tutorial shows how to use hardware tessellation to implement simple adaptive terrain rendering algorithm.  
      AntTweakBar sample demonstrates how to use AntTweakBar library to create simple user interface.

       
      Atmospheric scattering sample is a more advanced example. It demonstrates how Diligent Engine can be used to implement various rendering tasks: loading textures from files, using complex shaders, rendering to textures, using compute shaders and unordered access views, etc. 

       
      The repository includes Asteroids performance benchmark based on this demo developed by Intel. It renders 50,000 unique textured asteroids and lets compare performance of D3D11 and D3D12 implementations. Every asteroid is a combination of one of 1000 unique meshes and one of 10 unique textures. 

      Integration with Unity
      Diligent Engine supports integration with Unity through Unity low-level native plugin interface. The engine relies on Native API Interoperability to attach to the graphics API initialized by Unity. After Diligent Engine device and context are created, they can be used us usual to create resources and issue rendering commands. GhostCubePlugin shows an example how Diligent Engine can be used to render a ghost cube only visible as a reflection in a mirror.

       
    • By ChuckNovice
      Hello,
      I am fairly experienced with DX11 and recently gave myself the challenge to keep myself up to date with graphic APIs so I am learning DX12 in parallel.
       
      After compiling one of my old DX11 shader that was using interface / classes as a hack to fake function pointers I got a nice message in the ouput saying that DX12 doesn't support interfaces in shaders.
      Am I right to assume that all this class linkage system was pure candy from the DX11 drivers and that it was using multiple root signatures behind the scene to achieve this? I want to be sure before I switch everything around and start handling all that stuff manually.
       
      This is the interfaces / classes that I am talking about : https://msdn.microsoft.com/en-us/library/windows/desktop/ff471421(v=vs.85).aspx
       
      Thanks for your time
    • By zmic
      Hi programmers,
      I have this problem in DirectX 12.
      I have one thread that runs a compute shader in a loop. This compute shader generates a 2d image in a buffer (D3D12_HEAP_TYPE_DEFAULT, D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS). The buffer gets updated, say, a hundred times a second.
      In the main thread, I copy this buffer to the back buffer of the swap chain when WM_PAINT is called. WM_PAINT keeps getting called because I never do the BeginPaint/Endpaint. For this copy operation I use a graphical CommandQueue/CommandList. Here's the pseudo-code for this paint operation:
      ... reset CommandQueue/CommandList swapchain->GetBuffer(back_buffer) commandlist->CopyTextureRegion(back_buffer, 0, 0, 0, computed_buffer, nullptr); commandlist->ResourceBarrier( back_buffer, D3D12_RESOURCE_STATE_COPY_DEST, D3D12_RESOURCE_STATE_PRESENT)); ... execute CommandList and wait for finish using fence ... swapchain->Present(..) I use a good old CriticalSection to make sure the compute CommandList and the graphical CommandList don't run at the same time.
      When I start the program this runs fine in a normal window. I see the procedurally generated buffer animated in real time. However, when I switch to full screen (and resize the swapchain), nothing gets rendered. The screen stays black. When I leave fullscreen (again resize swapchain), same problem. The screen just stays black. Other than that the application runs stable. No directX warnings in the debug output, no nothing. I checked that the WM_PAINT messages keep coming and I checked that the compute thread keeps computing.
      Note that I don't do anything else with the graphical commandlist. I set no pipeline, or root signature because I have no 3d rendering to do. Can this be a problem?
      I suppose I could retrieve the computed buffer with a readback buffer and paint it with an ordinary GDI function, but that seems silly with the data already being on the GPU.
      EDIT: I ran the code on another PC and on there the window stays black right from the start. So the resizing doesn't seem to be the problem.
      Any ideas appreciated!
    • By CGEngine
      Hi,
      I'm reading https://software.intel.com/en-us/articles/sample-application-for-direct3d-12-flip-model-swap-chains to figure out the best way to setup my swapchain but I dont understand the following:
      1 - In "Classic Mode" on the link above, what causes the GPU to wait for the next Vsync before starting working on the next frame?

      (eg: orange bar on the second column doesn't start executing right after the previous blue bar)
      Its clear it has to wait because the new frame will render to the render target currently on screen, but is there an explicit wait one must do in code? Or does the driver force the wait? If so, how does the driver know? Does it check which RT is bound, so if I was rendering to GBuffer no wait would happen?
      2 - When Vsync is off, what does it mean that a frame is dropped and what causes it?
      Thanks!!
  • Advertisement