• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By Jason Smith
      While working on a project using D3D12 I was getting an exception being thrown while trying to get a D3D12_CPU_DESCRIPTOR_HANDLE. The project is using plain C so it uses the COBJMACROS. The following application replicates the problem happening in the project.
      #define COBJMACROS #pragma warning(push, 3) #include <Windows.h> #include <d3d12.h> #include <dxgi1_4.h> #pragma warning(pop) IDXGIFactory4 *factory; ID3D12Device *device; ID3D12DescriptorHeap *rtv_heap; int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE pinst, PWSTR cline, int cshow) { (hinst), (pinst), (cline), (cshow); HRESULT hr = CreateDXGIFactory1(&IID_IDXGIFactory4, (void **)&factory); hr = D3D12CreateDevice(0, D3D_FEATURE_LEVEL_11_0, &IID_ID3D12Device, &device); D3D12_DESCRIPTOR_HEAP_DESC desc; desc.NumDescriptors = 1; desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_RTV; desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_NONE; desc.NodeMask = 0; hr = ID3D12Device_CreateDescriptorHeap(device, &desc, &IID_ID3D12DescriptorHeap, (void **)&rtv_heap); D3D12_CPU_DESCRIPTOR_HANDLE rtv = ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart(rtv_heap); (rtv); } The call to ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart throws an exception. Stepping into the disassembly for ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart show that the error occurs on the instruction
      mov  qword ptr [rdx],rax
      which seems odd since rdx doesn't appear to be used. Any help would be greatly appreciated. Thank you.
       
    • By lubbe75
      As far as I understand there is no real random or noise function in HLSL. 
      I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
      Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious? 
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
       
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      MSN>
      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..
      <MSN

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



       
  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 How could we benefit from using split barriers

This topic is 782 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

From MS dx12 docs, I found out these barrier flags:  D3D12_RESOURCE_BARRIER_FLAG_BEGIN_ONLY D3D12_RESOURCE_BARRIER_FLAG_END_ONLY, and they are described as:

 
D3D12_RESOURCE_BARRIER_FLAG_BEGIN_ONLY

This starts a barrier transition in a new state, putting a resource in a temporary no-access condition.

D3D12_RESOURCE_BARRIER_FLAG_END_ONLY

This barrier completes a transition, setting a new state and restoring active access to a resource.

 

 

 

 

 

 

 

curious about in what scenarios could we use it, I find out this

   

Example of split barriers

The following example shows how to use a split barrier to reduce pipeline stalls. The code that follows does not use split barriers:

 
 
D3D12_RESOURCE_BARRIER BarrierDesc = {};
BarrierDesc.Type = D3D12_RESOURCE_BARRIER_TRANSITION;
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_NONE;
BarrierDesc.Transition.pResource = pResource;
BarrierDesc.Transition.Subresource = 0;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_COMMON;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_RENDER_TARGET;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Write(pResource); // ... render to pResource
OtherStuff(); // .. other gpu work

// Transition pResource to PIXEL_SHADER_RESOURCE
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_RENDER_TARGET;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Read(pResource); // ... read from pResource
 

The following code uses split barriers:

 
 
D3D12_RESOURCE_BARRIER BarrierDesc = {};
BarrierDesc.Type = D3D12_RESOURCE_BARRIER_TRANSITION;
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_NONE;
BarrierDesc.Transition.pResource = pResource;
BarrierDesc.Transition.Subresource = 0;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_COMMON;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_RENDER_TARGET;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Write(pResource); // ... render to pResource

// Done writing to pResource. Start barrier to PIXEL_SHADER_RESOURCE and
// then do other work
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_BEGIN_ONLY;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_RENDER_TARGET;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE;
pCommandList->ResourceBarrier( 1, &BarrierDesc );

OtherStuff(); // .. other gpu work

// Need to read from pResource so end barrier
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_END_ONLY;

pCommandList->ResourceBarrier( 1, &BarrierDesc );
Read(pResource); // ... read from pResource
 

But still not quite get it that how could split barriers reduce pipeline stalls....

 

Any comments will be greatly appreciated

 

Share this post


Link to post
Share on other sites
Advertisement

During this time you could execute Compute work on the 3D/Graphics command queue that has a high ALU cost while also having low bandwidth requirements. By parallelising one high bandwidth operation with another operation that requires very little the two can run at the same time and perhaps even hide the cost of one of those operations completely.

That situation of high-bandwidth plus high-ALU in parallel is obviously ideal, but these are beneficial even in more regular situations.

 

e.g. a GPU might be able to execute asynchronous cache control operations outside of the normal command flow. That would allow a programmer to request a cache flush/invalidation early, and then later, block on the async task right before it's actually required to be complete:

1) Render shadow #1 to Depth#1
2) Begin transitioning Depth#1 from depth-write to shader-read state
3) Render shadow #2 to Depth#2
4) Begin transitioning Depth#2 from depth-write to shader-read state
5) End transition Depth #1
6) Draw light #1 using Depth#1 as SRV
7) End transition Depth #2
8) Draw light #2 using Depth#2 as SRV

The cache operations might be something like:
2) Begin flushing the memory range from Depth[1].pointer to Depth[1].pointer + Depth[1].size out of the depth-write cache and into RAM.
4) Begin flushing the memory range from Depth[2].pointer to Depth[2].pointer + Depth[2].size out of the depth-write cache and into RAM.
5) Invalidate the address range from Step(2) within the general L2 cache. Block the GPU queue until the flush operation from Step(2) has completed (hopefully won't actually block as it was requested to do this some time ago now).
7) Invalidate the address range from Step(4) within the general L2 cache. Block the GPU queue until the flush operation from Step(4) has completed (hopefully won't actually block as it was requested to do this some time ago now).

 

This allows the GPU to be told when data needs to be evicted from caches and moved into RAM, but also allows it to do this work in the background and not cause any stalling while you wait for those writes to complete.

In a normal situation, steps 2/5 and 4/7 would occur simultaneously, which means the GPU likely would have to block, as the flush operation definitely won't be completed.

Share this post


Link to post
Share on other sites

Except that decompressing depth buffer is often a more complicated operation than simply issuing a cache flush. GCN stores depth data not as individual values but as plane equations for entire tiles of pixels. In this form the data can't be read like a texture by the texture sampling hardware, so the depth hardware is responsible for reading and writing a significant amount of data in order to turn the data from plane equations into individual samples that can be filtered. All the time the depth block is doing this work it's not going to be possible to do depth testing on another surface as you suggested.

Share this post


Link to post
Share on other sites

Except that decompressing depth buffer...

I didn't mention decompressing depth buffers...
I was giving a different example, where this API allows GPU vendors to exploit internal concurrency in other ways - a hypothetical async cache flush request and wait-for-completion feature.

 

On that topic though, in console land where the GPU arch is known and the API doesn't get in the way, the programmer could always perform the coarse-tiled plane-equation to fine pixel depth decompression in an async compute job instead of using the build in decompressor to rearrange their critical path wink.png

Share this post


Link to post
Share on other sites

But the act of transitioning a depth buffer from Depth-Write to Shader-Read requires a decompression, which is what you said was happening in 2) and 4). Details are pretty scarce on what green team and blue team do in terms of compressing depth buffers during rendering, but it wouldn't surprise me if they employ a similar solution to GCN.

Edited by Adam Miles

Share this post


Link to post
Share on other sites

Thanks for you guys' reply. So can I say that split barrier only worth trying when we have expensive resource transaction operation like DEPTH_WRITE to PIXEL_SHADER_RESOURCE as you guys mentioned?  while for most of the resource state transaction which doesn't involve compression and decompression like COPY_DEST to GENERIC_READ  it's better not use split barrier?

Share this post


Link to post
Share on other sites

But the act of transitioning a depth buffer from Depth-Write to Shader-Read requires a decompression, which is what you said was happening in 2) and 4).

Why are you argumentatively focusing on details of a specific details of a specific GPU and not the idea behind the API in general?

I didn't say anything about decompression - you're talking about decompression. I was talking about the ability to begin flushing address ranges, and waiting for completion of the flush.

Pretend it's an uncompressed colour buffer if that helps you... or an uncompressed depth buffer (those exist on GCN too, if you can't imagine working with architectures other than the Xbone)...
 

So can I say that split barrier only worth trying when we have expensive resource transaction operation like DEPTH_WRITE to PIXEL_SHADER_RESOURCE as you guys mentioned?  while for most of the resource state transaction which doesn't involve compression and decompression like COPY_DEST to GENERIC_READ  it's better not use split barrier?

If you know that for your particular target GPU, a barrier is not "expensive" and doesn't involve some workload that can be done asynchronously, then sure, just use a normal barrier in those cases. But, the point of PC GPU API's is that they work on a wide variety of GPU's smile.png
 
Split barriers are good in general when you have the knowledge that a transition is required early, but you know that it's not going to actually going to be used in the new state for some time. You can begin the transition immediately after the data is produced, and end the transition immediately before it is consumed - giving the driver a bigger window where it can choose to place the work involved in transitioning the resource. They give you an opportunity to pass this knowledge onto the driver, so that maybe the driver can arrange the work more optimally, if the GPU is capable of it.

A lot of the time your data dependencies are not spaced this far apart in time, so it's not possible to effectively make use of a split barrier -- e.g.
1) Render shadow map

1.5) Transition from depth-write to shader-read
2) Use shadow map to draw a light 
^^ In that case, the transition has to occur immediately between #1 and #2. Split barriers are only useful when the producer pass, and the consumer pass are spaced far apart in time / when there's other work being done in bewteen.

Edited by Hodgman

Share this post


Link to post
Share on other sites

I'm not interested in getting into an argument with you. My original explanation and example was chosen with no mention of any particular GPU and is one with a high chance of having the two mutually complimentary workloads run in parallel. You happened to pick an example that could quite reasonably be asking one hardware block on the GPU, irrespective of who makes it, to perform two different functions at once, I felt it worth pointing out.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement