Jump to content
  • Advertisement
Sign in to follow this  
Mr_Fox

DX12 How could we benefit from using split barriers

This topic is 871 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

From MS dx12 docs, I found out these barrier flags:  D3D12_RESOURCE_BARRIER_FLAG_BEGIN_ONLY D3D12_RESOURCE_BARRIER_FLAG_END_ONLY, and they are described as:

 
D3D12_RESOURCE_BARRIER_FLAG_BEGIN_ONLY

This starts a barrier transition in a new state, putting a resource in a temporary no-access condition.

D3D12_RESOURCE_BARRIER_FLAG_END_ONLY

This barrier completes a transition, setting a new state and restoring active access to a resource.

 

 

 

 

 

 

 

curious about in what scenarios could we use it, I find out this

   

Example of split barriers

The following example shows how to use a split barrier to reduce pipeline stalls. The code that follows does not use split barriers:

 
 
D3D12_RESOURCE_BARRIER BarrierDesc = {};
BarrierDesc.Type = D3D12_RESOURCE_BARRIER_TRANSITION;
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_NONE;
BarrierDesc.Transition.pResource = pResource;
BarrierDesc.Transition.Subresource = 0;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_COMMON;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_RENDER_TARGET;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Write(pResource); // ... render to pResource
OtherStuff(); // .. other gpu work

// Transition pResource to PIXEL_SHADER_RESOURCE
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_RENDER_TARGET;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Read(pResource); // ... read from pResource
 

The following code uses split barriers:

 
 
D3D12_RESOURCE_BARRIER BarrierDesc = {};
BarrierDesc.Type = D3D12_RESOURCE_BARRIER_TRANSITION;
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_NONE;
BarrierDesc.Transition.pResource = pResource;
BarrierDesc.Transition.Subresource = 0;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_COMMON;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_RENDER_TARGET;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Write(pResource); // ... render to pResource

// Done writing to pResource. Start barrier to PIXEL_SHADER_RESOURCE and
// then do other work
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_BEGIN_ONLY;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_RENDER_TARGET;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE;
pCommandList->ResourceBarrier( 1, &BarrierDesc );

OtherStuff(); // .. other gpu work

// Need to read from pResource so end barrier
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_END_ONLY;

pCommandList->ResourceBarrier( 1, &BarrierDesc );
Read(pResource); // ... read from pResource
 

But still not quite get it that how could split barriers reduce pipeline stalls....

 

Any comments will be greatly appreciated

 

Share this post


Link to post
Share on other sites
Advertisement

During this time you could execute Compute work on the 3D/Graphics command queue that has a high ALU cost while also having low bandwidth requirements. By parallelising one high bandwidth operation with another operation that requires very little the two can run at the same time and perhaps even hide the cost of one of those operations completely.

That situation of high-bandwidth plus high-ALU in parallel is obviously ideal, but these are beneficial even in more regular situations.

 

e.g. a GPU might be able to execute asynchronous cache control operations outside of the normal command flow. That would allow a programmer to request a cache flush/invalidation early, and then later, block on the async task right before it's actually required to be complete:

1) Render shadow #1 to Depth#1
2) Begin transitioning Depth#1 from depth-write to shader-read state
3) Render shadow #2 to Depth#2
4) Begin transitioning Depth#2 from depth-write to shader-read state
5) End transition Depth #1
6) Draw light #1 using Depth#1 as SRV
7) End transition Depth #2
8) Draw light #2 using Depth#2 as SRV

The cache operations might be something like:
2) Begin flushing the memory range from Depth[1].pointer to Depth[1].pointer + Depth[1].size out of the depth-write cache and into RAM.
4) Begin flushing the memory range from Depth[2].pointer to Depth[2].pointer + Depth[2].size out of the depth-write cache and into RAM.
5) Invalidate the address range from Step(2) within the general L2 cache. Block the GPU queue until the flush operation from Step(2) has completed (hopefully won't actually block as it was requested to do this some time ago now).
7) Invalidate the address range from Step(4) within the general L2 cache. Block the GPU queue until the flush operation from Step(4) has completed (hopefully won't actually block as it was requested to do this some time ago now).

 

This allows the GPU to be told when data needs to be evicted from caches and moved into RAM, but also allows it to do this work in the background and not cause any stalling while you wait for those writes to complete.

In a normal situation, steps 2/5 and 4/7 would occur simultaneously, which means the GPU likely would have to block, as the flush operation definitely won't be completed.

Share this post


Link to post
Share on other sites

Except that decompressing depth buffer is often a more complicated operation than simply issuing a cache flush. GCN stores depth data not as individual values but as plane equations for entire tiles of pixels. In this form the data can't be read like a texture by the texture sampling hardware, so the depth hardware is responsible for reading and writing a significant amount of data in order to turn the data from plane equations into individual samples that can be filtered. All the time the depth block is doing this work it's not going to be possible to do depth testing on another surface as you suggested.

Share this post


Link to post
Share on other sites

Except that decompressing depth buffer...

I didn't mention decompressing depth buffers...
I was giving a different example, where this API allows GPU vendors to exploit internal concurrency in other ways - a hypothetical async cache flush request and wait-for-completion feature.

 

On that topic though, in console land where the GPU arch is known and the API doesn't get in the way, the programmer could always perform the coarse-tiled plane-equation to fine pixel depth decompression in an async compute job instead of using the build in decompressor to rearrange their critical path wink.png

Share this post


Link to post
Share on other sites

But the act of transitioning a depth buffer from Depth-Write to Shader-Read requires a decompression, which is what you said was happening in 2) and 4). Details are pretty scarce on what green team and blue team do in terms of compressing depth buffers during rendering, but it wouldn't surprise me if they employ a similar solution to GCN.

Edited by Adam Miles

Share this post


Link to post
Share on other sites

Thanks for you guys' reply. So can I say that split barrier only worth trying when we have expensive resource transaction operation like DEPTH_WRITE to PIXEL_SHADER_RESOURCE as you guys mentioned?  while for most of the resource state transaction which doesn't involve compression and decompression like COPY_DEST to GENERIC_READ  it's better not use split barrier?

Share this post


Link to post
Share on other sites

But the act of transitioning a depth buffer from Depth-Write to Shader-Read requires a decompression, which is what you said was happening in 2) and 4).

Why are you argumentatively focusing on details of a specific details of a specific GPU and not the idea behind the API in general?

I didn't say anything about decompression - you're talking about decompression. I was talking about the ability to begin flushing address ranges, and waiting for completion of the flush.

Pretend it's an uncompressed colour buffer if that helps you... or an uncompressed depth buffer (those exist on GCN too, if you can't imagine working with architectures other than the Xbone)...
 

So can I say that split barrier only worth trying when we have expensive resource transaction operation like DEPTH_WRITE to PIXEL_SHADER_RESOURCE as you guys mentioned?  while for most of the resource state transaction which doesn't involve compression and decompression like COPY_DEST to GENERIC_READ  it's better not use split barrier?

If you know that for your particular target GPU, a barrier is not "expensive" and doesn't involve some workload that can be done asynchronously, then sure, just use a normal barrier in those cases. But, the point of PC GPU API's is that they work on a wide variety of GPU's smile.png
 
Split barriers are good in general when you have the knowledge that a transition is required early, but you know that it's not going to actually going to be used in the new state for some time. You can begin the transition immediately after the data is produced, and end the transition immediately before it is consumed - giving the driver a bigger window where it can choose to place the work involved in transitioning the resource. They give you an opportunity to pass this knowledge onto the driver, so that maybe the driver can arrange the work more optimally, if the GPU is capable of it.

A lot of the time your data dependencies are not spaced this far apart in time, so it's not possible to effectively make use of a split barrier -- e.g.
1) Render shadow map

1.5) Transition from depth-write to shader-read
2) Use shadow map to draw a light 
^^ In that case, the transition has to occur immediately between #1 and #2. Split barriers are only useful when the producer pass, and the consumer pass are spaced far apart in time / when there's other work being done in bewteen.

Edited by Hodgman

Share this post


Link to post
Share on other sites

I'm not interested in getting into an argument with you. My original explanation and example was chosen with no mention of any particular GPU and is one with a high chance of having the two mutually complimentary workloads run in parallel. You happened to pick an example that could quite reasonably be asking one hardware block on the GPU, irrespective of who makes it, to perform two different functions at once, I felt it worth pointing out.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!