How could we benefit from using split barriers

Started by
7 comments, last by Adam Miles 8 years, 1 month ago

From MS dx12 docs, I found out these barrier flags: D3D12_RESOURCE_BARRIER_FLAG_BEGIN_ONLY D3D12_RESOURCE_BARRIER_FLAG_END_ONLY, and they are described as:

D3D12_RESOURCE_BARRIER_FLAG_BEGIN_ONLY

This starts a barrier transition in a new state, putting a resource in a temporary no-access condition.

D3D12_RESOURCE_BARRIER_FLAG_END_ONLY

This barrier completes a transition, setting a new state and restoring active access to a resource.

curious about in what scenarios could we use it, I find out this

Example of split barriers

The following example shows how to use a split barrier to reduce pipeline stalls. The code that follows does not use split barriers:

D3D12_RESOURCE_BARRIER BarrierDesc = {};
BarrierDesc.Type = D3D12_RESOURCE_BARRIER_TRANSITION;
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_NONE;
BarrierDesc.Transition.pResource = pResource;
BarrierDesc.Transition.Subresource = 0;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_COMMON;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_RENDER_TARGET;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Write(pResource); // ... render to pResource
OtherStuff(); // .. other gpu work

// Transition pResource to PIXEL_SHADER_RESOURCE
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_RENDER_TARGET;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Read(pResource); // ... read from pResource

The following code uses split barriers:

D3D12_RESOURCE_BARRIER BarrierDesc = {};
BarrierDesc.Type = D3D12_RESOURCE_BARRIER_TRANSITION;
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_NONE;
BarrierDesc.Transition.pResource = pResource;
BarrierDesc.Transition.Subresource = 0;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_COMMON;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_RENDER_TARGET;

pCommandList->ResourceBarrier( 1, &BarrierDesc );

Write(pResource); // ... render to pResource

// Done writing to pResource. Start barrier to PIXEL_SHADER_RESOURCE and
// then do other work
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_BEGIN_ONLY;
BarrierDesc.Transition.StateBefore = D3D12_RESOURCE_STATE_RENDER_TARGET;
BarrierDesc.Transition.StateAfter = D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE;
pCommandList->ResourceBarrier( 1, &BarrierDesc );

OtherStuff(); // .. other gpu work

// Need to read from pResource so end barrier
BarrierDesc.Flags = D3D12_RESOURCE_BARRIER_END_ONLY;

pCommandList->ResourceBarrier( 1, &BarrierDesc );
Read(pResource); // ... read from pResource

But still not quite get it that how could split barriers reduce pipeline stalls....

Any comments will be greatly appreciated

Advertisement

A link to the documentation would have sufficed.

The idea behind split barriers is that other GPU work can be done in parallel with the work required to execute the barrier. Rather than wait for the barrier to do its work and then proceed with later work, you could do both simultaneously and the time could potentially be less than if they'd been done serially.

For example, the act of transitioning a depth buffer from DEPTH_WRITE to PIXEL_SHADER_RESOURCE may require the depth block on the GPU to perform a decompression. This is likely a high bandwidth operation that ties up the depth block for the duration of the decompression operation. During this time you could execute Compute work on the 3D/Graphics command queue that has a high ALU cost while also having low bandwidth requirements. By parallelising one high bandwidth operation with another operation that requires very little the two can run at the same time and perhaps even hide the cost of one of those operations completely.

By issuing the transition as a split barrier you are basically telling the GPU that it can start the operation at that point in the timeline but that it isn't required to be finished until much later, giving if the chance to issue other work alongside it.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

During this time you could execute Compute work on the 3D/Graphics command queue that has a high ALU cost while also having low bandwidth requirements. By parallelising one high bandwidth operation with another operation that requires very little the two can run at the same time and perhaps even hide the cost of one of those operations completely.

That situation of high-bandwidth plus high-ALU in parallel is obviously ideal, but these are beneficial even in more regular situations.

e.g. a GPU might be able to execute asynchronous cache control operations outside of the normal command flow. That would allow a programmer to request a cache flush/invalidation early, and then later, block on the async task right before it's actually required to be complete:

1) Render shadow #1 to Depth#1
2) Begin transitioning Depth#1 from depth-write to shader-read state
3) Render shadow #2 to Depth#2
4) Begin transitioning Depth#2 from depth-write to shader-read state
5) End transition Depth #1
6) Draw light #1 using Depth#1 as SRV
7) End transition Depth #2
8) Draw light #2 using Depth#2 as SRV

The cache operations might be something like:
2) Begin flushing the memory range from Depth[1].pointer to Depth[1].pointer + Depth[1].size out of the depth-write cache and into RAM.
4) Begin flushing the memory range from Depth[2].pointer to Depth[2].pointer + Depth[2].size out of the depth-write cache and into RAM.
5) Invalidate the address range from Step(2) within the general L2 cache. Block the GPU queue until the flush operation from Step(2) has completed (hopefully won't actually block as it was requested to do this some time ago now).
7) Invalidate the address range from Step(4) within the general L2 cache. Block the GPU queue until the flush operation from Step(4) has completed (hopefully won't actually block as it was requested to do this some time ago now).

This allows the GPU to be told when data needs to be evicted from caches and moved into RAM, but also allows it to do this work in the background and not cause any stalling while you wait for those writes to complete.

In a normal situation, steps 2/5 and 4/7 would occur simultaneously, which means the GPU likely would have to block, as the flush operation definitely won't be completed.

Except that decompressing depth buffer is often a more complicated operation than simply issuing a cache flush. GCN stores depth data not as individual values but as plane equations for entire tiles of pixels. In this form the data can't be read like a texture by the texture sampling hardware, so the depth hardware is responsible for reading and writing a significant amount of data in order to turn the data from plane equations into individual samples that can be filtered. All the time the depth block is doing this work it's not going to be possible to do depth testing on another surface as you suggested.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Except that decompressing depth buffer...

I didn't mention decompressing depth buffers...
I was giving a different example, where this API allows GPU vendors to exploit internal concurrency in other ways - a hypothetical async cache flush request and wait-for-completion feature.

On that topic though, in console land where the GPU arch is known and the API doesn't get in the way, the programmer could always perform the coarse-tiled plane-equation to fine pixel depth decompression in an async compute job instead of using the build in decompressor to rearrange their critical path wink.png

But the act of transitioning a depth buffer from Depth-Write to Shader-Read requires a decompression, which is what you said was happening in 2) and 4). Details are pretty scarce on what green team and blue team do in terms of compressing depth buffers during rendering, but it wouldn't surprise me if they employ a similar solution to GCN.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Thanks for you guys' reply. So can I say that split barrier only worth trying when we have expensive resource transaction operation like DEPTH_WRITE to PIXEL_SHADER_RESOURCE as you guys mentioned? while for most of the resource state transaction which doesn't involve compression and decompression like COPY_DEST to GENERIC_READ it's better not use split barrier?

But the act of transitioning a depth buffer from Depth-Write to Shader-Read requires a decompression, which is what you said was happening in 2) and 4).

Why are you argumentatively focusing on details of a specific details of a specific GPU and not the idea behind the API in general?

I didn't say anything about decompression - you're talking about decompression. I was talking about the ability to begin flushing address ranges, and waiting for completion of the flush.

Pretend it's an uncompressed colour buffer if that helps you... or an uncompressed depth buffer (those exist on GCN too, if you can't imagine working with architectures other than the Xbone)...

So can I say that split barrier only worth trying when we have expensive resource transaction operation like DEPTH_WRITE to PIXEL_SHADER_RESOURCE as you guys mentioned? while for most of the resource state transaction which doesn't involve compression and decompression like COPY_DEST to GENERIC_READ it's better not use split barrier?

If you know that for your particular target GPU, a barrier is not "expensive" and doesn't involve some workload that can be done asynchronously, then sure, just use a normal barrier in those cases. But, the point of PC GPU API's is that they work on a wide variety of GPU's smile.png

Split barriers are good in general when you have the knowledge that a transition is required early, but you know that it's not going to actually going to be used in the new state for some time. You can begin the transition immediately after the data is produced, and end the transition immediately before it is consumed - giving the driver a bigger window where it can choose to place the work involved in transitioning the resource. They give you an opportunity to pass this knowledge onto the driver, so that maybe the driver can arrange the work more optimally, if the GPU is capable of it.

A lot of the time your data dependencies are not spaced this far apart in time, so it's not possible to effectively make use of a split barrier -- e.g.
1) Render shadow map

1.5) Transition from depth-write to shader-read
2) Use shadow map to draw a light
^^ In that case, the transition has to occur immediately between #1 and #2. Split barriers are only useful when the producer pass, and the consumer pass are spaced far apart in time / when there's other work being done in bewteen.

I'm not interested in getting into an argument with you. My original explanation and example was chosen with no mention of any particular GPU and is one with a high chance of having the two mutually complimentary workloads run in parallel. You happened to pick an example that could quite reasonably be asking one hardware block on the GPU, irrespective of who makes it, to perform two different functions at once, I felt it worth pointing out.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

This topic is closed to new replies.

Advertisement