Jump to content
  • Advertisement
NikiTo

DX12 Multiple AppendStructuredBuffers

Recommended Posts

In the shader code, I need to determine to which AppendStructuredBuffers the data should append. And the AppendStructuredBuffers are more than 30.
Is declaring 30+ AppendStructuredBuffers going to overkill the shader? Buffers descriptors should consume SGPRs.

Some other way to distribute the output over multiple AppendStructuredBuffers?

Is emulating the push/pop functionality with one single byte address buffer worth it? Wouldn't it be much slower than using AppendStructuredBuffer?

Share this post


Link to post
Share on other sites
Advertisement

Append Buffers have no special hardware behind them. It's a typical Microsoft invention to overspecify simple things. Other APIs don't have them, also no 'structured' buffers or other confusing differences - it is all just plain memory for me as a Khronos guy.

So what i do is this: Allocate one large buffer with space for all 15 lists i generate in one shader, the first 15 numbers in the buffer being the counters, use some defines to ease up indexing, and finally increment the counters with atomic add and use the result to write the data to the buffer. Of course you can use the buffer for other, unrelated int data as well.

So you don't need to worry and can do the same (notice that for all lists and counters the shader only needs one parameter to adress it all, so there might be an advantage eventually).

 

Related: It is often a big win if you write your lists in short sequences first to LDS (using atomics to LDS as well), and when the sequence is full write it to global memory in one batch. This minimizes slow atomics and avoids scattered writes to global memory. I can make an example if you don't know what i mean...

We should also talk about using structs as in C, and why you mostly avoid this on GPU and prefer SoA, if you don't know...

42 minutes ago, NikiTo said:

Is emulating the push/pop functionality with one single byte address buffer worth it? Wouldn't it be much slower than using AppendStructuredBuffer?

Probably it would be much slower, because you would need to read the data before you write, and even create some lock to guarantee preserving the other bytes. But you can do this stuff in LDS, as hinted above.

Share this post


Link to post
Share on other sites

I was thinking to do the very same with byte address buffers in case AppendStructuredBuffer option fails.

It looks very slow, but the next step should compensate it. If I don't use lists, the next shader will have to read through the whole buffer, testing if each offset has valid data.

Plus, making my own push/pop logic is giving me more freedom. I could create a single list apart for lists shared data. It is just that I expected AppendStructuredBuffer to offer me hardware acceleration over my own push/pop-ing code.
 

40 minutes ago, JoeJ said:

Related: It is often a big win if you write your lists in short sequences first to LDS (using atomics to LDS as well),

Good to know!! I will use this approach for my next shader pass. In this one the LSD is full.
 

23 minutes ago, JoeJ said:

I can make an example if you don't know what i mean...

It is not needed, I understand it.
 

26 minutes ago, JoeJ said:

We should also talk about using structs as in C, and why you mostly avoid this on GPU and prefer SoA, if you don't know...

I googled it. My whole struct is 16 bytes. It is 4 floats each and I need to read them all together, so I think there will be not a problem.

Thanks!

Share this post


Link to post
Share on other sites
16 minutes ago, NikiTo said:

In this one the LSD is full.

I still have connections to Albert Hofmann in case you run out of it, just PM me... ;P

 

 

18 minutes ago, NikiTo said:

I was thinking to do the very same with byte address buffers in case AppendStructuredBuffer option fails.

 

19 minutes ago, NikiTo said:

I expected AppendStructuredBuffer to offer me hardware acceleration over my own push/pop-ing code.

GPUs can't address bytes so writes are very expensive under the hood, and there is no hardware backing for Append Buffers.

(I repeat this so anybody can correct me if i'm wrong.)

Share this post


Link to post
Share on other sites

in D3D11 it was possible for Append Buffers to have their counter stored in special memory (e.g. GDS) if the hardware had it. Performing atomic memory operations on GDS is much faster than having the count stored in main memory and be atomically incremented/decremented there.

In D3D12 where it's all "just memory" and the counter is a separate resource in main memory this optimisation no longer applies.

If AppendBuffers weren't codified as being in some way 'special', I can imagine it would have been more difficult to allow the IHVs to handle the counters in whatever magic ways they might have wanted.

Share this post


Link to post
Share on other sites
1 hour ago, ajmiles said:

in D3D11 it was possible for Append Buffers to have their counter stored in special memory (e.g. GDS) if the hardware had it. Performing atomic memory operations on GDS is much faster than having the count stored in main memory and be atomically incremented/decremented there.

In D3D12 where it's all "just memory" and the counter is a separate resource in main memory this optimisation no longer applies.

If AppendBuffers weren't codified as being in some way 'special', I can imagine it would have been more difficult to allow the IHVs to handle the counters in whatever magic ways they might have wanted.

On AMD, I've measured a *massive* performance difference in D3D11 between manually implementing append buffers (putting the counter in a normal UAV and incrementing it with atomic shader instructions), and using D3D's magic counter abstraction. On NVidia I didn't notice a difference.

To me, this hints that (on the two specific devices that I compared):

* the NV one was very capable of dealing with highly contended atomic increments to memory, whole the AMD one struggled here. 

* the counter abstraction allows AMD to utilise their GDS to completely make up for this deficiency. 

Share this post


Link to post
Share on other sites

On the other hand, D3D12 brings with it Shader Model 6 and the new 'Wave Intrinsics' which allow for wave-level reductions in the number of atomic operations (32x less on NVIDIA, 64x less on AMD). By balloting the wave on the number of threads that wish to increment a value you can have a single thread perform a single InterlockedAdd on behalf of all threads. In my experience so far (on AMD hardware) this 64-fold reduction in the number of atomic operations to main memory more than makes up for the fact that counters can no longer live in GDS, so we're in a better place than we were on D3D11!

Share this post


Link to post
Share on other sites
10 hours ago, ajmiles said:

In D3D12 where it's all "just memory" and the counter is a separate resource in main memory this optimisation no longer applies.

.... On the other hand, D3D12 brings with it Shader Model 6 and the new 'Wave Intrinsics' 

Because I'm not doing any fancy SM6 stuff yet, this got me interested because the difference was so pronounced in D3D11, so I finally got around to testing it in D3D12 for the first time :D 
From the looks of it, AMD are still doing something tricky, like copying that bit of memory to GDS before the draw/dispatch and copying it back afterwards.

I tested my pixel-linked-list OIT shader, which has a highly contended atomic counter to allocate fragment storage in the pixel shader. I recorded these timings on an AMD R9 Nano, but they should taken with a large grain of salt as my demo scene has some slight animation in it, and it's timed using D3D GPU timestamps rather than a precise profiler:
D3D12 native append counter: 866µs
D3D12 manual atomic counter in HLSL: 3848µs
D3D12 manual atomic hierarchical counter in HLSL: 1127µs (algorithm to reduce memory contention)
D3D11 native append counter: 1041µs
D3D11 manual atomic counter in HLSL: 4119µs
D3D11 manual atomic hierarchical counter in HLSL: 1349µs (algorithm to reduce memory contention)

Share this post


Link to post
Share on other sites

Oops again - sorry for spreading wrong things :)

2 hours ago, Hodgman said:

I tested my pixel-linked-list OIT shader

Could you share some more details about this?

I assume you use 2 passes, first incrementing a single counter to allocate storage, second writing fragments using per pixel counters?

I wonder how you can use intrinsics in a pixel shader here, but assume you can do it only in the first pass. So i assume your timings are just the first pass?

 

I would be interested in doing this for all geometry, to improve SS Reflections / Shadows, SSAO or SSGI etc as well. Assuming Nano is close to next gen consoles this might start to make sense looking at your numbers.

 

 

Share this post


Link to post
Share on other sites

I suspected Microsoft had to add things to DX other APIs lack.

 

14 hours ago, NikiTo said:

Is declaring 30+ AppendStructuredBuffers going to overkill the shader?

I assume the answer of this is obvious then

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!