[D3D12] Ping Pong Rendering

Graphics and GPU Programming Programming

Started by Funkymunky May 16, 2016 04:19 PM

9 comments, last by SoldierOfLight 7 years, 11 months ago

1,420

Author

May 16, 2016 04:19 PM

I am trying to render an effect that requires ping-ponging back and forth between two textures. First I tried to implement it as a loop in a compute shader, with the two textures represented as RWStructuredBuffers and the output directed to a RWTexture2D. This resulted in a lot of "feedback" in the end texture, despite the fact that I put in calls to DeviceMemoryBarrierWithGroupSync after each iteration of the loop.

Then I tried setting up a loop in the C++ code to render via Pixel Shader back and forth between render targets, with Resource Barrier Transitions from Render Target to Pixel Shader Resource and back. This seems to only produce the output from the first ping-pong.

How should I be doing it, in theory? I liked the idea of the compute shader approach, but I'm sampling data outside of each execution's threadgroup quadrant of the texture, and that seems to be ruining things for me. I'm hesitant to Submit each render and then fence for completion, because that seems wasteful...

I guess basically, what's the best way to update a texture and ensure the update takes place before using it as a source?

SoldierOfLight

2,378

May 16, 2016 04:24 PM

The PS approach should work properly. Are you updating your SRV input descriptor as well as your RT output descriptor? Are you making sure that you don't overwrite it before the GPU has had a chance to read it?

Funkymunky

1,420

Author

May 16, 2016 05:04 PM

I put together a pretty basic set up (three passes, one that writes red, one that writes green, and a blue) and verified that ping-ponging back and forth with the PS method works properly. I've got something else going on in my original approach that's messing me up.

I'm still wondering why the CS method doesn't work though. If I have a RWStructuredBuffer that contains all my texture data, and my various threadgroups are sampling from that buffer, shouldn't a call to DeviceMemoryBarrierWithGroupSync sync all the thread groups after writing to the "other" buffer, such that the next iteration of the loop will have a completely filled out buffer as its source data?

SoldierOfLight

2,378

May 16, 2016 05:32 PM

I'm not an expert at writing/debugging shaders unfortunately, my expertise is more in API usage patterns. It sounds like it should work, but I'm sure there's more to it than that.

red75prime

592

May 16, 2016 06:55 PM

I'm still wondering why the CS method doesn't work though. If I have a RWStructuredBuffer that contains all my texture data, and my various threadgroups are sampling from that buffer, shouldn't a call to DeviceMemoryBarrierWithGroupSync sync all the thread groups after writing to the "other" buffer, such that the next iteration of the loop will have a completely filled out buffer as its source data?

I had similar problem. Try to put UAV resource barrier after Dispatch. https://msdn.microsoft.com/en-us/library/windows/desktop/dn986740(v=vs.85).aspx

Dingleberry

924

May 16, 2016 07:23 PM

I'm sampling data outside of each execution's threadgroup quadrant of the texture, and that seems to be ruining things for me.

You can't do a global synchronization within a compute shader -- so if you're doing something like, say, sampling some neighbor pixels, combining them, and writing them back out, you can get a race condition. All that needs to happen is for one thread group to write out data before another thread group reads from the location it wrote to. You can't synchronize thread groups with each other, so sometimes ping ponging is a good way to do certain algorithms.

Funkymunky

1,420

Author

May 16, 2016 08:57 PM

Huh. So then what is the difference between a GroupMemoryBarrier and a DeviceMemoryBarrier? The latter talks about blocking for "device memory accesses", which I took to mean things like RWStructuredBuffers, RWTexture2Ds, etc.

Adam Miles

3,468

May 16, 2016 09:09 PM

Yes, but only those threads within the current Thread Group, not the entire Dispatch.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Matias Goldberg

9,637

May 17, 2016 02:17 AM

Adam Miles answer is correct. I'll just expand on it:

Huh. So then what is the difference between a GroupMemoryBarrier and a DeviceMemoryBarrier? The latter talks about blocking for "device memory accesses", which I took to mean things like RWStructuredBuffers, RWTexture2Ds, etc.

A 8x8 ThreadGroup works on a group of 8x8 pixels. To process a 1024x1024 texture you'll need 16384 thread groups.

A DeviceMemoryBarrier will sync all transfers to global memory (such as RWStructuredBuffers, RWTexture2Ds) within the threadgroup (within that 8x8 block).

A GroupMemoryBarrier will sync all transfers to shared memory (everything declared as groupshared; which is usually stored inside an on-chip cache. In GCN this is called LDS Local Data Storage) also within the threadgroup.

The difference within these two barriers are which kind of memory they sync. But neither of them can sync with the whole dispatch. There is no intrinsic function to do such thing.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

eppo

4,879

May 18, 2016 01:43 PM

I guess basically, what's the best way to update a texture and ensure the update takes place before using it as a source?

Setting one texture as a shader resource and the other as a render target, render, and swapping the two seems simpler to me than a compute approach. The API will guarantee proper synchronization.

[D3D12] Ping Pong Rendering

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

[D3D12] Ping Pong Rendering

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines