DX11 - Pixel Shader 5 vs. Group Shared Memory and Atomic operations

Started by
3 comments, last by pcmaster 12 years, 8 months ago
Greetings community,

we all know that SM5 brought the possibility to scatter stuff in pixel shaders, too (not only compute shaders). MSDN is rather brief on this topic. I can only see that I can use Interlocked*() instructions in both PS and CS. I suppose on UAVs. DeviceMemoryBarrier() seems to work in both PS and CS and it seems to be the only barrier instruction usable in PS. My question now is whether it's principally impossible to take advantage of the group shared memory in PS too. I don't see the API for that and maybe that makes sense. In GL4.2 I noticed they released the GL_ARB_shader_image_load_store extension, which obviously supports the same stuff but still nothing for the scarce but fast shared memory manipulation :( I did implement various parallel algorithms in OpenCL, so although I might seem little confused now, I'm very much aware which memory is which and what's it good for in GPGPU via CUDA/OpenCL/DX11 CS.

Also, I see virtually nobody discussing using the atomic instructions outside compute shaders and wonder why. I see some OIT and Bokehs around which use Append Buffers. But I have a scenario where I need to rasterise normal geometry with a lot of textures and where I might benefit from being able to reduce a lot of info from pixel shaders using atomic operations on global (device) buffers, instead of writing out shitloads of texture data and reducing it parallelly afterwards. I'm not going to elaborate on my scenario further, I just state that I'll need to analyse what has been rasterised. I don't know how will the performance suffer if all units (fragments) try to write to the same memory location using InterlockedMax() or similar :(

Any thoughts on pixel shaders (not compute shaders!) and shared and atomic stuff in DX11?
Advertisement
There's no way to access shared memory at all in pixel shaders. I would assume that the GPU is already using shared memory for coordinating pixel shader executions, but even if that's not the case the API has no means of using it. So you're out of luck on that one.

I really haven't played around too much with using UAV's in pixel shaders, aside from using an append buffer for bokeh (I wrote that sample you're talking about). I'd imagine it's pretty slow using device-wide interlocked operations due to the kind of synchronization required for that sort operation. Even interlocked adds on shared memory is pretty slow...if you look at any fast parallel reductions for compute shaders or Cuda you'll find that they all avoid atomics. But it would definitely be better to profile than to assume, so if you do try any experiments I'd love to know how they turn out.
That is my understanding too. That there is magic going on behind the scenes when you compile a pixel shader that converts the pixel shader code into low-level GPU instructions that use shared memory and the like (basically everything you have to do yourself when you write CS or Cuda code).

I have used DeviceMemoryBarrier() in a pixel shader, the documentation is VERY sketchy. As I understand it this is basically a hint to tell the compiler all the GPU threads in the current block should finish accessing globak memory before continuing. Used correctly this should reduce the memory access overhead associated with different threads accessing global memory. But without a coherent description of exactly what this means in the context of pixel shader its difficult to know if I'm using it correctly. Does anyone know of a good description of what this function means in the context of a pixel shader ?
Building on what the others have said, there is no access to the group shared memory in pixel shaders. If you consider for a moment how it is used in compute shaders, I think it will be clear why. In the compute shader, you specify how large the thread groups are that you will be working with, and how many of them will be executed in a particular dispatch. Part of your thread group size declaration is the declaration of how much shared memory it will be using. This gives very fine control over how many threads will be needing to access the memory, and you can design your algorithm very precisely to coordinate access to it.

In a pixel shader on the other hand, there is currently no method or concept of a thread group. Instead, it is up to the vendors to determine the optimal split size to be used when rasterizing a primitive, and then it is done more or less behind the scenes. This makes it impossible for a developer to write a shader that will have a coherent access strategy to the shared memory.

Who knows what will be coming in the next versions of D3D, but this seems like a logical extension of the possibilities. People have been talking about programmable rasterization for a while too, so perhaps sometime down the road there could be selectable group sizes for rasterization... That is just pure speculation though - I would be happy with a programmable rasterizer, but I don't know if one would ever come around and/or be useful...
Thanks guys, I see it's exactly as I thought. I'll try to experiment with the interlocked atomics, though.

This topic is closed to new replies.

Advertisement