OpenGL parallelism and shaders

Started by
9 comments, last by amtri 4 years, 4 months ago

Hello,

I need to write a shader where, for a specific pixel, as a fragment comes in I will sort a struct stored in memory based on the value of depth.

For the sake of argument, let's assume I have an array of 10 entries per pixel of the struct containing {float depth; vec4 color}. As a new fragment comes in for a pixel I will use an insertion sort so I only keep the 10 front-most fragments and discard the rest.

If OpenGL were sequential, this is a rather trivial operation: just program a proper insertion sort. However, my worry is that the same pixel may be processed in 2 different fragments simultaneously, so my insertion sort would fail because multiple threads would be accessing the same array.

This leads me to 2 questions I'm hoping somebody has the insight to help out:

1) Can I be assured that, despite OpenGL's parallelism, no two fragments for the same pixel will ever be processed simultaneously? In other words, can I be sure that if I'm working on a pixel, then parallelism means that on another thread OpenGL will not be working on the same pixel?

2) If the answer to (1) above is no - that is, there are no guarantees - is there a way that would allow me do this in a safe way?

Thanks.

Advertisement

1. No, but not 100% sure.

2. You can make a lock using atomics, some approaches discussed here: http://on-demand.gputechconf.com/gtc/2014/presentations/S4385-order-independent-transparency-opengl.pdf

Joe,

Order-independent transparency is what I'm after (as you may have guessed by the link you pointed me to).

I have coded the linked-list approach. But this has a downside that I may be left with pixels with many fragments and others with few - or none - because I may have run out of memory.

So the idea is to use K entries per pixels - as described in the paper. What is not clear to me in this approach is how to use the atomics in this case. For the linked list, the atomic is in the tail value for each pixel and the global counter in my linked list.

So let's assume I have a x*y buffer of atomics that serves as my counter of fragments per pixel (if that's the way to do it). As another fragment comes in I can check the counter. If less than K => insertion sort; if equal to K, insertion sort and drop the last. But if two threads are working on the same pixel, would I need to define my entire storage as atomic? That would be a performance killer (I imagine).

Could you point me to some (pseudo) shader code that would allow me to store up to K entries per pixel in a safe way?

Thanks

I should have mentioned: I'm looking for a single-pass algorithm, if at all possible. I can accomplish the single pass with the linked list approach, but would like to have an optional K-storage-per-pixel approach, also in a single pass. Just haven't been able to figure out how I can do this.

1 hour ago, amtri said:

would I need to define my entire storage as atomic?

I did not further reply after seeing in the NV presentation they mention robustness issues, I lack pixel shader experience, and other people should know better than me...

However, i have good eperience with compute shaders, and i do not see the technical reason for the mentioned issues. What i have in mind is to use a atomic lock. So you need a single (guess 32 bit) varible per pixel to take ownership, and as long as it is set all other threads aiming for the same pixel have to wait until you are done with the sort. The sorting itself (or whatever else) so would not need atomic access and K entries per pixel would work. But now i'm no longer so sure this works as expected from pixel shaders?

Performance might still be suboptimal if ther is a lot of transparency.

The most elegant solution seems a 2 pass method for me:

1. 1st pass: Increment a atomic counter per pixel but nothing else. (Should be fast - e.g. the game Dreams does this with 64 bit atomics to splat points with a lot of overdraw and is still fast on PS4)

2. In a compute shader calculate prefix sum over entire screen. This gives a huge but compact list of all fragments that appear, and the first address for each pixel. (IDK how long prefix sum over 2M values - maybe 1 ms?)

3. 2nd pass: Draw again, this time again using atomic variable per pixel to get the local offset. Write the fragments to address + local offset.

4. Sort the fragment lists with compute shader. Blend and generate final image.

But that's just a proposal - no idea if this is not practical for some reason, and how this compares to other methods. (The NV presentation might cover this already, but not sure after a quick look.)

 

 

 

Joe,

You say

20 minutes ago, JoeJ said:

So you need a single (guess 32 bit) variable per pixel to take ownership, and as long as it is set all other threads aiming for the same pixel have to wait until you are done with the sort.

If I understand you correctly, what you are describing is the equivalent of a "mutex" per pixel.

What you are suggesting is the equivalent of "lock", sort everything for the pixel, then "unlock".

The methods I've seen with atomic variables are such that, while the call is being made, that atomic variable is replaced with a new value. But when the call returns any other thread can start modifying the variable - or, in other words, the variable is locked during the call and unlocked the moment the call returns.

So by the time I start my sort the variable is no longer locked.

Is there a shader call I can make to a variable to lock it and another one to unlock it?

Also, on a side note: the linked list approach suffices for me provided I have enough memory. Rather than doing 2 passes I do just one and, if the memory is full, I drop everything else and set a flag that the memory was exhausted. Users then have the option to increase the memory and redraw the scene. Most of the time they will only have to do this once. Then they can draw, rotate, etc. the image without worrying. Yes; when the scene changes, say, with a rotation, everything changes, but I've noticed that if one allocates, say, 20-30% more than the initially required value, they rarely run out of memory later. My issue here is when I cannot allocate the needed memory. In that case, I want to make sure I allow a fixed number of fragments per pixel so that every pixel has a "decent" image. But that requires an insertion sort - hence, the problem I'm running into.

If there's a way to lock a per-pixel variable, do the insertion sort, then unlock the per-pixel variable, then I'm all set. If you know how to do this please let me know.

Thanks.

You're looking for the fragment shader interlock extension, I think.

44 minutes ago, amtri said:

What you are suggesting is the equivalent of "lock", sort everything for the pixel, then "unlock".

Yes exactly. 

28 minutes ago, SoldierOfLight said:

You're looking for the fragment shader interlock extension, I think.

I assume this extension is faster than the regular atomic operations i had in mind:(https://www.khronos.org/opengl/wiki/Shader_Storage_Buffer_Object)

 

In general you can *not* build a per-pixel lock out of atomics that will work across all GPU architectures. The way that GPUs handle flow control and divergence is not spec'd in such a way that would let you robustly spin a thread on a lock without potentially running into deadlocks or other unwanted behavior. See this blog post for some more details. Also see the latest changes that Nvidia has been adding for Volta and Turing in terms of thread scheduling.

You'll want to use that mentioned fragment shader interlock extension instead, since that's guaranteed to actually work the way you want. You may also want to look into some of the Intel articles and samples for Adaptive Order Independent Transparency, since that sounds similar to what you're describing.

First of all, my apologies for "disappearing". I got sucked into other projects.

But I finally had the time to code the interlock mechanism. Unfortunately, this is what I got:

0(3) : warning C7547: extension GL_ARB_fragment_shader_interlock not supported in profile gp5fp

(0) : fatal error C9999: Unkown builtin 'beginInvocationInterlockARB' in CreateDagForBuiltin_HAL

I assume this is a hardware issue. Can anybody confirm that this is the case and my current graphics card does not support this? Or, perhaps, there's a workaround for it?

Thanks.

This topic is closed to new replies.

Advertisement