Jump to content

  • Log In with Google      Sign In   
  • Create Account

Interested in a FREE copy of HTML5 game maker Construct 2?

We'll be giving away three Personal Edition licences in next Tuesday's GDNet Direct email newsletter!

Sign up from the right-hand sidebar on our homepage and read Tuesday's newsletter for details!


We're also offering banner ads on our site from just $5! 1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


[Dx11] InterlockedAdd on floats in Pixel Shader - Workaround?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
10 replies to this topic

#1 Tsus   Members   -  Reputation: 1048

Like
1Likes
Like

Posted 25 October 2011 - 03:19 AM

Hi!

I'd like to use an InterlockedAdd operation on floats. Unfortunately, the documentation says it’s allowed on ints and uints only. So, I’m looking for some sort of workaround.

Here is my scenario: I have pixel shaders that need to write multiple times into a resource (at different positions). The problem is, the number of writes differs from pixel to pixel and is somewhere between zero (not that unlikely actually) and about five. The write operation is a simple addition of a float (so, a simple scattering via rasterization would do the trick, if I wouldn’t have to write more than once...)

Btw, what is the globallycoherent modifier doing in pixel shaders? Will this ensure that different primitives will see the writes of each other?

Any thoughts or ideas on that would be greatly appreciated!

Thanks in advance!
Cheers,
Tsus

Sponsor:

#2 pcmaster   Members   -  Reputation: 678

Like
0Likes
Like

Posted 25 October 2011 - 06:20 AM

Maybe "scale" the floats to a certain range and store/add as normal ints? For example you know your floats will be from 0.0f to 100.0f, therefore you store them as uints from 0x00000000 to 0xFFFFFFFF with precision (step) (100.0f-0.0f)/(2^-32), which is over 9 decimal digits, if I count correctly :-) Then you can decimate them to these uints, run the atomic operations on them as uints and just convert them back to floats after reading back (if needed).

This, of course, will not work, if you require a huge range.

#3 Tsus   Members   -  Reputation: 1048

Like
0Likes
Like

Posted 25 October 2011 - 06:44 AM

Thanks for the quick response!

Maybe "scale" the floats to a certain range, almost a fixed-point, round them and store/add as normal ints? For example you know your floats will be from 0.0f to 100.0f, therefore you store them as uints from 0x00000000 to 0xFFFFFFFF with precision (100.0f-0.0f)/(2^-32), which is over 9 decimal digits, if I count correctly :-) Then you can decimate them to these uints, run the atomic operations on them as uints and just convert them back to floats after reading back (if needed).

This, of course, will not work, if you require a huge range.

I hesitate to go in this direction, since the computation should be unbiased. The range I’m expecting increases over time and unfortunately I’ll need most precision at the end.
It’s nice to have a fallback solution, but accuracy is crucial in my case…

#4 Adam_42   Crossbones+   -  Reputation: 2562

Like
0Likes
Like

Posted 25 October 2011 - 10:31 AM

Could you an integer as an array index to store each of the floats in a different location? With a maximum of 5 floats that shouldn't be too much extra storage required.

#5 Tsus   Members   -  Reputation: 1048

Like
0Likes
Like

Posted 25 October 2011 - 12:07 PM

Could you an integer as an array index to store each of the floats in a different location? With a maximum of 5 floats that shouldn't be too much extra storage required.

Hm, in general I can't make any assumption on how many write operations the pixel shader will have to do at most. It could be 5 (as I stated as an example before, just to give you a rough feeling), it could as well be 100. It always depends on the scene and the view. (Sorry, should have made that clearer..)
Even if I would know that there will happen at most 5 write operations, they would actually happen concurrently. Different pixel shaders probably like to add a value at the same position. So, storing that fix number of floats at a dedicated pre-allocated position wouldn't help me to sum them up. I'd rather avoid a second splatting pass...

#6 pcmaster   Members   -  Reputation: 678

Like
1Likes
Like

Posted 26 October 2011 - 08:57 AM

Maybe you could do with some kind of manual locking and a kind of busy waiting (boo boo boo :D). Kinda manual mutex. So, you will have a texture representing mutex, one for each fragment, initialised to 0. Now a thread (fragment) wants to operate on some memory location [x,y].

[loop]do // critical section enter (alias mutex::lock())
{
  uint orig;
  InterlockedCompareExchange(mutex[x,y], 0, 1, orig);
  if (orig == 0) // this means the exchange succeeded! you own the "mutex"
	break; // mutex[x,y] now equals 1

} while (1);
Then tamper the float4 texture at [x,y]. Read it. Modify the value. Write it back. Nobody else will touch it in the meantime. After you're done, call
InterlockedCompareExchange(mutex[x,y], 1, 0, dummy); // critical section leave (alias mutex::unlock())
Since we made sure that mutex[x,y]==1, this will exchange its value to 0. This is a signal for the other threads waiting in the loop for this location, that the mutex is "free" and one of them can enter the critical section. I claim this is actually the same serialisation that the GPU thread scheduler or whatever name would do anyway -- if many want to access the same critical location, they have to queue up.

I have not done this before, I mean not with DX11 (I did something similar with OpenCL). I have mixed experience with such "complex" shaders and DX11 (fxc.exe), so I have no idea whether this will actually work but to me it now seems legit :-) I'm NOOOOOOT sure whether this will work with Pixel Shader but in a Compute Shader (or OpenCL or CUDA), this really should work. The main problem might be in the eternal loop, which is something the optimiser doesn't seem to like at all :D

#7 Tsus   Members   -  Reputation: 1048

Like
0Likes
Like

Posted 26 October 2011 - 09:32 AM

Oh, well! The fact that we need to pull out such guns tells me we’re running out of options, aren’t we? :)
But seriously, I’ll give that a try and come back to share the results with you. It looks odd and promising at the same time. :)
Unfortunately my colleague is having a hard time with a deadline, so I’m going to help out there for a while. Thus, this project will have to wait a little… Nevertheless, I’ll try that.
Thank you very much pcmaster, you gave me two things to try out. Two thumbs up! :)

In the meantime I’ll hope someone will come up with another brilliant idea.
I’m having a little hope that the globallycoherent modifier could help me out, but I still don’t know what it’s doing. Isn’t this the kind of thing I would put on a resource in compute shaders to make operations like float-additions visible to all threads? If so, what is it doing in pixel shaders?

#8 pcmaster   Members   -  Reputation: 678

Like
0Likes
Like

Posted 27 October 2011 - 02:38 AM

Let us know what you came up with in the end :-)

I don't know much about globallycoherent and as you probably did too, I tried looking for some info but only run into a few bits. That's the case with all this new GPGPU HLSL stuff, discussions and info are extremely scarce, not so many people seem to use this new stuff, although it's been out there for over an year now :-(
The MSDN documentation suggests that globallycoherent should synchronise between all groups, device-wide, somehow. However, I don't even see any barrier() HLSL instructions that would block a whole dispatch in compute shaders or all pixels in a PS. Perhaps AllMemoryBarrier*() or DeviceMemoryBarrier() would change behaviour depending on globallycoherent modifier? Who can tell?

There's only one way to figure this HLSL synchronisation thing out - just try it out :D Bad luck I don't have time for this right now :-(

#9 Tsus   Members   -  Reputation: 1048

Like
0Likes
Like

Posted 27 October 2011 - 03:39 AM

Okay, I just looked over the docs once more and here is what I found out (or let's rather say - my interpretation).

The DeviceMemoryBarrier is the only barrier available in pixel shaders. (Makes sense, since we don't have group shared memory available.)
It synchronizes all device memory accesses of pixel shader threads inside of a group. (Internally the pixel shader threads are divided into groups, as well. The rasterizer decides on its dimension.). I guess globallycoherent will sync the threads from different groups as well - at least the groups that are currently running. Not all pixel shader threads are necessarily executed at the same time, since the number of shader units is limited and I find it rather hard to believe that Direct3D will push all states of the running groups on a stack and then start the next groups up to the barrier. That would consume too much memory. In my experience GPGPU languages don't have that feature for exactly that reason. (At least I'm not aware of any.)

Am I wrong with my interpretation of the globallycoherent modifier?

However, synchronizing every thread won't help if two threads plan to write at the same position in a buffer...

#10 Tsus   Members   -  Reputation: 1048

Like
1Likes
Like

Posted 24 February 2012 - 05:09 PM

Alright… it has been a while. Posted Image

After scribbling on a sheet of paper for quite some time, I eventually found another approach that avoided the InterlockedAddFloat. I had adapted a workaround I found in the Nvidia forum months ago, but since I didn’t gave it a fair testing and comparison to some ground truth I decided to wait before putting my solution out here. It was good that I waited, because it turned out that it was buggy. Posted Image

The funny thing is… it seems to be a compiler bug.
When I use a while-loop it works and when I use a do-while-loop it doesn’t. Very strange, but perhaps it might be of help to someone in the future. Posted Image

Here is the code that worked for me:
RWByteAddressBuffer Accum : register( u0 );

void interlockedAddFloat(uint addr, float value)  // Works perfectly!
{
  uint i_val = asuint(value);
  uint tmp0 = 0;
  uint tmp1;
  [allow_uav_condition] while (true)	  
  {
	Accum.InterlockedCompareExchange(addr, tmp0, i_val, tmp1);
	if (tmp1 == tmp0)
	  break;
	tmp0 = tmp1;
	i_val = asuint(value + asfloat(tmp1));
  }
}

And here (just for the curious reader Posted Image) the one that didn’t (only difference is the loop). With “not working” I mean, values were added too often (image got too bright).
void interlockedAddFloat_Test(uint addr, float value)  // Does not work and is slower.
{
  uint i_val = asuint(value);
  uint tmp0 = 0;
  uint tmp1;
  [allow_uav_condition] do
  {							
	Accum.InterlockedCompareExchange(addr, tmp0, i_val, tmp1);
	if (tmp1 == tmp0)
	  break;
	tmp0 = tmp1;
	i_val = asuint( value + asfloat(tmp1) );
  }while(true);
}

I have to admit that the code above is nested in another loop with unpredictable end, involving three conditional continues. Posted Image

Cheers!

#11 Poppyspy   Members   -  Reputation: 105

Like
1Likes
Like

Posted 16 February 2014 - 10:28 PM

HI! Thanks a lot!, this along with some NVidia slides led me what I believe you were attempting as well.

Modifying your solution, This is the most Elegant regarding what I believe most people will stumble upon this post for.

 

RWByteAddressBuffer Accum : register( u0 );

void interlockedAddFloat(uint addr, float value)
{
uint comp,orig = Accum.Load(addr);
[allow_uav_condition]do
{
Accum.InterlockedCompareExchange(addr, comp = orig, asuint(asfloat(orig) + value), orig);
}
while(orig != comp);
}






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS