• FEATURED
• FEATURED
• FEATURED
• FEATURED
• FEATURED

View more

View more

View more

Image of the Day Submit

IOTD | Top Screenshots

The latest, straight to your Inbox.

Subscribe to GameDev.net Direct to receive the latest updates and exclusive content.

Is there a trick to use signed floats with integer atomic min/max?

13 replies to this topic

#1JoeJ  Members

Posted 19 May 2017 - 01:41 AM

I want to calculate a bounding box form points given by float numbers on GPU, but it has atomic min/max operations only for integers.
Usually i multiply the float value by a large number and round to integer,
but if all values are positive i could just use the float bits as they are and integer min/max works properly without a need to convert from float to int.
Of course my values are positive and negative.

Is there any trick i miss here?

#2Álvaro  Members

Posted 19 May 2017 - 03:13 AM

[EDIT: Oh, I deleted the first paragraph because it's probably incorrect. I have to think about it a bit more.]

However, you can also implement reduce operations without atomics. The Thrust library, for instance, gives you reductions: http://docs.nvidia.com/cuda/thrust/#reductions

Edited by Álvaro, 19 May 2017 - 03:17 AM.

#3Hodgman  Moderators

Posted 19 May 2017 - 03:29 AM

You can treat the bit pattern as a "sign and magnitude" format integer. Convert that into a two's compliment integer, do the atomic stuff, then convert back to sign and magnitude and bitwise cast back to float.

This problem also applies to counting sort algorithm, so a search for how to radix sort floats will find example solutions.

#4Álvaro  Members

Posted 19 May 2017 - 03:42 AM

Here's one way to do the conversion Hodgman talks about (which I botched badly in my first attempt):
unsigned flip_negatives(unsigned x) {
return x ^ ((-(x >> 31)) & 0x7fffffffu);
}


#5JoeJ  Members

Posted 19 May 2017 - 04:21 AM

That's what i've expected, thanks guys!

Here the code again also with the back conversion:
static inline uint32_t F2U(uint32_t f)
{
uint32_t mask = -int32_t(f >> 31) | 0x80000000;
}

static inline uint32_t U2F(uint32_t f)
{
uint32_t mask = ((f >> 31) - 1) | 0x80000000;
}


I'll try the scan too but on recent GPUs atomics in LDS seem always faster than a scan for min/max.

Edited by JoeJ, 19 May 2017 - 04:25 AM.

#6JoeJ  Members

Posted 19 May 2017 - 05:08 AM

Here's one way to do the conversion Hodgman talks about (which I botched badly in my first attempt):

unsigned flip_negatives(unsigned x) {
return x ^ ((-(x >> 31)) & 0x7fffffffu);
}


Spotted a difference and your code is probably still buggy.
It has a single order error when going from negative to positive, with this test code:
(ignoring conversiation errors because i don't know your bock conversion)

I'm too lazy to figure out how it works exactly rigth now, so i'm just 99% sure

	struct stuff
{
static inline uint32_t F2U(uint32_t f)
{
uint32_t mask = -int32_t(f >> 31) | 0x80000000;
//return f ^ ((-(f >> 31)) & 0x7fffffffu); // <- does not work
}

static inline uint32_t U2F(uint32_t f)
{
uint32_t mask = ((f >> 31) - 1) | 0x80000000;
}
};

static int k=1;
if(k)
{
k = 0;
SystemTools::Log("FLOAT MAX TEST\n");

uint32_t old = 0;
for (float f=-50; f<50; f+=0.1f)
{
uint32_t i = (uint32_t&) f;
i = stuff::F2U(i);
if (old > i) SystemTools::Log("ORDER ERROR i %f\n", f);
old = i;

uint32_t back = stuff::U2F(i);
if ( ((float&)back) != f) SystemTools::Log("CONVERSION ERROR i %f\n", f);
}
SystemTools::Log("\n");
}



#7Álvaro  Members

Posted 19 May 2017 - 06:58 AM

Here's one way to do the conversion Hodgman talks about (which I botched badly in my first attempt):

unsigned flip_negatives(unsigned x) {
return x ^ ((-(x >> 31)) & 0x7fffffffu);
}


Spotted a difference and your code is probably still buggy.
It has a single order error when going from negative to positive, with this test code:
(ignoring conversiation errors because i don't know your bock conversion)

No, my code does something slightly different than the one you are using, but it's correct. My code requires interpreting the result as a signed integer before you compare. You can think of the one you found as adding 2^31 to every number so the comparisons can be done as unsigned integers. There is very little difference between both options.

#8JoeJ  Members

Posted 19 May 2017 - 08:39 AM

Ah sorry, but i like your version a lot more.
We can rewrite this even simpler with signed numbers:

int32_t flip_negatives(int32_t f)
{
return f ^ ((f >> 31) & 0x7fffffff); // if negative flip all bits except the sign to flip the direction numbers increase
}

The back conversation is just the same function and does not need a addition, so your version is clearly better.

Also, after looking closely it boils down to the exact thing i've had in mind before starting the thread - but i did not dare to try due to floating point paranoia

Edited by JoeJ, 19 May 2017 - 08:51 AM.

#9Álvaro  Members

Posted 19 May 2017 - 01:05 PM

It does look better with signed integers. Neat.

#10frob  Moderators

Posted 19 May 2017 - 01:35 PM

All those conversions are neat and all, but are they actually the right solution to the REAL underlying problem?

I want to calculate a bounding box form points given by float numbers on GPU, but it has atomic min/max operations only for integers.

Yes, converting to integer work is faster for that part of building a bounding box. And doing it as an integer allows using the atomic operation that works in place. However, non-atomic min/max is supported just fine if you are modifying a buffer instead of a single value, and doesn't suffer the performance penalty of the atomic ordering. Yes you will need a slightly different algorithm to gather the results, but that is quite a collection of bitwise operations and they are not free either.  Even with those integer-based bitwise operations you've still got the waiting for atomic operations as you slowly build your bounding box.

Secondly, if you're looking at bounding boxes of objects that can be rendered, why are you doing it that way?  If they were small and trivial you wouldn't need to be asking, so we can assume their bigger data sets.  In that case, there are alternative methods to calculate bounding boxes on the GPU, like this one

Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.

Also check out my personal website at bryanwagstaff.com, where I occasionally write about assorted stuff.

#11JoeJ  Members

Posted 19 May 2017 - 03:29 PM

Thanks, but using a pixel shader is not really an option for me.
I'm building a tree by splitting the bounding box at the center into child octants if there are too many nodes inside or if the box is too large.
So i have no geometry to render, just a few thousand points or more likely hundreds.

But i can't believe rendering those points to two pixels (which is still a render target in main memory) can be faster than atomics to LDS (on core memory which is more than 10 times faster).
I guess the graphics pipeline uses the same mechanism to do things like z-buffer and alpha blending (and that's where GPU atomics come from).
But usually the renderer has to draw the primitives in the order we give it to the API, and the only way to parallelize this is to distribute the primitives to small nonoverlapping pixel rectangles.
If we are lucky and can tell the API to ignore primitive order, those two pixels still fall into a single rectangle and we end up using sequential atomics to main memory under the hood.
Even if the GPU is clever enough to cache this to LDS, i would get additional synchronization cost between graphics and compute.

So i say bounding boxes are a typical compute workload today and using pixel shader is a dated approach.
But let me know if i'm wrong with anything - my knowledge about the graphics pipeline is not that good.

Edited by JoeJ, 19 May 2017 - 03:31 PM.

#12frob  Moderators

Posted 19 May 2017 - 04:51 PM

I'm still confused about it.  It isn't a large chunk of data, just "a few thousand points or hundreds". And that description is for managing a spatial tree for future processing tasks. Why are you using the graphics card for the processing of the tree? That type of tree management is usually a CPU-side operation.

Assuming they're in a linear array and you've got proper simd instructions to test XYZ at once, hundreds to thousands is going to be faster on the CPU than a single round-trip to the GPU.  If you were talking about tens or hundreds of thousands and bigger CUDA processing, that would be a different story.

I could easily be confused since I only have the small amount you wrote above, but I don't see why it is a GPU-side task if it isn't for processing relatively complex geometry.

Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.

Also check out my personal website at bryanwagstaff.com, where I occasionally write about assorted stuff.

#13Hodgman  Moderators

Posted 19 May 2017 - 11:24 PM

But i can't believe rendering those points to two pixels (which is still a render target in main memory) can be faster than atomics to LDS (on core memory which is more than 10 times faster). I guess the graphics pipeline uses the same mechanism to do things like z-buffer and alpha blending (and that's where GPU atomics come from).
The graphics hardware has dedicated hardware (often called ROP's, or the OM stage by Microsoft, or depth-block/color-block by AMD) which writes to colour/depth buffers, including any sorting/read-modify-write logic. Atomics on memory are a completely different bit of hardware, and atomics in LDS are a different bit of hardware again.

Atomics in LDS are going to be fast, but they're also limited to communication within the current thread-group. LDS hardware is within a shader core, which can only host a few hundred threads at a time. If you're reducing a data set of tens of thousands of points, then your kernel will be running across multiple shader cores with multiple non-connected LDS banks. So at some point you need to do some atomics out to memory addresses... but yes, if you can structure your algorithm so that it does most of its communication in LDS and then only small amounts of communication back to memory then you'll probably be doing pretty well.

#14JoeJ  Members

Posted 20 May 2017 - 12:44 AM

I'm still confused about it. It isn't a large chunk of data, just "a few thousand points or hundreds". And that description is for managing a spatial tree for future processing tasks. Why are you using the graphics card for the processing of the tree? That type of tree management is usually a CPU-side operation.

Agree with this, i'll keep CPU generation (which i already have) as an option.
In fact the tree rebuild takes 1% of runtime from the complete algorithm on CPU, on GPU it will be maybe 5% i guess.
It hurts to put small workloads on GPU, but i save the need to upload data per frame.
Also at the beginning of the frame all the setup work coming after the tree rebuild are small workloads too and i hope i can do shadowmap rendering or texture space shading async at that time.

The graphics hardware has dedicated hardware (often called ROP's, or the OM stage by Microsoft, or depth-block/color-block by AMD) which writes to colour/depth buffers, including any sorting/read-modify-write logic.

Ha ok, i did not know. ROPs, of course But i guess the limitation that all work goes to just one ROP blocking any others still applies somehow.

So here is what i'm trying to do and how i plan to implement: (It's inefficient, let me know if hou have an idea...)
I have all geometry covered with disc shaped samples to calculate GI. They are at the positions of lightmap texel centers and there's a hierarchy similar to mip maps, so we can think of them as a quad tree.
So the tree for something like a character is precomputed and does not need to be rebuild, i only need to rebuild the tree above all those object parents as they move around.

To make sure the direction of parent discs can be averaged from children, i first bin all these nodes to the closest of the 8 directions formed by a cube corner to it's center.
From that i launch 8 workgroups (each 256-1024 threads wide, depending on how much dynamic objects a game has), and each workgroup builds the entire tree bottom down.
At the end the 8 trees get a common parent and done.
(The precomputed object trees also use the 8 directions approach, so each object generates 8 root nodes to process)

So i have work only for 8 x 1024 threads but at least they likely keep busy from start to finish.
Unfortunately this will eat up pretty all LDS to store the boxes which limits the chance to do other things async very well.
Using a bounding sphere instead box could help a bit but requires 2 runs over the nodes: One to calculate the center and one to find the radius.

My nodes can have 8 children at most,
a node can have a maximum bounding radius of (1<<treeLevel) * detail,
and the tree needs to be level complete (i duplicate nodes from a level 5 child up to level 8 parent in a later step to get missing level 6+7).

What i really dislike about my approach is that it mixes nodes of various levels under a common parent.
A bottom up approach wold be much nicer: Create parents for all level 0 nodes first, add results to existing level 1 nodes and proceed.
This would result in much better trees, but i guess the compleixity is too high to be worth it.