• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
JoeJ

Is there a trick to use signed floats with integer atomic min/max?

13 posts in this topic

I want to calculate a bounding box form points given by float numbers on GPU, but it has atomic min/max operations only for integers.
Usually i multiply the float value by a large number and round to integer,
but if all values are positive i could just use the float bits as they are and integer min/max works properly without a need to convert from float to int.
Of course my values are positive and negative.

Is there any trick i miss here?
1

Share this post


Link to post
Share on other sites

Posted (edited)

[EDIT: Oh, I deleted the first paragraph because it's probably incorrect. I have to think about it a bit more.]

However, you can also implement reduce operations without atomics. The Thrust library, for instance, gives you reductions: http://docs.nvidia.com/cuda/thrust/#reductions Edited by �lvaro
1

Share this post


Link to post
Share on other sites
Here's one way to do the conversion Hodgman talks about (which I botched badly in my first attempt):
unsigned flip_negatives(unsigned x) {
  return x ^ ((-(x >> 31)) & 0x7fffffffu);
}
1

Share this post


Link to post
Share on other sites

Posted (edited)

Yeah, found the same meanwhile with the radix keyword: http://stereopsis.com/radix.html
That's what i've expected, thanks guys!

Here the code again also with the back conversion:
static inline uint32_t F2U(uint32_t f)
{
uint32_t mask = -int32_t(f >> 31) | 0x80000000;
return f ^ mask;
}

static inline uint32_t U2F(uint32_t f)
{
uint32_t mask = ((f >> 31) - 1) | 0x80000000;
return f ^ mask;
}

I'll try the scan too but on recent GPUs atomics in LDS seem always faster than a scan for min/max. Edited by JoeJ
0

Share this post


Link to post
Share on other sites

Here's one way to do the conversion Hodgman talks about (which I botched badly in my first attempt):

unsigned flip_negatives(unsigned x) {
  return x ^ ((-(x >> 31)) & 0x7fffffffu);
}


Spotted a difference and your code is probably still buggy.
It has a single order error when going from negative to positive, with this test code:
(ignoring conversiation errors because i don't know your bock conversion)

I'm too lazy to figure out how it works exactly rigth now, so i'm just 99% sure :D

	struct stuff
	{
		static inline uint32_t F2U(uint32_t f)
		{
			uint32_t mask = -int32_t(f >> 31) | 0x80000000;
			return f ^ mask;
			//return f ^ ((-(f >> 31)) & 0x7fffffffu); // <- does not work
		}

		static inline uint32_t U2F(uint32_t f)
		{
			uint32_t mask = ((f >> 31) - 1) | 0x80000000;
			return f ^ mask;
		}
	};

	static int k=1;
	if(k)
	{
		k = 0;
		SystemTools::Log("FLOAT MAX TEST\n");
		
		uint32_t old = 0;
		for (float f=-50; f<50; f+=0.1f)
		{
			uint32_t i = (uint32_t&) f;
			i = stuff::F2U(i);
			if (old > i) SystemTools::Log("ORDER ERROR i %f\n", f);
			old = i;

			uint32_t back = stuff::U2F(i);
			if ( ((float&)back) != f) SystemTools::Log("CONVERSION ERROR i %f\n", f); 
		}
		SystemTools::Log("\n");
	}

0

Share this post


Link to post
Share on other sites

Here's one way to do the conversion Hodgman talks about (which I botched badly in my first attempt):

unsigned flip_negatives(unsigned x) {
  return x ^ ((-(x >> 31)) & 0x7fffffffu);
}


Spotted a difference and your code is probably still buggy.
It has a single order error when going from negative to positive, with this test code:
(ignoring conversiation errors because i don't know your bock conversion)


No, my code does something slightly different than the one you are using, but it's correct. My code requires interpreting the result as a signed integer before you compare. You can think of the one you found as adding 2^31 to every number so the comparisons can be done as unsigned integers. There is very little difference between both options.
2

Share this post


Link to post
Share on other sites

Posted (edited)

Ah sorry, but i like your version a lot more.
We can rewrite this even simpler with signed numbers:
 
int32_t flip_negatives(int32_t f)
{
	return f ^ ((f >> 31) & 0x7fffffff); // if negative flip all bits except the sign to flip the direction numbers increase
}
The back conversation is just the same function and does not need a addition, so your version is clearly better.

Also, after looking closely it boils down to the exact thing i've had in mind before starting the thread - but i did not dare to try due to floating point paranoia :D Edited by JoeJ
2

Share this post


Link to post
Share on other sites

All those conversions are neat and all, but are they actually the right solution to the REAL underlying problem?

I want to calculate a bounding box form points given by float numbers on GPU, but it has atomic min/max operations only for integers.

Yes, converting to integer work is faster for that part of building a bounding box. And doing it as an integer allows using the atomic operation that works in place. However, non-atomic min/max is supported just fine if you are modifying a buffer instead of a single value, and doesn't suffer the performance penalty of the atomic ordering. Yes you will need a slightly different algorithm to gather the results, but that is quite a collection of bitwise operations and they are not free either.  Even with those integer-based bitwise operations you've still got the waiting for atomic operations as you slowly build your bounding box.

Secondly, if you're looking at bounding boxes of objects that can be rendered, why are you doing it that way?  If they were small and trivial you wouldn't need to be asking, so we can assume their bigger data sets.  In that case, there are alternative methods to calculate bounding boxes on the GPU, like this one

2

Share this post


Link to post
Share on other sites

Posted (edited)

Thanks, but using a pixel shader is not really an option for me.
I'm building a tree by splitting the bounding box at the center into child octants if there are too many nodes inside or if the box is too large.
So i have no geometry to render, just a few thousand points or more likely hundreds.


But i can't believe rendering those points to two pixels (which is still a render target in main memory) can be faster than atomics to LDS (on core memory which is more than 10 times faster).
I guess the graphics pipeline uses the same mechanism to do things like z-buffer and alpha blending (and that's where GPU atomics come from).
But usually the renderer has to draw the primitives in the order we give it to the API, and the only way to parallelize this is to distribute the primitives to small nonoverlapping pixel rectangles.
If we are lucky and can tell the API to ignore primitive order, those two pixels still fall into a single rectangle and we end up using sequential atomics to main memory under the hood.
Even if the GPU is clever enough to cache this to LDS, i would get additional synchronization cost between graphics and compute.

So i say bounding boxes are a typical compute workload today and using pixel shader is a dated approach.
But let me know if i'm wrong with anything - my knowledge about the graphics pipeline is not that good. Edited by JoeJ
0

Share this post


Link to post
Share on other sites

I'm still confused about it.  It isn't a large chunk of data, just "a few thousand points or hundreds". And that description is for managing a spatial tree for future processing tasks. Why are you using the graphics card for the processing of the tree? That type of tree management is usually a CPU-side operation.

Assuming they're in a linear array and you've got proper simd instructions to test XYZ at once, hundreds to thousands is going to be faster on the CPU than a single round-trip to the GPU.  If you were talking about tens or hundreds of thousands and bigger CUDA processing, that would be a different story.

I could easily be confused since I only have the small amount you wrote above, but I don't see why it is a GPU-side task if it isn't for processing relatively complex geometry.

1

Share this post


Link to post
Share on other sites

But i can't believe rendering those points to two pixels (which is still a render target in main memory) can be faster than atomics to LDS (on core memory which is more than 10 times faster). I guess the graphics pipeline uses the same mechanism to do things like z-buffer and alpha blending (and that's where GPU atomics come from).
The graphics hardware has dedicated hardware (often called ROP's, or the OM stage by Microsoft, or depth-block/color-block by AMD) which writes to colour/depth buffers, including any sorting/read-modify-write logic. Atomics on memory are a completely different bit of hardware, and atomics in LDS are a different bit of hardware again.

Atomics in LDS are going to be fast, but they're also limited to communication within the current thread-group. LDS hardware is within a shader core, which can only host a few hundred threads at a time. If you're reducing a data set of tens of thousands of points, then your kernel will be running across multiple shader cores with multiple non-connected LDS banks. So at some point you need to do some atomics out to memory addresses... but yes, if you can structure your algorithm so that it does most of its communication in LDS and then only small amounts of communication back to memory then you'll probably be doing pretty well.

2

Share this post


Link to post
Share on other sites

I'm still confused about it. It isn't a large chunk of data, just "a few thousand points or hundreds". And that description is for managing a spatial tree for future processing tasks. Why are you using the graphics card for the processing of the tree? That type of tree management is usually a CPU-side operation.

Agree with this, i'll keep CPU generation (which i already have) as an option.
In fact the tree rebuild takes 1% of runtime from the complete algorithm on CPU, on GPU it will be maybe 5% i guess.
It hurts to put small workloads on GPU, but i save the need to upload data per frame.
Also at the beginning of the frame all the setup work coming after the tree rebuild are small workloads too and i hope i can do shadowmap rendering or texture space shading async at that time.

The graphics hardware has dedicated hardware (often called ROP's, or the OM stage by Microsoft, or depth-block/color-block by AMD) which writes to colour/depth buffers, including any sorting/read-modify-write logic.

Ha ok, i did not know. ROPs, of course :) But i guess the limitation that all work goes to just one ROP blocking any others still applies somehow.



So here is what i'm trying to do and how i plan to implement: (It's inefficient, let me know if hou have an idea...)
I have all geometry covered with disc shaped samples to calculate GI. They are at the positions of lightmap texel centers and there's a hierarchy similar to mip maps, so we can think of them as a quad tree.
So the tree for something like a character is precomputed and does not need to be rebuild, i only need to rebuild the tree above all those object parents as they move around.

To make sure the direction of parent discs can be averaged from children, i first bin all these nodes to the closest of the 8 directions formed by a cube corner to it's center.
From that i launch 8 workgroups (each 256-1024 threads wide, depending on how much dynamic objects a game has), and each workgroup builds the entire tree bottom down.
At the end the 8 trees get a common parent and done.
(The precomputed object trees also use the 8 directions approach, so each object generates 8 root nodes to process)

So i have work only for 8 x 1024 threads but at least they likely keep busy from start to finish.
Unfortunately this will eat up pretty all LDS to store the boxes which limits the chance to do other things async very well.
Using a bounding sphere instead box could help a bit but requires 2 runs over the nodes: One to calculate the center and one to find the radius.

My nodes can have 8 children at most,
a node can have a maximum bounding radius of (1<<treeLevel) * detail,
and the tree needs to be level complete (i duplicate nodes from a level 5 child up to level 8 parent in a later step to get missing level 6+7).

What i really dislike about my approach is that it mixes nodes of various levels under a common parent.
A bottom up approach wold be much nicer: Create parents for all level 0 nodes first, add results to existing level 1 nodes and proceed.
This would result in much better trees, but i guess the compleixity is too high to be worth it.
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0