Ultra-Fast 2D AABB Overlap Test

Started by
20 comments, last by japro 12 years, 3 months ago
I am seeking what is in the title.

This is going into some extremely high-performance code, so I am hoping to find a way to vectorize this.
Target architectures are x86, x64, and armv7 for now.

To avoid branches, I am using this:
/**
* Determines if the given 2D AABB's overlap in any way.
*
* \param _a2Left The left 2D axis-aligned bounding box.
* \param _a2Right The right 2D axis-aligned bounding box.
* \return Returns true if the boxes overlap in any way.
*/
LSE_INLINE LSBOOL LSE_CALL CTest::Aabb2dAabb2d( const CAabb2d &_a2Left, const CAabb2d &_a2Right ) {
return (_a2Left.m_vMin.x <= _a2Right.m_vMax.x) && (_a2Left.m_vMin.y <= _a2Right.m_vMax.y) &&
(_a2Left.m_vMax.x >= _a2Right.m_vMin.x) && (_a2Left.m_vMax.y >= _a2Right.m_vMin.y);
}



I am open to suggestions on how to make this faster in any way for any or all of the target architectures.


Thank you,
L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Advertisement
What type do the x and y have?
32-bit floats.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Er, can't really think of much faster than that... Does && still cause lots of branching issues in modern code for stuff like that? (where shortcircuiting doesn't matter)

Anyways, some quick code I came up with, I didn't even check that this works though, coming up with this from memory. This should get rid of two comparisons:

fabs(bx2 + bx1 - ax2 + ax1) <= (bx2 - bx1 + ax2 - ax1) &&
fabs(by2 + by1 - ay2 + ay1) <= (by2 - by1 + ay2 - ay1)


What it does is check if the distance between the center of both rectangles is larger than half the sum of their dimensions. In other words:

fabs(bcenterx - acenterx) <= (bwidth + awidth) * 0.5 &&
fabs(bcentery - acentery) <= (bheight + aheight) * 0.5


I have used this method in a game before because the object coordinates happened to be centered in their boundary boxes and the dimensions were fixed and it works fine there (used this because in that specific case it made the code simpler - assembly, yuck).

Not sure if this is faster. This is under the assumption that the extra conditions make it slower to process, it may not be true in practice. May want to profile to make sure it's actually an issue for starters.
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.
If the two bounding boxes are already stored in one SSE register each (4 floats), and you shift the values a bit to only use <=, you might be able to use http://msdn.microsoft.com/en-us/library/f330yhc8%28VS.80%29.aspx to compare, and combine the results with http://msdn.microsoft.com/en-us/library/4490ys29(VS.80).aspx.
How is the function being used? I suspect you'll get much better results from improving the algorithm at a higher level than tweaking that function.
Perhaps this is what you already meant, but it would be much easier to vectorize e.g. a 4-vs-1 overlap test than a 1-vs-1 test, as you have now.

This would, however, in turn imply restructuring of some of the surrounding code to make vectorization of this one test worthwhile.
Well I didn't have that much hope to beat it after looking at the assembler output I got but i tried this anyway in the hope that less branching would be faster:

bool overlapSSE(const Box &a, const Box &b)
{
__m128 min = _mm_set_ps(a.xmin, b.xmin, a.ymin, b.ymin);
__m128 max = _mm_set_ps(b.xmax, a.xmax, b.ymax, a.ymax);
__m128 cmp = _mm_cmple_ps(min, max);
cmp = _mm_and_ps(cmp, _mm_shuffle_ps(cmp, cmp, _MM_SHUFFLE(1,0,3,2)));
cmp = _mm_and_ps(cmp, _mm_shuffle_ps(cmp, cmp, _MM_SHUFFLE(2,3,0,1)));
return _mm_comilt_ss(cmp,cmp);
}

but it's like 30% slower than what GCC produces with high enough optimization settings. And it seems to be broken in ICC anyway.
Avoiding the branches:


LSE_INLINE LSBOOL LSE_CALL CTest::Aabb2dAabb2d( const CAabb2d &_a2Left, const CAabb2d &_a2Right ) {
return (reinterpret_cast<const unsigned int&>(_a2Left.m_vMin.x - _a2Right.m_vMax.x) &
(reinterpret_cast<const unsigned int&>(_a2Left.m_vMin.y - _a2Right.m_vMax.y) &
(reinterpret_cast<const unsigned int&>(_a2Right.m_vMin.x - _a2Left.m_vMax.x) &
(reinterpret_cast<const unsigned int&>(_a2Right.m_vMin.y - _a2Left.m_vMax.y)) & 0x80000000);
}

Pitfalls with this: 1) float and unsigned int must both be 32 bits. 2) I don't even know if you can reinterpret cast an expression like that to a reference, and I don't have time to check the standard (so this very well may be breaking the rules of C++). 3) it doesn't work for times when equality is important (i.e. it's discarding all times when Min.x/y and Max.x/y are equal). 4) I'm not really sure how the computer converts an unsigned int to a bool behind the scenes, so there may be penalties there. 5) I haven't tested this.

Personally, I doubt these hacks are worth it at this point. Compilers are bloody good at optimizing.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Well I didn't have that much hope to beat it after looking at the assembler output I got but i tried this anyway in the hope that less branching would be faster:

bool overlapSSE(const Box &a, const Box &b)
{
__m128 min = _mm_set_ps(a.xmin, b.xmin, a.ymin, b.ymin);
__m128 max = _mm_set_ps(b.xmax, a.xmax, b.ymax, a.ymax);
__m128 cmp = _mm_cmple_ps(min, max);
cmp = _mm_and_ps(cmp, _mm_shuffle_ps(cmp, cmp, _MM_SHUFFLE(1,0,3,2)));
cmp = _mm_and_ps(cmp, _mm_shuffle_ps(cmp, cmp, _MM_SHUFFLE(2,3,0,1)));
return _mm_comilt_ss(cmp,cmp);
}

but it's like 30% slower than what GCC produces with high enough optimization settings. And it seems to be broken in ICC anyway.


Those shuffles will kill you every time with the instruction latencies. If you can deal with SSE3, the result of the compare can be checked in two instructions (the above is 8 I believe), hsubpd( cmp , cmp ) (subtract the values horizontally), _mm_store_ss to a float and then return a comparison to non-zero. Might have to swap _mm_cmple_ps to _mm_cmpgt_ps, haven't looked at it in detail. Anyway, that would be about as fast as you are likely to get. The math works out, any failure in comparison will make the bottom 32 bits after haddpd (double/double hadd) non-zero. Of course I also remember trying to cast __m128 to __m128d and the result back to __m128 without introducing instructions was painful because playing games at the bit twiddly level is frowned upon in C++ even when you know exactly what you are doing.

Add some declspecs for naked and do conversion to the __m128 outside and you are basically as fast as possible in the core functionality.

This topic is closed to new replies.

Advertisement