• Create Account

Like
9Likes
Dislike

# Useless Snippet #2: AABB/Frustum test

By Jim Drygiannakis | Published Sep 04 2013 11:47 PM in General Programming
Peer Reviewed by (Dave Hunt, Josh Vega, ivan.spasov)

aabb frustum culling sse snippet

Welcome back. This time we'll take a look at one of the most common operations in a 3D graphics engine, frustum culling. How fast can such code become by using SIMD?

## The problem

Classify whether a batch of AABBs are completely inside, completely outside or intersecting a frustum (6 planes).

Restrictions
• AABBs are defined as (Center, Extent) pairs.
• All vectors are Vector3f's.

## Structs and initialization

#define OUTSIDE 0
#define INSIDE 1
#define INTERSECT 2 // or 3 depending on the algorithm used (see the discussion on the 2nd SSE version)

struct Vector3f
{
float x, y, z;
};

struct AABB
{
Vector3f m_Center;
Vector3f m_Extent;
};

struct Plane
{
float nx, ny, nz, d;
};


All the functions presented in the snippets below have the same signature:

void CullAABBList(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState);


As you can see, I've taken the state of the AABB (with regard to the specified frustum) outside of the struct itself. This has a couple advantages.
• You can cull the same AABB list with multiple different frustums in parallel without worrying about conflicts. Just pass a different aabbState array to each function call and you are done.
• All relevant AABB data are close to each other, which helps reducing cache misses (since the array is expected to be read sequentially).
Finally, all arrays are assumed to be 16-byte aligned. This is required for the SSE version.

Let's take a look at some code.

## C++ (reference implementation)

// Performance (cycles/AABB): Average = 102.5 (stdev = 12.0)
void CullAABBList_C(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState)
{
for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB)
{
const Vector3f& aabbCenter = aabbList[iAABB].m_Center;
const Vector3f& aabbSize = aabbList[iAABB].m_Extent;

unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
const Plane& frustumPlane = frustumPlanes[iPlane];

float d = aabbCenter.x * frustumPlane.nx +
aabbCenter.y * frustumPlane.ny +
aabbCenter.z * frustumPlane.nz;

float r = aabbSize.x * fast_fabsf(frustumPlane.nx) +
aabbSize.y * fast_fabsf(frustumPlane.ny) +
aabbSize.z * fast_fabsf(frustumPlane.nz);

float d_p_r = d + r;
float d_m_r = d - r;

if(d_p_r < -frustumPlane.d)
{
result = OUTSIDE;
break;
}
else if(d_m_r < -frustumPlane.d)
result = INTERSECT;
}

aabbState[iAABB] = result;
}
}


The code is based on method 4c from an excellent post by Fabian Giesen on AABB culling: View frustum culling. If you haven't read it yet, check it out now. He presents several different ways of performing the same test, each of which has its own advantages and disadvantages.

There's nothing special about the above code, except from the fact that it breaks out of the inner loop as soon as the box is classified as completely outside of one of the planes. No need to check the others.
But we can do better. First thing, precalculate the absolute of the plane normals once, outside of the AABB loop. No need to recalculate the same thing over and over. We can also replace the floating point comparisons with bitwise ANDs. Lets take a look at the code.

## C++ Optimized

// Performance (cycles/AABB): Average = 84.9 (stdev = 12.3)
void CullAABBList_C_Opt(AABB* __restrict aabbList, unsigned int numAABBs, Plane* __restrict frustumPlanes, unsigned int* __restrict aabbState)
{
Plane absFrustumPlanes[6];
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
absFrustumPlanes[iPlane].nx = fast_fabsf(frustumPlanes[iPlane].nx);
absFrustumPlanes[iPlane].ny = fast_fabsf(frustumPlanes[iPlane].ny);
absFrustumPlanes[iPlane].nz = fast_fabsf(frustumPlanes[iPlane].nz);
}

for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB)
{
const Vector3f& aabbCenter = aabbList[iAABB].m_Center;
const Vector3f& aabbSize = aabbList[iAABB].m_Extent;

unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
const Plane& frustumPlane = frustumPlanes[iPlane];
const Plane& absFrustumPlane = absFrustumPlanes[iPlane];

float d = aabbCenter.x * frustumPlane.nx +
aabbCenter.y * frustumPlane.ny +
aabbCenter.z * frustumPlane.nz;

float r = aabbSize.x * absFrustumPlane.nx +
aabbSize.y * absFrustumPlane.ny +
aabbSize.z * absFrustumPlane.nz;

float d_p_r = d + r + frustumPlane.d;
if(IsNegativeFloat(d_p_r))
{
result = OUTSIDE;
break;
}

float d_m_r = d - r + frustumPlane.d;
if(IsNegativeFloat(d_m_r))
result = INTERSECT;
}

aabbState[iAABB] = result;
}
}


fast_fabsf() and IsNegativeFloat() are simple functions which treat the float as int and remove/check the MSB.

I don't have any other C-level optimizations in mind. So let's see what we can do using SSE.

## SSE (1 AABB at a time)

// 2013-09-10: Moved outside of the function body. Check comments by @Matias Goldberg for details.
__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};

// Performance (cycles/AABB): Average = 63.9 (stdev = 10.8)
void CullAABBList_SSE_1(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState)
{
__declspec(align(16)) Plane absFrustumPlanes[6];

for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
}

for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB)
{

unsigned int result = INSIDE; // Assume that the aabb will be inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);

__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);

__m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));

// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
{
result = OUTSIDE;
break;
}
result = INTERSECT;
}

aabbState[iAABB] = result;
}
}


Again, since we are processing one AABB at a time, we can break out of the inner loop as soon as the AABB is found to be completely outside one of the 6 planes.

The SSE code should be straightforward. All arithmetic operations are scalar and we use _mm_movemask_ps in order to extract the signs of (d + r) and (d - r). Depending on the signs, we classify the AABB as completely outside or intersecting the frustum. Even though we are testing one AABB at a time, the code is faster than the optimized C++ implementation.

The "problem" with the above snippet is that all SSE operations are scalar. We can do better by swizzling the data from 4 AABBs in such a way that it will be possible to calculate 4 (d + r) and 4 (d - r) simultaneously. Let's try that.

## SSE (4 AABBs at a time)

// 2013-09-10: Moved outside of the function body. Check comments by @Matias Goldberg for details.
__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};

// Performance (cycles/AABB): Average = 24.1 (stdev = 4.2)
void CullAABBList_SSE_4(AABB* aabbList, unsigned int numAABBs, Plane* frustumPlanes, unsigned int* aabbState)
{
__declspec(align(16)) Plane absFrustumPlanes[6];
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
}

// Process 4 AABBs in each iteration...
unsigned int numIterations = numAABBs >> 2;
for(unsigned int iIter = 0;iIter < numIterations;++iIter)
{
// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Center.x);
__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Extent.y);
__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].m_Center.z);
__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Center.x);
__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Extent.y);
__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].m_Center.z);

// Shuffle the data in order to get all Xs, Ys, etc. in the same register.
__m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1));

unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum.
unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag[i] == 1 then this box intersects the frustum.
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
// Calculate d...
__m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z);

// Calculate r...
__m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z);

// Calculate d + r + frustumPlane.d

// Check which boxes are outside this plane (if any)...
// NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding
// box is outside the frustum.
in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite.

// If all boxes have been marked as outside the frustum, stop checking the rest of the planes.
if(!in_out_flag)
break;

// Calculate d - r + frustumPlane.d
__m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r);

// Check which boxes intersect the frustum...
intersect_flag |= intersect_flag_curPlane;
}

// Calculate the state of the AABB from the 2 flags.
// If the i-th bit from in_out_flag is 0, then the result will be 0 independent of the value of intersect_flag
// If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag.
aabbState[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0);
aabbState[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1);
aabbState[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2);
aabbState[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3);
}

// Process the rest of the AABBs one by one...
for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB)
{
// NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.
}
}


Compared to the original (unoptimized) C++ version, it's 4x faster. But that's unfair. Compared to the scalar SSE version it's about 2.5x faster. Not bad, don't you think? Can it get any better? Probably yes. The problem with the 2 SSE versions is the lack of enough XMM registers to keep all the AABB data in them and avoid using the stack for storing intermediate results. Unfortunately, we need 6 registers for the 4 Center/Extent pairs, and the rest (2) aren't enough for the inner loop. Compiling under 64-bits should give better results because of that.

Another way to optimize this algorithm further is to have the data already laid out the way the code expects them (in SoA form). This will get rid of the shuffles (the memory loads will still be 6).

Finally, the code used to calculate the state of each AABB from the two bitfields can change. One other way of doing it is the following:

intersect_flag &= in_out_flag;
intersect_flag <<= 1;
aabbState[i] = ((in_out_flag & (1 << i)) | (intersect_flag & (1 << (i + 1)))) >> i


This way, we can get rid of the variable shifts (all shifts and ANDs are compile-time constants). The difference from the previous version is that the intersection state is 3 instead of 2. 2 isn't a valid state, that's why we AND the intersection_flag with the in_out_flag. 2 means that the box is outside and intersecting the frustum, which is an invalid case.

Unfortunately, in practice this doesn't make a big difference in performance. So, it's a matter of taste.

## Results

Below is a table with more performance data from all the above snippets. Two cases have been tested. For both of them, the frustum is the [0, 1]^3 box. The first one is a random case, where all the boxes are randomly placed in the [-1.0, 2.0] box, and have random sizes in [0.1, 0.2] range. The other case is the worst case, where all boxes are inside the frustum.

 Method Random (32) Worst (32) Random (1024) * Worst (1024) C++ (ref) 105.2 (14.0) 181.5 (21.0) 102.5 (12.0) 159.3 (20.3) C++ (opt) 96.1 (10.3) 138.0 (17.2) 84.9 (12.3) 119.8 (16.3) SSE_1 73.7 (9.8) 93.2 (13.8) 63.9 (10.8) 72.8 (9.8) SSE_4 26.5 (4.0) 27.6 (4.3) 24.1 (4.2) 22.9 (4.0)

Table: Performance comparison between the 4 snippets. 2 batch sizes, 32 and 1024 AABBs. The column marked with * contains the data shown in the post.

That's all folks. Thanks for reading. Any corrections/suggestions are welcome.

## Changes

2013-09-06: Changed C to C++ because the code uses references. Thanks to @zdlr for pointing that out.
2013-09-10: Removed underscore from structure names (comment by @NightCreature83). Also moved definition of absFrustumPlaneMask outside of the two SSE functions (see comments by @Matias Goldberg for details).

I'm a freelance programmer, currently working on Android, web and desktop apps. I find low level programming and optimizations in general, very interesting subjects, so I try to spent most of my free time on them.

Why is it that in your benchmark results, 1024 AABB tests are consistently faster than 32 AABB tests?

That's rather counterintuitive for me!

@Servant: I think the figures are per AABB (see the source code comments)

If you're using references, is it still a 'C implementation'?

Could you please choose better variable names than 'd' and 'r'? You then exacerbate the lack of clarity by aliasing d + r to 'd_p_r' and d - r to 'd_m_r' - not forgetting the 'd' field on the _Plane struct.

Would distanceFromCenterToPlaneNormal and distanceFromSizeToAbsolutePlaneNormal not be more explicative? Example code like this benefits hugely from such verbosity.

Thank you all for the comments.

@Servant: Bacterius is right. The numbers are "cycles per AABB". Culling a batch of 1024 AABBs in a single loop ends up being faster than 32 probably because the function overhead (e.g. calculating the abs of the plane normals) is minimal compared to the main loop.

Example: Assume that the initial loop which calculates the abs of the plane normals requires 200 cycles. Also, assume that each AABB requires 100 cycles. For a batch of 10 AABBs, the function would require 1200 cycles to complete. Or in other words, 120 cycles per AABB. If the batch had 1000 AABBs, the function would require 100200 cycles, or 100.2 cycles per AABB. Hope that makes sense.

@zdlr: Variable names were kept like that in order to match the examples in Fabian Giesen's article which I linked above. Also, the term 'reference implementation' doesn't mean that the code uses references. It means that, that specific snippet is used as a baseline for performance comparisons (if this was what you meant).

@hellraizer, thanks for replying. Re references, mean this:

const _Vector3f& aabbCenter = aabbList[iAABB].m_Center;

const _Vector3f& aabbSize = aabbList[iAABB].m_Extent;

Which creates constant references to the current AABB's center and extent vectors. References are a C++ construct, not C.

Your reference implementation can be optimized further.
You perform two dots per plane while it can be reduced to one.

Fabien Giesen covers this in excellent detail

On your 4 at a time SSE2 code, you're using movemask too often and switching to GPR registers to perform the bitwise arithmetic.
I suggest you keep everything in XMM registers (actually, avoid the switch of registers) and use cmpeqps familiy of instructions together with andps & orps instructions; then use movemask to get the final mask outside the loop

Another optimization is to analyze where to put the restrict/__restrict keyword to allow the compiler to optimize further.

@zdlr: Of course you are right. I forgot about those two. I'll edit the article to read "C++" instead of "C".

@Matias: Thank you very much for the tips. I'll try to find some time to do the changes you suggest and edit the article. I'll also try to make a little VS project to attach to the article at the same time.

About the two dot products. Do you mean method 5 on that article? In this case, the resulting code wouldn't be able to distinguish between fully inside and intersecting states which is a requirement for this article.

I know, it might sound bad trying to optimize something and add arbitrary restrictions which might affect performance, but I think having the ability to distinguish between those two cases might help in the case of an hierarchy. E.g. parent is completely inside, so there's no need to check any of its children.

Please correct me if I'm wrong.

Btw. a macro is often used to add the restrict keyword since every compiler has it's own way of declaring it (i.e. gcc vs msvc vs clang)

As for the two dot products, yes I'm talking about method 5. I wrongly saw in your code your routine was returning outside and inside and didn't see the "intersect". My apologies.

I wonder though if real life scenarios would just perform better using method 5 (brute force over smartness).

Also, it may be worth noting in the article that your routine is 100% re-entrant, meaning it's very easy to execute in parallel; except for your "static const unsigned int absPlaneMask[4]" part. That variable should be declared outside to make it re-entrant.

Being very strict, the snippet "_mm_load_ps((float*)&absPlaneMask[0])" is not standard compliant because it breaks the strict aliasing rule (casting an unsigned int* to a float*). It's a PITA coming from the intrinsics, since movaps can load integers just fine.

The "correct" (standard compliant) thing to do would be either to declare absPlaneMask as unsigned char* and then cast to float*

Another option is to rely on SSE2, and declare the variable as __m128 absPlaneMask( _mm_castsi128_ps( _mm_set1_epi32( 0x7FFFFFFF ) ) );

Absurd, but well, standard compliant (provided the compiler supports the SSE intrinsic extensions, which by definition aren't standard..., but you never know in which compiler your code will be used, and if it won't be ported to a different set of intrinsics)

Btw you shouldnt use _<Captial char> identifiers for C/C++ code as these are reserved as global names, this is one of the reasons why STL code looks so horrible. I am assuming here that this code should also compile for a C++ compiler.

17.4.3.2.1 Global names [lib.global.names]

Certain sets of names and function signatures are always reserved to the implementation:

• Each name that contains a double underscore (_ _) or begins with an underscore followed by an uppercase letter (2.11) is reserved to the implementation for any use.
• Each name that begins with an underscore is reserved to the implementation for use as a name in the global namespace.165

165) Such names are also reserved in namespace ::std (17.4.3.1).

Also defining a few macros for allignment and typedefing the intrinsics makes the code easier to read in the end.

@Matias Goldberg.

About the static const unsigned int absPlaneMask[4]. Could you explain how that stops it being reentrant? It is read-only and never modified.

Also, re-entrancy is not sufficient to make something thread-safe. It is possible to be re-entrant and yet not thread safe, and vice-versa.

@zdlr

The problem with static const inside a function scope is that initialization will happen the first time it is being used.

So, if two or more threads try to enter a function where there is a static variable (declared at that scope) for the first time, there's a race condition which can potentially leave the variable in an inconsistent state.

Once the function has been called once, this danger is no longer present and can be called from multiple threads safely.

In other words, if we look at the generated assembly, it's the variable is not really read-only. To ensure this doesn't happen we need to declare it somewhere else.

As for the re-entrancy, I didn't say it means it's thread safe, but rather makes it very easy to parallelize.

In this case, caller just needs to be sure it only gives the function access to a portion of the data, different to every thread, which is very straightforward to implement.

Perhaps the biggest thing to worry about is that the amount of data to be sent to each thread needs to be multiple of 4 (since the last snippet SSE is operating with 4 objects at a time)

Everyone, thanks for the constructive comments. I've updated the article based on some of your suggestions.

@Matias Goldberg: Unfortunately, changing _mm_load_ps((float*)&absPlaneMask[0]) to be standard compliant as you suggested, as well as adding the __restrict keyword, requires a rerun of the benchmarks, because otherwise the numbers won't be accurate. I'll keep a note and hope to be able to do it sometime soon.

One small note about your last comment. The 4 AABBs at a time SSE version has a loop at the end which should handle the case where the number of AABBs isn't a multiple of 4. The code isn't shown for clarity (it should actually be the same as the 1 AABB at a time version).

Thanks again for the great input.

Note: Please offer only positive, constructive comments - we are looking to promote a positive atmosphere where collaboration is valued above all else.

PARTNERS