# Is my frustum culling slow ?

This topic is 1750 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

## Recommended Posts

the operator new do not receive an alignment value

This is true, but new is also required to return a block of memory that is suitably aligned to represent any object of the requested size. This is pretty ridiculous, because if you're allocating an object of size 32, the operator new function has absolutely no way of knowing if it should be aligned to 32/16/8/4/2/1 byte boundary, or some other value!

I've found that in practice, most implementations of operator new simply use min(16, size) as the alignment value, and assume that no object would ever require a larger alignment value... This means that objects that have been manually set to use an alignment value of 16 do typically still work ok with new.

However, when you manually set the alignment of a class to some larger value, say, 42 bytes, and then create it with new, these implementations are actually failing to obey the spec... but as you point out, there's no real way for them to be able to comply with the spec!

#### Share this post

##### Share on other sites

As i said in a previous post, in practice, new return a memory aligned on 8 ( because the standard require a alignment enough for long long, the largest standard type ) with 32 bits builds and align on 16 with 64 bits build ( not because of the C++ standard, but because of the x64 ABI ).

Of course, because we (programmers) do not like a random behavior, we override the things by ourself, so the allocator will be consistant on PC, X360 or PS3 ( + any other target ).

#### Share this post

##### Share on other sites

What if I were to create a class with nothing but an _AABB inside which is aligned to 16 and store this class inside the std::vector, and then access the _AABB inside for the frustum cull, would that work ?

edit: probably not... so how would I have to work around this?

How about storing it inside a non aligned container, and then create a new _AABB on the stack inside the frustum cull from the non aligned container ? Or is that bad/too slow ? edit: also doesn't work :p

Edited by lipsryme

#### Share this post

##### Share on other sites

std::vector simply has problems with aligned types, especially on MSVC.

Often the solution is to not use it, e.g. the Bullet physics middleware wrote a replacement called btAlignedObjectArray.

#### Share this post

##### Share on other sites

@Hodgman

What about making custom allocator for vector with _aligned_malloc/_aligned_free ?

I still don't understand this alignment thing, how does it work. Know any good tutorials for idiots?

Edited by belfegor

#### Share this post

##### Share on other sites

@Hellraizer

On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly

Now i am confused as in 4box version says:

// NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.


where that part of code is that you mentioned.

What I meant is, that the calculations of (d+r) and (d-r) in the 4-box-at-a-time loop are correct.

When you substitute the comment you mentioned, with the loop from CullAABBList_SSE_1(), you have to fix the typo I mentioned, in order for it to be correct.

Hope that makes sense.

#### Share this post

##### Share on other sites

Yea I'd like to know that too, how to make it aligned to 16 bytes...

I've found something like this: http://code.google.com/p/mastermind-strategy/source/browse/trunk/src/util/aligned_allocator.hpp?r=167

but I'm still getting a compiler error:

c:\program files (x86)\microsoft visual studio 10.0\vc\include\vector(870): error C2719: '_Val': formal parameter with __declspec(align('16')) won't be aligned

see reference to class template instantiation 'std::vector<_Ty,_Ax>' being compiled
1>          with
1>          [
1>              _Ty=_AABB,
1>              _Ax=aligned_allocator<_AABB,16>
1>          ]
Edited by lipsryme

#### Share this post

##### Share on other sites

@lipsryme

I get no error using vector (with default allocator), but i am not sure will it work all the time so i need to understand this alignment thing so i can fix possible issues.

My test code (DX9)

#define NUM_TEAPOTS 1024

struct Object
{
_AABB*              bbox;
unsigned int*       state;
D3DXVECTOR3         pos;
};

struct _AABB
{
D3DXVECTOR3 m_Center;
D3DXVECTOR3 m_Extent;
};

...
std::vector<Object>       vObjects;
std::vector<_AABB>        vAABBs;
std::vector<unsigned int> vStates;
...
vObjects.resize(NUM_TEAPOTS);
vAABBs.resize(NUM_TEAPOTS);
vStates.resize(NUM_TEAPOTS);
vOccData.resize(NUM_TEAPOTS);
...//fill data

update()
{
auto frustum = ExtractFrustumPlanes(&cam->getView(), &cam->getProjection());
CullAABBList_SSE_4(&vAABBs[0], vAABBs.size(), &frustum[0], &vStates[0]);

....
for(int i = 0; i < NUM_TEAPOTS; ++i)
{
if(*vObjects[i].state != 0)
draw();
}
}


#### Share this post

##### Share on other sites

Well so I've found an interesting post here: http://stackoverflow.com/questions/9409591/self-contained-stl-compatible-implementation-of-stdvector/9414618#9414618

Which basically says that something like this should work:

#include <vector>

template <typename T>
struct wrapper : public T
{
wrapper() {}
wrapper(const T& rhs) : T(rhs) {}
};

struct __declspec(align(64)) S
{
float x, y, z, w;
};

int main()
{
std::vector< wrapper<S> > v;  // OK, no C2719 error
return 0;
}


And it does compile...but maybe my problem is elsewhere but my culling doesn't work (only outputs 0).

Plus I'm getting an access violation read as soon as I put in up to four AABBs.

here's my code:

void RendererContext::FrustumCull_AABB_SSE(std::vector<aligned_16_wrapper<_AABB>>* aabbList,
unsigned int numAABBs,
_Plane* frustumPlanes,
unsigned int* culledSceneIDs)
{
__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};

__declspec(align(16)) _Plane absFrustumPlanes[6];
__m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]);
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx);
__m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask);
_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
}

// Process 4 AABBs in each iteration...
unsigned int numIterations = numAABBs >> 2;
for(unsigned int iIter = 0;iIter < numIterations;++iIter)
{
_Vector3f centerWS[4];
centerWS[0].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.x + aabbList->at((iIter << 2) + 0).center.x;
centerWS[0].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.y + aabbList->at((iIter << 2) + 0).center.y;
centerWS[0].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.z + aabbList->at((iIter << 2) + 0).center.z;
centerWS[1].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.x + aabbList->at((iIter << 2) + 1).center.x;
centerWS[1].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.y + aabbList->at((iIter << 2) + 1).center.y;
centerWS[1].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.z + aabbList->at((iIter << 2) + 1).center.z;
centerWS[2].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.x + aabbList->at((iIter << 2) + 2).center.x;
centerWS[2].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.y + aabbList->at((iIter << 2) + 2).center.y;
centerWS[2].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.z + aabbList->at((iIter << 2) + 2).center.z;
centerWS[3].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.x + aabbList->at((iIter << 2) + 3).center.x;
centerWS[3].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.y + aabbList->at((iIter << 2) + 3).center.y;
centerWS[3].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.z + aabbList->at((iIter << 2) + 3).center.z;

_Vector3f extentWS[4];
extentWS[0].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.x + aabbList->at((iIter << 2) + 0).extent.x;
extentWS[0].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.y + aabbList->at((iIter << 2) + 0).extent.y;
extentWS[0].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.z + aabbList->at((iIter << 2) + 0).extent.z;
extentWS[1].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.x + aabbList->at((iIter << 2) + 1).extent.x;
extentWS[1].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.y + aabbList->at((iIter << 2) + 1).extent.y;
extentWS[1].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.z + aabbList->at((iIter << 2) + 1).extent.z;
extentWS[2].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.x + aabbList->at((iIter << 2) + 2).extent.x;
extentWS[2].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.y + aabbList->at((iIter << 2) + 2).extent.y;
extentWS[2].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.z + aabbList->at((iIter << 2) + 2).extent.z;
extentWS[3].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.x + aabbList->at((iIter << 2) + 3).extent.x;
extentWS[3].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.y + aabbList->at((iIter << 2) + 3).extent.y;
extentWS[3].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.z + aabbList->at((iIter << 2) + 3).extent.z;

// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&centerWS[0].x);
__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&extentWS[0].y);
__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&centerWS[1].z);
__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&centerWS[2].x);
__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&extentWS[2].y);
__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&centerWS[3].z);

// Shuffle the data in order to get all Xs, Ys, etc. in the same register.
__m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1));

unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum.
unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag[i] == 1 then this box intersects the frustum.
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
// Calculate d...
__m128 xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nx);
__m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x);

xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].ny);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y);
xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nz);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z);
xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);

// Calculate r...
xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nx);
__m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].ny);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y);
xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nz);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z);
xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);

// Calculate d + r + frustumPlane.d
__m128 xmm_d_p_r = _mm_add_ps(xmm_d, xmm_r);
xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].d);
xmm_d_p_r = _mm_add_ps(xmm_d_p_r, xmm_frustumPlane_Component);

// Check which boxes are outside this plane (if any)...
// NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding
// box is outside the frustum.
unsigned int in_out_flag_curPlane = _mm_movemask_ps(xmm_d_p_r);
in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite.

// If all boxes have been marked as outside the frustum, stop checking the rest of the planes.
if(!in_out_flag)
break;

// Calculate d - r + frustumPlane.d
__m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r);
xmm_d_m_r = _mm_add_ps(xmm_d_m_r, xmm_frustumPlane_Component);

// Check which boxes intersect the frustum...
unsigned int intersect_flag_curPlane = _mm_movemask_ps(xmm_d_m_r);
intersect_flag |= intersect_flag_curPlane;
}

// Calculate the state of the AABB from the 2 flags.
// If the i-th bit from in_out_flag is 0, then the result will be 0 independent of value intersect_flag
// If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag.
culledSceneIDs[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0);
culledSceneIDs[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1);
culledSceneIDs[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2);
culledSceneIDs[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3);
}

// Process the rest of the AABBs one by one...
for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB)
{
_Vector3f centerWS;
centerWS.x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.x + aabbList->at(iAABB).center.x;
centerWS.y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.y + aabbList->at(iAABB).center.y;
centerWS.z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.z + aabbList->at(iAABB).center.z;

_Vector3f extentWS;
extentWS.x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.x + aabbList->at(iAABB).extent.x;
extentWS.y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.y + aabbList->at(iAABB).extent.y;
extentWS.z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.z + aabbList->at(iAABB).extent.z;

__m128 xmm_aabbCenter_x = _mm_load_ss(&centerWS.x);
__m128 xmm_aabbCenter_y = _mm_load_ss(&centerWS.y);
__m128 xmm_aabbCenter_z = _mm_load_ss(&centerWS.z);
__m128 xmm_aabbExtent_x = _mm_load_ss(&extentWS.x);
__m128 xmm_aabbExtent_y = _mm_load_ss(&extentWS.y);
__m128 xmm_aabbExtent_z = _mm_load_ss(&extentWS.z);

unsigned int result = 1; // Assume that the aabb will be Inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nx);
__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].ny);
xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_y, xmm_frustumPlane_Component));

xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nz);
xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_z, xmm_frustumPlane_Component));

__m128 xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nx);
__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);

xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].ny);
xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_y, xmm_absFrustumPlane_Component));

xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nz);
xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_z, xmm_absFrustumPlane_Component));

__m128 xmm_frustumPlane_d = _mm_load_ss(&frustumPlanes[iPlane].d);
__m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
__m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));
int negativeMask = _mm_movemask_ps(xmm_d_p_r__d_m_r);

// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
if(negativeMask & 0x01)
{
result = 0; // Outside
break;
}
else if(negativeMask & 0x04)
result = 2; // Intersect
}

culledSceneIDs[iAABB] = result;
}
}


#### Share this post

##### Share on other sites

I believe the changes you made in the code is the problem. To be more exact:

The original code read the centers and extends of the 4 AABBs from an array with the following layout:

c0.x, c0.y, c0.z, e0.x, e0.y, e0.z, c1.x, c1.y, c1.z, e1.x, e1.y, e1.z, c2.x, c2.y, c2.z, e2.x, e2.y, e2.z, c3.x, c3.y, c3.z, e3.x, e3.y, e3.z, ...


When the following instructions are executed, the XMM registers hold the values mentioned in their name:

// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Center.x);
__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Extent.y);
__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].m_Center.z);
__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Center.x);
__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Extent.y);
__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].m_Center.z);


If we assume that the initial aabbList array is 16-byte aligned, all loads are 16-byte aligned and the instructions are executed correctly. This is the reason we are loading the XMM regs with those specific array offsets.

On the other hand, your code doesn't do the same thing. It just stores the AABBs on the stack and the layout isn't the one expected by the code. The best case scenario is that your layout is:

c0.x, c0.y, c0.z, c1.x, c1.y, c1.z, c2.x, c2.y, c2.z, c3.x, c3.y, c3.z, e0.x, e0.y, e0.z, e1.x, e1.y, e1.z, e2.x, e2.y, e2.z, e3.x, e3.y, e3.z


but:

1) I think you can't be 100% sure about that (e.g. that the compiler will place the centers before the extends)

2) It's not guaranteed to be 16-byte aligned.

3) Most importantly, it's not what the code expects.

If you have to read the AABBs the way you did (one element at a time) I would suggest something like this:

__declspec(align(16)) _Vector3f aabbData[8];

aabbData[0].x = ... // center0.x
aabbData[0].y = ... // center0.y
aabbData[0].z = ... // center0.z
aabbData[1].x = ... // extend0.x
...

And then use this array to load the XMM regs as in the original code snippet.

PS. If you try to understand what the code does with those SSE instructions, you might be able to "optimize" it and get rid of the loads and shuffles completely. This is in case you continue to read the AABB data the way you do it.

Edited by hellraizer

#### Share this post

##### Share on other sites

But how is it that my array is not aligned the same way ?

_AABB is defined as

struct _Vector3f
{
float x, y, z;
};

struct _AABB
{
_Vector3f center;
_Vector3f extent;
};

struct __declspec(align(16)) _Plane
{
float nx, ny, nz, d;
};


wouldn't that have the same alignment as it's supposed to ?

The following code does work (the culling works) but as soon as I put in a fourth element it crashes with an access violation...

crashing at this line:

__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].center.x);


Here's the full code:

void RendererContext::FrustumCull_AABB_SSE(_AABB* aabbList,
unsigned int numAABBs,
_Plane* frustumPlanes,
unsigned int* culledSceneIDs)
{
__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};

__declspec(align(16)) _Plane absFrustumPlanes[6];
__m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]);
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx);
__m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask);
_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
}

// Process 4 AABBs in each iteration...
unsigned int numIterations = numAABBs >> 2;
for(unsigned int iIter = 0;iIter < numIterations;++iIter)
{
// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].center.x);
__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].extent.y);
__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].center.z);
__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].center.x);
__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].extent.y);
__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].center.z);

// Shuffle the data in order to get all Xs, Ys, etc. in the same register.
__m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1));

unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum.
unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag[i] == 1 then this box intersects the frustum.
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
// Calculate d...
__m128 xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nx);
__m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x);

xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].ny);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y);
xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nz);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z);
xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);

// Calculate r...
xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nx);
__m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].ny);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y);
xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nz);
xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z);
xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);

// Calculate d + r + frustumPlane.d
__m128 xmm_d_p_r = _mm_add_ps(xmm_d, xmm_r);
xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].d);
xmm_d_p_r = _mm_add_ps(xmm_d_p_r, xmm_frustumPlane_Component);

// Check which boxes are outside this plane (if any)...
// NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding
// box is outside the frustum.
unsigned int in_out_flag_curPlane = _mm_movemask_ps(xmm_d_p_r);
in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite.

// If all boxes have been marked as outside the frustum, stop checking the rest of the planes.
if(!in_out_flag)
break;

// Calculate d - r + frustumPlane.d
__m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r);
xmm_d_m_r = _mm_add_ps(xmm_d_m_r, xmm_frustumPlane_Component);

// Check which boxes intersect the frustum...
unsigned int intersect_flag_curPlane = _mm_movemask_ps(xmm_d_m_r);
intersect_flag |= intersect_flag_curPlane;
}

// Calculate the state of the AABB from the 2 flags.
// If the i-th bit from in_out_flag is 0, then the result will be 0 independent of value intersect_flag
// If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag.
culledSceneIDs[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0);
culledSceneIDs[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1);
culledSceneIDs[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2);
culledSceneIDs[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3);
}

// Process the rest of the AABBs one by one...
for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB)
{
__m128 xmm_aabbCenter_x = _mm_load_ss(&aabbList[iAABB].center.x);
__m128 xmm_aabbCenter_y = _mm_load_ss(&aabbList[iAABB].center.y);
__m128 xmm_aabbCenter_z = _mm_load_ss(&aabbList[iAABB].center.z);
__m128 xmm_aabbExtent_x = _mm_load_ss(&aabbList[iAABB].extent.x);
__m128 xmm_aabbExtent_y = _mm_load_ss(&aabbList[iAABB].extent.y);
__m128 xmm_aabbExtent_z = _mm_load_ss(&aabbList[iAABB].extent.z);

unsigned int result = 1; // Assume that the aabb will be Inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nx);
__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].ny);
xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_y, xmm_frustumPlane_Component));

xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nz);
xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_z, xmm_frustumPlane_Component));

__m128 xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nx);
__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);

xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].ny);
xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_y, xmm_absFrustumPlane_Component));

xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nz);
xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_z, xmm_absFrustumPlane_Component));

__m128 xmm_frustumPlane_d = _mm_load_ss(&frustumPlanes[iPlane].d);
__m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
__m128 xmm_d_m_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));
int negativeMask = _mm_movemask_ps(xmm_d_p_r__d_m_r);

// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
if(negativeMask & 0x01)
{
result = 0; // Outside
break;
}
else if(negativeMask & 0x04)
result = 2; // Intersect
}

culledSceneIDs[iAABB] = result;
}
}


called like this:

	// Get View Frustum planes
_Plane frustumPlanes[6];
this->perspectiveCamera->GetFrustumPlanes(frustumPlanes);

std::vector< wrapper<_AABB> > aabbs = this->contentManager->GetAABBData();
unsigned int aabbSize = aabbs.size();
std::vector<unsigned int> culledSceneEntityIDs;
culledSceneEntityIDs.reserve(aabbSize);
for(size_t i = 0; i < aabbSize; ++i)
{
culledSceneEntityIDs.push_back(999999);
}
if(aabbSize > 0)
{
this->FrustumCull_AABB_SSE(&aabbs[0], aabbSize, frustumPlanes, &culledSceneEntityIDs[0]);
}


wrapper being:

template <typename T>
struct wrapper : public T
{
wrapper() {}
wrapper(const T& rhs) : T(rhs) {}
};


Now I tried checking its size with:

unsigned int TEST = sizeof(aabbs[0]);
unsigned int TEST2 = sizeof(_Vector3f);


and it returns 24, so doesn't that mean it is correctly aligned. So the first vector center would be 4 bytes each with extent.x being the fourth in the first 16 bytes, right ?

Same result checking size inside the loop before the line that makes it crash.

unsigned int alignmentTest = sizeof(aabbList[(iIter << 2) + 0]); // RESULT IS 24


ok so this post is getting ridiculously long already xD but why does this fail ?

	_AABB test[2];
test[0].center.x = 2.0f;
test[0].center.y = 1.0f;
test[0].center.z = 0.0f;
test[0].extent.x = 0.1f;
test[0].extent.y = 0.5f;
test[0].extent.z = 1.0f;
test[1].center.x = 2.0f;
test[1].center.y = 1.0f;
test[1].center.z = 0.0f;
test[1].extent.x = 0.1f;
test[1].extent.y = 0.5f;
test[1].extent.z = 1.0f;
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&test[0].center.x);


same error as inside the frustum cull. I feel the error is somewhere else here...

Edited by lipsryme

#### Share this post

##### Share on other sites

but I'm still getting a compiler error:
c:\program files (x86)\microsoft visual studio 10.0\vc\include\vector(870): error C2719: '_Val': formal parameter with __declspec(align('16')) won't be aligned

This is a bug in visual studio's implementation of the standard containers -- the resize method takes a 'T' argument by value, and MSVC does not support pass-by-value with declspec-align'ed types... std::vector simply doesn't compile under MSVC when it's used with one of these types

Hence the alternative versions, like the Bullet version I posted earlier (which you can steal the source for, and use it by itself without Bullet) or other projects, like RDESTL.

#### Share this post

##### Share on other sites

@lipsryme

If you get an access violation as soon as you add a 4th aabb in the list it means that your aabbList array isn't 16-byte aligned. Two choices here:

1) Explicitly allocate the aabbList array to be 16-byte aligned (e.g. using _aligned_malloc()) or

2) Change the 6 _mm_load_ps() calls with _mm_loadu_ps() which allow reading from unaligned addresses.

Hope that helps.

PS. To test if an array address is 16-byte aligned you can use one of the functions found here: http://stackoverflow.com/a/1898487/2136504

E.g. Check &aabbList[(iIter << 2) + 0].center.x and if it returns true but the _mm_load_ps() fails then something else is wrong with your array.

Edited by hellraizer

#### Share this post

##### Share on other sites

You're right it does return false. I just tried it using an unaligned call and it does seem to work.

What do you think I should do about translating the center vector to my world space?

Should I do that separately inside the update loop for every AABB in my scene ?

And I'm going to try to steal some of the source code from that physics sdk (bullet).

Edited by lipsryme

#### Share this post

##### Share on other sites

If the AABBs correspond to static geometry, translating them to world space every frame is an overkill. You should do it once at start up.

If it's about dynamic geometry, then it shouldn't be that simple when rotation is involved. If your objects rotate, you should calculate the AABB from the OBB defined by the original AABB and the object's transformation, in case you want to use the same code for all your objects. Otherwise you can find/write another function which culls OBBs against the frustum.

In case you go about the OBB route, it might be faster to just check the bounding sphere (which isn't affected by rotation) against the frustum, at the expense of rendering a few more objects (bounding spheres tend to be larger that AABBs depending on the object they enclose).

Edited by hellraizer

#### Share this post

##### Share on other sites

Oh my god I've finally done it ....using the btAlignedArray from the bullet physics sdk.

And it is blazingly fast. With just the culling itself it takes about 0.03ms for 10k AABBs.

50k goes to around 0.53ms. 100k AABBs in 1.25ms.

Big thanks to everyone in here !

Edited by lipsryme

#### Share this post

##### Share on other sites

Remember, even the fastest frustum culling loop will be no match with hierarchical culing.

Most of a world is made of static geometry, so it is only a preprocess to clusterize it and merge the clusters AABB in some kind of hierarchical structure. An octree can be a good choice, each node split into 8 sub node ( neat, it is 2 times a 4 box simd culling ). You send have a usefull information for traversal, fully visible, invisible and partially culled. You can then push the primitives of the fully visible node without even culling for example.

And you seems to still miss understand the difference between type size and memory alignement, so study them again and again, also add bunch of assert each time to try to do an aligned load or write to catch bugs as soon as possible.

#### Share this post

##### Share on other sites

Remember, even the fastest frustum culling loop will be no match with hierarchical culing.
Most of a world is made of static geometry, so it is only a preprocess to clusterize it and merge the clusters AABB in some kind of hierarchical structure. An octree can be a good choice, each node split into 8 sub node ( neat, it is 2 times a 4 box simd culling ). You send have a usefull information for traversal, fully visible, invisible and partially culled. You can then push the primitives of the fully visible node without even culling for example.

And you seems to still miss understand the difference between type size and memory alignement, so study them again and again, also add bunch of assert each time to try to do an aligned load or write to catch bugs as soon as possible.

Hierarchical can be overkill for many cases and the added overhead and complexity can be then net loss.
In Battlefield 3 culling paper their paraller brute force algorithm was 3 times faster than hierarchical, code size was 80% smaller and because of this simplicity further optimizations was easier.
http://dice.se/publications/culling-the-battlefield-data-oriented-design-in-practice/

#### Share this post

##### Share on other sites

I perfectly know about that talks but the 360/PS3 way is not anymore. Beeing an in order RISC processor was a pain in the ass for code size and branching. Things as needed as possible in frustum culling like the float compare instruction was really costly too. Add the need for simple data layout to not saturate the DMA communication with the SPUs and the battefield choice may be legitime in their context.

But today and tomorow's engine will again scale the number of primitives to manage and draw by a good amount. Hierarchical layout will strike back easely with the more modern hardwares working more efficiently with branching.

And being hierarchical do not means we have to go down to single primitive leaves, everything is in the balance between the overhead of the hierarchy and the raw test. May be instead of storing hundreds of primitive in leafs, we will store thousands and dispatch each leaf on separate thread. In the end, only profile session and testing will give the answer, an answer dependant of the context of each game.

And strangely, on a previous game i work on, with a lot of instances to manage, one optimisation i did was to strip the hierarchical split of the culling after it reach a too small size because it was more efficient. So no, i am not a pro hierarchical, i am just a pro performance :)

Edited by galop1n

#### Share this post

##### Share on other sites

@OP - Could you share what hardware are you running this on?Those are some awesome results.

#### Share this post

##### Share on other sites
Core i5 2.8ghz quad core (I think it's a nehalem)
Mind you the frustum cull was basically the only thing inside my render loop and the results might not have been 100% perfectly measured

#### Share this post

##### Share on other sites

This topic is 1750 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

## Sign in

Already have an account? Sign in here.

Sign In Now