• Create Account

Is my frustum culling slow ?

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

45 replies to this topic

#21belfegor  Members

Posted 31 March 2013 - 07:21 AM

class and type sizes are unrelated to instance adresses.

Excuse my stupidity i still don't understand.

You mean instances of a _Plane in an std::array or vector might not be contiguous in memory (with their default allocator)?

Edited by belfegor, 31 March 2013 - 07:25 AM.

#22Hodgman  Moderators

Posted 31 March 2013 - 07:32 AM

But i don't understand why isn't it already 16byte aligned as data is 4 floats ?

It's made up of primitives, each of which only requires 4-byte alignment to work correctly, so the struct will work as long as it's 4-byte aligned. It doesn't need to be 8 or 16 byte aligned in order to function correctly, only 4-byte aligned (and this is assuming that floats actually need to be 4 byte aligned).

class and type sizes are unrelated to instance adresses.

Excuse my stupidity i still don't understand.
You mean instances of a _Plane in an std::array or vector might not be contiguous in memory (with their default allocator)?

As far as I know:
• std::allocator<T>::allocate (which is used by the std containers) or "new T" will return a block of memory that is correctly aligned for the type "T" (i.e. the address is a multiple of alignof(T))
• malloc( sizeof(T) ) wont.

Edited by Hodgman, 31 March 2013 - 07:35 AM.

#23galop1n  Members

Posted 31 March 2013 - 07:34 AM

if a class is declared as align on a 16 byte boundary, the class size will be a multiple of 16 too, because we can write things like MyClass array[N]; and if array[i] is align on 16, then array[i+1] will too.

But if you write MyClass* array = new MyClass[N]; you have no guarante on the memory address return by new, because the operator new do not receive an alignment value. You will need to overload the operator new of MyClass to do some internal memalign, or globaly override the memory allocator to always return address multiple of 16.

And std::array is a C array, so the memory is contigus but do not rely on a memory allocator, so the compiler is able to align the members with the declspec value (as long as the objet storing an std::array<_Plane,6> is not allocate with a bare new )

#24galop1n  Members

Posted 31 March 2013 - 07:38 AM

As far as I know:

std::allocator::allocate (which is used by the std containers) or "new T" will return a block of memory that is correctly aligned for the type "T".
malloc( sizeof(T) ) wont.

Sadly, the standard do not consider simd types, the largest native alignment use by the default std::alocator is the one for a long long, that is only 8 bytes.

#25belfegor  Members

Posted 31 March 2013 - 08:11 AM

Am i getting this right:

__declspec(align(16))
struct Foo
{
float x;
float y;
float z;
};


storage space for each element would be 16 bytes?

#26Hodgman  Moderators

Posted 31 March 2013 - 08:12 AM

the operator new do not receive an alignment value

This is true, but new is also required to return a block of memory that is suitably aligned to represent any object of the requested size. This is pretty ridiculous, because if you're allocating an object of size 32, the operator new function has absolutely no way of knowing if it should be aligned to 32/16/8/4/2/1 byte boundary, or some other value!

I've found that in practice, most implementations of operator new simply use min(16, size) as the alignment value, and assume that no object would ever require a larger alignment value... This means that objects that have been manually set to use an alignment value of 16 do typically still work ok with new.

However, when you manually set the alignment of a class to some larger value, say, 42 bytes, and then create it with new, these implementations are actually failing to obey the spec... but as you point out, there's no real way for them to be able to comply with the spec!

#27galop1n  Members

Posted 31 March 2013 - 08:23 AM

As i said in a previous post, in practice, new return a memory aligned on 8 ( because the standard require a alignment enough for long long, the largest standard type ) with 32 bits builds and align on 16 with 64 bits build ( not because of the C++ standard, but because of the x64 ABI ).

Of course, because we (programmers) do not like a random behavior, we override the things by ourself, so the allocator will be consistant on PC, X360 or PS3 ( + any other target ).

#28lipsryme  Members

Posted 31 March 2013 - 08:32 AM

What if I were to create a class with nothing but an _AABB inside which is aligned to 16 and store this class inside the std::vector, and then access the _AABB inside for the frustum cull, would that work ?

edit: probably not... so how would I have to work around this?

How about storing it inside a non aligned container, and then create a new _AABB on the stack inside the frustum cull from the non aligned container ? Or is that bad/too slow ? edit: also doesn't work

Edited by lipsryme, 31 March 2013 - 09:20 AM.

#29Hodgman  Moderators

Posted 31 March 2013 - 08:47 AM

std::vector simply has problems with aligned types, especially on MSVC.

Often the solution is to not use it, e.g. the Bullet physics middleware wrote a replacement called btAlignedObjectArray.

#30belfegor  Members

Posted 31 March 2013 - 08:55 AM

@Hodgman

What about making custom allocator for vector with _aligned_malloc/_aligned_free ?

I still don't understand this alignment thing, how does it work. Know any good tutorials for idiots?

Edited by belfegor, 31 March 2013 - 08:55 AM.

#31hellraizer  Members

Posted 31 March 2013 - 09:15 AM

@Hellraizer

On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly

Now i am confused as in 4box version says:

// NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.


where that part of code is that you mentioned.

What I meant is, that the calculations of (d+r) and (d-r) in the 4-box-at-a-time loop are correct.

When you substitute the comment you mentioned, with the loop from CullAABBList_SSE_1(), you have to fix the typo I mentioned, in order for it to be correct.

Hope that makes sense.

HellRaiZer

#32lipsryme  Members

Posted 31 March 2013 - 09:15 AM

Yea I'd like to know that too, how to make it aligned to 16 bytes...

I've found something like this: http://code.google.com/p/mastermind-strategy/source/browse/trunk/src/util/aligned_allocator.hpp?r=167

but I'm still getting a compiler error:

c:\program files (x86)\microsoft visual studio 10.0\vc\include\vector(870): error C2719: '_Val': formal parameter with __declspec(align('16')) won't be aligned

see reference to class template instantiation 'std::vector<_Ty,_Ax>' being compiled
1>          with
1>          [
1>              _Ty=_AABB,
1>              _Ax=aligned_allocator<_AABB,16>
1>          ]

Edited by lipsryme, 31 March 2013 - 09:20 AM.

#33belfegor  Members

Posted 31 March 2013 - 09:41 AM

@lipsryme

I get no error using vector (with default allocator), but i am not sure will it work all the time so i need to understand this alignment thing so i can fix possible issues.

My test code (DX9)

#define NUM_TEAPOTS 1024

struct Object
{
_AABB*              bbox;
unsigned int*       state;
D3DXVECTOR3         pos;
};

struct _AABB
{
D3DXVECTOR3 m_Center;
D3DXVECTOR3 m_Extent;
};

...
std::vector<Object>       vObjects;
std::vector<_AABB>        vAABBs;
std::vector<unsigned int> vStates;
...
vObjects.resize(NUM_TEAPOTS);
vAABBs.resize(NUM_TEAPOTS);
vStates.resize(NUM_TEAPOTS);
vOccData.resize(NUM_TEAPOTS);
...//fill data

update()
{
auto frustum = ExtractFrustumPlanes(&cam->getView(), &cam->getProjection());
CullAABBList_SSE_4(&vAABBs[0], vAABBs.size(), &frustum[0], &vStates[0]);

....
for(int i = 0; i < NUM_TEAPOTS; ++i)
{
if(*vObjects[i].state != 0)
draw();
}
}


#34lipsryme  Members

Posted 31 March 2013 - 10:32 AM

Well so I've found an interesting post here: http://stackoverflow.com/questions/9409591/self-contained-stl-compatible-implementation-of-stdvector/9414618#9414618

Which basically says that something like this should work:

#include <vector>

template <typename T>
struct wrapper : public T
{
wrapper() {}
wrapper(const T& rhs) : T(rhs) {}
};

struct __declspec(align(64)) S
{
float x, y, z, w;
};

int main()
{
std::vector< wrapper<S> > v;  // OK, no C2719 error
return 0;
}


And it does compile...but maybe my problem is elsewhere but my culling doesn't work (only outputs 0).

Plus I'm getting an access violation read as soon as I put in up to four AABBs.

here's my code:

void RendererContext::FrustumCull_AABB_SSE(std::vector<aligned_16_wrapper<_AABB>>* aabbList,
unsigned int numAABBs,
_Plane* frustumPlanes,
unsigned int* culledSceneIDs)
{
__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};

__declspec(align(16)) _Plane absFrustumPlanes[6];
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
}

// Process 4 AABBs in each iteration...
unsigned int numIterations = numAABBs >> 2;
for(unsigned int iIter = 0;iIter < numIterations;++iIter)
{
_Vector3f centerWS[4];
centerWS[0].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.x + aabbList->at((iIter << 2) + 0).center.x;
centerWS[0].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.y + aabbList->at((iIter << 2) + 0).center.y;
centerWS[0].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.z + aabbList->at((iIter << 2) + 0).center.z;
centerWS[1].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.x + aabbList->at((iIter << 2) + 1).center.x;
centerWS[1].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.y + aabbList->at((iIter << 2) + 1).center.y;
centerWS[1].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.z + aabbList->at((iIter << 2) + 1).center.z;
centerWS[2].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.x + aabbList->at((iIter << 2) + 2).center.x;
centerWS[2].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.y + aabbList->at((iIter << 2) + 2).center.y;
centerWS[2].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.z + aabbList->at((iIter << 2) + 2).center.z;
centerWS[3].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.x + aabbList->at((iIter << 2) + 3).center.x;
centerWS[3].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.y + aabbList->at((iIter << 2) + 3).center.y;
centerWS[3].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.z + aabbList->at((iIter << 2) + 3).center.z;

_Vector3f extentWS[4];
extentWS[0].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.x + aabbList->at((iIter << 2) + 0).extent.x;
extentWS[0].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.y + aabbList->at((iIter << 2) + 0).extent.y;
extentWS[0].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.z + aabbList->at((iIter << 2) + 0).extent.z;
extentWS[1].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.x + aabbList->at((iIter << 2) + 1).extent.x;
extentWS[1].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.y + aabbList->at((iIter << 2) + 1).extent.y;
extentWS[1].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.z + aabbList->at((iIter << 2) + 1).extent.z;
extentWS[2].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.x + aabbList->at((iIter << 2) + 2).extent.x;
extentWS[2].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.y + aabbList->at((iIter << 2) + 2).extent.y;
extentWS[2].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.z + aabbList->at((iIter << 2) + 2).extent.z;
extentWS[3].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.x + aabbList->at((iIter << 2) + 3).extent.x;
extentWS[3].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.y + aabbList->at((iIter << 2) + 3).extent.y;
extentWS[3].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.z + aabbList->at((iIter << 2) + 3).extent.z;

// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.

// Shuffle the data in order to get all Xs, Ys, etc. in the same register.
__m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1));

unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum.
unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag[i] == 1 then this box intersects the frustum.
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
// Calculate d...
__m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z);

// Calculate r...
__m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z);

// Calculate d + r + frustumPlane.d

// Check which boxes are outside this plane (if any)...
// NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding
// box is outside the frustum.
in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite.

// If all boxes have been marked as outside the frustum, stop checking the rest of the planes.
if(!in_out_flag)
break;

// Calculate d - r + frustumPlane.d
__m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r);

// Check which boxes intersect the frustum...
intersect_flag |= intersect_flag_curPlane;
}

// Calculate the state of the AABB from the 2 flags.
// If the i-th bit from in_out_flag is 0, then the result will be 0 independent of value intersect_flag
// If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag.
culledSceneIDs[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0);
culledSceneIDs[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1);
culledSceneIDs[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2);
culledSceneIDs[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3);
}

// Process the rest of the AABBs one by one...
for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB)
{
_Vector3f centerWS;
centerWS.x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.x + aabbList->at(iAABB).center.x;
centerWS.y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.y + aabbList->at(iAABB).center.y;
centerWS.z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.z + aabbList->at(iAABB).center.z;

_Vector3f extentWS;
extentWS.x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.x + aabbList->at(iAABB).extent.x;
extentWS.y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.y + aabbList->at(iAABB).extent.y;
extentWS.z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.z + aabbList->at(iAABB).extent.z;

unsigned int result = 1; // Assume that the aabb will be Inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);

__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);

__m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));

// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
{
result = 0; // Outside
break;
}
result = 2; // Intersect
}

culledSceneIDs[iAABB] = result;
}
}


#35hellraizer  Members

Posted 31 March 2013 - 11:53 AM

I believe the changes you made in the code is the problem. To be more exact:

The original code read the centers and extends of the 4 AABBs from an array with the following layout:

c0.x, c0.y, c0.z, e0.x, e0.y, e0.z, c1.x, c1.y, c1.z, e1.x, e1.y, e1.z, c2.x, c2.y, c2.z, e2.x, e2.y, e2.z, c3.x, c3.y, c3.z, e3.x, e3.y, e3.z, ...


When the following instructions are executed, the XMM registers hold the values mentioned in their name:

// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Center.x);
__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Extent.y);
__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].m_Center.z);
__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Center.x);
__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Extent.y);
__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].m_Center.z);


If we assume that the initial aabbList array is 16-byte aligned, all loads are 16-byte aligned and the instructions are executed correctly. This is the reason we are loading the XMM regs with those specific array offsets.

On the other hand, your code doesn't do the same thing. It just stores the AABBs on the stack and the layout isn't the one expected by the code. The best case scenario is that your layout is:

c0.x, c0.y, c0.z, c1.x, c1.y, c1.z, c2.x, c2.y, c2.z, c3.x, c3.y, c3.z, e0.x, e0.y, e0.z, e1.x, e1.y, e1.z, e2.x, e2.y, e2.z, e3.x, e3.y, e3.z


but:

1) I think you can't be 100% sure about that (e.g. that the compiler will place the centers before the extends)

2) It's not guaranteed to be 16-byte aligned.

3) Most importantly, it's not what the code expects.

If you have to read the AABBs the way you did (one element at a time) I would suggest something like this:

__declspec(align(16)) _Vector3f aabbData[8];

aabbData[0].x = ... // center0.x
aabbData[0].y = ... // center0.y
aabbData[0].z = ... // center0.z
aabbData[1].x = ... // extend0.x
...

And then use this array to load the XMM regs as in the original code snippet.

PS. If you try to understand what the code does with those SSE instructions, you might be able to "optimize" it and get rid of the loads and shuffles completely. This is in case you continue to read the AABB data the way you do it.

Edited by hellraizer, 31 March 2013 - 11:56 AM.

HellRaiZer

#36lipsryme  Members

Posted 31 March 2013 - 01:20 PM

But how is it that my array is not aligned the same way ?

_AABB is defined as

struct _Vector3f
{
float x, y, z;
};

struct _AABB
{
_Vector3f center;
_Vector3f extent;
};

struct __declspec(align(16)) _Plane
{
float nx, ny, nz, d;
};


wouldn't that have the same alignment as it's supposed to ?

The following code does work (the culling works) but as soon as I put in a fourth element it crashes with an access violation...

crashing at this line:

__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].center.x);


Here's the full code:

void RendererContext::FrustumCull_AABB_SSE(_AABB* aabbList,
unsigned int numAABBs,
_Plane* frustumPlanes,
unsigned int* culledSceneIDs)
{
__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};

__declspec(align(16)) _Plane absFrustumPlanes[6];
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
}

// Process 4 AABBs in each iteration...
unsigned int numIterations = numAABBs >> 2;
for(unsigned int iIter = 0;iIter < numIterations;++iIter)
{
// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].center.x);
__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].extent.y);
__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].center.z);
__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].center.x);
__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].extent.y);
__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].center.z);

// Shuffle the data in order to get all Xs, Ys, etc. in the same register.
__m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2));
__m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1));

__m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0));
__m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0));
__m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1));

unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum.
unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag[i] == 1 then this box intersects the frustum.
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
// Calculate d...
__m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z);

// Calculate r...
__m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y);

xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z);

// Calculate d + r + frustumPlane.d

// Check which boxes are outside this plane (if any)...
// NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding
// box is outside the frustum.
in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite.

// If all boxes have been marked as outside the frustum, stop checking the rest of the planes.
if(!in_out_flag)
break;

// Calculate d - r + frustumPlane.d
__m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r);

// Check which boxes intersect the frustum...
intersect_flag |= intersect_flag_curPlane;
}

// Calculate the state of the AABB from the 2 flags.
// If the i-th bit from in_out_flag is 0, then the result will be 0 independent of value intersect_flag
// If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag.
culledSceneIDs[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0);
culledSceneIDs[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1);
culledSceneIDs[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2);
culledSceneIDs[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3);
}

// Process the rest of the AABBs one by one...
for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB)
{

unsigned int result = 1; // Assume that the aabb will be Inside the frustum
for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
{
__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);

__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);

// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));

// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
{
result = 0; // Outside
break;
}
result = 2; // Intersect
}

culledSceneIDs[iAABB] = result;
}
}


called like this:

	// Get View Frustum planes
_Plane frustumPlanes[6];
this->perspectiveCamera->GetFrustumPlanes(frustumPlanes);

std::vector< wrapper<_AABB> > aabbs = this->contentManager->GetAABBData();
unsigned int aabbSize = aabbs.size();
std::vector<unsigned int> culledSceneEntityIDs;
culledSceneEntityIDs.reserve(aabbSize);
for(size_t i = 0; i < aabbSize; ++i)
{
culledSceneEntityIDs.push_back(999999);
}
if(aabbSize > 0)
{
this->FrustumCull_AABB_SSE(&aabbs[0], aabbSize, frustumPlanes, &culledSceneEntityIDs[0]);
}


wrapper being:

template <typename T>
struct wrapper : public T
{
wrapper() {}
wrapper(const T& rhs) : T(rhs) {}
};


Now I tried checking its size with:

unsigned int TEST = sizeof(aabbs[0]);
unsigned int TEST2 = sizeof(_Vector3f);


and it returns 24, so doesn't that mean it is correctly aligned. So the first vector center would be 4 bytes each with extent.x being the fourth in the first 16 bytes, right ?

Same result checking size inside the loop before the line that makes it crash.

unsigned int alignmentTest = sizeof(aabbList[(iIter << 2) + 0]); // RESULT IS 24


ok so this post is getting ridiculously long already xD but why does this fail ?

	_AABB test[2];
test[0].center.x = 2.0f;
test[0].center.y = 1.0f;
test[0].center.z = 0.0f;
test[0].extent.x = 0.1f;
test[0].extent.y = 0.5f;
test[0].extent.z = 1.0f;
test[1].center.x = 2.0f;
test[1].center.y = 1.0f;
test[1].center.z = 0.0f;
test[1].extent.x = 0.1f;
test[1].extent.y = 0.5f;
test[1].extent.z = 1.0f;


same error as inside the frustum cull. I feel the error is somewhere else here...

Edited by lipsryme, 31 March 2013 - 03:03 PM.

#37Hodgman  Moderators

Posted 31 March 2013 - 09:13 PM

but I'm still getting a compiler error:
c:\program files (x86)\microsoft visual studio 10.0\vc\include\vector(870): error C2719: '_Val': formal parameter with __declspec(align('16')) won't be aligned

This is a bug in visual studio's implementation of the standard containers -- the resize method takes a 'T' argument by value, and MSVC does not support pass-by-value with declspec-align'ed types... std::vector simply doesn't compile under MSVC when it's used with one of these types

Hence the alternative versions, like the Bullet version I posted earlier (which you can steal the source for, and use it by itself without Bullet) or other projects, like RDESTL.

#38hellraizer  Members

Posted 01 April 2013 - 02:27 AM

@lipsryme

If you get an access violation as soon as you add a 4th aabb in the list it means that your aabbList array isn't 16-byte aligned. Two choices here:

1) Explicitly allocate the aabbList array to be 16-byte aligned (e.g. using _aligned_malloc()) or

Hope that helps.

PS. To test if an array address is 16-byte aligned you can use one of the functions found here: http://stackoverflow.com/a/1898487/2136504

E.g. Check &aabbList[(iIter << 2) + 0].center.x and if it returns true but the _mm_load_ps() fails then something else is wrong with your array.

Edited by hellraizer, 01 April 2013 - 02:36 AM.

HellRaiZer

#39lipsryme  Members

Posted 01 April 2013 - 03:39 AM

You're right it does return false. I just tried it using an unaligned call and it does seem to work.

What do you think I should do about translating the center vector to my world space?

Should I do that separately inside the update loop for every AABB in my scene ?

And I'm going to try to steal some of the source code from that physics sdk (bullet).

Edited by lipsryme, 01 April 2013 - 03:39 AM.

#40hellraizer  Members

Posted 01 April 2013 - 04:34 AM

If the AABBs correspond to static geometry, translating them to world space every frame is an overkill. You should do it once at start up.

If it's about dynamic geometry, then it shouldn't be that simple when rotation is involved. If your objects rotate, you should calculate the AABB from the OBB defined by the original AABB and the object's transformation, in case you want to use the same code for all your objects. Otherwise you can find/write another function which culls OBBs against the frustum.

In case you go about the OBB route, it might be faster to just check the bounding sphere (which isn't affected by rotation) against the frustum, at the expense of rendering a few more objects (bounding spheres tend to be larger that AABBs depending on the object they enclose).

Edited by hellraizer, 01 April 2013 - 04:35 AM.

HellRaiZer

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.