• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
lipsryme

Is my frustum culling slow ?

45 posts in this topic

the operator new do not receive an alignment value

This is true, but new is also required to return a block of memory that is suitably aligned to represent any object of the requested size. This is pretty ridiculous, because if you're allocating an object of size 32, the operator new function has absolutely no way of knowing if it should be aligned to 32/16/8/4/2/1 byte boundary, or some other value!

 

I've found that in practice, most implementations of operator new simply use min(16, size) as the alignment value, and assume that no object would ever require a larger alignment value... This means that objects that have been manually set to use an alignment value of 16 do typically still work ok with new.

However, when you manually set the alignment of a class to some larger value, say, 42 bytes, and then create it with new, these implementations are actually failing to obey the spec... but as you point out, there's no real way for them to be able to comply with the spec!

0

Share this post


Link to post
Share on other sites

As i said in a previous post, in practice, new return a memory aligned on 8 ( because the standard require a alignment enough for long long, the largest standard type ) with 32 bits builds and align on 16 with 64 bits build ( not because of the C++ standard, but because of the x64 ABI ).

 

Of course, because we (programmers) do not like a random behavior, we override the things by ourself, so the allocator will be consistant on PC, X360 or PS3 ( + any other target ).

0

Share this post


Link to post
Share on other sites

What if I were to create a class with nothing but an _AABB inside which is aligned to 16 and store this class inside the std::vector, and then access the _AABB inside for the frustum cull, would that work ?

 

edit: probably not... so how would I have to work around this?

How about storing it inside a non aligned container, and then create a new _AABB on the stack inside the frustum cull from the non aligned container ? Or is that bad/too slow ? edit: also doesn't work :p

Edited by lipsryme
0

Share this post


Link to post
Share on other sites

@Hodgman

What about making custom allocator for vector with _aligned_malloc/_aligned_free ?
 
I still don't understand this alignment thing, how does it work. Know any good tutorials for idiots?

Edited by belfegor
0

Share this post


Link to post
Share on other sites

@Hellraizer

On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly

 

Now i am confused as in 4box version says:

 

// NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.

 

where that part of code is that you mentioned.

 

What I meant is, that the calculations of (d+r) and (d-r) in the 4-box-at-a-time loop are correct.

 

When you substitute the comment you mentioned, with the loop from CullAABBList_SSE_1(), you have to fix the typo I mentioned, in order for it to be correct.

 

Hope that makes sense.

1

Share this post


Link to post
Share on other sites

Yea I'd like to know that too, how to make it aligned to 16 bytes...

I've found something like this: http://code.google.com/p/mastermind-strategy/source/browse/trunk/src/util/aligned_allocator.hpp?r=167

 

but I'm still getting a compiler error:

c:\program files (x86)\microsoft visual studio 10.0\vc\include\vector(870): error C2719: '_Val': formal parameter with __declspec(align('16')) won't be aligned

 

 

see reference to class template instantiation 'std::vector<_Ty,_Ax>' being compiled
1>          with
1>          [
1>              _Ty=_AABB,
1>              _Ax=aligned_allocator<_AABB,16>
1>          ]
Edited by lipsryme
1

Share this post


Link to post
Share on other sites

@lipsryme

I get no error using vector (with default allocator), but i am not sure will it work all the time so i need to understand this alignment thing so i can fix possible issues.

 

My test code (DX9)

#define NUM_TEAPOTS 1024
 
struct Object
{
    _AABB*              bbox;
    unsigned int*       state;
    D3DXVECTOR3         pos;
};
 
struct _AABB
{
    D3DXVECTOR3 m_Center;
    D3DXVECTOR3 m_Extent;
};
 
...
std::vector<Object>       vObjects;
std::vector<_AABB>        vAABBs;
std::vector<unsigned int> vStates;
...
vObjects.resize(NUM_TEAPOTS);
    vAABBs.resize(NUM_TEAPOTS);
    vStates.resize(NUM_TEAPOTS);
    vOccData.resize(NUM_TEAPOTS);
...//fill data
 
update()
{
    auto frustum = ExtractFrustumPlanes(&cam->getView(), &cam->getProjection());
    CullAABBList_SSE_4(&vAABBs[0], vAABBs.size(), &frustum[0], &vStates[0]);
 
    ....
    for(int i = 0; i < NUM_TEAPOTS; ++i)
    {
        if(*vObjects[i].state != 0)
           draw();
    }
}
0

Share this post


Link to post
Share on other sites

Well so I've found an interesting post here: http://stackoverflow.com/questions/9409591/self-contained-stl-compatible-implementation-of-stdvector/9414618#9414618

Which basically says that something like this should work:

#include <vector>

template <typename T>
struct wrapper : public T
{
    wrapper() {}
    wrapper(const T& rhs) : T(rhs) {}
};

struct __declspec(align(64)) S
{
    float x, y, z, w;
};

int main()
{
    std::vector< wrapper<S> > v;  // OK, no C2719 error
    return 0;
}

 

And it does compile...but maybe my problem is elsewhere but my culling doesn't work (only outputs 0).

Plus I'm getting an access violation read as soon as I put in up to four AABBs.

 

here's my code:

void RendererContext::FrustumCull_AABB_SSE(std::vector<aligned_16_wrapper<_AABB>>* aabbList,
										   unsigned int numAABBs,
										   _Plane* frustumPlanes,
										   unsigned int* culledSceneIDs)
{
	__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};

	__declspec(align(16)) _Plane absFrustumPlanes[6];
	__m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]);
	for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
	{
		__m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx);
		__m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask);
		_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
	}

	// Process 4 AABBs in each iteration...
	unsigned int numIterations = numAABBs >> 2;
	for(unsigned int iIter = 0;iIter < numIterations;++iIter)
	{
		_Vector3f centerWS[4];
		centerWS[0].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.x + aabbList->at((iIter << 2) + 0).center.x;
		centerWS[0].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.y + aabbList->at((iIter << 2) + 0).center.y;
		centerWS[0].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.z + aabbList->at((iIter << 2) + 0).center.z;
		centerWS[1].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.x + aabbList->at((iIter << 2) + 1).center.x;
		centerWS[1].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.y + aabbList->at((iIter << 2) + 1).center.y;
		centerWS[1].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.z + aabbList->at((iIter << 2) + 1).center.z;
		centerWS[2].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.x + aabbList->at((iIter << 2) + 2).center.x;
		centerWS[2].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.y + aabbList->at((iIter << 2) + 2).center.y;
		centerWS[2].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.z + aabbList->at((iIter << 2) + 2).center.z;
		centerWS[3].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.x + aabbList->at((iIter << 2) + 3).center.x;
		centerWS[3].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.y + aabbList->at((iIter << 2) + 3).center.y;
		centerWS[3].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.z + aabbList->at((iIter << 2) + 3).center.z;

		_Vector3f extentWS[4];
		extentWS[0].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.x + aabbList->at((iIter << 2) + 0).extent.x;
		extentWS[0].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.y + aabbList->at((iIter << 2) + 0).extent.y;
		extentWS[0].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 0].worldPosition.z + aabbList->at((iIter << 2) + 0).extent.z;
		extentWS[1].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.x + aabbList->at((iIter << 2) + 1).extent.x;
		extentWS[1].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.y + aabbList->at((iIter << 2) + 1).extent.y;
		extentWS[1].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 1].worldPosition.z + aabbList->at((iIter << 2) + 1).extent.z;
		extentWS[2].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.x + aabbList->at((iIter << 2) + 2).extent.x;
		extentWS[2].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.y + aabbList->at((iIter << 2) + 2).extent.y;
		extentWS[2].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 2].worldPosition.z + aabbList->at((iIter << 2) + 2).extent.z;
		extentWS[3].x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.x + aabbList->at((iIter << 2) + 3).extent.x;
		extentWS[3].y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.y + aabbList->at((iIter << 2) + 3).extent.y;
		extentWS[3].z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[(iIter << 2) + 3].worldPosition.z + aabbList->at((iIter << 2) + 3).extent.z;

		// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
		// Load the 4 Center/Extents pairs for the 4 AABBs.
		__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&centerWS[0].x);
		__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&extentWS[0].y);
		__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&centerWS[1].z);
		__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&centerWS[2].x);
		__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&extentWS[2].y);
		__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&centerWS[3].z);

		// Shuffle the data in order to get all Xs, Ys, etc. in the same register.
		__m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0));
		__m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0));
		__m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0));
		__m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1));

		__m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2));
		__m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2));
		__m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0));
		__m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1));

		__m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0));
		__m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0));
		__m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0));
		__m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1));

		unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum.
		unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag[i] == 1 then this box intersects the frustum.
		for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
		{
			// Calculate d...
			__m128 xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nx);
			__m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x);

			xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].ny);
			xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y);
			xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);

			xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nz);
			xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z);
			xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);

			// Calculate r...
			xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nx);
			__m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component);

			xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].ny);
			xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y);
			xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);

			xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nz);
			xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z);
			xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);

			// Calculate d + r + frustumPlane.d
			__m128 xmm_d_p_r = _mm_add_ps(xmm_d, xmm_r);
			xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].d);
			xmm_d_p_r = _mm_add_ps(xmm_d_p_r, xmm_frustumPlane_Component);

			// Check which boxes are outside this plane (if any)...
			// NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding 
			// box is outside the frustum. 
			unsigned int in_out_flag_curPlane = _mm_movemask_ps(xmm_d_p_r);
			in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite.

			// If all boxes have been marked as outside the frustum, stop checking the rest of the planes.
			if(!in_out_flag)
				break;

			// Calculate d - r + frustumPlane.d
			__m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r);
			xmm_d_m_r = _mm_add_ps(xmm_d_m_r, xmm_frustumPlane_Component);

			// Check which boxes intersect the frustum...
			unsigned int intersect_flag_curPlane = _mm_movemask_ps(xmm_d_m_r);
			intersect_flag |= intersect_flag_curPlane;
		}

		// Calculate the state of the AABB from the 2 flags.
		// If the i-th bit from in_out_flag is 0, then the result will be 0 independent of value intersect_flag
		// If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag.
		culledSceneIDs[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0);
		culledSceneIDs[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1);
		culledSceneIDs[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2);
		culledSceneIDs[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3);
	}

	// Process the rest of the AABBs one by one...
	for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB)
	{
		_Vector3f centerWS;
		centerWS.x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.x + aabbList->at(iAABB).center.x;
		centerWS.y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.y + aabbList->at(iAABB).center.y;
		centerWS.z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.z + aabbList->at(iAABB).center.z;

		_Vector3f extentWS;
		extentWS.x = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.x + aabbList->at(iAABB).extent.x;
		extentWS.y = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.y + aabbList->at(iAABB).extent.y;
		extentWS.z = this->sceneContext->GetCurrentScene()->GetDesc()->primitives[iAABB].worldPosition.z + aabbList->at(iAABB).extent.z;

		__m128 xmm_aabbCenter_x = _mm_load_ss(&centerWS.x);
		__m128 xmm_aabbCenter_y = _mm_load_ss(&centerWS.y);
		__m128 xmm_aabbCenter_z = _mm_load_ss(&centerWS.z);
		__m128 xmm_aabbExtent_x = _mm_load_ss(&extentWS.x);
		__m128 xmm_aabbExtent_y = _mm_load_ss(&extentWS.y);
		__m128 xmm_aabbExtent_z = _mm_load_ss(&extentWS.z);

		unsigned int result = 1; // Assume that the aabb will be Inside the frustum
		for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
		{
			__m128 xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nx);
			__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);

			xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].ny);
			xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_y, xmm_frustumPlane_Component));

			xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nz);
			xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_z, xmm_frustumPlane_Component));

			__m128 xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nx);
			__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);

			xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].ny);
			xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_y, xmm_absFrustumPlane_Component));

			xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nz);
			xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_z, xmm_absFrustumPlane_Component));

			__m128 xmm_frustumPlane_d = _mm_load_ss(&frustumPlanes[iPlane].d);
			__m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
			__m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

			// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
			__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));
			int negativeMask = _mm_movemask_ps(xmm_d_p_r__d_m_r);

			// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
			if(negativeMask & 0x01)
			{
				result = 0; // Outside
				break;
			}
			else if(negativeMask & 0x04)
				result = 2; // Intersect
		}

		culledSceneIDs[iAABB] = result;
	}
}
1

Share this post


Link to post
Share on other sites

I believe the changes you made in the code is the problem. To be more exact:

 

The original code read the centers and extends of the 4 AABBs from an array with the following layout:

c0.x, c0.y, c0.z, e0.x, e0.y, e0.z, c1.x, c1.y, c1.z, e1.x, e1.y, e1.z, c2.x, c2.y, c2.z, e2.x, e2.y, e2.z, c3.x, c3.y, c3.z, e3.x, e3.y, e3.z, ...

 

When the following instructions are executed, the XMM registers hold the values mentioned in their name:

 

// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
// Load the 4 Center/Extents pairs for the 4 AABBs.
__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Center.x);
__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Extent.y);
__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].m_Center.z);
__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Center.x);
__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Extent.y);
__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].m_Center.z); 

 

If we assume that the initial aabbList array is 16-byte aligned, all loads are 16-byte aligned and the instructions are executed correctly. This is the reason we are loading the XMM regs with those specific array offsets.

 

On the other hand, your code doesn't do the same thing. It just stores the AABBs on the stack and the layout isn't the one expected by the code. The best case scenario is that your layout is: 

 

 

c0.x, c0.y, c0.z, c1.x, c1.y, c1.z, c2.x, c2.y, c2.z, c3.x, c3.y, c3.z, e0.x, e0.y, e0.z, e1.x, e1.y, e1.z, e2.x, e2.y, e2.z, e3.x, e3.y, e3.z

 

but:

1) I think you can't be 100% sure about that (e.g. that the compiler will place the centers before the extends)

2) It's not guaranteed to be 16-byte aligned.

3) Most importantly, it's not what the code expects.

 

If you have to read the AABBs the way you did (one element at a time) I would suggest something like this:

 

__declspec(align(16)) _Vector3f aabbData[8];

aabbData[0].x = ... // center0.x
aabbData[0].y = ... // center0.y
aabbData[0].z = ... // center0.z
aabbData[1].x = ... // extend0.x
...

And then use this array to load the XMM regs as in the original code snippet.

 

PS. If you try to understand what the code does with those SSE instructions, you might be able to "optimize" it and get rid of the loads and shuffles completely. This is in case you continue to read the AABB data the way you do it.

Edited by hellraizer
1

Share this post


Link to post
Share on other sites

But how is it that my array is not aligned the same way ?

_AABB is defined as 

struct _Vector3f
{
float x, y, z;
};

struct _AABB
{
_Vector3f center;
_Vector3f extent;
};

struct __declspec(align(16)) _Plane
{
float nx, ny, nz, d;
};

 

 

wouldn't that have the same alignment as it's supposed to ?

The following code does work (the culling works) but as soon as I put in a fourth element it crashes with an access violation...

crashing at this line:

__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].center.x);

 

 

Here's the full code:

void RendererContext::FrustumCull_AABB_SSE(_AABB* aabbList,
										   unsigned int numAABBs,
										   _Plane* frustumPlanes,
										   unsigned int* culledSceneIDs)
{
	__declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF};

	__declspec(align(16)) _Plane absFrustumPlanes[6];
	__m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]);
	for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
	{
		__m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx);
		__m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask);
		_mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane);
	}

	// Process 4 AABBs in each iteration...
	unsigned int numIterations = numAABBs >> 2;
	for(unsigned int iIter = 0;iIter < numIterations;++iIter)
	{
		// NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves.
		// Load the 4 Center/Extents pairs for the 4 AABBs.
		__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].center.x);
		__m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].extent.y);
		__m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].center.z);
		__m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].center.x);
		__m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].extent.y);
		__m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].center.z);

		// Shuffle the data in order to get all Xs, Ys, etc. in the same register.
		__m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0));
		__m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0));
		__m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0));
		__m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1));

		__m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2));
		__m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2));
		__m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0));
		__m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1));

		__m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0));
		__m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0));
		__m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0));
		__m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1));

		unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum.
		unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag[i] == 1 then this box intersects the frustum.
		for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
		{
			// Calculate d...
			__m128 xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nx);
			__m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x);

			xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].ny);
			xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y);
			xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);

			xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nz);
			xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z);
			xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component);

			// Calculate r...
			xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nx);
			__m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component);

			xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].ny);
			xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y);
			xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);

			xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nz);
			xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z);
			xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component);

			// Calculate d + r + frustumPlane.d
			__m128 xmm_d_p_r = _mm_add_ps(xmm_d, xmm_r);
			xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].d);
			xmm_d_p_r = _mm_add_ps(xmm_d_p_r, xmm_frustumPlane_Component);

			// Check which boxes are outside this plane (if any)...
			// NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding 
			// box is outside the frustum. 
			unsigned int in_out_flag_curPlane = _mm_movemask_ps(xmm_d_p_r);
			in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite.

			// If all boxes have been marked as outside the frustum, stop checking the rest of the planes.
			if(!in_out_flag)
				break;

			// Calculate d - r + frustumPlane.d
			__m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r);
			xmm_d_m_r = _mm_add_ps(xmm_d_m_r, xmm_frustumPlane_Component);

			// Check which boxes intersect the frustum...
			unsigned int intersect_flag_curPlane = _mm_movemask_ps(xmm_d_m_r);
			intersect_flag |= intersect_flag_curPlane;
		}

		// Calculate the state of the AABB from the 2 flags.
		// If the i-th bit from in_out_flag is 0, then the result will be 0 independent of value intersect_flag
		// If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag.
		culledSceneIDs[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0);
		culledSceneIDs[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1);
		culledSceneIDs[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2);
		culledSceneIDs[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3);
	}

	// Process the rest of the AABBs one by one...
	for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB)
	{
		__m128 xmm_aabbCenter_x = _mm_load_ss(&aabbList[iAABB].center.x);
		__m128 xmm_aabbCenter_y = _mm_load_ss(&aabbList[iAABB].center.y);
		__m128 xmm_aabbCenter_z = _mm_load_ss(&aabbList[iAABB].center.z);
		__m128 xmm_aabbExtent_x = _mm_load_ss(&aabbList[iAABB].extent.x);
		__m128 xmm_aabbExtent_y = _mm_load_ss(&aabbList[iAABB].extent.y);
		__m128 xmm_aabbExtent_z = _mm_load_ss(&aabbList[iAABB].extent.z);

		unsigned int result = 1; // Assume that the aabb will be Inside the frustum
		for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
		{
			__m128 xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nx);
			__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);

			xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].ny);
			xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_y, xmm_frustumPlane_Component));

			xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nz);
			xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_z, xmm_frustumPlane_Component));

			__m128 xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nx);
			__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);

			xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].ny);
			xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_y, xmm_absFrustumPlane_Component));

			xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nz);
			xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_z, xmm_absFrustumPlane_Component));

			__m128 xmm_frustumPlane_d = _mm_load_ss(&frustumPlanes[iPlane].d);
			__m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
			__m128 xmm_d_m_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

			// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
			__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));
			int negativeMask = _mm_movemask_ps(xmm_d_p_r__d_m_r);

			// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
			if(negativeMask & 0x01)
			{
				result = 0; // Outside
				break;
			}
			else if(negativeMask & 0x04)
				result = 2; // Intersect
		}

		culledSceneIDs[iAABB] = result;
	}
}

 

called like this:

 


	// Get View Frustum planes
	_Plane frustumPlanes[6];
	this->perspectiveCamera->GetFrustumPlanes(frustumPlanes);

	std::vector< wrapper<_AABB> > aabbs = this->contentManager->GetAABBData();
	unsigned int aabbSize = aabbs.size();
	std::vector<unsigned int> culledSceneEntityIDs;
	culledSceneEntityIDs.reserve(aabbSize);
	for(size_t i = 0; i < aabbSize; ++i)
	{
		culledSceneEntityIDs.push_back(999999);
	}
	if(aabbSize > 0)
	{
		this->FrustumCull_AABB_SSE(&aabbs[0], aabbSize, frustumPlanes, &culledSceneEntityIDs[0]);
	}

wrapper being:

template <typename T>
struct wrapper : public T
{
	wrapper() {}
	wrapper(const T& rhs) : T(rhs) {}
};

 

Now I tried checking its size with:

unsigned int TEST = sizeof(aabbs[0]);
unsigned int TEST2 = sizeof(_Vector3f);

 

and it returns 24, so doesn't that mean it is correctly aligned. So the first vector center would be 4 bytes each with extent.x being the fourth in the first 16 bytes, right ?

Same result checking size inside the loop before the line that makes it crash.

unsigned int alignmentTest = sizeof(aabbList[(iIter << 2) + 0]); // RESULT IS 24

ok so this post is getting ridiculously long already xD but why does this fail ?

	_AABB test[2];
	test[0].center.x = 2.0f;
	test[0].center.y = 1.0f;
	test[0].center.z = 0.0f;
	test[0].extent.x = 0.1f;
	test[0].extent.y = 0.5f;
	test[0].extent.z = 1.0f;
	test[1].center.x = 2.0f;
	test[1].center.y = 1.0f;
	test[1].center.z = 0.0f;
	test[1].extent.x = 0.1f;
	test[1].extent.y = 0.5f;
	test[1].extent.z = 1.0f;
	__m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&test[0].center.x);

 

same error as inside the frustum cull. I feel the error is somewhere else here...

Edited by lipsryme
0

Share this post


Link to post
Share on other sites

You're right it does return false. I just tried it using an unaligned call and it does seem to work.

What do you think I should do about translating the center vector to my world space?

Should I do that separately inside the update loop for every AABB in my scene ?

 

And I'm going to try to steal some of the source code from that physics sdk (bullet).

Edited by lipsryme
0

Share this post


Link to post
Share on other sites

If the AABBs correspond to static geometry, translating them to world space every frame is an overkill. You should do it once at start up.

 

If it's about dynamic geometry, then it shouldn't be that simple when rotation is involved. If your objects rotate, you should calculate the AABB from the OBB defined by the original AABB and the object's transformation, in case you want to use the same code for all your objects. Otherwise you can find/write another function which culls OBBs against the frustum. 

 

In case you go about the OBB route, it might be faster to just check the bounding sphere (which isn't affected by rotation) against the frustum, at the expense of rendering a few more objects (bounding spheres tend to be larger that AABBs depending on the object they enclose). 

Edited by hellraizer
1

Share this post


Link to post
Share on other sites

Oh my god I've finally done it smile.png....using the btAlignedArray from the bullet physics sdk.

And it is blazingly fast. With just the culling itself it takes about 0.03ms for 10k AABBs.

50k goes to around 0.53ms. 100k AABBs in 1.25ms.

 

Big thanks to everyone in here !

Edited by lipsryme
1

Share this post


Link to post
Share on other sites

Remember, even the fastest frustum culling loop will be no match with hierarchical culing.

Most of a world is made of static geometry, so it is only a preprocess to clusterize it and merge the clusters AABB in some kind of hierarchical structure. An octree can be a good choice, each node split into 8 sub node ( neat, it is 2 times a 4 box simd culling ). You send have a usefull information for traversal, fully visible, invisible and partially culled. You can then push the primitives of the fully visible node without even culling for example.

 

And you seems to still miss understand the difference between type size and memory alignement, so study them again and again, also add bunch of assert each time to try to do an aligned load or write to catch bugs as soon as possible.

0

Share this post


Link to post
Share on other sites

Remember, even the fastest frustum culling loop will be no match with hierarchical culing.
Most of a world is made of static geometry, so it is only a preprocess to clusterize it and merge the clusters AABB in some kind of hierarchical structure. An octree can be a good choice, each node split into 8 sub node ( neat, it is 2 times a 4 box simd culling ). You send have a usefull information for traversal, fully visible, invisible and partially culled. You can then push the primitives of the fully visible node without even culling for example.
 
And you seems to still miss understand the difference between type size and memory alignement, so study them again and again, also add bunch of assert each time to try to do an aligned load or write to catch bugs as soon as possible.

Hierarchical can be overkill for many cases and the added overhead and complexity can be then net loss.
In Battlefield 3 culling paper their paraller brute force algorithm was 3 times faster than hierarchical, code size was 80% smaller and because of this simplicity further optimizations was easier.
http://dice.se/publications/culling-the-battlefield-data-oriented-design-in-practice/
0

Share this post


Link to post
Share on other sites

I perfectly know about that talks but the 360/PS3 way is not anymore. Beeing an in order RISC processor was a pain in the ass for code size and branching. Things as needed as possible in frustum culling like the float compare instruction was really costly too. Add the need for simple data layout to not saturate the DMA communication with the SPUs and the battefield choice may be legitime in their context.

 

But today and tomorow's engine will again scale the number of primitives to manage and draw by a good amount. Hierarchical layout will strike back easely with the more modern hardwares working more efficiently with branching.

 

And being hierarchical do not means we have to go down to single primitive leaves, everything is in the balance between the overhead of the hierarchy and the raw test. May be instead of storing hundreds of primitive in leafs, we will store thousands and dispatch each leaf on separate thread. In the end, only profile session and testing will give the answer, an answer dependant of the context of each game.

 

And strangely, on a previous game i work on, with a lot of instances to manage, one optimisation i did was to strip the hierarchical split of the culling after it reach a too small size because it was more efficient. So no, i am not a pro hierarchical, i am just a pro performance :)

Edited by galop1n
0

Share this post


Link to post
Share on other sites

@OP - Could you share what hardware are you running this on?Those are some awesome results.

0

Share this post


Link to post
Share on other sites
Core i5 2.8ghz quad core (I think it's a nehalem)
Mind you the frustum cull was basically the only thing inside my render loop and the results might not have been 100% perfectly measured
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0