Sign in to follow this  
lipsryme

Is my frustum culling slow ?

Recommended Posts

lipsryme    1522

Doing frustum culling on 10.000 objects it takes about 0.85ms...is that slow/fast ?

 

I'm using AABBs to frustum cull my objects.

See code here:

 

bool RendererD3D11::FrustumCull(SceneEntityDescription* entity,
								XMFLOAT4* frustumPlanes)
{
	// If there even was a rejection last time
	if(entity->lastRejectedFrustumPlane != 999999)
	{
		// Check last rejected planeId first
		XMVECTOR planeNormal = XMVectorSet(frustumPlanes[entity->lastRejectedFrustumPlane].x, frustumPlanes[entity->lastRejectedFrustumPlane].y,
			frustumPlanes[entity->lastRejectedFrustumPlane].z, 0.0f);

		float planeConstant = frustumPlanes[entity->lastRejectedFrustumPlane].w;

		// Check each axis (x, y, z) to get the AABB vertex furthest away from the direction
		// the plane is facing (plane normal)
		XMFLOAT3 axisVert;
		XMFLOAT3 aabb_min = this->contentManager->GetPrimitiveFromPool(entity->ID)->GetBoundingVolume()->Min();
		XMFLOAT3 aabb_max = this->contentManager->GetPrimitiveFromPool(entity->ID)->GetBoundingVolume()->Max();
		XMFLOAT3 objPos = entity->worldPosition;

		// x-axis
		if(frustumPlanes[entity->lastRejectedFrustumPlane].x < 0.0f)
		{
			axisVert.x = aabb_min.x + objPos.x; // min x + obj position's x
		}
		else
		{
			axisVert.x = aabb_max.x + objPos.x; // max x + obj position's x
		}

		// y-axis
		if(frustumPlanes[entity->lastRejectedFrustumPlane].y < 0.0f)
		{
			axisVert.y = aabb_min.y + objPos.y; // min y + obj position's y
		}
		else
		{
			axisVert.y = aabb_max.y + objPos.y; // max y + obj position's y
		}

		// z-axis
		if(frustumPlanes[entity->lastRejectedFrustumPlane].z < 0.0f)
		{
			axisVert.z = aabb_min.z + objPos.z; // min z + obj position's z
		}
		else
		{
			axisVert.z = aabb_max.z + objPos.z; // min z + obj position's z
		}

		// Now we get the signed distance from the AABB's vertex that's furthest down the frustum planes normal,
		// and if the signed distance is negative, then the entire bounding box is behind the frustum plane, which means
		// that it should be culled
		if(XMVectorGetX(XMVector3Dot(planeNormal, XMLoadFloat3(&axisVert))) + planeConstant < 0.0f)
		{
			return true;
		}
	}
	

	// Loop through each frustum plane
	for(int planeID = 0; planeID < 6; ++planeID)
	{
		if(entity->lastRejectedFrustumPlane != 999999)
		{
			if(planeID == entity->lastRejectedFrustumPlane)
			{
				// skip last rejected frustum plane since we've checked it before this loop
				continue;
			}
		}


		XMVECTOR planeNormal = XMVectorSet(frustumPlanes[planeID].x, frustumPlanes[planeID].y,
										   frustumPlanes[planeID].z, 0.0f);

		float planeConstant = frustumPlanes[planeID].w;

		// Check each axis (x, y, z) to get the AABB vertex furthest away from the direction
		// the plane is facing (plane normal)
		XMFLOAT3 axisVert;
		XMFLOAT3 aabb_min = this->contentManager->GetPrimitiveFromPool(entity->ID)->GetBoundingVolume()->Min();
		XMFLOAT3 aabb_max = this->contentManager->GetPrimitiveFromPool(entity->ID)->GetBoundingVolume()->Max();
		XMFLOAT3 objPos = entity->worldPosition;

		// x-axis
		if(frustumPlanes[planeID].x < 0.0f)
		{
			axisVert.x = aabb_min.x + objPos.x; // min x + obj position's x
		}
		else
		{
			axisVert.x = aabb_max.x + objPos.x; // max x + obj position's x
		}

		// y-axis
		if(frustumPlanes[planeID].y < 0.0f)
		{
			axisVert.y = aabb_min.y + objPos.y; // min y + obj position's y
		}
		else
		{
			axisVert.y = aabb_max.y + objPos.y; // max y + obj position's y
		}

		// z-axis
		if(frustumPlanes[planeID].z < 0.0f)
		{
			axisVert.z = aabb_min.z + objPos.z; // min z + obj position's z
		}
		else
		{
			axisVert.z = aabb_max.z + objPos.z; // min z + obj position's z
		}

		// Now we get the signed distance from the AABB's vertex that's furthest down the frustum planes normal,
		// and if the signed distance is negative, then the entire bounding box is behind the frustum plane, which means
		// that it should be culled
		if(XMVectorGetX(XMVector3Dot(planeNormal, XMLoadFloat3(&axisVert))) + planeConstant < 0.0f)
		{
			entity->lastRejectedFrustumPlane = planeID; // store rejected frustum plane ID for the next frames
			return true;
		}
	}

	return false;
}

 

 

Keep in mind these 10.000 objects are all inside the view frustum so they'll all go through all 6 planes and pass (worst case scenario I know).

I've commented out the draw call itself so the rendering itself has no influence on these numbers.

Edited by lipsryme

Share this post


Link to post
Share on other sites
Juliean    7067

Sounds resonable, at least my frustum culling implementation I implemented today takes approximately 0,53 ms to cull 10000 objects that are all visible. I didn't implement the last rejection optimization though. Thats is my code:

 

bool Plane::Inside(const BoundingBox& box) const
{
	D3DXVECTOR3 vMin(box.m_vCenter.x - box.m_size, box.m_vCenter.y - box.m_size, box.m_vCenter.z - box.m_size), vMax(box.m_vCenter.x + box.m_size, box.m_vCenter.y + box.m_size, box.m_vCenter.z + box.m_size);
	D3DXVECTOR4 vPos;
	if(m_vNormal.x >= 0.0f)  
	{ 
		vPos.x=vMax.x; 
	}
    else 
	{ 
		vPos.x=vMin.x; 
	}
    if(m_vNormal.y >= 0.0f)  
	{ 
		vPos.y=vMax.y; 
	}
    else 
	{ 
		vPos.y=vMin.y; 
	}
    if(m_vNormal.z >= 0.0f)  
	{ 
		vPos.z=vMax.z; 
	}
    else 
	{ 
		vPos.z=vMin.z; 
	}

	vPos.w = 1.0f;

	float posDot = D3DXPlaneDot(&m_plane, &vPos);

	if(posDot < 0.0f)
		return false;
	else
		return true;
}

bool Frustum::TestBox(const BoundingBox& box) const
{
	for ( int i = 0; i < 6; i++ )
    {
        if ( !m_frustums[i].Inside(box) )
			return false;
    }
    return true;
}

Share this post


Link to post
Share on other sites
galop1n    936

The question miss details for a good answer. 10K box culled by a frustum in 0.5ms on PS3 is probably good, on a high end CPU is probably not :) If your final scene is below 10K primitives, it is good, if your final scene will contains millions of primitives, it will not be. If you can run the culling on a thread and doing something long enough before needing the result, even if it would last for 3ms, it is not a problem.  You may even run the frustum culling on GPU and do some draw indexed indirect without cpu sync.

 

 The only thing i can tell is that your code is far from optimal by itself when considering batching a same process for a lot of entry ! Everything will depend on the data layout. Two paradigm exist, Array of structure (each box in the array is made of two points each made of x, y and z coordinate) and Structure of Array ( a container of N box split in 6 arrays of float ). By choosing the second one, you can do some SIMD programing, and divide the frustum test count by 4 for almost free. You can also try to remove branching with some value masking, ...

 

The C++ way with a Plane API to test a single point distance looks nice, but most of the times, if you want to test one point, you wants to test N points and there is certainly part of the computation mergeable.

 

Good luck :)

Share this post


Link to post
Share on other sites
lipsryme    1522
Well it's just for the pc, so x86 architecture. And I don't think I'll ever have even close to that many entities in my scene. I guess particles would be culled differently?

Share this post


Link to post
Share on other sites
kalle_h    2464

Well it's just for the pc, so x86 architecture. And I don't think I'll ever have even close to that many entities in my scene. I guess particles would be culled differently?

If doing cpu simulated particles it is quite fast to cull them using some sort of sphere frustum test. The the test is just couple of dots and sometimes you can skip near and far planes.
I implemented this for all my particles and this give me substantial performance gain with ios. And code is not yet even using any SIMD's.

Share this post


Link to post
Share on other sites
lipsryme    1522

I've got it implemented now but somehow my "negativeMask" always results to 0.

Do I have to transform center and extent to world space first ? I tried doing that using += objPos but still 0. Or do I need to do a multiply by the world matrix ?

 

UPDATE: Ah got it working! Of course I should not have added the world position to the extent rolleyes.gif

Well its down to ~0.4ms cool.png

bool RendererD3D11::FrustumCull(SceneEntityDescription* entity,
								_Plane* frustumPlanes,
								_Plane* absFrustumPlanes)
{
	_Vector3f center = static_cast<AABB*>(this->contentManager->GetPrimitiveFromPool(entity->ID)->GetBoundingVolume())->Volume().center;
	_Vector3f extent = static_cast<AABB*>(this->contentManager->GetPrimitiveFromPool(entity->ID)->GetBoundingVolume())->Volume().extent;
	_Vector3f objPos;
	objPos.x = entity->worldPosition.x;
	objPos.y = entity->worldPosition.y;
	objPos.z = entity->worldPosition.z;

	center.x += objPos.x;
	center.y += objPos.y;
	center.z += objPos.z;

        // this was wrong...
        //**************************
	//* extent.x += objPos.x;
	//* extent.y += objPos.y;
	//* extent.z += objPos.z;
        //**************************

	__m128 xmm_aabbCenter_x = _mm_load_ss(&center.x);
	__m128 xmm_aabbCenter_y = _mm_load_ss(&center.y);
	__m128 xmm_aabbCenter_z = _mm_load_ss(&center.z);
	__m128 xmm_aabbExtent_x = _mm_load_ss(&extent.x);
	__m128 xmm_aabbExtent_y = _mm_load_ss(&extent.y);
	__m128 xmm_aabbExtent_z = _mm_load_ss(&extent.z);
	
	for(unsigned int iPlane = 0;iPlane < 6;++iPlane)
	{
		__m128 xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nx);
		__m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component);

		xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].ny);
		xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_y, xmm_frustumPlane_Component));

		xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nz);
		xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_z, xmm_frustumPlane_Component));

		__m128 xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nx);
		__m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component);

		xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].ny);
		xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_y, xmm_absFrustumPlane_Component));

		xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nz);
		xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_z, xmm_absFrustumPlane_Component));

		__m128 xmm_frustumPlane_d = _mm_load_ss(&frustumPlanes[iPlane].d);
		__m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
		__m128 xmm_d_m_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

		// Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps
		__m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0));
		int negativeMask = _mm_movemask_ps(xmm_d_p_r__d_m_r);

		// Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r
		if(negativeMask & 0x01)
		{
			return true; // Outside
		}
		else if(negativeMask & 0x04)
		{
			return false; // Intersect
		}
	}


	return false;
}
Edited by lipsryme

Share this post


Link to post
Share on other sites
galop1n    936

As said before, store all the boxes in a continuous array, and output an array of visibility. Try avoiding going from scalar registers and simd ones too and process everything in a single api call. You will for example be able to move the frustum planes load outside the loop ( very important when you loop over thousands of boxes ). your happy 0.4ms will look sad after that :)

Share this post


Link to post
Share on other sites
Sock5    162

You can further accelerate it by integrating the frustum culling in the collision detection, so a large amount of stuff will be spared tests versus the frustum, due to the broad phase, altho things might get weird.

Share this post


Link to post
Share on other sites
lipsryme    1522

@galop1n you mean going through every object first and culling 4 boxes at once ?
But if I were to do that I'd need 2 loops through every object since I need to go through it again right after to actually draw them...
What I do right now is go through every object, check if it should be culled and then either draw it or continue to the next.
Also I was trying to avoid making separate containers since doing 10k push_backs doesn't work that great.

I could use a boolean maybe...would still have to go through all entities again but I wouldn't have the overhead of push_back.

Edited by lipsryme

Share this post


Link to post
Share on other sites
Hodgman    51220

But if I were to do that I'd need 2 loops through every object since I need to go through it again right after to actually draw them...
What I do right now is go through every object, check if it should be culled and then either draw it or continue to the next.

If you have 1 loop, I'm guessing your 'object' contains the AABB data, and the data for drawing, etc... The downside of this is that at the hardware level, when you read the AABB into the CPU cache, you're also reading in the objects other (unrelated) members. If the AABB test fails, you've spent time loading data into the cache for no reason. Also, when you then switch from calling the AABB testing code to calling the drawing code, you're loading all sorts of other drawing related data into the cache, which might evict other data, and will likely prevent your AABB testing loop from effective precaching.
If you instead have an array of struct{ AABB a; Drawable* d }, an output array of Drawable* culled results, and an array of actual Drawable instances, then the two loops you mention may actually be more efficient than your existing single loop.

Also I was trying to avoid making separate containers since doing 10k push_backs doesn't work that great.

If you're going to be calling push_back a lot, make sure you call reserve first with a sensible guess of the memory requirement, or you can use a raw array (with your own debug error checking, of course ;-) )

Share this post


Link to post
Share on other sites
HellRaiZer    1001

It's been a long time since my last reply here on gamedev.net.

 

@lipsryme: Happy to know that my blog post actually helped someone smile.png Unfortunately, there seems to be an error in the code. The culling should be incorrect. Since you haven't seen it yet I'd assume that you are just rendering more objects than needed.

 

The error is in:

__m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);
__m128 xmm_d_m_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

 

Can you spot it?

xmm_d_m_r should subtract r from d, not add it! it should be:

 

__m128 xmm_d_m_r = _mm_add_ss(_mm_sub_ss(xmm_d, xmm_r), xmm_frustumPlane_d);

 

I don't have the project anymore so I'd assume it's just a blog post typo and it didn't affect the timings.

 

On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly smile.png

 

Hope this doesn't ruin your benchmarks.

Share this post


Link to post
Share on other sites
belfegor    2834

@Hellraizer

On the plus side, the last piece of code in the post (4 boxes at a time) does it correctly

 

Now i am confused as in 4box version says:

// NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small.

 

where that part of code is that you mentioned.

 

Also, for some reason it fails on Debug mode (VS 2012) with "Unhandled exception...reading location 0x0"

 

debugging.jpg

 

while working fine in Release?

Share this post


Link to post
Share on other sites
belfegor    2834

@galop1n

I don't understand why is it not already aligned? I have array of planes

struct _Plane
{
    float nx;
    float ny;
    float nz;
    float d;
};
 
std::array<_Plane, 6> vFrustum;

 

Does it need to force alignment or something?

Share this post


Link to post
Share on other sites
galop1n    936

There is nothing in the std and the C++ standard that enforce alignment other than the natural alignment of types ( 4 for float ). new and malloc are also anaware of that. use a __declspec(align(16)) on the _Plane declaration and then just instanciate a _Plane vFrustum[6] without std::array. But remember that if you put the array in dynamically allocated memory, you may also fail on the alignment constraint, so you will need to manualy align the root objet memory.

Share this post


Link to post
Share on other sites
belfegor    2834

Ive added this

 

__declspec(align(16))
struct _Plane
{
    float nx;
    float ny;
    float nz;
    float d;
};

 

and now it works in Debug. But i don't understand why isn't it already 16byte aligned as data is 4 floats ?

 

sizeof(float) * 4 == sizeof(_Plane)
Edited by belfegor

Share this post


Link to post
Share on other sites
galop1n    936

class and type sizes are unrelated to instance adresses. If you instanciate a class or type on the stack or in the globals ( data, rodata and bss ), the compiler and linker are aware of the native alignment and can do the work for you if you tag thinkgs correctly, but with memory allocated on the heap you depends on the vendor implementation of the allocator. x64 ABI force a 16 byte aligned address for dynamic allocation, but x86 ABI only require 4 bytes aligned address. Of course, you are free to override the allocator with a custom one.

Share this post


Link to post
Share on other sites
belfegor    2834

class and type sizes are unrelated to instance adresses.

 

Excuse my stupidity i still don't understand.

You mean instances of a _Plane in an std::array or vector might not be contiguous in memory (with their default allocator)?

Edited by belfegor

Share this post


Link to post
Share on other sites
Hodgman    51220

But i don't understand why isn't it already 16byte aligned as data is 4 floats ?

It's made up of primitives, each of which only requires 4-byte alignment to work correctly, so the struct will work as long as it's 4-byte aligned. It doesn't need to be 8 or 16 byte aligned in order to function correctly, only 4-byte aligned (and this is assuming that floats actually need to be 4 byte aligned).
 
 


class and type sizes are unrelated to instance adresses.

Excuse my stupidity i still don't understand.
You mean instances of a _Plane in an std::array or vector might not be contiguous in memory (with their default allocator)?


As far as I know:
  • std::allocator<T>::allocate (which is used by the std containers) or "new T" will return a block of memory that is correctly aligned for the type "T" (i.e. the address is a multiple of [tt]alignof(T)[/tt])
  • malloc( sizeof(T) ) wont.
Edited by Hodgman

Share this post


Link to post
Share on other sites
galop1n    936

if a class is declared as align on a 16 byte boundary, the class size will be a multiple of 16 too, because we can write things like MyClass array[N]; and if array[i] is align on 16, then array[i+1] will too.

 

But if you write MyClass* array = new MyClass[N]; you have no guarante on the memory address return by new, because the operator new do not receive an alignment value. You will need to overload the operator new of MyClass to do some internal memalign, or globaly override the memory allocator to always return address multiple of 16.

 

And std::array is a C array, so the memory is contigus but do not rely on a memory allocator, so the compiler is able to align the members with the declspec value (as long as the objet storing an std::array<_Plane,6> is not allocate with a bare new )

Share this post


Link to post
Share on other sites
galop1n    936

As far as I know:

std::allocator::allocate (which is used by the std containers) or "new T" will return a block of memory that is correctly aligned for the type "T".
malloc( sizeof(T) ) wont.

 

Sadly, the standard do not consider simd types, the largest native alignment use by the default std::alocator is the one for a long long, that is only 8 bytes.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this