Sign in to follow this  

Frustum Culling Question

This topic is 673 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi all,

 

I have a quick question regarding my current frustum-culling implementation, that being: is it necessary to factor in the transformation matrix of each game object/entity (it's model->world transform) when constructing the clip matrix? I haven't seen any references sources do this and it sounds very ineffecient to have to reconstruct the clip matrix for each object? I'm currently constructing the clip matrix via : projection * view * model->world and can't help but think that while it works it's wrong in some way...

 

I can include code if required but I think the question is relatively self-contained.

 

Any help would be appreciated :)

 

Share this post


Link to post
Share on other sites

If you don't factor in the model's transformation how would you know if it's in your frustum? Nothing you wrote sounds out of the ordinary, Model View Projection matrices have MVP as an acronym for a reason!

Share this post


Link to post
Share on other sites

On a basic level? Yes. But there are methods to speed this up.

This is where you come into the spatial partitioning algorithms often used in games. These algorithms speed up the frustrum check by acting as a "set" of data in a given volume.

 

Remember that a set is a relational container of data, that can also exist in other sets. A venn-diagram if you will. For games however, it's more optimal if we DONT check the same object more than once.. (even if it barely impacts performance) so it will usually be a venn-diagram with no overlapping circles.

 

When we frustrum check, we check to see if these volumes/planes are within the frustrum, before we start checking if each object is in the frustrum or not.

 

See Quad-Trees, Oct-trees, BSPs, K-trees.

Basically...

Quad-Trees, and Oct-trees work by recursion. You start off with very large volumes, and you work your way downwards, checking smaller volumes, till you finally get to the actual meshes to check. For small worlds... these tend to be more costly. For larger worlds these are efficient. 

The trees... are more like graphs to be honest. These are designed to fit well with hallway like games. I won't talk about them much... because I am not that familiar with them. But they use a concept of portals, and rooms. If you can see a portal, then chances are you can see a room. If you can see the room, then you -can- do a frustrum check. But you can also choose to not waste time on that if you wish. It might just be faster to render everything than to spend some time to see if everything can be seen or not.

 

This is why Skyrim and Fallout flip flops between the BSPs and Quad-Trees like a schitzophrenic squirrel.

Edited by Tangletail

Share this post


Link to post
Share on other sites

For the static objects, the easiest option is to build one of trees Tangletail mentioned as the scene is being loaded; but if your scene is dynamic, then these trees will have to be rebuilt every frame. With dynamic scenes your best bet is to use the GPU and NVIDIA has a great paper on the subject. The examples they give are in CUDA; because well... That's their thing. But the algorithms could be implemented in OpenCl or Compute shaders all the same. 

 

http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/

 

BVH Trees are also a great tool.

Edited by GameGeezer

Share this post


Link to post
Share on other sites

 


See Quad-Trees, Oct-trees, BSPs, K-trees.

Basically...

Quad-Trees, and Oct-trees work by recursion. You start off with very large volumes, and you work your way downwards, checking smaller volumes, till you finally get to the actual meshes to check. For small worlds... these tend to be more costly. For larger worlds these are efficient. 

 

Are hierarchical structures even necessary for frustum culling? My CPU can cull ~450,000,000 AABBs per second (on 1 thread). 1 ms would be 450,000 AABBs. I'm sure you'll run into a dozen other problems before bruteforce frustum culling becomes a bottleneck.

 

 

 

That's up to you really Modern CPUs can go higher than that to be honest, there is a reason to the madness however. If you are building a game engine which must calculate physics, logic, and a whole mess of other stuff, do you honestly want your engine to waste time checking against everything in a five mile radius?

That's ~450,000,000 AABB checks... vs only 4,600 AABB checks plus the small amount of time it would take to create some of the hierarchy's on demand.

keep in mind that 16ms is 60fps. 32ms = 30fps. And that's per update cycle.

Edited by Tangletail

Share this post


Link to post
Share on other sites

Hey guys, thank you all for taking the time to get back to me on this, much appreciated.

 

@GameGeezer: It was after looking at rastertek's tutorial on this topic that I decided to post the question, in the tutorial, he uses only the view and projection matrices (from the looks of it anyway), that's why I thought I was missing a trick or something.

 

@Tangletail: Thanks for mentioning spatial partitioning, I am indeed aware of this and how it can help you discard large swathes or objects/entities at a time. This is what I'll be looking into in more detail next.

 

Thanks again everyone :)

Share this post


Link to post
Share on other sites


It was after looking at rastertek's tutorial on this topic that I decided to post the question

 

my cull code is based in part on that tutorial. depending on the tutorial you used, you might find a bug in the rastertek code.  i combined it with code from another tutorial to get correct code. something related to calculation of the near plane as i recall. the tutorial made assumptions that didn't hold in the general case, something like that.

 

in my current project, for large outdoor scenes i calculate which terrain chunks are visible, then cull the meshes in the visible chunks. note that frustum cull for objects that span the frustum are a special case and different from your typical Bsphere type checks.

Share this post


Link to post
Share on other sites
You don't need clip space matrices for culling. You can extract the 6 planes from your view-proj matrix, and then use them to test your world-space bounding volumes.

Are hierarchical structures even necessary for frustum culling? My CPU can cull ~450,000,000 AABBs per second (on 1 thread). 1 ms would be 450,000 AABBs. I'm sure you'll run into a dozen other problems before bruteforce frustum culling becomes a bottleneck.

Yep. DICE went from spatial partitioning on Battlefield 2, to a dumb flat array of all objects on their next game... And the stupid flat array (that should be far worse in BigO notation) was way faster due to the realities on how modern CPUs/RAM work.
You probably only get a win from partitioning when your data set is too big to fit into cache, which probably is well over 10k objects. Under that size, the dumb methods are likely faster. You just have to arrange your data properly -- e.g. a contiguous array of bounding spheres can fit 16k objects into a 256KiB blob (with good SSE compatible alignment to boot).

Share this post


Link to post
Share on other sites


You probably only get a win from partitioning when your data set is too big to fit into cache, which probably is well over 10k objects. Under that size, the dumb methods are likely faster. You just have to arrange your data properly -- e.g. a contiguous array of bounding spheres can fit 16k objects into a 256KiB blob (with good SSE compatible alignment to boot).

Where does cache size fit in to this?  Isn't the only relevant thing that the cache line is pre-caching data for linear access?  I mean if you're only frustum culling you'll only use the data once or in one temporal locality... you won't visit the data again later, so I am not seeing why cache size plays into this.

Share this post


Link to post
Share on other sites

He's referring to iterating over all objects to determine if they're in the frustum versus only testing those within the spatial grids containing the frustum.

 

The recent improvements in cache sizes since about 2008 finally have enough mainstream distribution that they can be relied on.  Around the time of Intel's big switch to Core 2, both AMD and Intel bumped the standard sizes of cache.  From 2002-2007 it was roughly 1MB-2MB.  Starting in 2008 none of the major chipsets shipped with anything less than 3MB cache.

 

 

 

Most software these days is limited by memory speed and keeping the CPU fed rather than waiting for instructions. When you can take advantage of cache effects you can often run several iterations of a loop for effectively the cost of a single cache miss, you pay for one and get the others for "free". The bigger your caches the more "free" processing you get.  When you can stream through a large number of values the costs for both processing and memory drop considerably as the costs waiting for each are interleaved with each other. At the same time, improvements to speeds mean the costs paid are reduced; not only is your data present right when instructions need it and instructions are ready when the data gets there, the time for doing each is getting faster and faster.

 

 

 

So to compare the two:

 

Let's assume a quadtree spatial grid, drilling down means hitting the topmost node (a full cost), hitting probably 3-4 child nodes (a full cost each) then hitting the leaf nodes which can be read sequentially.  Trying to query all the items in the view frustum might mean jumping around 64, 96, 128, 512, or some other number of times through your spatial tree.  Each of those jumps makes it hard for the prefetcher to work. This means stall after stall after stall, then finally doing a small chunk of work at low cost.

 

With a direct linear search, the huge amount of cache means you rarely need to pay the costs for fetching data. You pay a single full-cost for the first one, but if your system was designed in a cache friendly way it just means pulling in an object's transform (a single cache line), doing the boundary check, and storing a yes/no result. Then you repeat over and over in a way that keeps your CPU well-fed with data. Done well it is the flow the CPU cache's ideal case. 

 

For game programmers that means longer linear searches became a better option. Rather than paying the penalty for jumping around all over memory a few times, it becomes less total CPU-time to loop through the long list directly as long as you can ensure it stays cache-friendly.

 

 

 

However, if you don't implement it in a cache-friendly manner, if your stride is too big or irregular or you are doing too much work between memory reads, or you are touching the objects in another thread, or any of the other things that can mess up cache friendly programming, then you don't get that benefit.

 

 

This is one of many examples where it pays to revisit your assumptions for optimizations.  Over time various optimizations become more relevant or less relevant as the underlying hardware changes.  Bigger caches mean different potential optimization choices.

Edited by frob

Share this post


Link to post
Share on other sites


You don't need clip space matrices for culling. You can extract the 6 planes from your view-proj matrix, and then use them to test your world-space bounding volumes.

 

this is the method used in the rastertek tutorials (at least the first one), but their near plane calculation was non-generic and took advantage of a simplification in their example as i recall. it, combined with the near plane calculation code from the tutorial that was the next most promising looking google search result after rastertek, yielded correct code. but neither example covers the spannning case as i recall.

 

i can post code if desired.

Share this post


Link to post
Share on other sites

With a direct linear search, the huge amount of cache means you rarely need to pay the costs for fetching data. You pay a single full-cost for the first one, but if your system was designed in a cache friendly way it just means pulling in an object's transform (a single cache line), doing the boundary check, and storing a yes/no result. Then you repeat over and over in a way that keeps your CPU well-fed with data. Done well it is the flow the CPU cache's ideal case. 

This is the case I was referring to but like I said for frustum culling you're only hitting data once.  Given that OOOe windows are only so large only a few cache lines will be relevant at any given time never to be looked at again.  So how exactly does cache size factor into this use case?

 

edit - * instruction window

Edited by Infinisearch

Share this post


Link to post
Share on other sites

for cache friendly culling for the most part you'll probably be doing a bunch of Bshpere checks vs the six planes. its quick and easy. there's even some special instruction tricks you can use to make the number crunching a bit more parallel. but since its Bspheres vs 6 constant planes, you'll most likely be dealing with 4 floats for each check (x,y,z,and radius). so the most cache friendly / data oriented way to do that would be using a contiguous chunk of memory with just that data, IE the position and Bsphere radius of all possibly visible objects in the scene (everything that needs frustum culling). this might be achieved by populating the list as you determine the possibly visible objects for the frame, or perhaps a CES approach, although position and Bshpere radius don't really get used together much anywhere else, maybe collision detection. position alone gets used (perhaps quite a bit) in update for movement. maybe a CES approach with position as one component and Bsphere rad as a second. then you would pull in positions on the first cache line, and  Bsphere radii on the second, and so forth. you'd get 4 position cache line misses for each Bshpere radius cache line miss. not as nice as just a cache miss when you hit the end of a line, but better than a cache miss every 50 bites of data, relevant or otherwise, which is what a non-data oriented design would typically yield.  but going CES just for that would be premature optimization most likely. i draw my culling info from an array of structs that contain all drawing info: single mesh, multi mesh rigid body model,  2d billboard, or 3d billboard; mesh or model ID, texture or animation ID; material ID; cull, clip, alpha, clamp, wrap, and similar flags; texture scale, etc. as well as position, Bsphere frustum clip radius, far clip range, and so on. and i can brute force through thousands per frame with no problem, no data oriented design, no cache friendly design, and no shaders !   just fixed function. always build stuff the most straightforward way possible, then optimize if necessary - unless recent experience says otherwise.

Share this post


Link to post
Share on other sites

re hierarchical structures even necessary for frustum culling? My CPU can cull ~450,000,000 AABBs per second (on 1 thread). 1 ms would be 450,000 AABBs. I'm sure you'll run into a dozen other problems before bruteforce frustum culling becomes a bottleneck.

 

First they aren't mutually exclusive.

AABB is more vs. sphere testing not the object heirarchy.

Assuming you are just processing the list of nodes with AABB then it is an O(n) algorithm whereas hierarchical algorithms are O(log(n)).  

 

Assuming a normal distribution of objects, checking 450M nodes in an oct-tree means you can cull 54 levels ... which would be be 7^54 nodes culled (4E45).

Edited by Shannon Barber

Share this post


Link to post
Share on other sites

Hello again everyone. 

 

Thanks again for taking the time to get back to me previously, if it's alright, I have another question. I've continued experimenting with frustum culling and have essentially found that certain camera position/objection position/camera-forward combinations do not work correctly. As stupid and wrong as it may potentially be I'm assuming there's some relationship between these three things (object-pos, camera-pos and camera-forward)?

 

Thanks :)

Share this post


Link to post
Share on other sites

Sorry for my delayed response, thanks for getting back to me Khatharr. Here's some code:

 

Frustum plane extraction:

---


void FirstPersonCamera::ConstructFrustum( const glm::mat4x4& objTransform ) {
	glm::mat4x4 clipMatrix( GetProjectionMatrix() * GetViewMatrix() * objTransform );

	mFrustum.planes[LEFT].a = clipMatrix[3][0] + clipMatrix[0][0];
	mFrustum.planes[LEFT].b = clipMatrix[3][1] + clipMatrix[0][1];
	mFrustum.planes[LEFT].c = clipMatrix[3][2] + clipMatrix[0][2];
	mFrustum.planes[LEFT].d = clipMatrix[3][3] + clipMatrix[0][3];

	mFrustum.planes[RIGHT].a = clipMatrix[3][0] - clipMatrix[0][0];
	mFrustum.planes[RIGHT].b = clipMatrix[3][1] - clipMatrix[0][1];
	mFrustum.planes[RIGHT].c = clipMatrix[3][2] - clipMatrix[0][2];
	mFrustum.planes[RIGHT].d = clipMatrix[3][3] - clipMatrix[0][3];

	mFrustum.planes[TOP].a = clipMatrix[3][0] - clipMatrix[1][0];
	mFrustum.planes[TOP].b = clipMatrix[3][1] - clipMatrix[1][1];
	mFrustum.planes[TOP].c = clipMatrix[3][2] - clipMatrix[1][2];
	mFrustum.planes[TOP].d = clipMatrix[3][3] - clipMatrix[1][3];

	mFrustum.planes[BOTTOM].a = clipMatrix[3][0] + clipMatrix[1][0];
	mFrustum.planes[BOTTOM].b = clipMatrix[3][1] + clipMatrix[1][1];
	mFrustum.planes[BOTTOM].c = clipMatrix[3][2] + clipMatrix[1][2];
	mFrustum.planes[BOTTOM].d = clipMatrix[3][3] + clipMatrix[1][3];

	mFrustum.planes[NEAR].a = clipMatrix[3][0] + clipMatrix[2][0];
	mFrustum.planes[NEAR].b = clipMatrix[3][1] + clipMatrix[2][1];
	mFrustum.planes[NEAR].c = clipMatrix[3][2] + clipMatrix[2][2];
	mFrustum.planes[NEAR].d = clipMatrix[3][3] + clipMatrix[2][3];

	mFrustum.planes[FAR].a = clipMatrix[3][0] - clipMatrix[2][0];
	mFrustum.planes[FAR].b = clipMatrix[3][1] - clipMatrix[2][1];
	mFrustum.planes[FAR].c = clipMatrix[3][2] - clipMatrix[2][2];
	mFrustum.planes[FAR].d = clipMatrix[3][3] - clipMatrix[2][3];

	NormalizeFrustumPlanes( mFrustum );
}

---

 

Normalizing the planes:

---


void NormalizeFrustumPlanes( Frustum& frustum )  {
	for( U32 i=0; i<6; ++i ) {
		float mag = std::sqrtf( 
			frustum.planes[i].a * frustum.planes[i].a + 
			frustum.planes[i].b * frustum.planes[i].b + 
			frustum.planes[i].c * frustum.planes[i].c );

		frustum.planes[i].a /= mag;
		frustum.planes[i].b /= mag;
		frustum.planes[i].c /= mag;
		frustum.planes[i].d /= mag;
	}
}

---

 

Testing the sphere against the frustum:

bool FirstPersonCamera::TEMP_SphereFrustumIntersection( const Sphere& sphere ) {
	for( U32 i=0; i<6; ++i ) {
		float d = ( mFrustum.planes[i].a * sphere.center.x +
			mFrustum.planes[i].b * sphere.center.y + 
			mFrustum.planes[i].c * sphere.center.z ) + 
			mFrustum.planes[i].d;

		if( d < -sphere.radius ) {
			return false;
		}
	}

	return true;
}

---

 

An example of this problem is: placing the camera at <0,0,0> with forward being either <0,0,1> or <0,0,-1> and an object being placed in the scene at positions such as <20,0,0> or <0,0,-5> causes the test to fail. Placing the object at <20,0,0> for example causes the culling to bounce back and forth between working correctly and incorrectly as you rotate about the object on the xy plane.

Share this post


Link to post
Share on other sites

This topic is 673 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this