Jump to content

Like
6Likes
Dislike
Frustum Culling

Peer Reviewed by BHXSpecter, Josh Petrie, Khawk


Frustum culling AABB OBB SSE Multithreaded Program GPU culling
Frustum culling is process of discarding objects not visible on the screen.
In this paper I will cover next themes:
- culling of: Bounding Spheres, AABB, OBB
- culling of huge amount of objects
- using SSE
- multithreaded culling
- GPU culling
- comparison of approaches efficiency

4: Adsense

Introduction


Frustum culling is process of discarding objects not visible on the screen. As we don't see them – we don't need to spend resources on computer to prepare it for rendering and rendering itself.


In this paper I will cover next themes:
  • culling of: Bounding Spheres, Axis-Aligned Bounding Boxes (AABB), Oriented Bounding Boxes (OBB)
  • culling of huge amount of objects
  • using SSE
  • multithreaded culling
  • GPU culling
  • comparison of approaches efficiency, working speed

What I will not cover:
  • using hierarchical structures, trees. We can unite objects in groups according to world positions and first check visibility of whole group.
  • optimizations of one object, like using last 'successful' culling plane
  • visibility test taking into account scene depth buffer. Object can be inside frustum, but can be completely blocked by another, closer to viewer object. Hence we also might discard this object from rendering
  • software culling. We may perform blocking of one objects by another on CPU side.
  • Culling for shadows

Simple culling

We have area of visibility, which is set by frustum of viewing pyramid. Objects that are not inside this area should be discarded from rendering process. Frustum one usually set with 6 planes.

frustum.png

We have objects in the scene. Each object might be aproxinated with simple geometry such as sphere or box. All object's geometry lies inside this primitive.

bounding_primitives.png

Visibility test of such simple geometry performs very fast. Our aim is to understand if this object visible in frustum.
Consider the definition of visibility: spheres and boxes. There are different kinds of boxes: world axis aligned (AABB) and aligned according local object's axis (OBB). One could clearly see that OBB beter aproximates object, but it's visibility test performs harder than AABB.



Sphere-frustum

Algorithm: for object's center we find distance to each frustum plane. If point is behind any plane more than sphere radius, then sphere not in frustum. And found one is splitting plane.

__forceinline bool SphereInFrustum(vec3 &pos, float &radius, vec4 *frustum_planes)
{
	bool res = true;
	//test all 6 frustum planes
	for (int i = 0; i < 6; i++)
	{
		//calculate distance from sphere center to plane.
		//if distance larger then sphere radius - sphere is outside frustum
		if (frustum_planes[i].x * pos.x + frustum_planes[i].y * pos.y + frustum_planes[i].z * pos.z + frustum_planes[i].w <= -radius)
			res = false;
			//return false; //with flag works faster
	}
	return res;
}


AABB-frustum

Bounding sphere sometimes not great choise to approximate object. For more precise test one often use boxes. World Axis-Aligned or Oriented Bounding Boxes.
Basic idea to test the box visibility in frustum: if all 8 box points lie behind one of frustum planes then box is not in frustum. In next example implemented AABB-fustum test. But if we place OBB world-space points in equations – we get OBB-frustum test.

__forceinline bool RightParallelepipedInFrustum2(vec4 &Min, vec4 &Max, vec4 *frustum_planes)
{
//this is just example of basic idea - how BOX culling works, both AABB and OBB
//Min & Max are 2 world space box points. For AABB-drustum culling
//We may use transformed (by object matrix) to world space 8 box points. Replace Min & Max in equations and we get OBB-frustum.
	//test all 6 frustum planes
	for (int i = 0; i<6; i++)
	{
	//try to find such plane for which all 8 box points behind it
	//test all 8 box points against frustum plane
	//calculate distance from point to plane
	//if point infront of the plane (dist > 0) - this is not separating plane
		if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
			continue;
		return false;
    }
    return true;
}

AABB-frustum test might be implemented more optimal.
Algorithm: from 8 point find closest one to the plane ant test if it is behind the plane. If yes – the object is not in frustum.

__forceinline bool RightParallelepipedInFrustum(vec4 &Min, vec4 &Max, vec4 *frustum_planes)
{
	bool inside = true;
	//test all 6 frustum planes
	for (int i = 0; i<6; i++)
	{
	//pick closest point to plane and check if it behind the plane
	//if yes - object outside frustum
		float d = max(Min.x * frustum_planes[i].x, Max.x * frustum_planes[i].x) +
				  max(Min.y * frustum_planes[i].y, Max.y * frustum_planes[i].y) +
				  max(Min.z * frustum_planes[i].z, Max.z * frustum_planes[i].z) + 
				  frustum_planes[i].w;
		inside &= d > 0;
		//return false; //with flag works faster
	}
	return inside;
}


OBB-frustum

Algorithm: transform 8 box points from local space to clip-space. It is easy to test points against plane in such space as frustum becomes the unit cube [-1..1] (but we should take into account that for DirectX for Z axis we have another size [0..1]). If for one of axis all 8 vertexes <-1 or >1, then box outside the frustum.

__forceinline bool OBBInFrustum(const vec3 &Min, const vec3 &Max, mat4 &obj_transform_mat, mat4 &cam_modelview_proj_mat)
{
	//transform all 8 box points to clip space
	//clip space because we easily can test points outside required unit cube
	//NOTE: for DirectX we should test z coordinate from 0 to w (-w..w - for OpenGL), look for      transformations / clipping box differences
	//matrix to transfrom points to clip space
	mat4 to_clip_space_mat = cam_modelview_proj_mat * obj_transform_mat;

	//transform all 8 box points to clip space
	vec4 obb_points[8];
	obb_points[0] = to_clip_space_mat * vec4(Min[0], Max[1], Min[2], 1.f);
	obb_points[1] = to_clip_space_mat * vec4(Min[0], Max[1], Max[2], 1.f);
	obb_points[2] = to_clip_space_mat * vec4(Max[0], Max[1], Max[2], 1.f);
	obb_points[3] = to_clip_space_mat * vec4(Max[0], Max[1], Min[2], 1.f);
	obb_points[4] = to_clip_space_mat * vec4(Max[0], Min[1], Min[2], 1.f);
	obb_points[5] = to_clip_space_mat * vec4(Max[0], Min[1], Max[2], 1.f);
	obb_points[6] = to_clip_space_mat * vec4(Min[0], Min[1], Max[2], 1.f);
	obb_points[7] = to_clip_space_mat * vec4(Min[0], Min[1], Min[2], 1.f);

	bool outside = false, outside_positive_plane, outside_negative_plane;

	//we have 6 frustum planes, which in clip space is unit cube (for GL) with -1..1 range
	for (int i = 0; i < 3; i++) //3 because we test positive & negative plane at once
	{
	//if all 8 points outside one of the plane
	//actually it is vertex normalization xyz / w, then compare if all 8points coordinates <      -1 or > 1
		outside_positive_plane =
			obb_points[0][i] > obb_points[0].w &&
			obb_points[1][i] > obb_points[1].w &&
			obb_points[2][i] > obb_points[2].w &&
			obb_points[3][i] > obb_points[3].w &&
			obb_points[4][i] > obb_points[4].w &&
			obb_points[5][i] > obb_points[5].w &&
			obb_points[6][i] > obb_points[6].w &&
			obb_points[7][i] > obb_points[7].w;

		outside_negative_plane =
			obb_points[0][i] < -obb_points[0].w &&
			obb_points[1][i] < -obb_points[1].w &&
			obb_points[2][i] < -obb_points[2].w &&
			obb_points[3][i] < -obb_points[3].w &&
			obb_points[4][i] < -obb_points[4].w &&
			obb_points[5][i] < -obb_points[5].w &&
			obb_points[6][i] < -obb_points[6].w &&
			obb_points[7][i] < -obb_points[7].w;

		outside = outside || outside_positive_plane || outside_negative_plane;
		//if (outside_positive_plane || outside_negative_plane)
			//return false;
	}
	return !outside;
	//return true;
}


Table 1: culing results of 100к objects. Intel Core I5. Single Thread.



Simple Culling SphereAABBOBB
Just cullung0,921,429,14
Whole frame1,942,510,3


The results are obvious. The harder calculations the slower it works. OBB test much slower than Spheres or AABB tests. But we get more precise culling with OBB.
May be, optimal solution is spiting objects into groups, For each group depending on distance to camera we use appropriate primitive. For closest groups use OBB, for middle one groups use ABB and Spheres for the rest.
Also should be notices than whole frame time is larger than just culling. 1 ms. in average. Because of transferring data about visible objects to gpu has cost, couple of dips and API commands. But it is necessary actions.



SSE


SSE (Streaming SIMD Extensions) – with one instructions we perform calculations on group of operands. SSE includes in it's architecture eight 128 bit registers and set of instructions to perform any operations on them.

Theoretically we might speedup code execution 4 times as we make operations with 4 operands simultaneously. Offcourse on practice perfomance win will be less because of SSE drawbacks.

  • Not all algorithms could be easily rewrited in SSE
  • data should be packed according to SSE requirements in registers to perform calculations
  • SSE has some restrictions with vertical operations like dot products
  • there are no conditions. One use so called static branching, when we execute bot 2 parts of condition an take just one interesting to us result.
  • Loading data in registers and storing results back into memory
  • don't forget about sse data striding

Algorithm of SSE Spheres-frustum and SSE AABB-frustum culling almost identical to simple implementation. In exception of that we perform calculations on 4 objects simultaneously.

void sse_culling_spheres(BSphere *sphere_data, int num_objects, int *culling_res, vec4 *frustum_planes)
{
	float *sphere_data_ptr = reinterpret_cast<float*>(&sphere_data[0]);
	int *culling_res_sse = &culling_res[0];

	//to optimize calculations we gather xyzw elements in separate vectors
	__m128 zero_v = _mm_setzero_ps();
	__m128 frustum_planes_x[6];
	__m128 frustum_planes_y[6];
	__m128 frustum_planes_z[6];
	__m128 frustum_planes_d[6];

	int i, j;
	for (i = 0; i < 6; i++)
	{
		frustum_planes_x[i] = _mm_set1_ps(frustum_planes[i].x);
		frustum_planes_y[i] = _mm_set1_ps(frustum_planes[i].y);
		frustum_planes_z[i] = _mm_set1_ps(frustum_planes[i].z);
		frustum_planes_d[i] = _mm_set1_ps(frustum_planes[i].w);
	}

	//we process 4 objects per step
	for (i = 0; i < num_objects; i += 4)
	{
	//load bounding sphere data
		__m128 spheres_pos_x = _mm_load_ps(sphere_data_ptr);
		__m128 spheres_pos_y = _mm_load_ps(sphere_data_ptr + 4);
		__m128 spheres_pos_z = _mm_load_ps(sphere_data_ptr + 8);
		__m128 spheres_radius = _mm_load_ps(sphere_data_ptr + 12);
		sphere_data_ptr += 16;

	//but for our calculations we need transpose data, to collect x, y, z and w coordinates in separate vectors
		_MM_TRANSPOSE4_PS(spheres_pos_x, spheres_pos_y, spheres_pos_z, spheres_radius);
		__m128 spheres_neg_radius = _mm_sub_ps(zero_v, spheres_radius); // negate all elements

		__m128 intersection_res = _mm_setzero_ps();
		for (j = 0; j < 6; j++) //plane index
		{
		//1. calc distance to plane dot(sphere_pos.xyz, plane.xyz) + plane.w
		//2. if distance < sphere radius, then sphere outside frustum
			__m128 dot_x = _mm_mul_ps(spheres_pos_x, frustum_planes_x[j]);
			__m128 dot_y = _mm_mul_ps(spheres_pos_y, frustum_planes_y[j]);
			__m128 dot_z = _mm_mul_ps(spheres_pos_z, frustum_planes_z[j]);

			__m128 sum_xy = _mm_add_ps(dot_x, dot_y);
			__m128 sum_zw = _mm_add_ps(dot_z, frustum_planes_d[j]);
			__m128 distance_to_plane = _mm_add_ps(sum_xy, sum_zw);

			__m128 plane_res = _mm_cmple_ps(distance_to_plane, spheres_neg_radius); //dist < -sphere_r ?
			intersection_res = _mm_or_ps(intersection_res, plane_res); //if yes - sphere behind the plane & outside frustum
		}

		//store result
		__m128i intersection_res_i = _mm_cvtps_epi32(intersection_res);
		_mm_store_si128((__m128i *)&culling_res_sse[i], intersection_res_i);
	}
}


void sse_culling_aabb(AABB *aabb_data, int num_objects, int *culling_res, vec4 *frustum_planes)
{
	float *aabb_data_ptr = reinterpret_cast<float*>(&aabb_data[0]);
	int *culling_res_sse = &culling_res[0];

	//to optimize calculations we gather xyzw elements in separate vectors
	__m128 zero_v = _mm_setzero_ps();
	__m128 frustum_planes_x[6];
	__m128 frustum_planes_y[6];
	__m128 frustum_planes_z[6];
	__m128 frustum_planes_d[6];

	int i, j;
	for (i = 0; i < 6; i++)
	{
		frustum_planes_x[i] = _mm_set1_ps(frustum_planes[i].x);
		frustum_planes_y[i] = _mm_set1_ps(frustum_planes[i].y);
		frustum_planes_z[i] = _mm_set1_ps(frustum_planes[i].z);
		frustum_planes_d[i] = _mm_set1_ps(frustum_planes[i].w);
	}

	__m128 zero = _mm_setzero_ps();
	//we process 4 objects per step
	for (i = 0; i < num_objects; i += 4)
	{
	//load objects data
	//load aabb min
		__m128 aabb_min_x = _mm_load_ps(aabb_data_ptr);
		__m128 aabb_min_y = _mm_load_ps(aabb_data_ptr + 8);
		__m128 aabb_min_z = _mm_load_ps(aabb_data_ptr + 16);
		__m128 aabb_min_w = _mm_load_ps(aabb_data_ptr + 24);

	//load aabb max
		__m128 aabb_max_x = _mm_load_ps(aabb_data_ptr + 4);
		__m128 aabb_max_y = _mm_load_ps(aabb_data_ptr + 12);
		__m128 aabb_max_z = _mm_load_ps(aabb_data_ptr + 20);
		__m128 aabb_max_w = _mm_load_ps(aabb_data_ptr + 28);
		aabb_data_ptr += 32;

	//for now we have points in vectors aabb_min_x..w, but for calculations we need to xxxx yyyy zzzz vectors representation - just transpose data
		_MM_TRANSPOSE4_PS(aabb_min_x, aabb_min_y, aabb_min_z, aabb_min_w);
		_MM_TRANSPOSE4_PS(aabb_max_x, aabb_max_y, aabb_max_z, aabb_max_w);

		__m128 intersection_res = _mm_setzero_ps();
		for (j = 0; j < 6; j++) //plane index
		{
		//this code is similar to what we make in simple culling
		//pick closest point to plane and check if it begind the plane. if yes - object outside frustum
		//dot product, separate for each coordinate, for min & max aabb points
			__m128 aabbMin_frustumPlane_x = _mm_mul_ps(aabb_min_x, frustum_planes_x[j]);
			__m128 aabbMin_frustumPlane_y = _mm_mul_ps(aabb_min_y, frustum_planes_y[j]);
			__m128 aabbMin_frustumPlane_z = _mm_mul_ps(aabb_min_z, frustum_planes_z[j]);

			__m128 aabbMax_frustumPlane_x = _mm_mul_ps(aabb_max_x, frustum_planes_x[j]);
			__m128 aabbMax_frustumPlane_y = _mm_mul_ps(aabb_max_y, frustum_planes_y[j]);
			__m128 aabbMax_frustumPlane_z = _mm_mul_ps(aabb_max_z, frustum_planes_z[j]);

		//we have 8 box points, but we need pick closest point to plane. Just take max
			__m128 res_x = _mm_max_ps(aabbMin_frustumPlane_x, aabbMax_frustumPlane_x);
			__m128 res_y = _mm_max_ps(aabbMin_frustumPlane_y, aabbMax_frustumPlane_y);
			__m128 res_z = _mm_max_ps(aabbMin_frustumPlane_z, aabbMax_frustumPlane_z);

		//dist to plane = dot(aabb_point.xyz, plane.xyz) + plane.w
			__m128 sum_xy = _mm_add_ps(res_x, res_y);
			__m128 sum_zw = _mm_add_ps(res_z, frustum_planes_d[j]);
			__m128 distance_to_plane = _mm_add_ps(sum_xy, sum_zw);

			__m128 plane_res = _mm_cmple_ps(distance_to_plane, zero); //dist from closest point to plane < 0 ?
			intersection_res = _mm_or_ps(intersection_res, plane_res); //if yes - aabb behind the plane & outside frustum
		}

		//store result
		__m128i intersection_res_i = _mm_cvtps_epi32(intersection_res);
		_mm_store_si128((__m128i *)&culling_res_sse[i], intersection_res_i);
	}
}


OBB culling is a bit harder. We perform calculations on one object at once. But make calculations for three xyz axes simultaneously. It is not optimal but it reflects basic idea of algorithm. Besides, vector math (matrix multiplications and point transformations) with SSE perform faster.


void sse_culling_obb(int firs_processing_object, int num_objects, int *culling_res, mat4 &cam_modelview_proj_mat)
{
	mat4_sse sse_camera_mat(cam_modelview_proj_mat);
	mat4_sse sse_clip_space_mat;

	//box points in local space
	__m128 obb_points_sse[8];
	obb_points_sse[0] = _mm_set_ps(1.f, box_min[2], box_max[1], box_min[0]);
	obb_points_sse[1] = _mm_set_ps(1.f, box_max[2], box_max[1], box_min[0]);
	obb_points_sse[2] = _mm_set_ps(1.f, box_max[2], box_max[1], box_max[0]);
	obb_points_sse[3] = _mm_set_ps(1.f, box_min[2], box_max[1], box_max[0]);
	obb_points_sse[4] = _mm_set_ps(1.f, box_min[2], box_min[1], box_max[0]);
	obb_points_sse[5] = _mm_set_ps(1.f, box_max[2], box_min[1], box_max[0]);
	obb_points_sse[6] = _mm_set_ps(1.f, box_max[2], box_min[1], box_min[0]);
	obb_points_sse[7] = _mm_set_ps(1.f, box_min[2], box_min[1], box_min[0]);

	ALIGN_SSE int obj_culling_res[4];
	__m128 zero_v = _mm_setzero_ps();

	int i, j;
	//process one object per step
	for (i = firs_processing_object; i < firs_processing_object+num_objects; i++)
	{
	//clip space matrix = camera_view_proj * obj_mat
		sse_mat4_mul(sse_clip_space_mat, sse_camera_mat, sse_obj_mat[i]);
		__m128 outside_positive_plane = _mm_set1_ps(0xffffffff);
		__m128 outside_negative_plane = _mm_set1_ps(0xffffffff);

	//for all 8 box points
		for (j = 0; j < 8; j++)
		{
		//transform point to clip space
			__m128 obb_transformed_point = sse_mat4_mul_vec4(sse_clip_space_mat, obb_points_sse[j]);

		//gather w & -w
			__m128 wwww = _mm_shuffle_ps(obb_transformed_point, obb_transformed_point, _MM_SHUFFLE(3, 3, 3, 3)); //get w
			__m128 wwww_neg = _mm_sub_ps(zero_v, wwww); // negate all elements

		//box_point.xyz > box_point.w || box_point.xyz < -box_point.w ?
		//similar to point normalization: point.xyz /= point.w; And compare: point.xyz > 1 && point.xyz < -1
			__m128 outside_pos_plane = _mm_cmpge_ps(obb_transformed_point, wwww);
			__m128 outside_neg_plane = _mm_cmple_ps(obb_transformed_point, wwww_neg);

		//if at least 1 of 8 points in front of the plane - we get 0 in outside_* flag
			outside_positive_plane = _mm_and_ps(outside_positive_plane, outside_pos_plane);
			outside_negative_plane = _mm_and_ps(outside_negative_plane, outside_neg_plane);
		}

		//all 8 points xyz < -1 or > 1 ?
		__m128 outside = _mm_or_ps(outside_positive_plane, outside_negative_plane);

		//store result
		__m128i outside_res_i = _mm_cvtps_epi32(outside);
		_mm_store_si128((__m128i *)&obj_culling_res[0], outside_res_i);

		//for now we have separate result separately for each axis
		//combine results. If outside any plane, then objects is outside frustum
		culling_res[i] = (obj_culling_res[0] != 0 || obj_culling_res[1] != 0 ||  obj_culling_res[2] != 0) ? 1 : 0;
	}
}

Table 2. SSE culling result of 100k objects. Intel Core I5. Single Thread. SSE.



SSE Culling SphereAABBOBB
Just culling0,260,463,48
Whole frame1,21,434,6


SSE implementation in average 3 times faster than simple one in C++.



Multithreading

Nowadays processors has several cores. Calculations might be performed simultaneously on all cores.

Architecture on new games should be planned taking into account multithreading, i.e. split work on independent parts/tasks and solve them simultaneously, loading evenly all the processor cores. The design should be flexible. Too large amount of small tasks leads to overhead of synchronizing work and switching between tasks. Too small abound of big tasks leads to uneavenly loading of cores. Need a balance. In current games there might be from several hundreds to thousand tasks per frame.

In our case of frustum culling each object is independent from the rest. Thats why we easily could split work into equal groups and cull them simultaneously with different cores of processor. After running jobs execution we need to wait threads to do their job and gather results.
Off course we should not ask results right after execution start.


Worker::Worker() : first_processing_oject(0), num_processing_ojects(0)
{
	//create 2 events: 1. to signal that we have a job 2.signal that we finished job
	has_jobs_event = CreateEvent(NULL, false, false, NULL);
	jobs_finished_event = CreateEvent(NULL, false, true, NULL);
}

void Worker::doJob()
{
	//make our part of work
	cull_objects(first_processing_oject, num_processing_ojects);
}

unsigned __stdcall thread_func(void* arguments)
{
	printf("In thread...\n");
	Worker *worker = static_cast<worker*>(arguments);

	//each worker has endless loop untill we signal to quit (stop_work flag)
	while (true)
	{
	//wait for starting jobs
	//if we have no job - just wait (has_jobs_event event). We do not wasting cpu work. Events designed for this.
		WaitForSingleObject(worker->has_jobs_event, INFINITE);

	//if we have signal to break - exit endless loop
		if (worker->stop_work)
			break;

	//do job
		worker->doJob();

	//signal that we finished the job
		SetEvent(worker->jobs_finished_event);
	}
	_endthreadex(0);
	return 0;
}

void create_threads()
{
	//create the threads
	//split the work into parts between threads
	int worker_num_processing_ojects = MAX_SCENE_OBJECTS / num_workers;
	int first_processing_oject = 0;

	int i;
	for (i = 0; i < num_workers; i++)
	{
	//create threads
		workers[i].thread_handle = (HANDLE)_beginthreadex(NULL, 0, &thread_func, &workers[i], CREATE_SUSPENDED, &workers[i].thread_id);
		thread_handles[i] = workers[i].thread_handle;

	//set threads parameters
		workers[i].first_processing_oject = first_processing_oject;
		workers[i].num_processing_ojects = worker_num_processing_ojects;
		first_processing_oject += worker_num_processing_ojects;
	}
	//run workers to do their jobs
	for (int i = 0; i < num_workers; i++)
		ResumeThread(workers[i].thread_handle);
}

void process_multithreading_culling()
{
	//signal workers that they have the job
	for (int i = 0; i < num_workers; i++)
		SetEvent(workers[i].has_jobs_event);
}

void wate_multithreading_culling_done()
{
	//wait threads to do their jobs
	HANDLE wait_events[num_workers];
	for (int i = 0; i < num_workers; i++)
		wait_events[i] = workers[i].jobs_finished_event;
	WaitForMultipleObjects(num_workers, &wait_events[0], true, INFINITE);
}


Table 3. Culling results of 100k objects. Intel Core I5 (4 cores). In brackets – speedup relatively to simple c++ implementation.



Method SphereAABBOBB
Simple c++0,92 (1)1,42 (1)9,14 (1)
SSE0,26 (3,54)0,46 (3,08)3,48 (2,62)
Simple c++, Mulithreaded0,25 (3,68)0,4 (3,55)2,5 (3,65)
SSE, Multithreaded0,1 (9,2)0,18 (7,89)1 (9,14)



Multithreaded version faster than single threaded in 3,6 times in average.
Using SSE gieves us 3 times speedup, relatively to simple c++ implementation.
Both using SSE and Multithreading gives us 8,7 times speedup!
I.e. we optimize our calculations by almost 9 times, depending on used culling primitive type.



GPU culling

GPU designed to perform the same operation on huge amount of data. GPU has a lot more parallel threads (thousands) than in CPU (2-8 in most desktop cases). But culling on gpu not allwas comfortably:

  • this assumes special graphics engine architecture
  • there is unpleasant moment that we evaluate dip execution on cpu side. For this we need to know the amout of generated primitives by GPU (visible in frustum objects in our case). Thats why we need to ask feedback from GPU. There are special commands for this purpose.
  • The problem is if we want get result in the same frame with culling and rendering we get GPU-stall, because we need to wait the result. This is bad for perfomance. If read result from previous frame – we get bugs. Full solution to this problem is using DrawIndirect commands and preparing information about dip on GPU side. This is available since DirectX11 and Opengl 4.

Implementation on gpu culling consist from next steps:
  1. Pack all instances data in vertex buffer. Assume that one vertex is one object for culling. Amount of atributes for vertex equal to amount of data per one object.
  2. Enable transform feedback. Send prepared vertex buffer on render. All results redirect to another vertex buffer with visible instances data.
  3. In vertex shader check visibility on the object
  4. In geometry shader discard object / kill the vertex if instance is not visible in frustum.
  5. Thus, we formed buffer with just visible instances data.
  6. But now we need to get information amout amount of visible objects from GPU to make the dip on CPU side. In this case we do this with transform feedback from previous frame (just for code simplicity).

void do_gpu_culling()
{
	culling_shader.bind();

	int cur_frame = frame_index % 2;
	int prev_frame = (frame_index + 1) % 2;

	//enable transform feedback & query
	glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, dips_texture_buffer);
	glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN,                     
	num_visible_instances_query[cur_frame]);
	glBeginTransformFeedback(GL_POINTS);

	//render cloud of points which we interprent as objects data
	glBindVertexArray(all_instances_data_vao);
	glDrawArrays(GL_POINTS, 0, MAX_SCENE_OBJECTS);
	glBindVertexArray(0);

	//disable all
	glEndTransformFeedback();
	glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN);
	glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, 0);
	glDisable(GL_RASTERIZER_DISCARD);

	//get feedback from prev frame
	num_visible_instances = 0;
	glGetQueryObjectiv(num_visible_instances_query[prev_frame], GL_QUERY_RESULT,      &num_visible_instances);

	//next frame
	frame_index++;
}

Vertex shader from 3rd step:
#version 330 core
in vec4 s_attribute_0;
in vec4 s_attribute_1;

out vec4 instance_data1;
out vec4 instance_data2;
out int visible;

uniform mat4 ModelViewProjectionMatrix;
uniform vec4 frustum_planes[6];

int InstanceCloudReduction()
{
	//sphere - frustum test
	bool inside = true;
	for (int i = 0; i < 6; i++)
	{
		if (dot(frustum_planes[i].xyz, s_attribute_0.xyz) + frustum_planes[i].w <= -s_attribute_0.w)
			inside = false;
	}
	return inside ? 1 : 0;
}

void main()
{
//read instance data
	instance_data1 = s_attribute_0;
	instance_data2 = s_attribute_1;

//visibility
	visible = InstanceCloudReduction();
	gl_Position = ModelViewProjectionMatrix * vec4(s_attribute_0.xyz,1);
}

Geometry shader from 4th step:
#version 400 core
layout (points) in;
layout (points, max_vertices = 1) out;

in vec4 instance_data1[];
in vec4 instance_data2[];
in int visible[];

out vec4 output_instance_data1;
out vec4 output_instance_data2;

void main( )
{
	//if primitive is not visible - discard it !
	//visible comes from vertex shader
	if (visible[0] == 1)
	{
	//just transfer data
		output_instance_data1 = instance_data1[0];
		output_instance_data2 = instance_data2[0];
		gl_Position = vec4(0.0, 0.0, 0.0, 1.0);
		EmitVertex();
		EndPrimitive();
	}
}

Whole frame time with GPU culling is about 1,19 ms. Allmost the same as with the fastest CPU variant, SSE multithreaded sphere-frustum culling.



Perfomance comparision of all methods

We compared culling speed CPU methods, but mention only culling time. For now we are going to measure whole frame time, so measurements will differs a bit as we need take into account data transferring and some other work.


Table 4: Perfomance comparition of all culling methods. Whole Frame time in ms.



Method SphereAABBOBB
Simple c++1,942,4610,3
SSE1,21,454,63
Simple c++, multithreading1,231,423,6
SSE, multithreading1,181,22,02
GPU1,19--


There might be some inaccuracy in measurements (in 2nd sigh after coma). Even averaged time always differs from frame to frame.



Conclusion

Using both SSE and multithreading gives us 9 times perfomance win in comparition with simple c++ implementation. If one can use Opengl 4 (map buffers with GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT flags), then data transfering to gpu becomes really fast. Plus we able to optimise culling using hierarchial structures and different trics like remembering last frustum plane which culled the object.

GPU culling works as fast as the most optimized CPU method. And has couple advantages:
  • no need to transfer visible instances data to GPU. They are already there. If using DrawIndirect (DX11+, OpenGL 4+, new API: DX12, Vulkan, Metal) and form information about dip on CPU side, then we can get even better performance.
  • Relatively easily able to make additional culling taking into account self blocking/hiding/inner visibility (Hierarchical-Z map based occlusion culling)

So – what to use CPU or GPU culling?
Depends on several things:
  • project and amount of changes to make GPU culling
  • Do we able to use DrawIndirect
  • what is bottle neck: CPU or GPU?
  • Do we want make additional culling according to depth complexity of the scene. On GPU side it will be much easier and faster.

If there is such opportunity – one should use GPU culling.

Source code of all examples.

Links:
  1. Collision Detection FAQ
  2. View frustum culling
  3. View frustum culling optimization
  4. Efficient View Frustum Culling
  5. Optimized View Frustum Culling Algorithms for Bounding Boxes
  6. Hierarchical-Z map based occlusion culling
  7. Beyond Porting. How Modern OpenGL can Radically Reduce Driver Overhead
  8. Parallelizing the Naughty Dog engine using fibers
  9. Practical Occlusion Culling in Killzone 3
  10. Software Occlusion Culling
  11. Streaming SIMD Extensions (SSE)
Gerlits Anatoliy.
February, 2017 year.

License

GDOL (Gamedev.net Open License)

6 Comments

Feb 16 2017 01:00 AM
This is a great post in many aspects. First, it's rare to find real discussions on the implementation of VF algorithms that use both SSE and multi-threading. Second, it's even harder to find code samples of those things. Although I didn't check the code, I am sure that it might be pretty usefull for people who never tried such techniques. It is a very generous offer by the author and if code is good it will be a gold mine for beginners. Third, I always appreciate when authors take the time to actually compare algorithms. It is of little no value to come here and advocate for an algorithm without performance analyses. This article does not commit such sin.
 
On the nitpicking side, I think the article misses details about how the performance was measured. Every article that makes performance comparisons should give details about that part since they can completely alter everything. Additionally, I would say that the text is often too rushed. Many parts will leave interested readers at a loss. In particular, rushing over details about SSE, threading and GPU is dangerous. Also, the text really needs to be reviewed. I am not talking about the written English, but about things like: it does not make sense to include "Not all algorithms could be easily rewrited in SSE" in the list of SSE drawbacks that make SSE less performant. Those kind of minor things are all over the article.
 
However, don't get me wrong. The article is very welcome and pretty useful. Again, I am very happy to see someone actually share actual implementation of actual multi-threaded+SSE VF code. Not to mention a GPU implementation. I would only stress to whoever read this, that the author chose (understandably, given the purposes of this article) to not deal with spatial partitioning. But in my experience, in many cases even a basic implementation of an Octree or KD-tree outperform even SSE+multithreaded non-spatialized solutions (although I never compared against GPU non-spatialized solutions, but if the performance metrics here are correct, these would be outperformed as well). Of course, it dependes heavily on the chracteristics of each game or engine. But anyone needing high performance should also try at least an Octree and then decide in case by case basis.
 
Anyways, thanks for the article!
Feb 16 2017 06:29 AM

Quite detailed article.

Have you considered trying AVX? On my i7-4770K 4.0 GHz I get roughly 500,000,000 AABB culls / second on a single thread, so it's roughly 0.2 ms per 100к.

Feb 16 2017 12:29 PM

I would also like to add a good reference: http://www.frostbite.com/wp-content/uploads/2013/05/CullingTheBattlefield.pdf

Feb 16 2017 02:49 PM
This is a very nice article.
 
The bit at the at the end of the OBB function
__m128i outside_res_i = _mm_cvtps_epi32(outside);
ALIGN_SSE int obj_culling_res[4];
_mm_store_si128((__m128i *)&obj_culling_res[0], outside_res_i);
culling_res[i] = (obj_culling_res[0] != 0 || obj_culling_res[1] != 0 ||  obj_culling_res[2] != 0) ? 1 : 0;
Is unnecessary memory traffic.  The store, loads, all of those compares and the branch can be replaced by a single instruction that keeps everything in registers.
culling_res[i] = _mm_movemask_ps(outside);
 
With SSE, RAM is the enemy.  Registers are your only friend.
Feb 17 2017 06:13 PM

MAnd

>This is a great post in many aspects

thanks

 

>I am not talking about the written English

well, I just learning it )

 

>rushing over details

any aspect of described optimizations is the whole universe. Just thought - it would be too boring for everyone to read basics again. For those who not familiar with concepts of SSE & multithreading & geometry shaders - it will be anyway too hard to understand the code.

The article is about optimizations, speedup of naive implementation, comparison and how good is gpu.

 

I was a bit surprised, but CPU is really fast with all optimizations - as fast as GPU.

Honestly, GPU is still a dark horse. Need some additional tests: geometry shaders vs compute, atomics overhead, indirect with lots of 'empty' dips, OBB culling, BVH/trees on gpu.

CPU culling also requires some additional research: need to understand how queries are suitable for occlusion culling. Tests need to be conservative due to queries nature. Such approach was used in Doom 2016. It might be very interesting.

So, even getting such 'not bad' results for CPU - I am still not sure how it is good in comparison with GPU. Additional research required.

 

>in many cases even a basic implementation of an Octree or KD-tree outperform even SSE+multithreaded

I work on WarThunder. We tested hierarchical structures for culling. Surprisingly, but simple brute force culling of linear array of objects outperforms all hierarchies.

It was not me, who tested this, but I believe in results:

- SSE 'doesn't like' branching :)

- branching is pretty slow

- next calculations depends on prev ones

- probably even important - processor's caches utilization (your data are not locally and you jump from one to another all the time, unpredictable, fetch different data each time)

Feb 17 2017 06:22 PM

Zaoshi Kaba

>Have you considered trying AVX? On my i7-4770K 4.0 GHz I get roughly 500,000,000 AABB culls / second on a single thread, so it's roughly 0.2 ms per 100к.

nice results

no - not tested AVX yet

 

corysama

looks good, will try it - thanks

Note: GameDev.net moderates article comments.