GPU geometry shader particles frustrum culling?

Started by
13 comments, last by unbird 11 years, 1 month ago

Yes, GPUs can be very counter-intuitive at times.

At work we have an API test demo which I wrote which draws 1,000,000 cubes via instancing (so one draw call). The vertex shader loads, via an indexed cbuffer, a matrix4x4 for each vertex and uses that to transform it. The matrix buffer is static and never touched post-creation. Ran at just over 30fps on a NV GTX 470, all GPU load.

Last week I wrote another test based on that demo however this time the cbuffer held 4x float4, one of which was an axis/angle the angle component of which was updated by a compute shader. On drawing the values where then used to construct a quaternion to do the rotation part of the transform and used the rest to translate the cubes in various ways (a few sin/cos calculations with per-instance phase and amplitude amounts). After that the workload for the shader was the same as before.

This new version, despite transferring more data and doing more per vertex work ran at ~40+fps... dry.png

(I might, if I have time, modify the demo this week to do some point expansion for the cubes in the GS, get a feel for the slow down... then maybe throw in some HS/DS work too...)

Wait, so on both tests the cbuffer stream size was the same? (you said matrix 4x4 compared to 4x float 4's) That does sound interesting...

New game in progress: Project SeedWorld

My development blog: Electronic Meteor

Advertisement

Wait, so on both tests the cbuffer stream size was the same? (you said matrix 4x4 compared to 4x float 4's) That does sound interesting...

Yep; if anything it hit twice as much data as the compute shader read in the whole per-instance entry (4x float4), modified one component in one float4 and wrote the whole thing back out (this is a test, performance wasn't something I cared about ;)); at best the shader compiler could have reduced that to one float4 read and one write but that's a million float4s still + 1 million x 4 x float4 per vertex.

My theory, and I have no nsight installed to back this up, is that because it was a series of float4s and not a matrix4x4 the hardware was able to better mask the latency by switching threads more often and generally was able to better schedule workloads BUT I have no data to back this up due to lack of tools so wouldn't even attempt to optimise down those lines smile.png

Btw, both the examples above also works via vertex stream instancing instead of cbuffers indexing; same data set size (in fact the compute version of the update writes more data as it has to write an extra float4 to update the instanced vertex buffer (separate buffer, same data)) but in both instances the vertex stream version works faster (~44fps vs 30fps for the static cube and about 1ms per frame difference for the animated case). You can switch seamlessly between the modes thanks to shared data buffers and the scene continues to look the same.

I wish I had real tools at times sad.png

For the CS particle system, you update the various point quantities in the CS. Then do you just instance a quad by drawing numParticle instances and sample the structure buffer you updated by the CS in the vertex shader?

-----Quat

While I can't find it right now there was a graph which was plotted which showed the various methods of doing particles and point expansion via a GS was much slower than simple instancing.

I've written and profiled both versions and instancing beats a GS any day, consistently on NV, AMD and Intel too. In fact, just sending all 4 verts in the traditional old-fashioned way and not doing any expansion at all (i.e. no instancing, no GS) was also faster than the GS path.

It's easy to see how one might think a GS should be faster. It's less verts, less data and these are things that can be directly measured in one's own code, but that's an old, old trap to fall in to.


I've had real-world use cases where GS quad expansion actually performed better than instancing when tested on various hardware. I don't think it's safe to make any broad assumptions here, there's a whole lot of variables in play (especially when you consider what the driver may or may not be doing being the scenes).

As alternative (somewhere between instancing and geometry shader) you can generate the quad vertices in the vertex shader from the vertex id.
(you will need to tag your particle vertices as D3D11_INPUT_PER_INSTANCE_DATA !).

And since the OP is asking about culling, there's also the SV_CullDistance semantic which can be emitted by both vertex and geometry shader.

static uint QuadIndices[] = {0,1,2,1,3,2};

void BillboardedPointsTestVS(
	uint vid              : SV_VertexID,
	uint iid              : SV_InstanceID,
	out float4 sv_position: SV_POSITION,
	inout float3 position : POSITION,
	out float2 tex		  : TEXCOORD,
	inout float4 color    : COLOR,
	out float cull        : SV_CullDistance)
{
	// index to vertex id, call with DrawInstanced(6, number_of_particles, 0, 0)
	uint id = QuadIndices[vid];  
	tex = float2(id & 1, (id >> 1) & 1);
	float4 quadPos = float4(tex * float2(2,-2) + float2(-1,1), 1, 1);	

	position = mul(float4(position, 1), World).xyz;

	// distance to camera
	float distance = length(position.xyz - Eye);

	// example culling, here in world space at some arbitrary plane at the origin
	cull = dot(float3(1,0,1), position);

	// extrude quad in view space ...
	float4 view = mul(float4(position, 1), View);	
	float size = 1 / distance;
	view.xyz += size * quadPos.xyz;
	
	// ... and project
	sv_position = mul(view, Projection);
}
094ff8242126441.jpg

This topic is closed to new replies.

Advertisement