Terrible light shader performance & optimizations

Started by
15 comments, last by cozzie 10 years, 10 months ago

Hi all,

I've managed to make a lighting shader, taking 1 to 3 directional lights (VS) and up to 32 point lights (PS).

Per frame I check which point lights are within the frustum (using basic sphere with light radius checking). If for example 20 point lights are within the frustum, I set the lights in the array 21 up to 32 to range = 0 and process them anyway. Doing this with another int variable (activelights) in the loops in the shader, isn't improving performance, compared to using a const max lights, and looping through all, for invisible lights with range = 0.

The issue:

- it's terribly slow on quite up 2 date hardware

I've copied the shader below, so you have an idea.

Before I dive into restructuring my scenegraph etc., I'd like to hear your opinion on the possible optimizations:

1 - currently my render queue sorts objects based on material and then renders them all, using all (point) lights. Even if the lighting don't affect the object (don't check this now). Would it be worthwile to check and set lights that affect an object, right before I render it. So basically light position + radius versus object position + radius (quick rough check).

2 - would it be worthwile to introduce a loop in the VS program, that checks which lights affect the vertex being processed, and only go through these lights in the PS program? (this would save me from redoing my renderqueue)

3 - are there any quickwin improvements I could apply to the effect/ VS/PS programs?

4 - I could check distance of point lights from the camera, and for example render point lights > 50 units from camera, using a vertexshader, instead of a pixel shader

5 - other possibilities? (I think multiple passes without changing anything else won't help that much.)

Any help is really appreciated.

When I have a scene with for example 4 point lights (and a const MAX LIGHT of 4), the performance is visibly better.

All this I want to do without using a deferred renderer (for learning/ practice and see if I can improve it).

Personally I think I'm GPU bound now, and the most logical step would be to have the shader take 4 or 8 max lights, and set the lights that affect each object, right before I render the object (keep renderqueue the same, sorted by material). This would mean setting say 4 times 3 shader parameters for each object (no commitchanges needed I think).


/*************************************************************/
/**	CR GENERIC SHADER		1 DIRECTIONAL, 4 POINT LIGHTS	**/
/**					PER PIXEL LIGHTING		**/
/**					SINGLE TEXTURE, ANISOTROPY	**/
/**	TECHNIQUES:			OPAQUE AND BLENDED		**/
/**	SHADER MODEL:		3.0,NO BWARDS COMPATIBILITY	**/
/*************************************************************/
	                            
/*******************************************************/
/**	UNIFORM INPUT, CONTROLLED BY ENGINE			**/
/**  TRANSFORMATIONS & MATERIALS				**/
/*******************************************************/

float4x4		World			: WORLD;
float4x4		WorldInvTransp	: WORLDINVTRANSP;
shared float4x4	ViewProj		: VIEWPROJECTION;

float4		AmbientColor	: AMB_COLOR; 
float			AmbientIntensity	: AMB_INTENSITY;
            
float4		MatAmb 		: MATERIAL_AMBIENT;
float4		MatDiff		: MATERIAL_DIFFUSE;
float4		MatSpec		: MATERIAL_SPECULAR;
float4		MatEmi		: MATERIAL_EMISSIVE;
float			MatPower		: MATERIAL_POWER;

texture		Tex0			: TEXTURE0 < string name = "roadblock texture.tga"; >;

// modelfile, just for effectedit
string XFile 	= "roadblock.x";

/*******************************************************/
/**	UNIFORM INPUT, CONTROLLED BY ENGINE			**/
/**  LIGHT SOURCES AND PROPERTIES				**/
/*******************************************************/
 
#define MaxDirectionalLights 1
       
float3 		DirLightDir[MaxDirectionalLights];
float4		DirLightCol[MaxDirectionalLights];
float			DirLightInt[MaxDirectionalLights];

#define MaxPointLights 32

float3		PointLightPos[MaxPointLights];
float			PointLightRange[MaxPointLights];
float			PointLightFPRange[MaxPointLights];
float4		PointLightCol[MaxPointLights];
float			PointLightInt[MaxPointLights];
   
/*******************************************************/
/**	SAMPLER STATES FOR TEXTURING				**/
/*******************************************************/

sampler2D textureSampler = sampler_state
{
	Texture		= (Tex0);
	MinFilter		= ANISOTROPIC;
	MagFilter		= LINEAR;
	MipFilter		= LINEAR;
	MaxAnisotropy	= 4;
};

/*******************************************************/
/**	VERTEX SHADER INPUT <= VERTEX DECLARATION		**/
/*******************************************************/

struct VS_INPUT
{
	float4 Pos		: POSITION0;
	float4 Normal	: NORMAL0;
	float2 TexCoord	: TEXCOORD0;
};

/*******************************************************/
/**	VERTEX SHADER OUTPUT - PIXEL SHADER INPUT       **/
/*******************************************************/

struct VS_OUTPUT
{
	float4 Pos		: POSITION0;
	float4 Color	: COLOR0;
	float3 Normal	: TEXCOORD1;
	float2 TexCoord	: TEXCOORD2;
	float3 wPos		: TEXCOORD3;
};

/*******************************************************/
/**	VERTEX SHADER PROGRAM					**/
/*******************************************************/
                    
VS_OUTPUT VS_function(VS_INPUT input)
{
	VS_OUTPUT Out = (VS_OUTPUT)0;

	float4 worldPosition = mul(input.Pos, World);
	Out.Pos = mul(worldPosition, ViewProj);

	float4 normal = mul(input.Normal, WorldInvTransp);

	Out.Normal = normal;
	Out.TexCoord = input.TexCoord;
	Out.wPos = worldPosition;

//	DIRECTIONAL LIGHT
	float dirIntensity[MaxDirectionalLights];
	float dirTotal = 0.0f;

	for(int i=0;i<MaxDirectionalLights;i++)
	{	
		dirIntensity 	= dot(normal, DirLightDir);
		dirTotal 		+= saturate(DirLightCol * DirLightInt * dirIntensity);
	}

	Out.Color = dirTotal;
	return Out;
}

/*******************************************************/
/**	PIXEL SHADER PROGRAM					**/
/*******************************************************/

float4 PS_function(VS_OUTPUT input): COLOR0
{
	float4 textureColor = tex2D(textureSampler, input.TexCoord);
	float4 amb = AmbientColor * AmbientIntensity * MatAmb;
	float4 diff = input.Color * MatDiff;

	float distt[MaxPointLights];
	float att[MaxPointLights];

	float4 att_total = 0.0f;
	float4 attcolored;
	float4 perpixel;

	for(int i=0;i<MaxPointLights;i++)
	{
		distt 	= distance(PointLightPos, input.wPos);
		att	= 1 - saturate((distt - PointLightFPRange) / PointLightRange);
		att	= (pow(att, 2)) * PointLightInt;

		attcolored	= att * PointLightCol;
		perpixel	= saturate(dot(normalize(PointLightPos - input.wPos), normalize(input.Normal)));
		att_total	+= (attcolored * perpixel);
	}

	return saturate((diff + amb + att_total + MatEmi) * textureColor);
}

/*******************************************************/
/**	TECHNIQUES & PASSES					**/
/*******************************************************/

technique OpaqueTechnique
{
	pass P0
	{
		AlphaBlendEnable	= FALSE;

		VertexShader = compile vs_3_0 VS_function();
		PixelShader = compile ps_3_0 PS_function();
	}
}

/*******************************************************/
/**	BLENDED							**/
/*******************************************************/

technique BlendedTechnique
{
	pass P0
	{
		AlphaBlendEnable	= TRUE;
		SrcBlend		= SRCALPHA;
		DestBlend		= INVSRCALPHA;
		AlphaOp[0]		= SelectArg1;
		AlphaArg1[0]	= Texture;

		VertexShader = compile vs_3_0 VS_function();
		PixelShader = compile ps_3_0 PS_function();
	}
}

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

Advertisement

The simplest optimization is probably best done on the CPU side:

- Firstly test all point lights against the current view frustum, and ignore all lights that don't intersect it. This just makes the next step more efficient.

- For each object you render test its bounding sphere against the reduced light list, and only pass those lights to the shader.

- Pass a constant to the shader to tell it the actual light count and use a [loop] to only process the correct number of lights.


float NumPointLights;
 
// ...
 
    [loop] for(int i=0;i<NumPointLights;i++)

If you actually need to render a huge number of lights look into deferred shading.

Seeing this is a forward rendering drop your light count to something sensible like 8 lights and see what the performance is, then slowly increase the amount of lights. Forward rendering has the problem that the lighting pass is done on all pixels of the scene regardless whether they are actually occluded by something else. So 32 times many more pixels than there actually are in the render target is what is most likely making this really slow.

You clearly don't understand how the pixel and vertex shaders work with your remark in point 2, the pixel shader has no information about a vertex, it only sees a pixel on screen and the data passed along through constants and interpolaters.

It's for these reasons people switched to deferred shading by which the complexity of light computation becomes screen resolution dependent instead of scene dependent. The next steps from there are clustered and tiled based approaches to deffered and forward shading techniques.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

@Adam; thanks, I'll go for that, looking at the number of lights actually affecting an object, 4 to 8 max should be more then enough.
Can you explain what the [loop] indicator changes in the shader/effect? (I know that using a defined constant would be quicker/quickest since it can be precomputed instead of dynamic at runtime)

@Nightcreature: thanks, this confirms that the first step/ bigges gain would be to skip lights for objects which are not affected. I can even precompute the combinations just once at startup of the scene for static objects with static lights. Then I just have to 'feed' them at runtime (and update the index for dynamic meshes and or lights each frame).
Regarding option 2, I thought it would be possible to have a boolean array for the lights which the VS could set to false if the light doesn't reach the vertex. And then skip the false lights in the pixel shader. But from your remark it sounds that this is not an option.

And yes, I don't know enough yet, thats exactly why I'm not blindly downloading and adopting a shader using deferred rendering. But instead ask "stupid questions" here on the forum :)

I'll post the results as soon as I'm there.

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

You've got the bulk of your work (and the largest loop) in your pixel shader, which is exactly what you don't want. If you render every object in your scene using this VS and PS you will have massive slowdown simply because the pipeline will loop over every rasterized pixel 32 times doing those few lighting calculations on all of them.

If you have overdraw and don't use your depth buffer correctly then your slowdown gets even more dramatic. Try sorting your objects (even better your triangles) by distance to viewer in the C++ program before you draw them. Draw the closest to the camera first, this should prevent your super-expensive pixel shader getting called on pixels blocked by the close-up objects.

Btw - If you want to send a "material index" out of the vertex shader and into the pixel shader, you can do it, but make sure that the interpolation doesn't destroy the value. For instance you could grow your vertex structure to include an extra DWORD that describes which of the 32 lights should affect that vertex. In the C++ program you determine this and then using a dynamic VB you change the value of the DWORD, storing powers of 2 to identify which lights to use. Then the VS just passes that DWORD out of itself. HOWEVER this just adds *MORE* work to your per-pixel shader ..... because it has to unpack that DWORD and those powers of 2 and figure out which lights to apply, this could even add branching and destroy available parallelism. So don't do this.

[loop] tells the compiler to actually make the loop dynamic instead of unrolling it. It's not strictly necessary - the compiler will automatically decide if it's going to unroll it or use flow control if you don't add the attribute, but it's safest to be explicit as you'll get better error messages that way.

Creating a separate shader for each light count and [unroll]ing the loop will speed things up, but you don't always want to create extra copies of the shader each time you add a feature because you can quickly end up with thousands of them that way.

Thanks, I actually tried that and ended up with a bunch of them, with/without anisotropy, with 1, 2 and 3 directional lights, with 4, 8, 12 and 16 point lights etc. :)

Oh and also difference for SM2 and SM2.

First thing I'm gonna do now is send only the lights that affect an object to the shader (this might also give some performance loss, but definately not as much as the GPU killing now with always 32 light processed). If I have a lighting shader with 4 or 8 max lights I should be fine. Maybe when there are more then 4, I'll just take the first 4 that are closest to the object.

This way I'm getting rid of all shader variants having a different number of max point lights.

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

For my own fun I started optimization from low level. Others have pointed out clear high level optimizations already. Your pixel shader was hmm. quite sloppy. Lot's of things remained old fixed function pipeline and somethings like arrays as temp variables was just plain wrong.

As example I optimized and cleaned some of the things but in no way this would be final.

Quick notes: Lights have three color components not four. Wasted cycles. Light color and intesity can be precalculate to one parameter instead of two that are multiplied at pixel level in inner loop. Lot's of wasted cycles. Normal normalization should be done outside of loop even if compiler can do this it's just make inner loop harder to read. Distance and Normalize share lot of calculations so those can be pulled out of those function and shared(compiler can do this but just for sake of it)

Ambient is just diffuse light as other so it does not need own materials. Color and diffuseMat can be precalculated at cpu.

Lot of redundant work done 32 times per pixel no wonder it run slow.


float4 PS_function(VS_OUTPUT input): COLOR0{
    float4 textureColor = tex2D(textureSampler, input.TexCoord);
    float4 diff = input.ColorMatDif  *  textureColor;
    
    float3 normal = normalize(input.Normal);
    float3 att_total = 0.0;

    for(int i=0;i<MaxPointLights;i++)
    {
        float3 diff       =  input.wPos - PointLightPos;
        float  diff2      = dot(diff,diff);
        float  invLen   = rsqrt(diff2);
        float3 dir        = diff * invLen;
        float nDotL     = saturate(dot(dir, normal));
        float distance = diff2 * invLen; // note: x * rsqrt(x) = sqrt(x)
        float att          = saturate((PointLightRange - distance) / PointLightRange);
        att_total    += (nDotL * att) * PointLightColorAndIntesity;
    }

    return float4(saturate((AmbientColorAndIntensity + att_total + MatEmi) * diff.rgb), diff.a);
}
 

@Kalle; wow, thanks!! That sounds like optimization. I'll give it a try and try your suggestions on what I can precompute (outside the shader).

I got the basics running now, selecting the lights that affect a mesh instance only, per mesh instance.
Quite some improvement gained already, only thing is that with a big mesh (say terrain), still performance drops down (if lots of point lights affect it).

Tomorrow next steps and more improvements :)

@Adam; the [LOOP] thing works but honestly I don't see much difference. But probably good to be sure. Not sure though if should use the defined max const instead (now going for 8 max, or 16 max point lights, per mesh instance). I had tonuse an int instead of float, in your post.

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

To handle terrain (and any other large mesh) there are a couple of options:

1. Split the mesh up into smaller pieces. That way each chunk of mesh is affected by less lights. This can also help performance as you won't need to render all the off screen vertices either. You just have to be a little careful with the joins or you may get small gaps appearing along them.

2. Pre-calculate all the static lighting into either a second texture or the vertex colour. This is called light mapping. Dynamic lights can still be combined with the static ones, but hopefully there's less of those.

This topic is closed to new replies.

Advertisement