Poor Performance with tiled shading , Problem with bounding test

Started by
5 comments, last by BlackBrain 9 years, 7 months ago

Hello.

I am trying to add Tiled Deferred Shading to my engine. I am at the beginning of it right now.however, I have some serious problems.

The first problem is the poor performance that I am getting right now. With my old Light Pre Pass renderer I get about 270-300 fps. But with this approach and only one point light I get about 105 fps ! Even it doesn't have any specular reflection. In my tests I figured out the bottle neck is when I dispatch the composite shader . It's the final shader that uses G-buffer and also information for each tile to shader individual pixels. This is the shader I am talking about:


cbuffer Globals
{
	int GroupCountX;
	float FarPlane;
	float4 FarPlaneCorners[4];
	int Width;
	int Height;
	int LightCount;
};


struct LightShadingInfo
{
	float3 ViewSpacePos;

	/// Light's direction in view space
	float3 Direction;


	/// used for SpotLights Only
	float CosTheta;

	/// 0=>Directional ,  1=> PointLight ,  2=> SpotLight
	int LightType;

	float Range;
};

struct LightIndicesStruct
{
	int indices[MAXIMUM_PER_TILE];
};

Texture2D NormalSmoothness;
Texture2D DepthBuffer;//in view space unnormalized depth. int [NearPlane , FarPlane ] Range
Texture2D Albedo;
Texture2D SpecularColor;

StructuredBuffer<LightShadingInfo> LightShadingBuffer;
StructuredBuffer<LightIndicesStruct> LightIndicesBuffer;

RWTexture2D<float4> Output;//HDR output Accumulation Buffer



float3 ShadeDirectionalLight(LightShadingInfo LightInfo, float3 Normal, float3 DiffuseColor, float3 ReflectiveColor)
{
	float NdotL = saturate(dot(Normal, LightInfo.Direction));

	return NdotL*DiffuseColor;
}

float3 ShadePointLight(LightShadingInfo LightInfo, float3 ViewSpacePos, float3 Normal, float3 DiffuseColor, float3 ReflectiveColor)
{
	float3 LightVector = LightInfo.ViewSpacePos - ViewSpacePos;

	float atten = 1 - saturate(length(LightVector) / LightInfo.Range);
	atten *= atten;

	LightVector = normalize(LightVector);

	float NdotL = saturate(dot(Normal, LightVector));

	return NdotL*DiffuseColor*atten;
}

float3 GetViewSpacePos(int2 Position, float Depth)
{
	/* Layout Of FarPlane Corners

	[0] ---- [1]
	----------
	----------
	[2] ---- [3]

	*/

	//Use Bilnear Filtering to Get Correct FarPlane Pos
	float XLerp = ((float)(Position.x)) / Width;
	float YLerp = ((float)(Position.y)) / Height;

	float3 Upper = lerp(FarPlaneCorners[0].xyz, FarPlaneCorners[1].xyz, XLerp);
		float3 Lower = lerp(FarPlaneCorners[2].xyz, FarPlaneCorners[3].xyz, XLerp);

		float3 ToFarPlane = lerp(Upper, Lower, YLerp);

		return (Depth / FarPlane)*ToFarPlane;

}

groupshared LightIndicesStruct indicesBuffer;

[numthreads(32, 32, 1)]
void main(int3 dispathThreadId:SV_DispatchThreadID, int3 groupId : SV_GroupID,int3 groupThreadId:SV_GroupThreadID)
{
	if (groupThreadId.x == 0 && groupThreadId.y==0)
		indicesBuffer = LightIndicesBuffer[groupId.y * GroupCountX + groupId.x];

	GroupMemoryBarrierWithGroupSync();//wait for all threads

	float depth = DepthBuffer[dispathThreadId.xy].r;
	float4 normalSmoothness = NormalSmoothness[dispathThreadId.xy];
	float3 albedo = Albedo[dispathThreadId.xy].rgb;
	float3 specColor = SpecularColor[dispathThreadId.xy].rgb;

	float3 ViewSpacePos = GetViewSpacePos(dispathThreadId.xy, depth);

	// we have the initial variables we needed let's shade !
	int index = 0;
	//const LightIndicesStruct indicesBuffer = LightIndicesBuffer[groupId.y * GroupCountX + groupId.x];
	int LightIndex = indicesBuffer.indices[index];

	float3 color = float3(0,0,0);

	while (LightIndex != -1 && index<LightCount)
	{
		const LightShadingInfo LightInfo = LightShadingBuffer[LightIndex];
		[branch]
		switch (LightInfo.LightType)
		{
		case 0:
			color += ShadeDirectionalLight(LightInfo, normalSmoothness.xyz, albedo, specColor);
			break;
		case 1:
			color +=  ShadePointLight(LightInfo, ViewSpacePos, normalSmoothness.xyz, albedo, specColor);
			break;

		}

		index++;
		LightIndex = indicesBuffer.indices[index];
	}

	Output[dispathThreadId.xy] = float4(color, 1.0f);
}

technique11 Tech0
{
	pass P0
	{
		SetVertexShader(NULL);
		SetPixelShader(NULL);
		SetComputeShader(CompileShader(cs_5_0, main()));
	}
} 

Can you please tell me how can I further optimize this shader ?

Another Problem that I am facing is checking if a light has an effect on a tile or not. Currently I have point lights that have bounds.

I create AABB for it in WorldSpace and transform it to ViewSpace and ClipSpace to do my tests. It's working 80% of the time but When camera is near the light or maybe in the range of the light I sometimes get wrong results.

This is how I am doing the job:


public void GetBoundingInfo(Camera cam,out LightBoundInfo BoundInfo)
        {
            Vector3 Center = Owner.Position;
            Vector3 min = new Vector3(Center.X - Range , Center.Y - Range , Center.Z - Range );
            Vector3 max = new Vector3(Center.X + Range , Center.Y + Range , Center.Z + Range );

            BoundingBox BoxInWorld = new BoundingBox(min,max);

            BoundingBox BoxInViewSpace = Utility.MathUtility.TransformBox(BoxInWorld, cam.View);

            BoundingBox BoxInClipSpace = Utility.MathUtility.TransformBox(BoxInWorld, cam.ViewProjection);

            Vector2 MinMaxZ = new Vector2(BoxInViewSpace.Minimum.Z, BoxInViewSpace.Maximum.Z);

            Vector2 MinClipSpace = new Vector2(Math.Max(BoxInClipSpace.Minimum.X, -1.0f), Math.Max(BoxInClipSpace.Minimum.Y,-1.0f));

            Vector2 MaxClipSpace = new Vector2(Math.Min(BoxInClipSpace.Maximum.X, 1.0f), Math.Min(BoxInClipSpace.Maximum.Y,1.0f));

            MinClipSpace.X = ((MinClipSpace.X / 2.0f) + 0.5f) * cam.TargetBuffer.width;
            MaxClipSpace.X = ((MaxClipSpace.X / 2.0f) + 0.5f) * cam.TargetBuffer.width;

            MinClipSpace.Y = (1.0f-((MinClipSpace.Y / 2.0f) + 0.5f)) * cam.TargetBuffer.height;
            MaxClipSpace.Y = (1.0f-((MaxClipSpace.Y / 2.0f) + 0.5f)) * cam.TargetBuffer.height;

            float temp = MinClipSpace.Y;
            MinClipSpace.Y = MaxClipSpace.Y;
            MaxClipSpace.Y = temp;

            int width = (int)Math.Ceiling(MaxClipSpace.X - MinClipSpace.X);
            int height = (int)Math.Ceiling(MaxClipSpace.Y - MinClipSpace.Y);

            BoundInfo = new LightBoundInfo((int)MinClipSpace.X, width, (int)MinClipSpace.Y, height, MinMaxZ);
        }

///

public static BoundingBox TransformBox(BoundingBox box,Matrix matrix)
        {
            Vector3[] CornerPoints = box.GetCorners();
            Vector3 min = new Vector3(float.MaxValue);
            Vector3 max = new Vector3(float.MinValue);

            for (int i = 0; i < CornerPoints.Length; i++)
            {
                Vector4 transformed = Vector3.Transform(CornerPoints[i], matrix);
                Vector3 Vec3 = new Vector3(transformed.X / transformed.W, transformed.Y / transformed.W, transformed.Z / transformed.W);
                min = Vector3.Min(Vec3,min);
                max = Vector3.Max(Vec3, max);
            }

            return new BoundingBox(min, max);
        } 

And the shader to do the tests :


cbuffer Globals
{
	int LightCount;//number of lights that we currently have for this frame
	int Width;
	int Height;
	int GroupCountX;
};

struct LightBoundInfo
{
	float2 MinMaxZ;//Minimum and Maximum View Space Z
	float4 BoundingRect; //(x,y,width,height)

	bool AlwaysVisible;
};

struct LightIndicesStruct
{
	int indices[MAXIMUM_PER_TILE]; // MAXIMUM_PER_TILE is a define
};

StructuredBuffer<LightBoundInfo> LightsBoundInfos;
Texture2D DepthBufferMinMaxZ;
RWStructuredBuffer<LightIndicesStruct> Output;

bool RectCheck(float4 rect1,float4 rect2)
{
	return (rect1.x<=min(Width, rect2.x + rect2.z)) && (min(rect1.x + rect1.z, Width)>=rect2.x) && (rect1.y<=min(Height, rect2.y + rect2.w)) && (min(rect1.y + rect1.w, Height)>=rect2.y);
}


groupshared uint LastIndex = 0;
groupshared float2 GroupMinMaxZ;

void AddToIndices(int LightIndex,int2 groupId)
{
	int index;
	InterlockedAdd(LastIndex, 1, index);
	Output[groupId.y*GroupCountX + groupId.x].indices[index] = LightIndex;
}

void EndIndices(int2 groupId)
{
	int index;
	InterlockedAdd(LastIndex, 1, index);
	Output[groupId.y*GroupCountX + groupId.x].indices[index] = -1;
}

void ProcessLight(int LightIndex, float4 RectTile,int2 groupId)
{
	if (LightIndex < LightCount)
	{// we have to process this light , let's see if we need it in this tile

		LightBoundInfo info = LightsBoundInfos[LightIndex];

		if (info.AlwaysVisible)//it's directional light we need this
			AddToIndices(LightIndex, groupId);
		else
		{//it's not directional

			bool ZReject = (info.MinMaxZ.x>GroupMinMaxZ.y) || (info.MinMaxZ.y<GroupMinMaxZ.x);

			
			if (!ZReject && RectCheck(RectTile, info.BoundingRect))
			{// this light hass effect on this tile
				AddToIndices(LightIndex, groupId);
			}

		}

	}
}

[numthreads(1024,1,1)]
void main(int3 dispathThreadId:SV_DispatchThreadID,int3 groupID:SV_GroupID , int3 groupThreadId:SV_GroupThreadId) //supports 2048 lights at maximum now , executed per tile
{
	//first thread reads MinMaxZ for group , preventing other threads to read from texture again
	if (groupThreadId.x == 0)
		GroupMinMaxZ = DepthBufferMinMaxZ[groupID.xy].xy;

	GroupMemoryBarrierWithGroupSync();//wait for all threads

	ProcessLight(groupThreadId.x, float4(groupID.xy * 32, 32, 32), groupID.xy);
	ProcessLight(groupThreadId.x + 1024, float4(groupID.xy * 32, 32, 32), groupID.xy);

	GroupMemoryBarrierWithGroupSync();//wait for all threads
	
	if (groupThreadId.x == 0)//set -1 to LightIndices , it's a sign for end of lights
		EndIndices(groupID.xy);
}

technique11 Tech0
{
	pass P0
	{
		SetVertexShader(NULL);
		SetPixelShader(NULL);
		SetComputeShader(CompileShader(cs_5_0, main()));
	}
} 

Zreject is working properly as far as I see

Thanks for your help in advance.

Advertisement

For the second problem I did this : if camera is in the bounding of the light , it occupies all of the screen so the code for it chages to :


 public void GetBoundingInfo(Camera cam, out LightBoundInfo BoundInfo)
        {
            Vector3 Center = Owner.Position;

            BoundingSphere sphere = GetBoundingShape();

            Vector3 min = new Vector3(Center.X - Range, Center.Y - Range, Center.Z - Range);
            Vector3 max = new Vector3(Center.X + Range, Center.Y + Range, Center.Z + Range);

            BoundingBox BoxInWorld = new BoundingBox(min, max);

            BoundingBox BoxInViewSpace = Utility.MathUtility.TransformBox(BoxInWorld, cam.View);

            Vector2 MinMaxZ = new Vector2(BoxInViewSpace.Minimum.Z, BoxInViewSpace.Maximum.Z);

            if (sphere.Contains(ref cam.Owner.Position)==ContainmentType.Contains)
            {//it occupes all of the screen
                BoundInfo = new LightBoundInfo(0, cam.TargetBuffer.width, 0, cam.TargetBuffer.height, MinMaxZ);
            }
            else
            {

                BoundingBox BoxInClipSpace = Utility.MathUtility.TransformBox(BoxInWorld, cam.ViewProjection);

                Vector2 MinClipSpace = new Vector2(Math.Max(BoxInClipSpace.Minimum.X, -1.0f), Math.Max(BoxInClipSpace.Minimum.Y, -1.0f));

                Vector2 MaxClipSpace = new Vector2(Math.Min(BoxInClipSpace.Maximum.X, 1.0f), Math.Min(BoxInClipSpace.Maximum.Y, 1.0f));

                MinClipSpace.X = ((MinClipSpace.X / 2.0f) + 0.5f) * cam.TargetBuffer.width;
                MaxClipSpace.X = ((MaxClipSpace.X / 2.0f) + 0.5f) * cam.TargetBuffer.width;

                MinClipSpace.Y = (1.0f - ((MinClipSpace.Y / 2.0f) + 0.5f)) * cam.TargetBuffer.height;
                MaxClipSpace.Y = (1.0f - ((MaxClipSpace.Y / 2.0f) + 0.5f)) * cam.TargetBuffer.height;

                float temp = MinClipSpace.Y;
                MinClipSpace.Y = MaxClipSpace.Y;
                MaxClipSpace.Y = temp;

                int width = (int)Math.Ceiling(MaxClipSpace.X - MinClipSpace.X);
                int height = (int)Math.Ceiling(MaxClipSpace.Y - MinClipSpace.Y);

                BoundInfo = new LightBoundInfo((int)MinClipSpace.X, width, (int)MinClipSpace.Y, height, MinMaxZ);
            }
        }

This seems to work , As it should.

No body can help me ? please I need guidance and mainly for optimizing this.

I am sorry, but I only roughly scanned through your shader code.

First thing I figured, is that synchronizing threads is usually very costly. And in your case 1 thread in each tile is copying the light indices struct of a tile and all the other ones are idle. ---> Let every thread copy a portion of the memory.

1 pointlight in tilebased is very close to the worst case. Tilebased rendering is all about clipping lightsources and therefore introduces overhead for calculating lightindices in each tile. Again when you synch your threads, the whole workgroup is waiting just for 1 thread to copy the indices of 1 point light, not what you ideally desire.

Even further get rid of the indirection in you while loop. Use shared memory to hold the LightInfo not the indices to the light info, that's much more cache efficient.

Then again the way, you iterate over the tile's indices is kind of ineffctive regarding branch divergence. Maybe it helps to seperate the point and spottlights and not brach in the loop. Also terminating the loop with a check to the lightcount of a tile and not by its next light index might help.

All this is very vague I guess, but not knowing the whole pipeline this is it, what I can give to you.

How's the performance, when adding a few thousand more lightsources, did you profile the computation time of each step in the algoritm? My implementation (clustered deferred) goes with 40-50 thousand pointlights at smooth 60 fps (on a GTX 770). It is in OpenGL but if you want I can supply you with some code (also I'm using a LBVH to do the clip test).

Hopefully I could give at least some good advice.

Thanks for your reply , It was really helpful. After I posted this on forum I worked on it to improve it. First of all I do every thing now in one shader. I mean determining minimum depth and maximum depth and light culling and etc are done in one shader and thus with one dispatch call.

Now I use Tile Frustum to cull lights. And as you suggested I keep LightInfo in group shared memory (not the indices).

Performance is much more better than before now. In Sponza scene I get(45-60 fps) with 512 Point Lights ,however I think it must be possible to get better performance with Tiled Shading. If knowing how I do things on CPU helps here it is :

When Tiled Shading Renderer Is called to process it draws geometry to G-buffer in main thread and on another thread meantime , it Frustum Culls Lights . After culling, a LightShadingInfo[] variable is filled from visible lights (Also in second thread). When the drawing is finished in main thread it waits for the job of the soecond thread . When waiting finishes it continues and updates LightShadingBuffer (it's a CPU Write Dynamic StructuredBuffer) .Then , Dispatching the final shader :


cbuffer Globals
{
	int GroupCountX;
	float FarPlane;
	float4 FarPlaneCorners[4];
	int Width;
	int Height;
	int LightCount;
	bool ShowTileLightCount;
};

#define ThreadSize 32

struct LightShadingInfo
{
	float3 ViewSpacePos;

	/// Light's direction in view space
	float3 Direction;


	/// used for SpotLights Only
	float CosTheta;

	/// 0=>Directional ,  1=> PointLight ,  2=> SpotLight
	int LightType;

	float Range;

	float3 Color;
};

Texture2D NormalSmoothness;
Texture2D DepthBuffer;//in view space unnormalized depth. int [NearPlane , FarPlane ] Range
Texture2D Albedo;
Texture2D SpecularColor;

StructuredBuffer<LightShadingInfo> LightShadingBuffer;


RWTexture2D<float4> Output;//HDR output Accumulation Buffer

// fills output with 4 side tile planes,
void ConstructTileFrustumPlanes(uint4 TileRect,out float4 Planes[4])
{
	//Use Bilnear Filtering to Get Correct FarPlane Pos
	float XLerp = ((float)(TileRect.x)) / Width;
	float YLerp = ((float)(TileRect.y)) / Height;

	float3 Upper = lerp(FarPlaneCorners[0].xyz, FarPlaneCorners[1].xyz, XLerp);
	float3 Lower = lerp(FarPlaneCorners[2].xyz, FarPlaneCorners[3].xyz, XLerp);

	float3 p00 = lerp(Upper, Lower, YLerp);


	//point p10
	XLerp = ((float)(TileRect.x+TileRect.z)) / Width;
	YLerp = ((float)(TileRect.y)) / Height;

	Upper = lerp(FarPlaneCorners[0].xyz, FarPlaneCorners[1].xyz, XLerp);
	Lower = lerp(FarPlaneCorners[2].xyz, FarPlaneCorners[3].xyz, XLerp);

	float3 p10 = lerp(Upper, Lower, YLerp);

	//point p01
	XLerp = ((float)(TileRect.x)) / Width;
	YLerp = ((float)(TileRect.y + TileRect.w)) / Height;

	Upper = lerp(FarPlaneCorners[0].xyz, FarPlaneCorners[1].xyz, XLerp);
	Lower = lerp(FarPlaneCorners[2].xyz, FarPlaneCorners[3].xyz, XLerp);

	float3 p01 = lerp(Upper, Lower, YLerp);

	//point p11
	XLerp = ((float)(TileRect.x + TileRect.z)) / Width;
	YLerp = ((float)(TileRect.y + TileRect.w)) / Height;

	Upper = lerp(FarPlaneCorners[0].xyz, FarPlaneCorners[1].xyz, XLerp);
	Lower = lerp(FarPlaneCorners[2].xyz, FarPlaneCorners[3].xyz, XLerp);

	float3 p11 = lerp(Upper, Lower, YLerp);

	float3 n0 = cross(p01, p00);
	n0 = normalize(n0);

	float3 n1 = cross(p00, p10);
	n1 = normalize(n1);

	float3 n2 = cross(p10, p11);
	n2 = normalize(n2);

	float3 n3 = cross(p11, p01);
	n3 = normalize(n3);

	Planes[0] = float4(n0,dot(n0,p00));
	Planes[1] = float4(n1, dot(n1, p00));
	Planes[2] = float4(n2, dot(n2, p10));
	Planes[3] = float4(n3, dot(n3, p11));
}


bool ProcessLight(int LightIndex, float2 GroupMinMaxZ,float4 Planes[4])
{
	if (LightIndex >= LightCount)
		return false;
	
	LightShadingInfo info = LightShadingBuffer[LightIndex];
	
	bool ZReject = (info.ViewSpacePos.z + info.Range >= GroupMinMaxZ.x) & (info.ViewSpacePos.z - info.Range <= GroupMinMaxZ.y);

	bool Condition = ZReject;

	for (int i = 0; i < 4; i++)
	{
		Condition = Condition & ((dot(Planes[i].xyz, info.ViewSpacePos.xyz) - (Planes[i].w + info.Range))<=0);
	}

	return Condition;
}

float3 ShadeDirectionalLight(LightShadingInfo LightInfo, float3 Normal, float3 DiffuseColor, float3 ReflectiveColor)
{
	float NdotL = saturate(dot(Normal, LightInfo.Direction));

	return NdotL*DiffuseColor*LightInfo.Color;
}

float3 ShadePointLight(LightShadingInfo LightInfo, float3 ViewSpacePos, float3 Normal, float3 DiffuseColor, float3 ReflectiveColor)
{
	float3 LightVector = LightInfo.ViewSpacePos - ViewSpacePos;

	float atten = 1 - saturate(length(LightVector) / LightInfo.Range);
	atten *= atten;

	[branch]
	if (atten == 0)
		return float3(0,0,0);

	LightVector = normalize(LightVector);

	float NdotL = saturate(dot(Normal, LightVector));

	return NdotL*DiffuseColor*atten*LightInfo.Color;
}

float3 GetViewSpacePos(int2 Position, float Depth)
{
	/* Layout Of FarPlane Corners

	[0] ---- [1]
	----------
	----------
	[2] ---- [3]

	*/

	//Use Bilnear Filtering to Get Correct FarPlane Pos
	float XLerp = ((float)(Position.x)) / Width;
	float YLerp = ((float)(Position.y)) / Height;

	float3 Upper = lerp(FarPlaneCorners[0].xyz, FarPlaneCorners[1].xyz, XLerp);
	float3 Lower = lerp(FarPlaneCorners[2].xyz, FarPlaneCorners[3].xyz, XLerp);

	float3 ToFarPlane = lerp(Upper, Lower, YLerp);

	return (Depth / FarPlane)*ToFarPlane;

}


groupshared LightShadingInfo ShadeInfoCache[256];

groupshared uint MinIntZ = 0xffffffff;
groupshared uint MaxIntZ = 0;

//groupshared uint LightIndices[512];

groupshared uint LastIndex = 0;

groupshared float4 Planes[4];

void AddToIndices(int LightIndex)
{
	int index;
	InterlockedAdd(LastIndex, 1, index);
	//LightIndices[index] = LightIndex;
	ShadeInfoCache[index] = LightShadingBuffer[LightIndex];
	//ShadingCache[index] = LightShadingBuffer[LightIndex];
}

[numthreads(ThreadSize, ThreadSize, 1)]
void main(int3 dispathThreadId:SV_DispatchThreadID, int3 groupId : SV_GroupID, int3 groupThreadId : SV_GroupThreadID)
{
	int LightIndex = groupThreadId.y*ThreadSize + groupThreadId.x;

	/* Reading From G-Buffer */
	float depth = DepthBuffer[dispathThreadId.xy].r;
	float3 albedo = Albedo[dispathThreadId.xy].rgb;
	float4 normalSmoothness = NormalSmoothness[dispathThreadId.xy];
	float3 specColor = SpecularColor[dispathThreadId.xy].rgb;

	float3 ViewSpacePos = GetViewSpacePos(dispathThreadId.xy, depth);

	/* Let's Determine MinMaxZ For EachTile */
	InterlockedMin(MinIntZ, asuint(depth));
	InterlockedMax(MaxIntZ, asuint(depth));

	// Wait for all threads to do their job
	GroupMemoryBarrierWithGroupSync();

	float2 GroupMinMaxZ = float2(asfloat(MinIntZ), asfloat(MaxIntZ));

	/* Let's see what lights affect this tile */

	if (LightIndex == 0)//if it's first thread
	{
		ConstructTileFrustumPlanes(float4(groupId.xy*ThreadSize, ThreadSize, ThreadSize), Planes);
	}

	// Wait for all threads to do their job
	GroupMemoryBarrierWithGroupSync();

	while (LightIndex < LightCount)
	{
		[branch]
		if (ProcessLight(LightIndex, GroupMinMaxZ,Planes))
		{//this light has an effect on this tile , we have to add it to LightIndices
			AddToIndices(LightIndex);
			//ShadingCache[LastIndex - 1] = LightShadingBuffer[LightIndex];
		}
		LightIndex += ThreadSize*ThreadSize;
	}

	// Wait for all threads to do their job
	GroupMemoryBarrierWithGroupSync();

	int TileLightCount = LastIndex;

	// we have the initial variables we needed let's shade !
	int index = 0;

	float3 color = float3(0, 0, 0);


	[loop]
	while (index < TileLightCount)
	{
		//LightIndex = LightIndices[index++];
		LightShadingInfo LightInfo = ShadeInfoCache[index++];
		//	LightShadingInfo LightInfo = ShadingCache[index++];
		/*
		[branch]
		switch (LightInfo.LightType)
		{
		case 0:

			color += ShadeDirectionalLight(LightInfo, normalSmoothness.xyz, albedo, specColor);

			break;
		case 1:
		*/
			color += ShadePointLight(LightInfo, ViewSpacePos, normalSmoothness.xyz, albedo, specColor);

			//break;

		//}

	}

	if (ShowTileLightCount)
	{
		Output[dispathThreadId.xy] = float4(TileLightCount / 256.0f, TileLightCount / 256.0f, TileLightCount / 256.0f, 1.0f);
	}
	else
		Output[dispathThreadId.xy] = float4(color, 1.0f);
}

technique11 Tech0
{
	pass P0
	{
		SetVertexShader(NULL);
		SetPixelShader(NULL);
		SetComputeShader(CompileShader(cs_5_0, main()));
	}
}

I also removed branching in the final loop . What's your suggestion to support spot lights now ? Should I just create seperate buffers in this shader and have two final loops or should I create another shader for spot lights and thus have two dispatching ? I like to go with the first way but because I am now storing LightShadingInfo not the indices I am worried that maybe it crosses the limits of group shared memory each tile can have.

Without Dispatching the final shader (Only drawing to G-Buffer and filling LightShadingInfo[]) the fps is about 200 .

My GPU is Geforce GT 636M .

Again thanks and if you share your code it would be awesome and helpful .

I don't know if I understood you correctly, but it sounds like you are performing the light culling on the CPU?

If yes, don't do that anymore since you can use compute shader. Doing the culling on the GPU is a LOT faster, especially for hight lightcounts.

In a compute shader you could compute for each light an AABB and then project it into screenspace resulting in one culling rectangle for each lightsource.

After that you have to build the tile indices for each light. There are several possibilities to do this, the easiest one would be a compute shader with one thread per tile, that iterates over all cullingrects and performs the test. If the test is positive save the lights index in a texture buffer.

As I alrady said, I am using a LBVH to cull the lightsources, that is a huge overhead but still can be done in 4-5 milliseconds (including BVH construction and traversal). I will outline in short what I am doing every frame (all on the GPU):

1) Render to the GBuffer

2) Calculate the min/max for each tile

3) Calculate the AABBs in view space for each light (like above)

4) Construct a LBVH with the lights AABBs (this will need to assign Morton Codes to each AABB, sort the AABBs repecivly and then apply a fully parallel tree construction algorithm)

5) Traverse the tree to calculate the number of lights present in each tile

6) Resize the tile index buffer and calculate offsets, where the light indices of each tile start

7) Traverse the tree again and save the lights indices to tile index texture buffer

8) Render the lights utilising the GBuffer tile index texture (pretty much like you are doing)

This is a whole other level than the linear probing, what you are currently doing, but this way you are able to aplly lighting for numbers of lightsources in the thens of thousands (don't be discouraged biggrin.png).

Actually I have written my bachelor thesis on this topic not so long ago, you are welcome to read it (but it's written in german, so there might be a lot of google translate involved)

Some resources for LBVH construction and traversal:

http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/ (you will need a parallel sorting algorithm too, I implemented BitonicSort) The article is about realtime collision detection, but the technique allies to this lighting algorithm as well.

http://jcgt.org/published/0002/01/03/paper.pdf (here the stackless traversal algorithms are described pretty nice)

I have uploaded my source code and thesis here:

(sorry the source is a mess, because I maintained it very reluctantly -- no guarantee what so ever)

Source: https://www.dropbox.com/sh/osccc1ynbgzqa09/AABreixi0dG8NrNJ8daORgfUa?dl=0

Thesis: https://www.dropbox.com/s/xuqsc678fm4ihyq/clustereddeferred.pdf?dl=0

Hopfully I could help you.

I do the culling against the whole camera frustum on CPU so that I can reject Lights that will have no effect on final result. The lights which are in the camera whole frustum then are sent to the GPU and the Tile Frustum Culling is done (In the last shader I put , in ProcessLight function ).

Thanks for these great helps smile.png . I think I need to devote quite a lot of time to read these and understand

This topic is closed to new replies.

Advertisement