• Advertisement
Sign in to follow this  

Compute Shaders & Lighting - Performance

This topic is 1452 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi guys wink.png .


So, I've recently tried to perform my lighting calculations in the compute shader, the visual results are as they need to be, great, but, the actual performance of the whole execution is horrible, much worse than before, when using a simple full screen quad. So, I was hoping that you guys, maybe could spot out what could go extremely wrong. 


The shader:

  • I tried removing everything, and only loading t_normals, and saving the encoded version of that into the uav, but, the performance is still bad for some reason. But when not loading any texture, and only saving/encoding float3(1,1,1), the performance is great, but for some reason the loading seems to cause problems...
// GBUFFER and Shadow Map and Material Information
Texture2D t_normals : register(t0);
Texture2D t_position : register(t1);
Texture2D t_diffuse : register(t2);
Texture2D t_material : register(t3);
Texture2D t_smap : register(t4);

#include "BRDF.hlsli"

cbuffer FrameBuffer : register(b0)
	matrix view;
	matrix projection;
	matrix texture_transform; 

	float3 cameraPosition; 
	float pad0;

cbuffer ObjectBuffer : register(b1)
	float4 lColor;

	float3 lPosition;

	float lCutoff;
	float lRadius;
	float lIntensity;

	float2 pad1;

RWTexture2D<uint> tOutputRW : register(u0);
SamplerState ss;

uint EncodeColor(in float3 color)
	int3 iColor = int3(color*255.0f);
		uint colorMask = (iColor.r<<16) | (iColor.g<<8) | iColor.b;
	return colorMask;

// Decode specified mask into a float3 color (range 0.0f-1.0f).
float3 DecodeColor(in uint colorMask)
	float3 color;
	color.r = (colorMask>>16) & 0x000000ff;
	color.g = (colorMask>>8) & 0x000000ff;
	color.b = colorMask & 0x000000ff;
	color /= 255.0f;
	return color;

[numthreads(1, 1, 1)]
void CShader( uint3 DTid : SV_DispatchThreadID )
	// Get Normals and Prelit Factor
	float4 normal = t_normals[DTid.xy];
	float prelit = normal.a;

	//Get Position
	float4 position = t_position[DTid.xy];

	// View Space -> World Space
        position = mul(float4(position.xyz, 1), texture_transform);
        normal = mul(float4(normal.xyz, 0), texture_transform);

	// Get Diffuse
	float4 diffuse = t_diffuse[DTid.xy];

	// Get Material
	float4 material = t_material[DTid.xy];
	float mlit = 1.0f-prelit;
	float3 col = BRDFPointLight(normal.xyz, lColor, diffuse.xyz, material.xyz, material.a, position.xyz, lPosition.xyz, lRadius, cameraPosition)*lIntensity;

	// Special Post Processing Flag
	col = lerp(col, diffuse, mlit);

	// Buffer
	tOutputRW[DTid.xy] = EncodeColor(DecodeColor(tOutputRW[DTid.xy]) + col.xyz);

So all of these point lights are additively added to a buffer, then later on I decode it into a simple Texture2D with the DecodeColor(...), and the performance of that one is good ( Intel GPA ), so it shouldn't be a problem.


As said, the result is fine, but the performance is NOT good,  it chops 60fps to 25fps ( I'm unable to get the micro/milliseconds as Intel GPA does not want to capture the time taken for my compute shader, though I can find it, the time taken is 0, which isn't true ).




Any ideas what could go wrong?


And you've reached the bottom, thanks for your time!


Edited by Migi0027

Share this post

Link to post
Share on other sites
"[numthreads(1, 1, 1)]" is likely to be your problem; you are telling the GPU to dispatch a thread group with only 1 active thread in it, which means on most GPUs you are idling 31 (NV) to 64 (AMD) threads per groups or most of the ALU power.

The number of threads dispatched here wants to be a multiple of 32 or 64, depending on target hardware, and then your overall thread group dispatch count needs to be adjusted to account for this.

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement