Jump to content
  • Advertisement
Sign in to follow this  
satanir

HLSL compiler weird performance behavior

This topic is 1471 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have this animation demo, which takes quite a few seconds to load. I initially thought that it was related to the model, but apparently the culprit is the skinning VS, and specifically the bones-matrices.

The VS looks like

cbuffer cbPerMesh : register(b1)
{
	matrix gBones[256];
}

struct VS_IN
{
	float4 PosL : POSITION;
	float3 NormalL : NORMAL;
	float2 TexC : TEXCOORD;
	float4 BonesWeights[2] : BONE_WEIGHTS;
	uint4  BonesIDs[2]    : BONE_IDS;
};

struct VS_OUT
{
	float4 svPos : SV_POSITION;
	float2 TexC : TEXCOORD;
	float3 NormalW : NORMAL;
};

float4x4 CalculateWorldMatrixFromBones(float4 BonesWeights[2], uint4  BonesIDs[2], float4x4 Bones[256])
{
	float4x4 WorldMat = { float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0) };

		for(int i = 0; i < 2; i++)
		{
			WorldMat += Bones[BonesIDs[i].x] * BonesWeights[i].x;
			WorldMat += Bones[BonesIDs[i].y] * BonesWeights[i].y;
			WorldMat += Bones[BonesIDs[i].z] * BonesWeights[i].z;
			WorldMat += Bones[BonesIDs[i].w] * BonesWeights[i].w;
		}

	return WorldMat;
}

VS_OUT VS(VS_IN vIn)
{
	VS_OUT vOut;
	float4x4 World = CalculateWorldMatrixFromBones(vIn.BonesWeights, vIn.BonesIDs, gBones);
	vOut.svPos = mul(mul(vIn.PosL, World), gVPMat);
	vOut.TexC = vIn.TexC;
	vOut.NormalW = mul(float4(vIn.NormalL, 0), World).xyz;
	return vOut;
}

As you can see, this shader supports up to 256 bones per model. Compiling this shader takes around 5 seconds on my Core-i7 CPU.

If I reduce the number of supported bones to 16, it compiles almost immediately.

Funny thing is that the generated assembly is exactly the same (except for the CB declaration).

 

I find it weird - the code doesn't rely in any way on the matrices count.

Anyone has any idea why the performance degradation?

Share this post


Link to post
Share on other sites
Advertisement


the code doesn't rely in any way on the matrices count.

 

Don't know if it's the problem, but different shader models support different limits for constant registers. If you're compiling with 4.0 or above (IIRC), you should be okay. What shader model do you compile with?

Share this post


Link to post
Share on other sites


What shader model do you compile with?

 

5.0.

Still, that doesn't explain the performance hit. All the compiler has to do is:

- Verify that sizeof(matrix)*256 < MAX_ALLOWED_CB_SIZE

- Create the SM 5 instruction 'dcl_constantbuffer cb1[41], dynamicIndexed'

 

And that's it. No need to optimize anything.

Weird. I'll never understand what compilers think.

Share this post


Link to post
Share on other sites

Just a hunch, but what happens if instead of passing float4x4 Bones[256] as a parameter to CalculateWorldMatrixFromBones, you just index your gBones instead?

Share this post


Link to post
Share on other sites


Just a hunch, but what happens if instead of passing float4x4 Bones[256] as a parameter to CalculateWorldMatrixFromBones, you just index your gBones instead?

Nothing changes, still takes a couple of seconds.

Share this post


Link to post
Share on other sites

How do you update the cbuffer? What is the average bone count? This could be a bandwidth problem?

I'm talking about the compilation time only.

I even tried FXC (the standalone HLSL compiler) - still takes a couple of seconds.

Edited by N.I.B.

Share this post


Link to post
Share on other sites

Can you take a look at the compiled code, perhaps the compiler is flattening or optimizing something, I'm a bit skeptical about the function argument float4x4 Bones[256].

 

EDIT: And post the compiled shader code.

EDIT2: Ive found sometimes that due to some weird flattening, It helped using the [loop] attribute before the loop, and try to add the D3DCOMPILE_PREFER_FLOW_CONTROL flag.

Edited by Migi0027

Share this post


Link to post
Share on other sites


And post the compiled shader code
vs_5_0
dcl_globalFlags refactoringAllowed
dcl_constantbuffer cb0[4], immediateIndexed
dcl_constantbuffer cb1[1025], dynamicIndexed
dcl_input v0.xyzw
dcl_input v1.xyz
dcl_input v2.xy
dcl_input v3.xyzw
dcl_input v4.xyzw
dcl_input v5.xyzw
dcl_input v6.xyzw
dcl_output_siv o0.xyzw, position
dcl_output o1.xy
dcl_output o2.xyz
dcl_temps 4
ishl r0.xyzw, v5.xyzw, l(2, 2, 2, 2)
mul r1.xyzw, v3.yyyy, cb1[r0.y + 4].xyzw
mad r1.xyzw, cb1[r0.x + 4].xyzw, v3.xxxx, r1.xyzw
mad r1.xyzw, cb1[r0.z + 4].xyzw, v3.zzzz, r1.xyzw
mad r1.xyzw, cb1[r0.w + 4].xyzw, v3.wwww, r1.xyzw
ishl r2.xyzw, v6.xyzw, l(2, 2, 2, 2)
mad r1.xyzw, cb1[r2.x + 4].xyzw, v4.xxxx, r1.xyzw
mad r1.xyzw, cb1[r2.y + 4].xyzw, v4.yyyy, r1.xyzw
mad r1.xyzw, cb1[r2.z + 4].xyzw, v4.zzzz, r1.xyzw
mad r1.xyzw, cb1[r2.w + 4].xyzw, v4.wwww, r1.xyzw
dp4 r1.w, v0.xyzw, r1.xyzw
mul r3.xyzw, v3.yyyy, cb1[r0.y + 1].xyzw
mad r3.xyzw, cb1[r0.x + 1].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 1].xyzw, v3.zzzz, r3.xyzw
mad r3.xyzw, cb1[r0.w + 1].xyzw, v3.wwww, r3.xyzw
mad r3.xyzw, cb1[r2.x + 1].xyzw, v4.xxxx, r3.xyzw
mad r3.xyzw, cb1[r2.y + 1].xyzw, v4.yyyy, r3.xyzw
mad r3.xyzw, cb1[r2.z + 1].xyzw, v4.zzzz, r3.xyzw
mad r3.xyzw, cb1[r2.w + 1].xyzw, v4.wwww, r3.xyzw
dp4 r1.x, v0.xyzw, r3.xyzw
dp3 o2.x, v1.xyzx, r3.xyzx
mul r3.xyzw, v3.yyyy, cb1[r0.y + 2].xyzw
mad r3.xyzw, cb1[r0.x + 2].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 2].xyzw, v3.zzzz, r3.xyzw
mad r3.xyzw, cb1[r0.w + 2].xyzw, v3.wwww, r3.xyzw
mad r3.xyzw, cb1[r2.x + 2].xyzw, v4.xxxx, r3.xyzw
mad r3.xyzw, cb1[r2.y + 2].xyzw, v4.yyyy, r3.xyzw
mad r3.xyzw, cb1[r2.z + 2].xyzw, v4.zzzz, r3.xyzw
mad r3.xyzw, cb1[r2.w + 2].xyzw, v4.wwww, r3.xyzw
dp4 r1.y, v0.xyzw, r3.xyzw
dp3 o2.y, v1.xyzx, r3.xyzx
mul r3.xyzw, v3.yyyy, cb1[r0.y + 3].xyzw
mad r3.xyzw, cb1[r0.x + 3].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 3].xyzw, v3.zzzz, r3.xyzw
mad r0.xyzw, cb1[r0.w + 3].xyzw, v3.wwww, r3.xyzw
mad r0.xyzw, cb1[r2.x + 3].xyzw, v4.xxxx, r0.xyzw
mad r0.xyzw, cb1[r2.y + 3].xyzw, v4.yyyy, r0.xyzw
mad r0.xyzw, cb1[r2.z + 3].xyzw, v4.zzzz, r0.xyzw
mad r0.xyzw, cb1[r2.w + 3].xyzw, v4.wwww, r0.xyzw
dp4 r1.z, v0.xyzw, r0.xyzw
dp3 o2.z, v1.xyzx, r0.xyzx
dp4 o0.x, r1.xyzw, cb0[0].xyzw
dp4 o0.y, r1.xyzw, cb0[1].xyzw
dp4 o0.z, r1.xyzw, cb0[2].xyzw
dp4 o0.w, r1.xyzw, cb0[3].xyzw
mov o1.xy, v2.xyxx
ret 

With 16 bones it looks exactly the same, expect for "dcl_constantbuffer cb1[65], dynamicIndexed" in line 4.

 


Ive found sometimes that due to some weird flattening, It helped using the [loop] attribute before the loop, and try to add the D3DCOMPILE_PREFER_FLOW_CONTROL flag.

Adding the [loop] reduces compilation time in half - from 4s to 2s. D3DCOMPILE_PREFER_FLOW_CONTROL doesn't improve on top of that. Nice speedup, but 2s for such a simple shader is way too much.

The [loop] improvement just adds to the mystery. I still don't understand why the number of bones has any effect on compilation time.

Share this post


Link to post
Share on other sites

I'm not sure why it takes so long to compile. You'd really need to get someone from the DirectX team to help you out. Historically the loop simulator used for unrolling loops has always been rather slow with HLSL compiler, but it doesn't make sense that it would be so slow for your particular case. It must be doing some sort of bounds-checking that is slowing it done.

I tried changing your shader to use a StructuredBuffer instead of a constant buffer for storing the array of bone matrices, and it compiles almost instantly. So you can do that as a workaround. A StructuredBuffer shouldn't be any slower (in fact it probably takes the same path on most recent hardware), and will give you the same functionality. 

As for unrolling, it's almost always something you want to do for loops with a fixed number of iterations. It generally results in better performance, because it allows the compiler to better optimize the resulting code and always prevents the hardware from having to execute looping/branching instructions every iteration. So you'll probably want to be explicit and put an [unroll] attribute on your loop.

Edited by MJP

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!