HLSL compiler weird performance behavior

Graphics and GPU Programming Programming

Started by satanir September 08, 2014 02:13 PM

17 comments, last by satanir 9 years, 7 months ago

1,457

Author

September 08, 2014 02:13 PM

I have this animation demo, which takes quite a few seconds to load. I initially thought that it was related to the model, but apparently the culprit is the skinning VS, and specifically the bones-matrices.

The VS looks like


cbuffer cbPerMesh : register(b1)
{
	matrix gBones[256];
}

struct VS_IN
{
	float4 PosL : POSITION;
	float3 NormalL : NORMAL;
	float2 TexC : TEXCOORD;
	float4 BonesWeights[2] : BONE_WEIGHTS;
	uint4  BonesIDs[2]    : BONE_IDS;
};

struct VS_OUT
{
	float4 svPos : SV_POSITION;
	float2 TexC : TEXCOORD;
	float3 NormalW : NORMAL;
};

float4x4 CalculateWorldMatrixFromBones(float4 BonesWeights[2], uint4  BonesIDs[2], float4x4 Bones[256])
{
	float4x4 WorldMat = { float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0) };

		for(int i = 0; i < 2; i++)
		{
			WorldMat += Bones[BonesIDs[i].x] * BonesWeights[i].x;
			WorldMat += Bones[BonesIDs[i].y] * BonesWeights[i].y;
			WorldMat += Bones[BonesIDs[i].z] * BonesWeights[i].z;
			WorldMat += Bones[BonesIDs[i].w] * BonesWeights[i].w;
		}

	return WorldMat;
}

VS_OUT VS(VS_IN vIn)
{
	VS_OUT vOut;
	float4x4 World = CalculateWorldMatrixFromBones(vIn.BonesWeights, vIn.BonesIDs, gBones);
	vOut.svPos = mul(mul(vIn.PosL, World), gVPMat);
	vOut.TexC = vIn.TexC;
	vOut.NormalW = mul(float4(vIn.NormalL, 0), World).xyz;
	return vOut;
}

As you can see, this shader supports up to 256 bones per model. Compiling this shader takes around 5 seconds on my Core-i7 CPU.

If I reduce the number of supported bones to 16, it compiles almost immediately.

Funny thing is that the generated assembly is exactly the same (except for the CB declaration).

I find it weird - the code doesn't rely in any way on the matrices count.

Anyone has any idea why the performance degradation?

Buckeye

10,754

September 08, 2014 02:25 PM

the code doesn't rely in any way on the matrices count.

Don't know if it's the problem, but different shader models support different limits for constant registers. If you're compiling with 4.0 or above (IIRC), you should be okay. What shader model do you compile with?

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

satanir

1,457

Author

September 08, 2014 02:39 PM

What shader model do you compile with?

5.0.

Still, that doesn't explain the performance hit. All the compiler has to do is:

- Verify that sizeof(matrix)*256 < MAX_ALLOWED_CB_SIZE

- Create the SM 5 instruction 'dcl_constantbuffer cb1[41], dynamicIndexed'

And that's it. No need to optimize anything.

Weird. I'll never understand what compilers think.

21st Century Moose

13,459

September 08, 2014 04:24 PM

Just a hunch, but what happens if instead of passing float4x4 Bones[256] as a parameter to CalculateWorldMatrixFromBones, you just index your gBones instead?

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

satanir

1,457

Author

September 08, 2014 04:44 PM

Just a hunch, but what happens if instead of passing float4x4 Bones[256] as a parameter to CalculateWorldMatrixFromBones, you just index your gBones instead?

Nothing changes, still takes a couple of seconds.

ongamex92

3,287

September 08, 2014 05:25 PM

How do you update the cbuffer? What is the average bone count? This could be a bandwidth problem?

satanir

1,457

Author

September 08, 2014 05:27 PM

How do you update the cbuffer? What is the average bone count? This could be a bandwidth problem?

I'm talking about the compilation time only.

I even tried FXC (the standalone HLSL compiler) - still takes a couple of seconds.

Migi0027

4,632

September 08, 2014 05:50 PM

Can you take a look at the compiled code, perhaps the compiler is flattening or optimizing something, I'm a bit skeptical about the function argument float4x4 Bones[256].

EDIT: And post the compiled shader code.

EDIT2: Ive found sometimes that due to some weird flattening, It helped using the [loop] attribute before the loop, and try to add the D3DCOMPILE_PREFER_FLOW_CONTROL flag.

FastCall22: "I want to make the distinction that my laptop is a whore-box that connects to different network"

Blog about... stuff (GDNet, WordPress): www.gamedev.net/blog/1882-the-cuboid-zone/, cuboidzone.wordpress.com/

satanir

1,457

Author

September 08, 2014 08:03 PM

And post the compiled shader code


vs_5_0
dcl_globalFlags refactoringAllowed
dcl_constantbuffer cb0[4], immediateIndexed
dcl_constantbuffer cb1[1025], dynamicIndexed
dcl_input v0.xyzw
dcl_input v1.xyz
dcl_input v2.xy
dcl_input v3.xyzw
dcl_input v4.xyzw
dcl_input v5.xyzw
dcl_input v6.xyzw
dcl_output_siv o0.xyzw, position
dcl_output o1.xy
dcl_output o2.xyz
dcl_temps 4
ishl r0.xyzw, v5.xyzw, l(2, 2, 2, 2)
mul r1.xyzw, v3.yyyy, cb1[r0.y + 4].xyzw
mad r1.xyzw, cb1[r0.x + 4].xyzw, v3.xxxx, r1.xyzw
mad r1.xyzw, cb1[r0.z + 4].xyzw, v3.zzzz, r1.xyzw
mad r1.xyzw, cb1[r0.w + 4].xyzw, v3.wwww, r1.xyzw
ishl r2.xyzw, v6.xyzw, l(2, 2, 2, 2)
mad r1.xyzw, cb1[r2.x + 4].xyzw, v4.xxxx, r1.xyzw
mad r1.xyzw, cb1[r2.y + 4].xyzw, v4.yyyy, r1.xyzw
mad r1.xyzw, cb1[r2.z + 4].xyzw, v4.zzzz, r1.xyzw
mad r1.xyzw, cb1[r2.w + 4].xyzw, v4.wwww, r1.xyzw
dp4 r1.w, v0.xyzw, r1.xyzw
mul r3.xyzw, v3.yyyy, cb1[r0.y + 1].xyzw
mad r3.xyzw, cb1[r0.x + 1].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 1].xyzw, v3.zzzz, r3.xyzw
mad r3.xyzw, cb1[r0.w + 1].xyzw, v3.wwww, r3.xyzw
mad r3.xyzw, cb1[r2.x + 1].xyzw, v4.xxxx, r3.xyzw
mad r3.xyzw, cb1[r2.y + 1].xyzw, v4.yyyy, r3.xyzw
mad r3.xyzw, cb1[r2.z + 1].xyzw, v4.zzzz, r3.xyzw
mad r3.xyzw, cb1[r2.w + 1].xyzw, v4.wwww, r3.xyzw
dp4 r1.x, v0.xyzw, r3.xyzw
dp3 o2.x, v1.xyzx, r3.xyzx
mul r3.xyzw, v3.yyyy, cb1[r0.y + 2].xyzw
mad r3.xyzw, cb1[r0.x + 2].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 2].xyzw, v3.zzzz, r3.xyzw
mad r3.xyzw, cb1[r0.w + 2].xyzw, v3.wwww, r3.xyzw
mad r3.xyzw, cb1[r2.x + 2].xyzw, v4.xxxx, r3.xyzw
mad r3.xyzw, cb1[r2.y + 2].xyzw, v4.yyyy, r3.xyzw
mad r3.xyzw, cb1[r2.z + 2].xyzw, v4.zzzz, r3.xyzw
mad r3.xyzw, cb1[r2.w + 2].xyzw, v4.wwww, r3.xyzw
dp4 r1.y, v0.xyzw, r3.xyzw
dp3 o2.y, v1.xyzx, r3.xyzx
mul r3.xyzw, v3.yyyy, cb1[r0.y + 3].xyzw
mad r3.xyzw, cb1[r0.x + 3].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 3].xyzw, v3.zzzz, r3.xyzw
mad r0.xyzw, cb1[r0.w + 3].xyzw, v3.wwww, r3.xyzw
mad r0.xyzw, cb1[r2.x + 3].xyzw, v4.xxxx, r0.xyzw
mad r0.xyzw, cb1[r2.y + 3].xyzw, v4.yyyy, r0.xyzw
mad r0.xyzw, cb1[r2.z + 3].xyzw, v4.zzzz, r0.xyzw
mad r0.xyzw, cb1[r2.w + 3].xyzw, v4.wwww, r0.xyzw
dp4 r1.z, v0.xyzw, r0.xyzw
dp3 o2.z, v1.xyzx, r0.xyzx
dp4 o0.x, r1.xyzw, cb0[0].xyzw
dp4 o0.y, r1.xyzw, cb0[1].xyzw
dp4 o0.z, r1.xyzw, cb0[2].xyzw
dp4 o0.w, r1.xyzw, cb0[3].xyzw
mov o1.xy, v2.xyxx
ret

With 16 bones it looks exactly the same, expect for "dcl_constantbuffer cb1[65], dynamicIndexed" in line 4.

Ive found sometimes that due to some weird flattening, It helped using the [loop] attribute before the loop, and try to add the D3DCOMPILE_PREFER_FLOW_CONTROL flag.

Adding the [loop] reduces compilation time in half - from 4s to 2s. D3DCOMPILE_PREFER_FLOW_CONTROL doesn't improve on top of that. Nice speedup, but 2s for such a simple shader is way too much.

The [loop] improvement just adds to the mystery. I still don't understand why the number of bones has any effect on compilation time.

MJP

20,295

September 08, 2014 08:58 PM

I'm not sure why it takes so long to compile. You'd really need to get someone from the DirectX team to help you out. Historically the loop simulator used for unrolling loops has always been rather slow with HLSL compiler, but it doesn't make sense that it would be so slow for your particular case. It must be doing some sort of bounds-checking that is slowing it done.

I tried changing your shader to use a StructuredBuffer instead of a constant buffer for storing the array of bone matrices, and it compiles almost instantly. So you can do that as a workaround. A StructuredBuffer shouldn't be any slower (in fact it probably takes the same path on most recent hardware), and will give you the same functionality.

As for unrolling, it's almost always something you want to do for loops with a fixed number of iterations. It generally results in better performance, because it allows the compiler to better optimize the resulting code and always prevents the hardware from having to execute looping/branching instructions every iteration. So you'll probably want to be explicit and put an [unroll] attribute on your loop.

The Blog | The Book

HLSL compiler weird performance behavior

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

HLSL compiler weird performance behavior

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines