Sign in to follow this  
satanir

HLSL compiler weird performance behavior

Recommended Posts

I have this animation demo, which takes quite a few seconds to load. I initially thought that it was related to the model, but apparently the culprit is the skinning VS, and specifically the bones-matrices.

The VS looks like

cbuffer cbPerMesh : register(b1)
{
	matrix gBones[256];
}

struct VS_IN
{
	float4 PosL : POSITION;
	float3 NormalL : NORMAL;
	float2 TexC : TEXCOORD;
	float4 BonesWeights[2] : BONE_WEIGHTS;
	uint4  BonesIDs[2]    : BONE_IDS;
};

struct VS_OUT
{
	float4 svPos : SV_POSITION;
	float2 TexC : TEXCOORD;
	float3 NormalW : NORMAL;
};

float4x4 CalculateWorldMatrixFromBones(float4 BonesWeights[2], uint4  BonesIDs[2], float4x4 Bones[256])
{
	float4x4 WorldMat = { float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0) };

		for(int i = 0; i < 2; i++)
		{
			WorldMat += Bones[BonesIDs[i].x] * BonesWeights[i].x;
			WorldMat += Bones[BonesIDs[i].y] * BonesWeights[i].y;
			WorldMat += Bones[BonesIDs[i].z] * BonesWeights[i].z;
			WorldMat += Bones[BonesIDs[i].w] * BonesWeights[i].w;
		}

	return WorldMat;
}

VS_OUT VS(VS_IN vIn)
{
	VS_OUT vOut;
	float4x4 World = CalculateWorldMatrixFromBones(vIn.BonesWeights, vIn.BonesIDs, gBones);
	vOut.svPos = mul(mul(vIn.PosL, World), gVPMat);
	vOut.TexC = vIn.TexC;
	vOut.NormalW = mul(float4(vIn.NormalL, 0), World).xyz;
	return vOut;
}

As you can see, this shader supports up to 256 bones per model. Compiling this shader takes around 5 seconds on my Core-i7 CPU.

If I reduce the number of supported bones to 16, it compiles almost immediately.

Funny thing is that the generated assembly is exactly the same (except for the CB declaration).

 

I find it weird - the code doesn't rely in any way on the matrices count.

Anyone has any idea why the performance degradation?

Share this post


Link to post
Share on other sites


the code doesn't rely in any way on the matrices count.

 

Don't know if it's the problem, but different shader models support different limits for constant registers. If you're compiling with 4.0 or above (IIRC), you should be okay. What shader model do you compile with?

Share this post


Link to post
Share on other sites


What shader model do you compile with?

 

5.0.

Still, that doesn't explain the performance hit. All the compiler has to do is:

- Verify that sizeof(matrix)*256 < MAX_ALLOWED_CB_SIZE

- Create the SM 5 instruction 'dcl_constantbuffer cb1[41], dynamicIndexed'

 

And that's it. No need to optimize anything.

Weird. I'll never understand what compilers think.

Share this post


Link to post
Share on other sites


Just a hunch, but what happens if instead of passing float4x4 Bones[256] as a parameter to CalculateWorldMatrixFromBones, you just index your gBones instead?

Nothing changes, still takes a couple of seconds.

Share this post


Link to post
Share on other sites

How do you update the cbuffer? What is the average bone count? This could be a bandwidth problem?

I'm talking about the compilation time only.

I even tried FXC (the standalone HLSL compiler) - still takes a couple of seconds.

Edited by N.I.B.

Share this post


Link to post
Share on other sites

Can you take a look at the compiled code, perhaps the compiler is flattening or optimizing something, I'm a bit skeptical about the function argument float4x4 Bones[256].

 

EDIT: And post the compiled shader code.

EDIT2: Ive found sometimes that due to some weird flattening, It helped using the [loop] attribute before the loop, and try to add the D3DCOMPILE_PREFER_FLOW_CONTROL flag.

Edited by Migi0027

Share this post


Link to post
Share on other sites


And post the compiled shader code
vs_5_0
dcl_globalFlags refactoringAllowed
dcl_constantbuffer cb0[4], immediateIndexed
dcl_constantbuffer cb1[1025], dynamicIndexed
dcl_input v0.xyzw
dcl_input v1.xyz
dcl_input v2.xy
dcl_input v3.xyzw
dcl_input v4.xyzw
dcl_input v5.xyzw
dcl_input v6.xyzw
dcl_output_siv o0.xyzw, position
dcl_output o1.xy
dcl_output o2.xyz
dcl_temps 4
ishl r0.xyzw, v5.xyzw, l(2, 2, 2, 2)
mul r1.xyzw, v3.yyyy, cb1[r0.y + 4].xyzw
mad r1.xyzw, cb1[r0.x + 4].xyzw, v3.xxxx, r1.xyzw
mad r1.xyzw, cb1[r0.z + 4].xyzw, v3.zzzz, r1.xyzw
mad r1.xyzw, cb1[r0.w + 4].xyzw, v3.wwww, r1.xyzw
ishl r2.xyzw, v6.xyzw, l(2, 2, 2, 2)
mad r1.xyzw, cb1[r2.x + 4].xyzw, v4.xxxx, r1.xyzw
mad r1.xyzw, cb1[r2.y + 4].xyzw, v4.yyyy, r1.xyzw
mad r1.xyzw, cb1[r2.z + 4].xyzw, v4.zzzz, r1.xyzw
mad r1.xyzw, cb1[r2.w + 4].xyzw, v4.wwww, r1.xyzw
dp4 r1.w, v0.xyzw, r1.xyzw
mul r3.xyzw, v3.yyyy, cb1[r0.y + 1].xyzw
mad r3.xyzw, cb1[r0.x + 1].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 1].xyzw, v3.zzzz, r3.xyzw
mad r3.xyzw, cb1[r0.w + 1].xyzw, v3.wwww, r3.xyzw
mad r3.xyzw, cb1[r2.x + 1].xyzw, v4.xxxx, r3.xyzw
mad r3.xyzw, cb1[r2.y + 1].xyzw, v4.yyyy, r3.xyzw
mad r3.xyzw, cb1[r2.z + 1].xyzw, v4.zzzz, r3.xyzw
mad r3.xyzw, cb1[r2.w + 1].xyzw, v4.wwww, r3.xyzw
dp4 r1.x, v0.xyzw, r3.xyzw
dp3 o2.x, v1.xyzx, r3.xyzx
mul r3.xyzw, v3.yyyy, cb1[r0.y + 2].xyzw
mad r3.xyzw, cb1[r0.x + 2].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 2].xyzw, v3.zzzz, r3.xyzw
mad r3.xyzw, cb1[r0.w + 2].xyzw, v3.wwww, r3.xyzw
mad r3.xyzw, cb1[r2.x + 2].xyzw, v4.xxxx, r3.xyzw
mad r3.xyzw, cb1[r2.y + 2].xyzw, v4.yyyy, r3.xyzw
mad r3.xyzw, cb1[r2.z + 2].xyzw, v4.zzzz, r3.xyzw
mad r3.xyzw, cb1[r2.w + 2].xyzw, v4.wwww, r3.xyzw
dp4 r1.y, v0.xyzw, r3.xyzw
dp3 o2.y, v1.xyzx, r3.xyzx
mul r3.xyzw, v3.yyyy, cb1[r0.y + 3].xyzw
mad r3.xyzw, cb1[r0.x + 3].xyzw, v3.xxxx, r3.xyzw
mad r3.xyzw, cb1[r0.z + 3].xyzw, v3.zzzz, r3.xyzw
mad r0.xyzw, cb1[r0.w + 3].xyzw, v3.wwww, r3.xyzw
mad r0.xyzw, cb1[r2.x + 3].xyzw, v4.xxxx, r0.xyzw
mad r0.xyzw, cb1[r2.y + 3].xyzw, v4.yyyy, r0.xyzw
mad r0.xyzw, cb1[r2.z + 3].xyzw, v4.zzzz, r0.xyzw
mad r0.xyzw, cb1[r2.w + 3].xyzw, v4.wwww, r0.xyzw
dp4 r1.z, v0.xyzw, r0.xyzw
dp3 o2.z, v1.xyzx, r0.xyzx
dp4 o0.x, r1.xyzw, cb0[0].xyzw
dp4 o0.y, r1.xyzw, cb0[1].xyzw
dp4 o0.z, r1.xyzw, cb0[2].xyzw
dp4 o0.w, r1.xyzw, cb0[3].xyzw
mov o1.xy, v2.xyxx
ret 

With 16 bones it looks exactly the same, expect for "dcl_constantbuffer cb1[65], dynamicIndexed" in line 4.

 


Ive found sometimes that due to some weird flattening, It helped using the [loop] attribute before the loop, and try to add the D3DCOMPILE_PREFER_FLOW_CONTROL flag.

Adding the [loop] reduces compilation time in half - from 4s to 2s. D3DCOMPILE_PREFER_FLOW_CONTROL doesn't improve on top of that. Nice speedup, but 2s for such a simple shader is way too much.

The [loop] improvement just adds to the mystery. I still don't understand why the number of bones has any effect on compilation time.

Share this post


Link to post
Share on other sites

I'm not sure why it takes so long to compile. You'd really need to get someone from the DirectX team to help you out. Historically the loop simulator used for unrolling loops has always been rather slow with HLSL compiler, but it doesn't make sense that it would be so slow for your particular case. It must be doing some sort of bounds-checking that is slowing it done.

I tried changing your shader to use a StructuredBuffer instead of a constant buffer for storing the array of bone matrices, and it compiles almost instantly. So you can do that as a workaround. A StructuredBuffer shouldn't be any slower (in fact it probably takes the same path on most recent hardware), and will give you the same functionality. 

As for unrolling, it's almost always something you want to do for loops with a fixed number of iterations. It generally results in better performance, because it allows the compiler to better optimize the resulting code and always prevents the hardware from having to execute looping/branching instructions every iteration. So you'll probably want to be explicit and put an [unroll] attribute on your loop.

Edited by MJP

Share this post


Link to post
Share on other sites
If you don't specify loop/flatten attributes, the compiler will try both options before picking one. Even if you do specify [loop], it still seems to partially unroll loops to see if maybe it's a better choice.

That's one reason you should always compile your shaders ahead of time, instead of on your loading screen!

Share this post


Link to post
Share on other sites


I tried changing your shader to use a StructuredBuffer instead of a constant buffer for storing the array of bone matrices, and it compiles almost instantly. So you can do that as a workaround. A StructuredBuffer shouldn't be any slower (in fact it probably takes the same path on most recent hardware), and will give you the same functionality.

Cool, I'll do that.

 


That's one reason you should always compile your shaders ahead of time, instead of on your loading screen!

Usually, I agree. But in this case it's a framework I use for shader prototyping, and I like the ability to re-compile shadres on the fly without having to stop the app. It makes the development process somewhat more fluid.

 

I've been using this shader at work for over a year now for facial animation study. We use a highly-detailed face model with 3 4Kx4K textures, and I always thought that the long loading time was because of the model loading. Yesterday I implemented a model viewer that just create state objects without loading anything, and the loading time was still there... All because of the bones array. Reducing the size of the bone array in my facial animation code - half of the loading time is gone!

There's probably a moral there somewhere, something about not assuming things and such...

 

Well, I'll just tag this compiler issue as a small DirectX wonder.

Share this post


Link to post
Share on other sites


Usually, I agree. But in this case it's a framework I use for shader prototyping, and I like the ability to re-compile shadres on the fly without having to stop the app. It makes the development process somewhat more fluid.
There's no reason you can't still do that with pre-compiled shaders. Instead of reading text from disc, compiling to binary and recreating your resources, you just load binary from disc and recreate your resources cool.png

Share this post


Link to post
Share on other sites

I had a quick play with the shader, and found compilation went much faster if I:

 

1. Manually inlined the function call (this had the biggest impact).

2. Used the [fastopt] attribute on the loop.

3. Instead of #2 disabled optimization completely in the shader compiler for a bigger impact.

 

Note that [fastopt] can make the compiler generate worse code, so I wouldn't recommend it outside of prototyping. The same goes for disabling optimization on the shader compiler. Having said that the driver optimizes the shader too, so the runtime performance hit from either of those isn't usually very big.

 

As a side note, you can generally get away with 4x3 matrices for your bones, which cuts down on the size of the constant buffer and saves a few instructions in the shader.

Share this post


Link to post
Share on other sites


Manually inlined the function call (this had the biggest impact).

Tried that, it still takes 2s, same as with unroll

 


2. Used the [fastopt] attribute on the loop.
3. Instead of #2 disabled optimization completely in the shader compiler for a bigger impact.

Tried that as well. Even with optimizations disabled it still takes 1s - still a lot for a simple shader.

 


As a side note, you can generally get away with 4x3 matrices for your bones, which cuts down on the size of the constant buffer and saves a few instructions in the shader.

Nice tip, thanks.

 


There's no reason you can't still do that with pre-compiled shaders. Instead of reading text from disc, compiling to binary and recreating your resources, you just load binary from disc and recreate your resources cool.png

Sure, but that means I have to re-compile the shader outside my app every time I change it. By letting the app re-compile, I just change the hlsl file, press a button and let the app do the magic for me.

Share this post


Link to post
Share on other sites


Sure, but that means I have to re-compile the shader outside my app every time I change it. By letting the app re-compile, I just change the hlsl file, press a button and let the app do the magic for me.
Sorry I'm taking you way off topic laugh.png

Yeah workflow often trumps theoretical performance, but I'd still recommend supporting both text and binary shader files if you're going to go that way, so you can iterate quickly and load quickly in shipping builds biggrin.png

The engine I'm currently using (and a different proprietary one I used in '09) have a system-tray tool that subscribes to OS notifications about changed files in your game's content directory, automatically passes those files to the appropriate data-compiler plugins, and then notifies the game that these compiled data files have changed. That way, the staff just have to press ctrl+S on the text files, the game engine itself remains simple with a single code-path for loading binary data, and end-users get fast load times.

Share this post


Link to post
Share on other sites


Yeah workflow often trumps theoretical performance, but I'd still recommend supporting both text and binary shader files if you're going to go that way, so you can iterate quickly and load quickly in shipping builds biggrin.png

I guess it's a matter of requirements. The framework I implemented is used for algorithmic development, where the top requirement is fast shader prototyping (think DXUT, but way way better and simpler to use). I don't see a lot of use for pre-compiled shaders in our case.

 

If I was working on games - then yeah, compiling shaders at load time would make people go german-kid-crazy.

Share this post


Link to post
Share on other sites


Whats the graphics card of your pc

ATI 6850.

But the compilation performance is unrelated - it happens when I compile with FXC, it's a Microsoft compiler issue.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this