D3DCompile() way too slow

Started by
15 comments, last by Aardappel 13 years, 3 months ago
So I am doing some implicit function evaluation with perlin noise with DirectCompute (on DX11 hw, cs_5 profile), and the whole thing takes about 2 minutes to complete. I was starting to get a bit dissapointed with the speed of my GPU, until I found out that 99% of that time wasn't spent in the shader, but in... D3DCompile()

My code is only about 5kb of source, and thru experimentation I have found out that any loops have the biggest influence on compile time: it is as if compile time is exponential to a version of the code with all loops unrolled and function calls inlined. Optimizing the loop counts where I could, I could get the compile time down to 1 minute.

tagging all the loops with [loop] had no effect, and neither had any of the D3D10_SHADER_SKIP_OPTIMIZATION or D3D10_SHADER_OPTIMIZATION_LEVEL0 flags. The only thing that had an impact was [fastopt], which reduced compilation time about 4x down to 15 seconds.

Thing is, I can't precompile this code to a blob, because the whole point of my program is to interactively change the code and see the result instantly (or at least fast, failing that). It is a bit ridiculous if the user has to wait 16 seconds to see the new result, of which only 1 second was actual computation.

How do I speed this up? I don't mind a lower optimisation level as clearly that will only ever be a win for me.

I have written optimizing compilers myself, and I know of no compiler that would take longer than a second on such small amount of code (on modern hw). Certainly optimizing hlsl is trivial compared to C/C++, because there's no aliasing problems etc. What the hell can it possibly be spending all that time on?

Oh, and I have searched google and this forum for solutions, but sofar haven't found any (e.g. http://www.gamedev.n..._1#entry4725295 and http://www.gamedev.n..._1#entry4574352)
I'd consider posting your shader source if you're not trying to keep it secret, maybe there is some way it can be made simpler or cleaner to help the compiler figure it out faster.
[size=2]My Projects:
[size=2]Portfolio Map for Android - Free Visual Portfolio Tracker
[size=2]Electron Flux for Android - Free Puzzle/Logic Game
Try the [fastopt] attribute on your loops. This should short circuit the loop simulator in the compiler which would otherwise be exponential for the number of embedded loops.

Docs for this are here

I'd consider posting your shader source if you're not trying to keep it secret, maybe there is some way it can be made simpler or cleaner to help the compiler figure it out faster.

I don't mind posting it, but like I said, I already experimented with commenting out code to see what effect it has on compilation speed, and it is purely due to for-loops (which I can't remove) and the amount of code in general. I can probably optimize the code more, but I want compilation to be a lot faster, not just a bit.

I guess I am hoping for some secret compiler flags that tone down whatever crazy stuff it is doing?

cbuffer consts
float4 sb;
float4 grad[16];

float4 Col(float3 c, float inside) { return float4(c, inside); }
bool Solid(float4 c) { return c.w > 0.999f; }

float rand(float2 co)
return frac(sin(dot(co.xy, float2(12.9898, 78.233))) * 43758.5453);

int randint(float2 co) { return int(rand(co) * 256); }

int Hash( float3 P )
return randint(P.xy) ^ randint(P.yz) ^ randint(P.zx);

float Snoise3D( float3 P )
const float F3 = 0.333333333333;
const float G3 = 0.166666666667;

float s = dot( P, F3 );
float3 Pi = floor( P + s );
float t = dot( Pi, G3 );

float3 P0 = Pi - t;
float3 Pf0 = P - P0;

float3 simplex[4];
float3 T = Pf0.xzy >= Pf0.yxz;
simplex[0] = 0;
simplex[1] = T.xzy > T.yxz;
simplex[2] = T.yxz <= T.xzy;
simplex[3] = 1;

float n = 0;

for (int i = 0; i<4; i++)
float3 Pf = Pf0 - simplex + G3 * i;
int h = Hash( Pi + simplex );
float d = saturate( 0.6f - dot( Pf, Pf ) );
d *= d;
n += d * d * dot((float3)grad[ h & 15 ], Pf);

return 32.0f * n;

float Turbulence3D( float3 p )
float res = 0;
float fact = 1;
float scale = 1;
float weight = 0;
for (int i = 0; i<4; i++)
res += fact * Snoise3D( p * scale );
weight += fact;
fact /= 2.5;
scale *= 2.5;
return res / weight;

float4 Fun(float3 p)
p = (p - sb.y) / sb.x;

return Col(0.5, Turbulence3D(p) < -0.3);

struct BufferStruct2
float4 a;
float4 b;

RWStructuredBuffer<BufferStruct2> g_OutBuff2 : register( u1 );

[numthreads(64, 1, 1)]
void main2( uint3 threadIDInGroup : SV_GroupThreadID,
uint3 groupID : SV_GroupID,
uint groupIndex : SV_GroupIndex,
uint3 dispatchThreadID : SV_DispatchThreadID )
BufferStruct2 e = g_OutBuff2[dispatchThreadID.x];
float3 a = (float3)e.a;
float3 b = (float3)e.b;

int subdivs = 10;
for (int i = 0; i < subdivs; i++)
float3 mid = (a + B) / 2;
float4 col = Fun(mid);
if (Solid(col)) a = mid;
else b = mid;

float3 pos = (a + B) / 2;

float3 col = float3(0, 0, 0);
int n = 0;
const float off = 0.6f;
for (float x = -off; x < off + 0.1f; x += off)
for (float y = -off; y < off + 0.1f; y += off)
for (float z = -off; z < off + 0.1f; z += off)
float4 acol = Fun(float3(x, y, z) + pos);
col += (float3)(acol * acol.w);
n += acol.w;
if (n > 1) col /= n;

e.a = float4(pos, 0);
e.b = float4(col, 0);
g_OutBuff2[dispatchThreadID.x] = e;
I don't know about any "secret compiler flags", but if [fastopt] doesn't work for you then you can always just specify a lower optimization level. I *think* the default is D3D10_SHADER_OPTIMIZATION_LEVEL1, and you can specify 0, 2, or 3. You can also just disable optimizations altogether.

Try the [fastopt] attribute on your loops. This should short circuit the loop simulator in the compiler which would otherwise be exponential for the number of embedded loops.

Docs for this are here

Thanks, but if you see my original post, I am already using that. It helps some, but not enough.

I don't know about any "secret compiler flags", but if [fastopt] doesn't work for you then you can always just specify a lower optimization level. I *think* the default is D3D10_SHADER_OPTIMIZATION_LEVEL1, and you can specify 0, 2, or 3. You can also just disable optimizations altogether.

Thanks, but as I said in my original post, I tried both D3D10_SHADER_OPTIMIZATION_LEVEL0 and D3D10_SHADER_SKIP_OPTIMIZATION, and neither appear to have any effect on compilation time.
On my PC (Core i7 920) pasting that code into ATI's GPU Shader Analyzer and compiling as cs_5_0 (and changing 'B' to 'b' to make it compile) gives me a compile time of about 5 seconds, regardless of settings. If it's taking you 15 seconds a CPU upgrade might help, or you're compiling it more than once...

This drops down to almost instant if I take out the code in the middle of those triple nested loops at the bottom, so that's clearly the slow bit to compile. Unfortunately playing with that code I was unable to noticeably speed up compilation time, without removing code.

You might want to consider some sort of cache of compilation results, but that's not easy if you want to ignore changes to the source that won't affect the compiled result like adding whitespace.

Another option is using extra threads to compile the code in the background after each change, to minimize the delay that the user sees.

On my PC (Core i7 920) pasting that code into ATI's GPU Shader Analyzer and compiling as cs_5_0 (and changing 'B' to 'b' to make it compile) gives me a compile time of about 5 seconds, regardless of settings. If it's taking you 15 seconds a CPU upgrade might help, or you're compiling it more than once...

This drops down to almost instant if I take out the code in the middle of those triple nested loops at the bottom, so that's clearly the slow bit to compile. Unfortunately playing with that code I was unable to noticeably speed up compilation time, without removing code.

You might want to consider some sort of cache of compilation results, but that's not easy if you want to ignore changes to the source that won't affect the compiled result like adding whitespace.

Another option is using extra threads to compile the code in the background after each change, to minimize the delay that the user sees.

Thanks for trying that out. Yes, the version I sent was already optimized somewhat from the 15 second version, and takes 5.8 seconds to compile on mine.

Yeah, a cache won't work, as any part of the code may change from run to run, so it will never hit the cache. If DirectCompute had some form of "linking", I could compile parts of the code that never change separately, but I don't think that's possible either.

Seeing as how none of the compiler flags affect compilation speed, it is clearly ignoring my request not to try to optimize those loops.
Some of the loop analysis can't be turned off regardless of the flags you specify. Anyway, compiler performance is being worked on for a future release as it has major issues especially for compute shaders. At the moment you won't be able to eak out better performance through anything other than HLSL code change to your algorithm which may or may not be possible.

This topic is closed to new replies.
