Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


D3DCompile() way too slow


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
16 replies to this topic

#1 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 20 January 2011 - 10:56 AM

So I am doing some implicit function evaluation with perlin noise with DirectCompute (on DX11 hw, cs_5 profile), and the whole thing takes about 2 minutes to complete. I was starting to get a bit dissapointed with the speed of my GPU, until I found out that 99% of that time wasn't spent in the shader, but in... D3DCompile()

My code is only about 5kb of source, and thru experimentation I have found out that any loops have the biggest influence on compile time: it is as if compile time is exponential to a version of the code with all loops unrolled and function calls inlined. Optimizing the loop counts where I could, I could get the compile time down to 1 minute.

tagging all the loops with [loop] had no effect, and neither had any of the D3D10_SHADER_SKIP_OPTIMIZATION or D3D10_SHADER_OPTIMIZATION_LEVEL0 flags. The only thing that had an impact was [fastopt], which reduced compilation time about 4x down to 15 seconds.

Thing is, I can't precompile this code to a blob, because the whole point of my program is to interactively change the code and see the result instantly (or at least fast, failing that). It is a bit ridiculous if the user has to wait 16 seconds to see the new result, of which only 1 second was actual computation.

How do I speed this up? I don't mind a lower optimisation level as clearly that will only ever be a win for me.

I have written optimizing compilers myself, and I know of no compiler that would take longer than a second on such small amount of code (on modern hw). Certainly optimizing hlsl is trivial compared to C/C++, because there's no aliasing problems etc. What the hell can it possibly be spending all that time on?

Oh, and I have searched google and this forum for solutions, but sofar haven't found any (e.g. http://www.gamedev.n..._1#entry4725295 and http://www.gamedev.n..._1#entry4574352)

Sponsor:

#2 karwosts   Members   -  Reputation: 840

Like
0Likes
Like

Posted 20 January 2011 - 11:14 AM

I'd consider posting your shader source if you're not trying to keep it secret, maybe there is some way it can be made simpler or cleaner to help the compiler figure it out faster.
My Projects:
Portfolio Map for Android - Free Visual Portfolio Tracker
Electron Flux for Android - Free Puzzle/Logic Game

#3 DieterVW   Members   -  Reputation: 700

Like
1Likes
Like

Posted 20 January 2011 - 12:30 PM

Try the [fastopt] attribute on your loops. This should short circuit the loop simulator in the compiler which would otherwise be exponential for the number of embedded loops.

Docs for this are here

#4 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 20 January 2011 - 03:38 PM

I'd consider posting your shader source if you're not trying to keep it secret, maybe there is some way it can be made simpler or cleaner to help the compiler figure it out faster.


I don't mind posting it, but like I said, I already experimented with commenting out code to see what effect it has on compilation speed, and it is purely due to for-loops (which I can't remove) and the amount of code in general. I can probably optimize the code more, but I want compilation to be a lot faster, not just a bit.

I guess I am hoping for some secret compiler flags that tone down whatever crazy stuff it is doing?


cbuffer consts
{
	float4 sb;
	float4 grad[16];
};

float4 Col(float3 c, float inside) { return float4(c, inside); }
bool Solid(float4 c) { return c.w > 0.999f; }

float rand(float2 co)
{
	return frac(sin(dot(co.xy, float2(12.9898, 78.233))) * 43758.5453);
}

int randint(float2 co) { return int(rand(co) * 256); }

int Hash( float3 P )
{
	return randint(P.xy) ^ randint(P.yz) ^ randint(P.zx);
}

float Snoise3D( float3 P )
{
	const float F3 = 0.333333333333;
	const float G3 = 0.166666666667;

	float s = dot( P, F3 );
	float3 Pi = floor( P + s );
	float t = dot( Pi, G3 );

	float3 P0 = Pi - t;        	
	float3 Pf0 = P - P0;    	

	float3 simplex[4];
	float3 T = Pf0.xzy >= Pf0.yxz;
	simplex[0] = 0;
	simplex[1] = T.xzy > T.yxz;
	simplex[2] = T.yxz <= T.xzy;
	simplex[3] = 1;

	float n = 0;

	[loop][fastopt]
	for (int i = 0; i<4; i++)
	{
    	float3 Pf = Pf0 - simplex[i] + G3 * i;
    	int h = Hash( Pi + simplex[i] );
    	float d = saturate( 0.6f - dot( Pf, Pf ) );
    	d *= d; 
    	n += d * d * dot((float3)grad[ h & 15 ], Pf);
	}

	return 32.0f * n;
}

float Turbulence3D( float3 p )
{
	float res = 0;
	float fact = 1;
	float scale = 1;
	float weight = 0;
	[loop][fastopt]
	for (int i = 0; i<4; i++)
	{
    	res += fact * Snoise3D( p * scale );
    	weight += fact;
    	fact /= 2.5;
    	scale *= 2.5;
	}
	return res / weight;
}

float4 Fun(float3 p)
{
	p = (p - sb.y) / sb.x;

	return Col(0.5, Turbulence3D(p) < -0.3);
}

struct BufferStruct2
{
	float4 a;
	float4 b;
};

RWStructuredBuffer<BufferStruct2> g_OutBuff2 : register( u1 );

[numthreads(64, 1, 1)]
void main2( uint3 threadIDInGroup : SV_GroupThreadID,
   		uint3 groupID : SV_GroupID, 
   		uint groupIndex : SV_GroupIndex, 
   		uint3 dispatchThreadID : SV_DispatchThreadID )
{
	BufferStruct2 e = g_OutBuff2[dispatchThreadID.x];
	float3 a = (float3)e.a;
	float3 b = (float3)e.b;

	int subdivs = 10;
	[loop][fastopt]
	for (int i = 0; i < subdivs; i++)
	{
    	float3 mid = (a + <img src='http://public.gamedev.net/public/style_emoticons/<#EMO_DIR#>/cool.gif' class='bbc_emoticon' alt='B)' /> / 2;
    	float4 col = Fun(mid);
    	if (Solid(col)) a = mid;
    	else        	b = mid;
	}

	float3 pos = (a + <img src='http://public.gamedev.net/public/style_emoticons/<#EMO_DIR#>/cool.gif' class='bbc_emoticon' alt='B)' /> / 2;

	
	float3 col = float3(0, 0, 0);
	int n = 0;
	const float off = 0.6f;
	[loop][fastopt]
	for (float x = -off; x < off + 0.1f; x += off)
	[loop][fastopt]
	for (float y = -off; y < off + 0.1f; y += off)
	[loop][fastopt]
	for (float z = -off; z < off + 0.1f; z += off)
	{
    	float4 acol = Fun(float3(x, y, z) + pos);
    	col += (float3)(acol * acol.w);
    	n += acol.w;
	}
	if (n > 1) col /= n;

	e.a = float4(pos, 0);
	e.b = float4(col, 0);
	g_OutBuff2[dispatchThreadID.x] = e;
}


#5 MJP   Moderators   -  Reputation: 11589

Like
0Likes
Like

Posted 20 January 2011 - 04:10 PM

I don't know about any "secret compiler flags", but if [fastopt] doesn't work for you then you can always just specify a lower optimization level. I *think* the default is D3D10_SHADER_OPTIMIZATION_LEVEL1, and you can specify 0, 2, or 3. You can also just disable optimizations altogether.

#6 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 20 January 2011 - 04:16 PM

Try the [fastopt] attribute on your loops. This should short circuit the loop simulator in the compiler which would otherwise be exponential for the number of embedded loops.

Docs for this are here

Thanks, but if you see my original post, I am already using that. It helps some, but not enough.

#7 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 20 January 2011 - 04:19 PM

I don't know about any "secret compiler flags", but if [fastopt] doesn't work for you then you can always just specify a lower optimization level. I *think* the default is D3D10_SHADER_OPTIMIZATION_LEVEL1, and you can specify 0, 2, or 3. You can also just disable optimizations altogether.


Thanks, but as I said in my original post, I tried both D3D10_SHADER_OPTIMIZATION_LEVEL0 and D3D10_SHADER_SKIP_OPTIMIZATION, and neither appear to have any effect on compilation time.

#8 Adam_42   Crossbones+   -  Reputation: 2568

Like
0Likes
Like

Posted 21 January 2011 - 05:40 AM

On my PC (Core i7 920) pasting that code into ATI's GPU Shader Analyzer and compiling as cs_5_0 (and changing 'B' to 'b' to make it compile) gives me a compile time of about 5 seconds, regardless of settings. If it's taking you 15 seconds a CPU upgrade might help, or you're compiling it more than once...

This drops down to almost instant if I take out the code in the middle of those triple nested loops at the bottom, so that's clearly the slow bit to compile. Unfortunately playing with that code I was unable to noticeably speed up compilation time, without removing code.

You might want to consider some sort of cache of compilation results, but that's not easy if you want to ignore changes to the source that won't affect the compiled result like adding whitespace.

Another option is using extra threads to compile the code in the background after each change, to minimize the delay that the user sees.

#9 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 21 January 2011 - 04:19 PM

On my PC (Core i7 920) pasting that code into ATI's GPU Shader Analyzer and compiling as cs_5_0 (and changing 'B' to 'b' to make it compile) gives me a compile time of about 5 seconds, regardless of settings. If it's taking you 15 seconds a CPU upgrade might help, or you're compiling it more than once...

This drops down to almost instant if I take out the code in the middle of those triple nested loops at the bottom, so that's clearly the slow bit to compile. Unfortunately playing with that code I was unable to noticeably speed up compilation time, without removing code.

You might want to consider some sort of cache of compilation results, but that's not easy if you want to ignore changes to the source that won't affect the compiled result like adding whitespace.

Another option is using extra threads to compile the code in the background after each change, to minimize the delay that the user sees.


Thanks for trying that out. Yes, the version I sent was already optimized somewhat from the 15 second version, and takes 5.8 seconds to compile on mine.

Yeah, a cache won't work, as any part of the code may change from run to run, so it will never hit the cache. If DirectCompute had some form of "linking", I could compile parts of the code that never change separately, but I don't think that's possible either.

Seeing as how none of the compiler flags affect compilation speed, it is clearly ignoring my request not to try to optimize those loops.

#10 DieterVW   Members   -  Reputation: 700

Like
0Likes
Like

Posted 21 January 2011 - 05:14 PM

Some of the loop analysis can't be turned off regardless of the flags you specify. Anyway, compiler performance is being worked on for a future release as it has major issues especially for compute shaders. At the moment you won't be able to eak out better performance through anything other than HLSL code change to your algorithm which may or may not be possible.

#11 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 21 January 2011 - 05:38 PM

Some of the loop analysis can't be turned off regardless of the flags you specify. Anyway, compiler performance is being worked on for a future release as it has major issues especially for compute shaders. At the moment you won't be able to eak out better performance through anything other than HLSL code change to your algorithm which may or may not be possible.

That is good to hear. Making the compiler obey flags of cheap optimization (just basic constant folding & inlining) would be fantastic.

Rather than reducing the amount of code, my project was planned to involve extending this code significantly. I guess I will have to implement it some other way, as this clearly will never get me fast turn arounds.

Does anyone have experience with OpenCL having a fast minimal optimization mode? I'd prefer to avoid OpenCL if I can, it appears pretty messy compared to DirectCompute. I can't use CUDA-C, since I'm on an ATI chip. Either that, or it is back to CPU code for me just for turn-around's sake :(

#12 Jason Z   Crossbones+   -  Reputation: 5163

Like
0Likes
Like

Posted 22 January 2011 - 12:52 PM


Some of the loop analysis can't be turned off regardless of the flags you specify. Anyway, compiler performance is being worked on for a future release as it has major issues especially for compute shaders. At the moment you won't be able to eak out better performance through anything other than HLSL code change to your algorithm which may or may not be possible.

That is good to hear. Making the compiler obey flags of cheap optimization (just basic constant folding & inlining) would be fantastic.

Rather than reducing the amount of code, my project was planned to involve extending this code significantly. I guess I will have to implement it some other way, as this clearly will never get me fast turn arounds.

Does anyone have experience with OpenCL having a fast minimal optimization mode? I'd prefer to avoid OpenCL if I can, it appears pretty messy compared to DirectCompute. I can't use CUDA-C, since I'm on an ATI chip. Either that, or it is back to CPU code for me just for turn-around's sake :(

I've never tried it, but could your program benefit from the dynamic linkage portion of D3D11? If your program can be chopped up into sections, then you could utilize the dynamic linkage to simply replace portions of the program when they are edited. This would require defining interfaces in your program for certain sections of it, but it could eliminate the expensive recompilation portion that is eating up so much time...

#13 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 22 January 2011 - 04:39 PM

I've never tried it, but could your program benefit from the dynamic linkage portion of D3D11? If your program can be chopped up into sections, then you could utilize the dynamic linkage to simply replace portions of the program when they are edited. This would require defining interfaces in your program for certain sections of it, but it could eliminate the expensive recompilation portion that is eating up so much time...


The DynamicShaderLinkage11 uses #include to compile the interfaces/classes and the code that uses it in a single D3DX11CompileFromFile call, which probably wouldn't speed things up much. So the question is, can I link a class instance defined in one separately compiled blob to my main shader in another blob?

I can set a ID3D11ClassInstance thru CSSetShader. I obtain such a pointer using GetClassInstance() on a ID3D11ClassLinkage, which can then find any instances in code passed thru CreateComputeShader with that ID3D11ClassLinkage as an argument.

However I don't think I can call CreateComputeShader / D3DX11CompileFromFile on a file that only contains interfaces and classes, because they assume and entrypoint? I could add a dummy entrypoint, but something tells me ID3D11ClassLinkage was not intended to work way (referring to a shader that's not bound to the pipeline at all)... or am I missing something?

And even if that works, the sample documentation says:

"The Direct3D 11 runtime efficiently links each of the selected methods at source level, inlining and optimizing the shader code as much as possible to provide an optimal shader for the GPU to execute."

Which makes sense, because for PS use no-one would use interfaces if they were slower than the equivalent inlined code. So does that mean that potentially part of all that slowness would return at CSSetShader() time?

#14 DieterVW   Members   -  Reputation: 700

Like
0Likes
Like

Posted 22 January 2011 - 09:54 PM

the linking will only help you if the code can be written and compiled once. If you're letting the user write new code directly then this won't work. Otherwise the functionality could be split across a large number of class which can be bound in any arbitrary order from the runtime to meet the currently requested functionality. The shader will have to have all the code available when it compiles but you will have the ability to change how the shader operates from the runtime side. Perhaps it wold help to know more about your tool/application.

#15 Adam_42   Crossbones+   -  Reputation: 2568

Like
0Likes
Like

Posted 23 January 2011 - 04:25 AM

There is one way I've spotted that you could effectively split some of that shader into multiple pieces that you can 'link together - convert some of the functions to texture lookups. For example you could try creating a 2D texture which you sample to calculate the results of randint(x, y).

#16 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 23 January 2011 - 12:02 PM

the linking will only help you if the code can be written and compiled once. If you're letting the user write new code directly then this won't work. Otherwise the functionality could be split across a large number of class which can be bound in any arbitrary order from the runtime to meet the currently requested functionality. The shader will have to have all the code available when it compiles but you will have the ability to change how the shader operates from the runtime side. Perhaps it wold help to know more about your tool/application.

The tool is an "implicit function modeller". The idea is to compositionally build complex models out of code rather than traditional modelling. The user writes code (at the moment in HLSL, but could be a more friendly special purpose language in the future that gets translated to HLSL on the fly), and the app shows what that looks in 3D like at the press of a key. To do so, it evaluates the function for however many locations in 3D space such that a good looking marching cubes mesh results (the loops in the above code for example sample the function many times to get accurate location of the isosurface on an edge, and anti-aliased color sampling).

One thing I could for a preview is a version that has less loops and thus looks uglier.

So the HLSL code changes constantly, and moreover, can become quite big is someone tries to model something complex (say a building, using many "union" operators). So compile time is critical, and would only get worse from this small example.

One way I could imagine hiding compile time latency is to continuously compile the code the user is editing, and keep the last one that compiled without errors, but that is still pointless if the compile takes several minutes.

#17 Aardappel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 23 January 2011 - 12:17 PM

There is one way I've spotted that you could effectively split some of that shader into multiple pieces that you can 'link together - convert some of the functions to texture lookups. For example you could try creating a 2D texture which you sample to calculate the results of randint(x, y).


Yeah, there are specific code size optimizations I could do, but that doesn't help me in the general case (see previous post). Besides, replacing randint by the number 42 does not change compile time much.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS