• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Aardappel

DX11
D3DCompile() way too slow

16 posts in this topic

So I am doing some implicit function evaluation with perlin noise with DirectCompute (on DX11 hw, cs_5 profile), and the whole thing takes about 2 minutes to complete. I was starting to get a bit dissapointed with the speed of my GPU, until I found out that 99% of that time wasn't spent in the shader, but in... D3DCompile()

My code is only about 5kb of source, and thru experimentation I have found out that any loops have the biggest influence on compile time: it is as if compile time is exponential to a version of the code with all loops unrolled and function calls inlined. Optimizing the loop counts where I could, I could get the compile time down to 1 minute.

tagging all the loops with [loop] had no effect, and neither had any of the D3D10_SHADER_SKIP_OPTIMIZATION or D3D10_SHADER_OPTIMIZATION_LEVEL0 flags. The only thing that had an impact was [fastopt], which reduced compilation time about 4x down to 15 seconds.

Thing is, I can't precompile this code to a blob, because the whole point of my program is to interactively change the code and see the result instantly (or at least fast, failing that). It is a bit ridiculous if the user has to wait 16 seconds to see the new result, of which only 1 second was actual computation.

How do I speed this up? I don't mind a lower optimisation level as clearly that will only ever be a win for me.

I have written optimizing compilers myself, and I know of no compiler that would take longer than a second on such small amount of code (on modern hw). Certainly optimizing hlsl is trivial compared to C/C++, because there's no aliasing problems etc. What the hell can it possibly be spending all that time on?

Oh, and I have searched google and this forum for solutions, but sofar haven't found any (e.g. [url="http://www.gamedev.net/topic/586104-hlsl-loops-cause-slow-startup/page__p__4725295__hl__d3dcompile__fromsearch__1#entry4725295"]http://www.gamedev.n..._1#entry4725295[/url] and [url="http://www.gamedev.net/topic/556523-hlsl-never-ever-unroll/page__p__4574352__hl__d3dcompile__fromsearch__1#entry4574352"]http://www.gamedev.n..._1#entry4574352[/url])
0

Share this post


Link to post
Share on other sites
I'd consider posting your shader source if you're not trying to keep it secret, maybe there is some way it can be made simpler or cleaner to help the compiler figure it out faster.
0

Share this post


Link to post
Share on other sites
Try the [fastopt] attribute on your loops. This should short circuit the loop simulator in the compiler which would otherwise be exponential for the number of embedded loops.

Docs for this are [url="http://msdn.microsoft.com/en-us/library/bb509602%28v=vs.85%29.aspx"]here[/url]
1

Share this post


Link to post
Share on other sites
[quote name='karwosts' timestamp='1295543648' post='4761929']
I'd consider posting your shader source if you're not trying to keep it secret, maybe there is some way it can be made simpler or cleaner to help the compiler figure it out faster.
[/quote]

I don't mind posting it, but like I said, I already experimented with commenting out code to see what effect it has on compilation speed, and it is purely due to for-loops (which I can't remove) and the amount of code in general. I can probably optimize the code more, but I want compilation to be a lot faster, not just a bit.

I guess I am hoping for some secret compiler flags that tone down whatever crazy stuff it is doing?

[code]

cbuffer consts
{
float4 sb;
float4 grad[16];
};

float4 Col(float3 c, float inside) { return float4(c, inside); }
bool Solid(float4 c) { return c.w > 0.999f; }

float rand(float2 co)
{
return frac(sin(dot(co.xy, float2(12.9898, 78.233))) * 43758.5453);
}

int randint(float2 co) { return int(rand(co) * 256); }

int Hash( float3 P )
{
return randint(P.xy) ^ randint(P.yz) ^ randint(P.zx);
}

float Snoise3D( float3 P )
{
const float F3 = 0.333333333333;
const float G3 = 0.166666666667;

float s = dot( P, F3 );
float3 Pi = floor( P + s );
float t = dot( Pi, G3 );

float3 P0 = Pi - t;
float3 Pf0 = P - P0;

float3 simplex[4];
float3 T = Pf0.xzy >= Pf0.yxz;
simplex[0] = 0;
simplex[1] = T.xzy > T.yxz;
simplex[2] = T.yxz <= T.xzy;
simplex[3] = 1;

float n = 0;

[loop][fastopt]
for (int i = 0; i<4; i++)
{
float3 Pf = Pf0 - simplex[i] + G3 * i;
int h = Hash( Pi + simplex[i] );
float d = saturate( 0.6f - dot( Pf, Pf ) );
d *= d;
n += d * d * dot((float3)grad[ h & 15 ], Pf);
}

return 32.0f * n;
}

float Turbulence3D( float3 p )
{
float res = 0;
float fact = 1;
float scale = 1;
float weight = 0;
[loop][fastopt]
for (int i = 0; i<4; i++)
{
res += fact * Snoise3D( p * scale );
weight += fact;
fact /= 2.5;
scale *= 2.5;
}
return res / weight;
}

float4 Fun(float3 p)
{
p = (p - sb.y) / sb.x;

return Col(0.5, Turbulence3D(p) < -0.3);
}

struct BufferStruct2
{
float4 a;
float4 b;
};

RWStructuredBuffer<BufferStruct2> g_OutBuff2 : register( u1 );

[numthreads(64, 1, 1)]
void main2( uint3 threadIDInGroup : SV_GroupThreadID,
uint3 groupID : SV_GroupID,
uint groupIndex : SV_GroupIndex,
uint3 dispatchThreadID : SV_DispatchThreadID )
{
BufferStruct2 e = g_OutBuff2[dispatchThreadID.x];
float3 a = (float3)e.a;
float3 b = (float3)e.b;

int subdivs = 10;
[loop][fastopt]
for (int i = 0; i < subdivs; i++)
{
float3 mid = (a + B) / 2;
float4 col = Fun(mid);
if (Solid(col)) a = mid;
else b = mid;
}

float3 pos = (a + B) / 2;


float3 col = float3(0, 0, 0);
int n = 0;
const float off = 0.6f;
[loop][fastopt]
for (float x = -off; x < off + 0.1f; x += off)
[loop][fastopt]
for (float y = -off; y < off + 0.1f; y += off)
[loop][fastopt]
for (float z = -off; z < off + 0.1f; z += off)
{
float4 acol = Fun(float3(x, y, z) + pos);
col += (float3)(acol * acol.w);
n += acol.w;
}
if (n > 1) col /= n;

e.a = float4(pos, 0);
e.b = float4(col, 0);
g_OutBuff2[dispatchThreadID.x] = e;
}
[/code]
0

Share this post


Link to post
Share on other sites
I don't know about any "secret compiler flags", but if [fastopt] doesn't work for you then you can always just specify a lower optimization level. I *think* the default is D3D10_SHADER_OPTIMIZATION_LEVEL1, and you can specify 0, 2, or 3. You can also just disable optimizations altogether.
0

Share this post


Link to post
Share on other sites
[quote name='DieterVW' timestamp='1295548250' post='4761956']
Try the [fastopt] attribute on your loops. This should short circuit the loop simulator in the compiler which would otherwise be exponential for the number of embedded loops.

Docs for this are [url="http://msdn.microsoft.com/en-us/library/bb509602%28v=vs.85%29.aspx"]here[/url]
[/quote]
Thanks, but if you see my original post, I am already using that. It helps some, but not enough.
0

Share this post


Link to post
Share on other sites
[quote name='MJP' timestamp='1295561408' post='4762061']
I don't know about any "secret compiler flags", but if [fastopt] doesn't work for you then you can always just specify a lower optimization level. I *think* the default is D3D10_SHADER_OPTIMIZATION_LEVEL1, and you can specify 0, 2, or 3. You can also just disable optimizations altogether.
[/quote]

Thanks, but as I said in my original post, I tried both D3D10_SHADER_OPTIMIZATION_LEVEL0 and D3D10_SHADER_SKIP_OPTIMIZATION, and neither appear to have any effect on compilation time.
0

Share this post


Link to post
Share on other sites
On my PC (Core i7 920) pasting that code into ATI's GPU Shader Analyzer and compiling as cs_5_0 (and changing 'B' to 'b' to make it compile) gives me a compile time of about 5 seconds, regardless of settings. If it's taking you 15 seconds a CPU upgrade might help, or you're compiling it more than once...

This drops down to almost instant if I take out the code in the middle of those triple nested loops at the bottom, so that's clearly the slow bit to compile. Unfortunately playing with that code I was unable to noticeably speed up compilation time, without removing code.

You might want to consider some sort of cache of compilation results, but that's not easy if you want to ignore changes to the source that won't affect the compiled result like adding whitespace.

Another option is using extra threads to compile the code in the background after each change, to minimize the delay that the user sees.
0

Share this post


Link to post
Share on other sites
[quote name='Adam_42' timestamp='1295610045' post='4762339']
On my PC (Core i7 920) pasting that code into ATI's GPU Shader Analyzer and compiling as cs_5_0 (and changing 'B' to 'b' to make it compile) gives me a compile time of about 5 seconds, regardless of settings. If it's taking you 15 seconds a CPU upgrade might help, or you're compiling it more than once...

This drops down to almost instant if I take out the code in the middle of those triple nested loops at the bottom, so that's clearly the slow bit to compile. Unfortunately playing with that code I was unable to noticeably speed up compilation time, without removing code.

You might want to consider some sort of cache of compilation results, but that's not easy if you want to ignore changes to the source that won't affect the compiled result like adding whitespace.

Another option is using extra threads to compile the code in the background after each change, to minimize the delay that the user sees.
[/quote]

Thanks for trying that out. Yes, the version I sent was already optimized somewhat from the 15 second version, and takes 5.8 seconds to compile on mine.

Yeah, a cache won't work, as any part of the code may change from run to run, so it will never hit the cache. If DirectCompute had some form of "linking", I could compile parts of the code that never change separately, but I don't think that's possible either.

Seeing as how none of the compiler flags affect compilation speed, it is clearly ignoring my request not to try to optimize those loops.
0

Share this post


Link to post
Share on other sites
Some of the loop analysis can't be turned off regardless of the flags you specify. Anyway, compiler performance is being worked on for a future release as it has major issues especially for compute shaders. At the moment you won't be able to eak out better performance through anything other than HLSL code change to your algorithm which may or may not be possible.
0

Share this post


Link to post
Share on other sites
[quote name='DieterVW' timestamp='1295651691' post='4762706']
Some of the loop analysis can't be turned off regardless of the flags you specify. Anyway, compiler performance is being worked on for a future release as it has major issues especially for compute shaders. At the moment you won't be able to eak out better performance through anything other than HLSL code change to your algorithm which may or may not be possible.
[/quote]
That is good to hear. Making the compiler obey flags of cheap optimization (just basic constant folding & inlining) would be fantastic.

Rather than reducing the amount of code, my project was planned to involve extending this code significantly. I guess I will have to implement it some other way, as this clearly will never get me fast turn arounds.

Does anyone have experience with OpenCL having a fast minimal optimization mode? I'd prefer to avoid OpenCL if I can, it appears pretty messy compared to DirectCompute. I can't use CUDA-C, since I'm on an ATI chip. Either that, or it is back to CPU code for me just for turn-around's sake :(
0

Share this post


Link to post
Share on other sites
[quote name='Aardappel' timestamp='1295653086' post='4762721']
[quote name='DieterVW' timestamp='1295651691' post='4762706']
Some of the loop analysis can't be turned off regardless of the flags you specify. Anyway, compiler performance is being worked on for a future release as it has major issues especially for compute shaders. At the moment you won't be able to eak out better performance through anything other than HLSL code change to your algorithm which may or may not be possible.
[/quote]
That is good to hear. Making the compiler obey flags of cheap optimization (just basic constant folding & inlining) would be fantastic.

Rather than reducing the amount of code, my project was planned to involve extending this code significantly. I guess I will have to implement it some other way, as this clearly will never get me fast turn arounds.

Does anyone have experience with OpenCL having a fast minimal optimization mode? I'd prefer to avoid OpenCL if I can, it appears pretty messy compared to DirectCompute. I can't use CUDA-C, since I'm on an ATI chip. Either that, or it is back to CPU code for me just for turn-around's sake :(
[/quote]
I've never tried it, but could your program benefit from the dynamic linkage portion of D3D11? If your program can be chopped up into sections, then you could utilize the dynamic linkage to simply replace portions of the program when they are edited. This would require defining interfaces in your program for certain sections of it, but it could eliminate the expensive recompilation portion that is eating up so much time...
0

Share this post


Link to post
Share on other sites
[quote name='Jason Z' timestamp='1295722354' post='4763082']
I've never tried it, but could your program benefit from the dynamic linkage portion of D3D11? If your program can be chopped up into sections, then you could utilize the dynamic linkage to simply replace portions of the program when they are edited. This would require defining interfaces in your program for certain sections of it, but it could eliminate the expensive recompilation portion that is eating up so much time...
[/quote]

The DynamicShaderLinkage11 uses #include to compile the interfaces/classes and the code that uses it in a single D3DX11CompileFromFile call, which probably wouldn't speed things up much. So the question is, can I link a class instance defined in one separately compiled blob to my main shader in another blob?

I can set a ID3D11ClassInstance thru CSSetShader. I obtain such a pointer using GetClassInstance() on a ID3D11ClassLinkage, which can then find any instances in code passed thru CreateComputeShader with that ID3D11ClassLinkage as an argument.

However I don't think I can call CreateComputeShader / D3DX11CompileFromFile on a file that only contains interfaces and classes, because they assume and entrypoint? I could add a dummy entrypoint, but something tells me ID3D11ClassLinkage was not intended to work way (referring to a shader that's not bound to the pipeline at all)... or am I missing something?

And even if that works, the sample documentation says:

"The Direct3D 11 runtime efficiently links each of the selected methods at source level, inlining and optimizing the shader code as much as possible to provide an optimal shader for the GPU to execute."

Which makes sense, because for PS use no-one would use interfaces if they were slower than the equivalent inlined code. So does that mean that potentially part of all that slowness would return at CSSetShader() time?
0

Share this post


Link to post
Share on other sites
the linking will only help you if the code can be written and compiled once. If you're letting the user write new code directly then this won't work. Otherwise the functionality could be split across a large number of class which can be bound in any arbitrary order from the runtime to meet the currently requested functionality. The shader will have to have all the code available when it compiles but you will have the ability to change how the shader operates from the runtime side. Perhaps it wold help to know more about your tool/application.
0

Share this post


Link to post
Share on other sites
There is one way I've spotted that you could effectively split some of that shader into multiple pieces that you can 'link together - convert some of the functions to texture lookups. For example you could try creating a 2D texture which you sample to calculate the results of randint(x, y).
0

Share this post


Link to post
Share on other sites
[quote name='DieterVW' timestamp='1295754885' post='4763282']
the linking will only help you if the code can be written and compiled once. If you're letting the user write new code directly then this won't work. Otherwise the functionality could be split across a large number of class which can be bound in any arbitrary order from the runtime to meet the currently requested functionality. The shader will have to have all the code available when it compiles but you will have the ability to change how the shader operates from the runtime side. Perhaps it wold help to know more about your tool/application.
[/quote]
The tool is an "implicit function modeller". The idea is to compositionally build complex models out of code rather than traditional modelling. The user writes code (at the moment in HLSL, but could be a more friendly special purpose language in the future that gets translated to HLSL on the fly), and the app shows what that looks in 3D like at the press of a key. To do so, it evaluates the function for however many locations in 3D space such that a good looking marching cubes mesh results (the loops in the above code for example sample the function many times to get accurate location of the isosurface on an edge, and anti-aliased color sampling).

One thing I could for a preview is a version that has less loops and thus looks uglier.

So the HLSL code changes constantly, and moreover, can become quite big is someone tries to model something complex (say a building, using many "union" operators). So compile time is critical, and would only get worse from this small example.

One way I could imagine hiding compile time latency is to continuously compile the code the user is editing, and keep the last one that compiled without errors, but that is still pointless if the compile takes several minutes.
0

Share this post


Link to post
Share on other sites
[quote name='Adam_42' timestamp='1295778348' post='4763343']
There is one way I've spotted that you could effectively split some of that shader into multiple pieces that you can 'link together - convert some of the functions to texture lookups. For example you could try creating a 2D texture which you sample to calculate the results of randint(x, y).
[/quote]

Yeah, there are specific code size optimizations I could do, but that doesn't help me in the general case (see previous post). Besides, replacing randint by the number 42 does not change compile time much.
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Similar Content

    • By bowerbirdcn
      hi, guys, how to understand the math used in CDXUTDirectionWidget ::UpdateLightDir 
      the  following code snippet is taken from MS DXTU source code
       
        D3DXMATRIX mInvView;
          D3DXMatrixInverse( &mInvView, NULL, &m_mView );
          mInvView._41 = mInvView._42 = mInvView._43 = 0;
          D3DXMATRIX mLastRotInv;
          D3DXMatrixInverse( &mLastRotInv, NULL, &m_mRotSnapshot );
          D3DXMATRIX mRot = *m_ArcBall.GetRotationMatrix();
          m_mRotSnapshot = mRot;
          // Accumulate the delta of the arcball's rotation in view space.
          // Note that per-frame delta rotations could be problematic over long periods of time.
          m_mRot *= m_mView * mLastRotInv * mRot * mInvView;
          // Since we're accumulating delta rotations, we need to orthonormalize 
          // the matrix to prevent eventual matrix skew
          D3DXVECTOR3* pXBasis = ( D3DXVECTOR3* )&m_mRot._11;
          D3DXVECTOR3* pYBasis = ( D3DXVECTOR3* )&m_mRot._21;
          D3DXVECTOR3* pZBasis = ( D3DXVECTOR3* )&m_mRot._31;
          D3DXVec3Normalize( pXBasis, pXBasis );
          D3DXVec3Cross( pYBasis, pZBasis, pXBasis );
          D3DXVec3Normalize( pYBasis, pYBasis );
          D3DXVec3Cross( pZBasis, pXBasis, pYBasis );
       
       
      https://github.com/Microsoft/DXUT/blob/master/Optional/DXUTcamera.cpp
    • By YixunLiu
      Hi,
      I have a surface mesh and I want to use a cone to cut a hole on the surface mesh.
      Anybody know a fast method to calculate the intersected boundary of these two geometries?
       
      Thanks.
       
      YL
       
    • By hiya83
      Hi, I tried searching for this but either I failed or couldn't find anything. I know there's D11/D12 interop and there are extensions for GL/D11 (though not very efficient). I was wondering if there's any Vulkan/D11 or Vulkan/D12 interop?
      Thanks!
    • By lonewolff
      Hi Guys,
      I am just wondering if it is possible to acquire the address of the backbuffer if an API (based on DX11) only exposes the 'device' and 'context' pointers?
      Any advice would be greatly appreciated
    • By MarcusAseth
      bool InitDirect3D::Init() { if (!D3DApp::Init()) { return false; } //Additional Initialization //Disable Alt+Enter Fullscreen Toggle shortkey IDXGIFactory* factory; CreateDXGIFactory(__uuidof(IDXGIFactory), reinterpret_cast<void**>(&factory)); factory->MakeWindowAssociation(mhWindow, DXGI_MWA_NO_WINDOW_CHANGES); factory->Release(); return true; }  
      As stated on the title and displayed on the code above, regardless of it Alt+Enter still takes effect...
      I recall something from the book during the swapChain creation, where in order to create it one has to use the same factory used to create the ID3D11Device, therefore I tested and indeed using that same factory indeed it work.
      How is that one particular factory related to my window and how come the MakeWindowAssociation won't take effect with a newly created factory?
      Also what's even the point of being able to create this Factories if they won't work,?(except from that one associated with the ID3D11Device) 
  • Popular Now