Jump to content

  • Log In with Google      Sign In   
  • Create Account

FREE SOFTWARE GIVEAWAY

We have 4 x Pro Licences (valued at $59 each) for 2d modular animation software Spriter to give away in this Thursday's GDNet Direct email newsletter.


Read more in this forum topic or make sure you're signed up (from the right-hand sidebar on the homepage) and read Thursday's newsletter to get in the running!


DirectX11 performance problems


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
9 replies to this topic

#1 David_pb   Members   -  Reputation: 668

Like
1Likes
Like

Posted 05 July 2013 - 01:20 PM

Hi,

 

I'm currently working on a DirectX11 port of a old DX9 renderer and facing the problem that DX11 seems really slow in comparison to the old stuff. I've already checked all the 'best practice' slides which are available all around the internet (such as creating new resources at runtime, updatefrequency for constantbuffers, etc...). But nothing seems to be a real problem. Other engines i checked are much more careless in most of this cases but seem not to have likewise problems.

 

Profiling results that the code is highly CPU bound since the GPU seems to be starving. GPUView emphasizes this since the CPU Queue is empty most of the time and becomes only occasionally a package pushed onto. The wired thing is, that the main thread isn't stalling but is active nearly the whole time. Vtune turns out that most of the samples are taken in DirectX API calls which are taking far to much time (the main bottlenecks seem to be DrawIndexed/Instanced, Map and IASetVertexbuffers). 

 

The next thing I thought about are sync-points. But the only source I can imagine is the update of the constant buffers. Which are quite a few per frame. What I'm essentially doing is caching the shader constants in a buffer and push the whole memory junk in my constant buffers. The buffers are all dynamic and are mapped with 'discard'. I also tried to create 'default' buffers and update them with UpdateSubresource and a mix out of both ('per frame' buffers dynamic and the rest default), but this seemed to result in equal performance.

 

The wired thing is, that the old DX9 renderer produces much better results with the same rendercode. Maybe somebody has experienced an equal behaviour and can give me a hint.

 

Cheers

David


@D13_Dreinig

Sponsor:

#2 Alessio1989   Members   -  Reputation: 2133

Like
0Likes
Like

Posted 05 July 2013 - 02:59 PM

make sure to set the proper usage flags on each type of resource (ie: use always immutable or default flag when is possible, try to use dyanamic flag instead of staging flag for resources that must be written by the CPU).


"Software does not run in a magical fairy aether powered by the fevered dreams of CS PhDs"


#3 David_pb   Members   -  Reputation: 668

Like
0Likes
Like

Posted 05 July 2013 - 03:14 PM

make sure to set the proper usage flags on each type of resource (ie: use always immutable or default flag when is possible, try to use dyanamic flag instead of staging flag for resources that must be written by the CPU).

 

Yeah, that I've already ensured, I'm quite certain that the problem comes from somethere else.


@D13_Dreinig

#4 Adam_42   Crossbones+   -  Reputation: 2619

Like
3Likes
Like

Posted 06 July 2013 - 06:19 AM

For each of those expensive functions, how many times per frame do you call them?

 

Have you considered using multi threaded draw calls?

 

You might also want to take a look at: http://fgiesen.wordpress.com/2013/03/05/mopping-up/ the whole series is interesting, but that section has a significant CPU optimization trick for constant buffers.


Edited by Adam_42, 06 July 2013 - 06:19 AM.


#5 David_pb   Members   -  Reputation: 668

Like
0Likes
Like

Posted 06 July 2013 - 01:03 PM

Thanks for the answer Adam, As for the first question, this is hard to say, in the worst case the functions are called round 2000-4000 times per frame. I actually thought about multi threading, but dropped the idea since I'm heavily bound to the available interface and there is currently no time to do bigger changes there. But maybe in the future this could be an option. Thanks for the interesting link though, I'll check this.


@D13_Dreinig

#6 Adam_42   Crossbones+   -  Reputation: 2619

Like
0Likes
Like

Posted 06 July 2013 - 07:43 PM

The reason I asked for the number of calls was more for the ratios between the different call types, than for absolute values. You can easily get those numbers by adding some global variables which you increment on each call, and reset once per frame. Examine the values just before the reset to get your results.

 

I tend to find that when optimizing the best case to look at first is the slowest one. Usually if you make that faster everything else will improve too, and it'll be more obvious where algorithms don't scale to larger numbers of inputs. You also want something that's easily repeatable so you can measure performance before and after optimization and see how well it actually worked.

 

It can also sometimes be useful to look at a simple case to check that you aren't doing lots of unnecessary work (e.g. running through a big list of items, checking to see if each one is active, when none of them are).

 

Also don't forget to compare optimized builds, with the debug runtimes disabled. Profiling debug code is generally not much use, because it's not representative of what users will see.



#7 mhagain   Crossbones+   -  Reputation: 8284

Like
4Likes
Like

Posted 07 July 2013 - 04:22 AM

Your CPU profiling results may be misleading you here; one cause of D3D11 being mysteriously slow can actually come from parameters passed to your CreateDeviceAndSwapChain call.

 

Double-check your swap chain description that you pass to this, making sure that the width and height parameters you use are actually the size of the client rectangle for the window you're using.  If these are different, then DXGI will need to do a stretch operation during present (rather than just exchanging buffers) which can slow you down considerably.  This can also happen if you're changing display modes at any time - if the new dimensions you specify in your ResizeTarget call are different to those in your ResizeBuffers call, you'll get the same behaviour. 

 

While you're doing this, it may not be a bad idea to also double-check the refresh rate values you use.  It's common enough to see people just using 60/1 here (on the understandable assumption that they've got a 60Hz monitor so therefore these must be correct) whereas the actual correct values may be something like 59994/1002 (or similar).  The DXGI documentation explicitly warns that you must use a properly enumerated mode when setting up your swap chain description, so it's important to get these right.

 

See http://msdn.microsoft.com/en-us/library/windows/desktop/bb205075(v=vs.85).aspx#Care_and_Feeding_of_the_Swap_Chain for more information on all of this.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#8 David_pb   Members   -  Reputation: 668

Like
0Likes
Like

Posted 08 July 2013 - 01:31 PM

@Adam_42, mhagain

 

Thank you, this is useful information. I've checked the initialization code and all seems to be fine.

What I found though is a bug in the code that sorts the render-operations so my batching was far from optimal. With correct rop order and many checks to avoid unnecessary API calls the performance is now quite decent, though not really optimal. Interestingly DirectX9 doesn't seem to have much trouble with this...


@D13_Dreinig

#9 imoogiBG   Members   -  Reputation: 1247

Like
0Likes
Like

Posted 11 July 2013 - 12:45 PM

How often InputLayout is updated? how often new input layout is created?

 

What flags are used when you create new shader?


Edited by imoogiBG, 11 July 2013 - 12:49 PM.


#10 David_pb   Members   -  Reputation: 668

Like
0Likes
Like

Posted 12 July 2013 - 01:44 AM

For release builds no flags are set when shaders are compiled, except for matrix order. For debug I use DEBUG, PREFRER_FLOW_CONTROL and SKIP_OPTIMIZATION. As for the InputLayouts. I use a simple caching system to share inputlayouts whenever possible. Whenever a shader is associated with some renderable entity a IL is requested from the cache. A hash is created over the vertex declaration and the shader input signature and if an equal hash is found the layout is returned and shared. Otherwise a new layout is created. This all happens only once per shader-renderable entity combination at loading time, so I assure not to create this stuff 'on the fly' at runtime.


@D13_Dreinig




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS