DirectX11 performance problems

Started by
8 comments, last by David_pb 10 years, 9 months ago

Hi,

I'm currently working on a DirectX11 port of a old DX9 renderer and facing the problem that DX11 seems really slow in comparison to the old stuff. I've already checked all the 'best practice' slides which are available all around the internet (such as creating new resources at runtime, updatefrequency for constantbuffers, etc...). But nothing seems to be a real problem. Other engines i checked are much more careless in most of this cases but seem not to have likewise problems.

Profiling results that the code is highly CPU bound since the GPU seems to be starving. GPUView emphasizes this since the CPU Queue is empty most of the time and becomes only occasionally a package pushed onto. The wired thing is, that the main thread isn't stalling but is active nearly the whole time. Vtune turns out that most of the samples are taken in DirectX API calls which are taking far to much time (the main bottlenecks seem to be DrawIndexed/Instanced, Map and IASetVertexbuffers).

The next thing I thought about are sync-points. But the only source I can imagine is the update of the constant buffers. Which are quite a few per frame. What I'm essentially doing is caching the shader constants in a buffer and push the whole memory junk in my constant buffers. The buffers are all dynamic and are mapped with 'discard'. I also tried to create 'default' buffers and update them with UpdateSubresource and a mix out of both ('per frame' buffers dynamic and the rest default), but this seemed to result in equal performance.

The wired thing is, that the old DX9 renderer produces much better results with the same rendercode. Maybe somebody has experienced an equal behaviour and can give me a hint.

Cheers

David

Advertisement

make sure to set the proper usage flags on each type of resource (ie: use always immutable or default flag when is possible, try to use dyanamic flag instead of staging flag for resources that must be written by the CPU).

"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/

make sure to set the proper usage flags on each type of resource (ie: use always immutable or default flag when is possible, try to use dyanamic flag instead of staging flag for resources that must be written by the CPU).

Yeah, that I've already ensured, I'm quite certain that the problem comes from somethere else.

For each of those expensive functions, how many times per frame do you call them?

Have you considered using multi threaded draw calls?

You might also want to take a look at: http://fgiesen.wordpress.com/2013/03/05/mopping-up/ the whole series is interesting, but that section has a significant CPU optimization trick for constant buffers.

Thanks for the answer Adam, As for the first question, this is hard to say, in the worst case the functions are called round 2000-4000 times per frame. I actually thought about multi threading, but dropped the idea since I'm heavily bound to the available interface and there is currently no time to do bigger changes there. But maybe in the future this could be an option. Thanks for the interesting link though, I'll check this.

The reason I asked for the number of calls was more for the ratios between the different call types, than for absolute values. You can easily get those numbers by adding some global variables which you increment on each call, and reset once per frame. Examine the values just before the reset to get your results.

I tend to find that when optimizing the best case to look at first is the slowest one. Usually if you make that faster everything else will improve too, and it'll be more obvious where algorithms don't scale to larger numbers of inputs. You also want something that's easily repeatable so you can measure performance before and after optimization and see how well it actually worked.

It can also sometimes be useful to look at a simple case to check that you aren't doing lots of unnecessary work (e.g. running through a big list of items, checking to see if each one is active, when none of them are).

Also don't forget to compare optimized builds, with the debug runtimes disabled. Profiling debug code is generally not much use, because it's not representative of what users will see.

Your CPU profiling results may be misleading you here; one cause of D3D11 being mysteriously slow can actually come from parameters passed to your CreateDeviceAndSwapChain call.

Double-check your swap chain description that you pass to this, making sure that the width and height parameters you use are actually the size of the client rectangle for the window you're using. If these are different, then DXGI will need to do a stretch operation during present (rather than just exchanging buffers) which can slow you down considerably. This can also happen if you're changing display modes at any time - if the new dimensions you specify in your ResizeTarget call are different to those in your ResizeBuffers call, you'll get the same behaviour.

While you're doing this, it may not be a bad idea to also double-check the refresh rate values you use. It's common enough to see people just using 60/1 here (on the understandable assumption that they've got a 60Hz monitor so therefore these must be correct) whereas the actual correct values may be something like 59994/1002 (or similar). The DXGI documentation explicitly warns that you must use a properly enumerated mode when setting up your swap chain description, so it's important to get these right.

See http://msdn.microsoft.com/en-us/library/windows/desktop/bb205075(v=vs.85).aspx#Care_and_Feeding_of_the_Swap_Chain for more information on all of this.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

@Adam_42, mhagain

Thank you, this is useful information. I've checked the initialization code and all seems to be fine.

What I found though is a bug in the code that sorts the render-operations so my batching was far from optimal. With correct rop order and many checks to avoid unnecessary API calls the performance is now quite decent, though not really optimal. Interestingly DirectX9 doesn't seem to have much trouble with this...

How often InputLayout is updated? how often new input layout is created?

What flags are used when you create new shader?

For release builds no flags are set when shaders are compiled, except for matrix order. For debug I use DEBUG, PREFRER_FLOW_CONTROL and SKIP_OPTIMIZATION. As for the InputLayouts. I use a simple caching system to share inputlayouts whenever possible. Whenever a shader is associated with some renderable entity a IL is requested from the cache. A hash is created over the vertex declaration and the shader input signature and if an equal hash is found the layout is returned and shared. Otherwise a new layout is created. This all happens only once per shader-renderable entity combination at loading time, so I assure not to create this stuff 'on the fly' at runtime.

This topic is closed to new replies.

Advertisement