DX11 Using GPU Profiling to find bottlenecks

Recommended Posts

In some situations, my game starts to "lag" on older computers. I wanted to search for bottlenecks and optimize my game by searching for flaws in the shaders and in the layer between CPU and GPU. My first step was to measure the time my render function needs to solve its tasks. Every second I wrote the accumulated times of each task into my console window. Each second it takes around

• 170ms to call render functions for all models (including settings shader resources, updating constant buffers, drawing all indexed and non-indexed vertices, etc.)
• 40ms to render the UI
• 790ms to call SwapChain.Present
• <1ms to do the rest (updating structures, etc.)

In my Swap Chain description I set a frame rate of 60 Hz, if it's supported by the computer. It made sense for me that the Present function waits some time until it starts the next frame. However, I wanted to check, if this might be a problem for me. After a web search I found articles like this one, which states

Quote

There are a few reasons why you might find Present() taking up that much time in your frame. The Present call is the primary method of synchronization between the CPU and the GPU; if the CPU is generating frames much faster than the GPU can process them, they'll get queued up. Once the buffer gets full, Present() turns into a glorified Sleep() call while it waits for it to empty out. [...] Also take a look at your app using one of the available GPU profilers; PIX is good as a base point, while NVIDIA and AMD each have their own more specific offerings for their own products. Finally, make sure your drivers are updated. If there's a bug in the driver, any chance you have at reasoning about the issue goes out the window.

My drivers are up-to-date so that's no issue. I installed Microsoft's PIX, but I was unable to use it. I could configure my game for x64, but PIX is not able to process DirectX 11.. After getting only error messages, I installed NVIDIA's NSight. After adjusting my game and installing all components, I couldn't get a proper result, because my game freezes after a few frames. I haven't figured out why. There is no exception or error message and other debug mechanisms like log messages and break points tell me the game freezes at the end of the render function after a few frames. So, I looked for another profiling tool and found Jeremy's GPUProfiler. However, the information returned by this tool is too basic to get an in-depth knowledge about my performance issues.

Can anyone recommend a GPU Profiler or any other tool that might help me to find bottlenecks in my game and or that is able to indicate performance problems in my shaders? My custom graphics engine can handle subjects like multi-texturing, instancing, soft shadowing, animation, etc. However, I am pretty sure, there are things I can optimize!

I am using SharpDX to develop a game (engine) based on DirectX 11 with .NET Framework 4.5. My graphics cards is from NVIDIA and my processor is made by Intel.

Edited by GalacticCrew
There were typos in my post..

Share on other sites

Dx11 is the worst, most commands building are delayed and present is a black box that does all the work. Deferred context even worse, rolling out even more work in the present…

You can use the no wait flag at Present to disociate from vsync wait and real cpu work.

But that black box also means that you can't know exactly what part of your frame is causing the real fuzz in your CPU time. It can be memory defrag, first time shaders compiles, etc. Nvidia does a better job at multithreading the driver than nvidia for you but again, at the price of opacity

Rule of thumb, optimize for AMD (a lost cause without AMD insiders support), assume will work better/greater on nVidia for your own sanity.

On the GPU, heavy use of timestamp help to inspect frame duration

Share on other sites

So your GPU is running at 1fps, but your CPU is also only running at 4fps!?

How does your CPU drawing take 170ms? How many draw calls do you make? It could be that you're doing something wrong on the CPU which is causing both CPU and GPU to run slowly.

To profile your GPU workload it's pretty easy to do it yourself. You can use D3Ds timestamp queries / events to read the GPUs clock at different points in a frame. You can then just read the clock before/after a group of commands, subtract the difference, divide by the clock frequency and you know how long that group of commands took.

Share on other sites

GPU profiling generally needs to be intrusive on your app. For CPU profiling there is usually metadata that can be leveraged (in PDB files for native code, or from the assembly itself in managed code), but there isn't anything for GPU profiling aside from the D3D calls that you issue. Nsight can give you timing breakdowns per-draw (if you can get it working), but that's usually not super helpful for getting a broader view of your game's performance. For that you need to add some annotations to your code that bracket your draws and dispatches into logical goups. Different profiling tools having different ways for doing this, some of them involving custom API's specifically for that tool. For instance, the new PIX for windows has the WinPixEventRuntime lib. Many tools (such as RenderDoc and Nsight) that support D3D11 still recognize the old D3DPERF_BeginEvent/D3DPERF_EndEvent functions from D3D9.

You may want to consider adding your own basic profiling to your game, which you can do by using timestamp queries. Unfortunately I don't have any C#/SharpDX code that you can look at, but if you're okay with looking at some C++ you can look at my basic D3D11 Profiler class that I have here and here as a reference. I would also consider adding some debug UI to your game so that you can output this data right to the screen in development builds.

6 minutes ago, Hodgman said:

So your GPU is running at 1fps, but your CPU is also only running at 4fps!?

I was thinking the same thing when I first read the OP, but then I realized that it says that those timings were accumulated over a single second. So basically 79% of the frame time is spent in Present, which sounds normal for a heavily GPU-bound game.

Share on other sites
15 minutes ago, Hodgman said:

How does your CPU drawing take 170ms?

I guessed that ms to him is micro seconds not milli seconds.

edit - BTW is there a way to use the symbol for micro in the forum?

edit2 - Just reread the OP because I read MJP's post... guess I was wrong.

Edited by Infinisearch

Share on other sites

Thank you for your replies. I will try use implement my own profiling mechanism. I also edited my post a little bit. There were many typos in it. I wrote it after finishing my working day... :-/

When writing "ms" I mean milli-seconds. I have 60 frames per second, i.e. each frame takes around 16 ms. An accumulated render time for my scene of 170 ms means that each frame I spend around 170 / 60 = 2.83 ms for rendering the scene.

Share on other sites
46 minutes ago, GalacticCrew said:

When writing "ms" I mean milli-seconds. I have 60 frames per second, i.e. each frame takes around 16 ms. An accumulated render time for my scene of 170 ms means that each frame I spend around 170 / 60 = 2.83 ms for rendering the scene.

Yeah sorry about that, I didn't really read your post I was more interested in the responses.  So I just assumed... again my mistake, sorry.

Share on other sites

No problem, Infinisearch. I used the information from the posts above to write my own GPU profiler using the links provided by MJP. I wrote two classes. The class GPUInterval is an interval you are interested in, e.g. the time that is used to render a scene. The class GPUProfiler is a container for a set of GPUIntervals and does all calculations.

namespace Engine.Game.Profiler
{
public class GPUInterval
{

private SharpDX.Direct3D11.Query _startQuery;
private SharpDX.Direct3D11.Query _endQuery;

public string Name { get; private set; }
public double Duration { get; private set; }

public GPUInterval(SharpDX.Direct3D11.Device device, SharpDX.Direct3D11.DeviceContext deviceContext, string name)
{
_device = device;
_deviceContext = deviceContext;

Name = name;

// Create timestamp query.
_startQuery = new SharpDX.Direct3D11.Query(_device, new SharpDX.Direct3D11.QueryDescription()
{
Type = SharpDX.Direct3D11.QueryType.Timestamp,
Flags = SharpDX.Direct3D11.QueryFlags.None
});

_endQuery = new SharpDX.Direct3D11.Query(_device, new SharpDX.Direct3D11.QueryDescription()
{
Type = SharpDX.Direct3D11.QueryType.Timestamp,
Flags = SharpDX.Direct3D11.QueryFlags.None
});
}

public void Start()
{
_deviceContext.End(_startQuery);
}

public void Stop()
{
_deviceContext.End(_endQuery);
}

public void Calculate(long frequency)
{
long startTime;
while (!_deviceContext.GetData(_startQuery, out startTime))

long endTime;
while (!_deviceContext.GetData(_endQuery, out endTime))

Duration = ((endTime - startTime) * 1000.0) / frequency;
}
}
}
namespace Engine.Game.Profiler
{
public class GPUProfiler
{

private SharpDX.Direct3D11.Query _disjointQuery;

public List<GPUInterval> Intervals { get; private set; }

public GPUProfiler(SharpDX.Direct3D11.Device device, SharpDX.Direct3D11.DeviceContext deviceContext)
{
_device = device;
_deviceContext = deviceContext;

// Create disjoint query.
_disjointQuery = new SharpDX.Direct3D11.Query(_device, new SharpDX.Direct3D11.QueryDescription()
{
Type = SharpDX.Direct3D11.QueryType.TimestampDisjoint,
Flags = SharpDX.Direct3D11.QueryFlags.None
});

// Create intervals list
Intervals = new List<GPUInterval>();
}

public void StartFrame()
{
_deviceContext.Begin(_disjointQuery);
}

public void EndFrame()
{
_deviceContext.End(_disjointQuery);

// Retrieve frequency.
SharpDX.Direct3D11.QueryDataTimestampDisjoint queryDataTimestampDisjoint;
while (!_deviceContext.GetData(_disjointQuery, out queryDataTimestampDisjoint))

// Calculate the duration of all intervals.
if (!queryDataTimestampDisjoint.Disjoint)
{
foreach (var interval in Intervals)
interval.Calculate(queryDataTimestampDisjoint.Frequency);
}
}
}
}

I created four GPUIntervals to check the same regions I mentioned in my initial post:

1. The entire render function
2. Rendering the scene (drawing models, setting constant shaders, ...)
3. Rendering UI
4. Calling SwapChain.Present

Here are the numbers for a random frame while the game is idling:

1. Entire render function = 16.35 ms
2. Render scene = 15.00 ms
3. Render UI = 0.26 ms
4. SwapChain.Present = 1.08 ms

These numbers are no big surprise, because the render scene does all the work. However, it is interesting that rendering the UI (which has A LOT of elements) and presenting the Swap Chain is fine.

Tomorrow, I will investigate the different parts of my scene rendering. I will keep you updated!

Share on other sites

One other suggestion I have is to make sure that you use a sync interval of 0 when performing GPU profiling. When VSYNC is enabled, the driver will always wait until the next sync interval (every 16.6ms for a 60Hz monitor) before presenting the next buffer in the swap chain. Because of this, the GPU will typically have to stall so that it can wait until a swap chain buffer is finished being used for presenting, and can therefore be re-used as a render target. This still can show up in your timestamp queries, which will give you misleading results. You may be able to avoid capturing that stall in your timings by excluding the first draw or copy operation that touches the back buffer, but it's easier to just disable VSYNC.

Share on other sites

I have extended my GPU Profiling classes. Now, I measure all intervals hierarchical. So, I can see the duration for each operation in each function. I located my bottleneck and I will further investigate how to solve it. If I can't solve it, I will open a new threat. My question here has been successfully answered. Thank you very much!

Create an account

Register a new account

• 23
• 10
• 19
• 15
• 14
• Similar Content

• By chiffre
Introduction:
In general my questions pertain to the differences between floating- and fixed-point data. Additionally I would like to understand when it can be advantageous to prefer fixed-point representation over floating-point representation in the context of vertex data and how the hardware deals with the different data-types. I believe I should be able to reduce the amount of data (bytes) necessary per vertex by choosing the most opportune representations for my vertex attributes. Thanks ahead of time if you, the reader, are considering the effort of reading this and helping me.
I found an old topic that shows this is possible in principal, but I am not sure I understand what the pitfalls are when using fixed-point representation and whether there are any hardware-based performance advantages/disadvantages.
(TLDR at bottom)
The Actual Post:
To my understanding HLSL/D3D11 offers not just the traditional floating point model in half-,single-, and double-precision, but also the fixed-point model in form of signed/unsigned normalized integers in 8-,10-,16-,24-, and 32-bit variants. Both models offer a finite sequence of "grid-points". The obvious difference between the two models is that the fixed-point model offers a constant spacing between values in the normalized range of [0,1] or [-1,1], while the floating point model allows for smaller "deltas" as you get closer to 0, and larger "deltas" the further you are away from 0.
To add some context, let me define a struct as an example:
struct VertexData { float[3] position; //3x32-bits float[2] texCoord; //2x32-bits float[3] normals; //3x32-bits } //Total of 32 bytes Every vertex gets a position, a coordinate on my texture, and a normal to do some light calculations. In this case we have 8x32=256bits per vertex. Since the texture coordinates lie in the interval [0,1] and the normal vector components are in the interval [-1,1] it would seem useful to use normalized representation as suggested in the topic linked at the top of the post. The texture coordinates might as well be represented in a fixed-point model, because it seems most useful to be able to sample the texture in a uniform manner, as the pixels don't get any "denser" as we get closer to 0. In other words the "delta" does not need to become any smaller as the texture coordinates approach (0,0). A similar argument can be made for the normal-vector, as a normal vector should be normalized anyway, and we want as many points as possible on the sphere around (0,0,0) with a radius of 1, and we don't care about precision around the origin. Even if we have large textures such as 4k by 4k (or the maximum allowed by D3D11, 16k by 16k) we only need as many grid-points on one axis, as there are pixels on one axis. An unsigned normalized 14 bit integer would be ideal, but because it is both unsupported and impractical, we will stick to an unsigned normalized 16 bit integer. The same type should take care of the normal vector coordinates, and might even be a bit overkill.
struct VertexData { float[3] position; //3x32-bits uint16_t[2] texCoord; //2x16bits uint16_t[3] normals; //3x16bits } //Total of 22 bytes Seems like a good start, and we might even be able to take it further, but before we pursue that path, here is my first question: can the GPU even work with the data in this format, or is all I have accomplished minimizing CPU-side RAM usage? Does the GPU have to convert the texture coordinates back to a floating-point model when I hand them over to the sampler in my pixel shader? I have looked up the data types for HLSL and I am not sure I even comprehend how to declare the vertex input type in HLSL. Would the following work?
struct VertexInputType { float3 pos; //this one is obvious unorm half2 tex; //half corresponds to a 16-bit float, so I assume this is wrong, but this the only 16-bit type I found on the linked MSDN site snorm half3 normal; //same as above } I assume this is possible somehow, as I have found input element formats such as: DXGI_FORMAT_R16G16B16A16_SNORM and DXGI_FORMAT_R16G16B16A16_UNORM (also available with a different number of components, as well as different component lengths). I might have to avoid 3-component vectors because there is no 3-component 16-bit input element format, but that is the least of my worries. The next question would be: what happens with my normals if I try to do lighting calculations with them in such a normalized-fixed-point format? Is there no issue as long as I take care not to mix floating- and fixed-point data? Or would that work as well? In general this gives rise to the question: how does the GPU handle fixed-point arithmetic? Is it the same as integer-arithmetic, and/or is it faster/slower than floating-point arithmetic?
Assuming that we still have a valid and useful VertexData format, how far could I take this while remaining on the sensible side of what could be called optimization? Theoretically I could use the an input element format such as DXGI_FORMAT_R10G10B10A2_UNORM to pack my normal coordinates into a 10-bit fixed-point format, and my verticies (in object space) might even be representable in a 16-bit unsigned normalized fixed-point format. That way I could end up with something like the following struct:
struct VertexData { uint16_t[3] pos; //3x16bits uint16_t[2] texCoord; //2x16bits uint32_t packedNormals; //10+10+10+2bits } //Total of 14 bytes Could I use a vertex structure like this without too much performance-loss on the GPU-side? If the GPU has to execute some sort of unpacking algorithm in the background I might as well let it be. In the end I have a functioning deferred renderer, but I would like to reduce the memory footprint of the huge amount of vertecies involved in rendering my landscape.
TLDR: I have a lot of vertices that I need to render and I want to reduce the RAM-usage without introducing crazy compression/decompression algorithms to the CPU or GPU. I am hoping to find a solution by involving fixed-point data-types, but I am not exactly sure how how that would work.
• By cozzie
Hi all,
I was wondering it it matters in which order you draw 2D and 3D items, looking at the BeginDraw/EndDraw calls on a D2D rendertarget.
The order in which you do the actual draw calls is clear, 3D first then 2D, means the 2D (DrawText in this case) is in front of the 3D scene.
The question is mainly about when to call the BeginDraw and EndDraw.
Note that I'm drawing D2D stuff through a DXGI surface linked to the 3D RT.
Option 1:
A - Begin frame, clear D3D RT
B - Draw 3D
C - BeginDraw D2D RT
D - Draw 2D
E - EndDraw D2D RT
F - Present
Option 2:
A - Begin frame, clear D3D RT + BeginDraw D2D RT
B - Draw 3D
C - Draw 2D
D - EndDraw D2D RT
E- Present
Would there be a difference (performance/issue?) in using option 2? (versus 1)
Any input is appreciated.

• Do you know any papers that cover custom data structures like lists or binary trees implemented in hlsl without CUDA that work perfectly fine no matter how many threads try to use them at any given time?
• By cozzie
Hi all,
Last week I noticed that when I run my test application(s) in Renderdoc, it crashes when it enable my code that uses D2D/DirectWrite. In Visual Studio no issues occur (debug or release), but when I run the same executable in Renderdoc, it crashes somehow (assert of D2D rendertarget or without any information). Before I spend hours on debugging/ figuring it out, does someone have experience with this symptom and/or know if Renderdoc has known issues with D2D? (if so, that would be bad news for debugging my application in the future );
I can also post some more information on what happens, code and which code commented out, eliminates the problems (when running in RenderDoc).
Any input is appreciated.

• Hi Guys,
I understand how to create input layouts etc... But I am wondering is it at all possible to derive an input layout from a shader and create the input layout directly from this? (Rather than manually specifying the input layout format?)