Using GPU Profiling to find bottlenecks

Started by
9 comments, last by GalacticCrew 6 years, 5 months ago

In some situations, my game starts to "lag" on older computers. I wanted to search for bottlenecks and optimize my game by searching for flaws in the shaders and in the layer between CPU and GPU. My first step was to measure the time my render function needs to solve its tasks. Every second I wrote the accumulated times of each task into my console window. Each second it takes around

  • 170ms to call render functions for all models (including settings shader resources, updating constant buffers, drawing all indexed and non-indexed vertices, etc.)
  • 40ms to render the UI
  • 790ms to call SwapChain.Present
  • <1ms to do the rest (updating structures, etc.)

In my Swap Chain description I set a frame rate of 60 Hz, if it's supported by the computer. It made sense for me that the Present function waits some time until it starts the next frame. However, I wanted to check, if this might be a problem for me. After a web search I found articles like this one, which states 

Quote

There are a few reasons why you might find Present() taking up that much time in your frame. The Present call is the primary method of synchronization between the CPU and the GPU; if the CPU is generating frames much faster than the GPU can process them, they'll get queued up. Once the buffer gets full, Present() turns into a glorified Sleep() call while it waits for it to empty out. [...] Also take a look at your app using one of the available GPU profilers; PIX is good as a base point, while NVIDIA and AMD each have their own more specific offerings for their own products. Finally, make sure your drivers are updated. If there's a bug in the driver, any chance you have at reasoning about the issue goes out the window.

My drivers are up-to-date so that's no issue. I installed Microsoft's PIX, but I was unable to use it. I could configure my game for x64, but PIX is not able to process DirectX 11.. After getting only error messages, I installed NVIDIA's NSight. After adjusting my game and installing all components, I couldn't get a proper result, because my game freezes after a few frames. I haven't figured out why. There is no exception or error message and other debug mechanisms like log messages and break points tell me the game freezes at the end of the render function after a few frames. So, I looked for another profiling tool and found Jeremy's GPUProfiler. However, the information returned by this tool is too basic to get an in-depth knowledge about my performance issues.

Can anyone recommend a GPU Profiler or any other tool that might help me to find bottlenecks in my game and or that is able to indicate performance problems in my shaders? My custom graphics engine can handle subjects like multi-texturing, instancing, soft shadowing, animation, etc. However, I am pretty sure, there are things I can optimize!

I am using SharpDX to develop a game (engine) based on DirectX 11 with .NET Framework 4.5. My graphics cards is from NVIDIA and my processor is made by Intel.

Advertisement

Dx11 is the worst, most commands building are delayed and present is a black box that does all the work. Deferred context even worse, rolling out even more work in the present…

 

You can use the no wait flag at Present to disociate from vsync wait and real cpu work.

 

But that black box also means that you can't know exactly what part of your frame is causing the real fuzz in your CPU time. It can be memory defrag, first time shaders compiles, etc. Nvidia does a better job at multithreading the driver than nvidia for you but again, at the price of opacity :(

Rule of thumb, optimize for AMD (a lost cause without AMD insiders support), assume will work better/greater on nVidia for your own sanity.

 

On the GPU, heavy use of timestamp help to inspect frame duration

So your GPU is running at 1fps, but your CPU is also only running at 4fps!?

How does your CPU drawing take 170ms? How many draw calls do you make? It could be that you're doing something wrong on the CPU which is causing both CPU and GPU to run slowly. 

To profile your GPU workload it's pretty easy to do it yourself. You can use D3Ds timestamp queries / events to read the GPUs clock at different points in a frame. You can then just read the clock before/after a group of commands, subtract the difference, divide by the clock frequency and you know how long that group of commands took. 

GPU profiling generally needs to be intrusive on your app. For CPU profiling there is usually metadata that can be leveraged (in PDB files for native code, or from the assembly itself in managed code), but there isn't anything for GPU profiling aside from the D3D calls that you issue. Nsight can give you timing breakdowns per-draw (if you can get it working), but that's usually not super helpful for getting a broader view of your game's performance. For that you need to add some annotations to your code that bracket your draws and dispatches into logical goups. Different profiling tools having different ways for doing this, some of them involving custom API's specifically for that tool. For instance, the new PIX for windows has the WinPixEventRuntime lib. Many tools (such as RenderDoc and Nsight) that support D3D11 still recognize the old D3DPERF_BeginEvent/D3DPERF_EndEvent functions from D3D9. 

You may want to consider adding your own basic profiling to your game, which you can do by using timestamp queries. Unfortunately I don't have any C#/SharpDX code that you can look at, but if you're okay with looking at some C++ you can look at my basic D3D11 Profiler class that I have here and here as a reference. I would also consider adding some debug UI to your game so that you can output this data right to the screen in development builds.

6 minutes ago, Hodgman said:

So your GPU is running at 1fps, but your CPU is also only running at 4fps!?

I was thinking the same thing when I first read the OP, but then I realized that it says that those timings were accumulated over a single second. So basically 79% of the frame time is spent in Present, which sounds normal for a heavily GPU-bound game.

15 minutes ago, Hodgman said:

How does your CPU drawing take 170ms?

I guessed that ms to him is micro seconds not milli seconds.

edit - BTW is there a way to use the symbol for micro in the forum?

edit2 - Just reread the OP because I read MJP's post... guess I was wrong.

-potential energy is easily made kinetic-

Thank you for your replies. I will try use implement my own profiling mechanism. I also edited my post a little bit. There were many typos in it. I wrote it after finishing my working day... :-/

When writing "ms" I mean milli-seconds. I have 60 frames per second, i.e. each frame takes around 16 ms. An accumulated render time for my scene of 170 ms means that each frame I spend around 170 / 60 = 2.83 ms for rendering the scene.

46 minutes ago, GalacticCrew said:

When writing "ms" I mean milli-seconds. I have 60 frames per second, i.e. each frame takes around 16 ms. An accumulated render time for my scene of 170 ms means that each frame I spend around 170 / 60 = 2.83 ms for rendering the scene.

Yeah sorry about that, I didn't really read your post I was more interested in the responses.  So I just assumed... again my mistake, sorry.

-potential energy is easily made kinetic-

No problem, Infinisearch. I used the information from the posts above to write my own GPU profiler using the links provided by MJP. I wrote two classes. The class GPUInterval is an interval you are interested in, e.g. the time that is used to render a scene. The class GPUProfiler is a container for a set of GPUIntervals and does all calculations.


namespace Engine.Game.Profiler
{
    public class GPUInterval
    {
        private readonly SharpDX.Direct3D11.Device _device;
        private readonly SharpDX.Direct3D11.DeviceContext _deviceContext;

        private SharpDX.Direct3D11.Query _startQuery;
        private SharpDX.Direct3D11.Query _endQuery;

        public string Name { get; private set; }
        public double Duration { get; private set; }

        public GPUInterval(SharpDX.Direct3D11.Device device, SharpDX.Direct3D11.DeviceContext deviceContext, string name)
        {
            _device = device;
            _deviceContext = deviceContext;

            Name = name;

            // Create timestamp query.
            _startQuery = new SharpDX.Direct3D11.Query(_device, new SharpDX.Direct3D11.QueryDescription()
            {
                Type = SharpDX.Direct3D11.QueryType.Timestamp,
                Flags = SharpDX.Direct3D11.QueryFlags.None
            });

            _endQuery = new SharpDX.Direct3D11.Query(_device, new SharpDX.Direct3D11.QueryDescription()
            {
                Type = SharpDX.Direct3D11.QueryType.Timestamp,
                Flags = SharpDX.Direct3D11.QueryFlags.None
            });
        }

        public void Start()
        {
            _deviceContext.End(_startQuery);
        }

        public void Stop()
        {
            _deviceContext.End(_endQuery);
        }

        public void Calculate(long frequency)
        {
            long startTime;
            while (!_deviceContext.GetData(_startQuery, out startTime))
                System.Threading.Thread.Sleep(1);

            long endTime;
            while (!_deviceContext.GetData(_endQuery, out endTime))
                System.Threading.Thread.Sleep(1);

            Duration = ((endTime - startTime) * 1000.0) / frequency;
        }
    }
}

namespace Engine.Game.Profiler
{
    public class GPUProfiler
    {
        private readonly SharpDX.Direct3D11.Device _device;
        private readonly SharpDX.Direct3D11.DeviceContext _deviceContext;

        private SharpDX.Direct3D11.Query _disjointQuery;

        public List<GPUInterval> Intervals { get; private set; }

        public GPUProfiler(SharpDX.Direct3D11.Device device, SharpDX.Direct3D11.DeviceContext deviceContext)
        {
            _device = device;
            _deviceContext = deviceContext;

            // Create disjoint query.
            _disjointQuery = new SharpDX.Direct3D11.Query(_device, new SharpDX.Direct3D11.QueryDescription()
            {
                Type = SharpDX.Direct3D11.QueryType.TimestampDisjoint,
                Flags = SharpDX.Direct3D11.QueryFlags.None
            });

            // Create intervals list
            Intervals = new List<GPUInterval>();
        }

        public void StartFrame()
        {
            _deviceContext.Begin(_disjointQuery);
        }

        public void EndFrame()
        {
            _deviceContext.End(_disjointQuery);

            // Retrieve frequency.
            SharpDX.Direct3D11.QueryDataTimestampDisjoint queryDataTimestampDisjoint;
            while (!_deviceContext.GetData(_disjointQuery, out queryDataTimestampDisjoint))
                System.Threading.Thread.Sleep(1);

            // Calculate the duration of all intervals.
            if (!queryDataTimestampDisjoint.Disjoint)
            {
                foreach (var interval in Intervals)
                    interval.Calculate(queryDataTimestampDisjoint.Frequency);
            }
        }
    }
}

I created four GPUIntervals to check the same regions I mentioned in my initial post:

  1. The entire render function
  2. Rendering the scene (drawing models, setting constant shaders, ...)
  3. Rendering UI
  4. Calling SwapChain.Present

Here are the numbers for a random frame while the game is idling:

  1. Entire render function = 16.35 ms
  2. Render scene = 15.00 ms
  3. Render UI = 0.26 ms
  4. SwapChain.Present = 1.08 ms

These numbers are no big surprise, because the render scene does all the work. However, it is interesting that rendering the UI (which has A LOT of elements) and presenting the Swap Chain is fine.

Tomorrow, I will investigate the different parts of my scene rendering. I will keep you updated!

One other suggestion I have is to make sure that you use a sync interval of 0 when performing GPU profiling. When VSYNC is enabled, the driver will always wait until the next sync interval (every 16.6ms for a 60Hz monitor) before presenting the next buffer in the swap chain. Because of this, the GPU will typically have to stall so that it can wait until a swap chain buffer is finished being used for presenting, and can therefore be re-used as a render target. This still can show up in your timestamp queries, which will give you misleading results. You may be able to avoid capturing that stall in your timings by excluding the first draw or copy operation that touches the back buffer, but it's easier to just disable VSYNC.

I have extended my GPU Profiling classes. Now, I measure all intervals hierarchical. So, I can see the duration for each operation in each function. I located my bottleneck and I will further investigate how to solve it. If I can't solve it, I will open a new threat. My question here has been successfully answered. Thank you very much!

This topic is closed to new replies.

Advertisement