Jump to content
  • Advertisement
GalacticCrew

DX11 DrawIndexed and DrawIndexedInstance takes very long

Recommended Posts

Hello,

I want to improve the performance of my game (engine) and some of your helped me to make a GPU Profiler. After creating the GPU Profiler, I started to measure the time my GPU needs per frame. I refined my GPU time measurements to find my bottleneck.

Searching the bottleneck

Rendering a small scene in an Idle state takes around 15.38 ms per frame. 13.54 ms (88.04%) are spent while rendering the scene, 1.57 ms (10.22%) are spent during the SwapChain.Present call (no VSync!) and the rest is spent on other tasks like rendering the UI. I further investigated the scene rendering, since it takes über 88% of my GPU frame rendering time.

When rendering my scene, most of the time (80.97%) is spent rendering my models. The rest is spent to render the background/skybox, updating animation data, updating pixel shader constant buffer, etc. It wasn't really suprising that most of the time is spent for my models, so I further refined my measurements to find the actual bottleneck.

In my example scene, I have five animated NPCs. When rendering these NPCs, most actions are almost for free. Setting the proper shaders in the input layout (0.11%), updating vertex shader constant buffers (0.32%), setting textures (0.24%) and setting vertex and index buffers (0.28%). However, the rest of the GPU time (99.05% !!) is spent in two function calls: DrawIndexed and DrawIndexedInstance.

I searched this forum and the web for other articles and threads about these functions, but I haven't found a lot of useful information. I use SharpDX and .NET Framework 4.5 to develop my game (engine). The developer of SharpDX said, that "The method DrawIndexed in SharpDX is a direct call to DirectX" (Source). DirectX 11 is widely used and SharpDX is "only" a wrapper for DirectX functions, I assume the problem is in my code.

How I render my scene

When rendering my scene, I render one model after another. Each model has one or more parts and one or more positions. For example, a human model has parts like head, hands, legs, torso, etc. and may be placed in different locations (on the couch, on a street, ...). For static elements like furniture, houses, etc. I use instancing, because the positions never change at run-time. Dynamic models like humans and monster don't use instancing, because positions change over time.

When rendering a model, I use this work-flow:

  1. Set vertex and pixel shaders, if they need to be updated (e.g. PBR shaders, simple shader, depth info shaders, ...)
  2. Set animation data as constant buffer in the vertex shader, if the model is animated
  3. Set generic vertex shader constant buffer (world matrix, etc.)
  4. Render all parts of the model. For each part:
    1. Set diffuse, normal, specular and emissive texture shader views
    2. Set vertex buffer
    3. Set index buffer
    4. Call DrawIndexedInstanced for instanced models and DrawIndexed models

What's the problem

After my GPU profiling, I know that over 99% of the rendering time for a single model is spent in the DrawIndexedInstanced and DrawIndexed function calls. But why do they take so long? Do I have to try to optimize my vertex or pixel shaders? I do not use other types of shaders at the moment. "Le Comte du Merde-fou" suggested in this post to merge regions of vertices to larger vertex buffers to reduce the number of Draw calls. While this makes sense to me, it does not explain why rendering my five (!) animated models takes that much GPU time. To make sure I don't analyse something I wrong, I made sure to not use the D3D11_CREATE_DEVICE_DEBUG flag and to run as Release version in Visual Studio as suggested by Hodgman in this forum thread.

My engine does its job. Multi-texturing, animation, soft shadowing, instancing, etc. are all implemented, but I need to reduce the GPU load for performance reasons. Each frame takes less than 3ms CPU time by the way. So the problem is on the GPU side, I believe.

Share this post


Link to post
Share on other sites
Advertisement

How many triangles and vertices are the models?  Also what hardware (GPU) are you running on?  Finally since you don't mention post processing I'm assuming you're not doing any.  Combine that with the fact that you don't mention a g-buffer so you're likely foward rendering (lighting and potentially shadowing done in the forward pass), I would expect most of your time to be in the draw calls since you're not doing much else.

Share this post


Link to post
Share on other sites

One of the NPC models has 3,339 vertices. The other NPC models will have a similar amount of vertices.

I am running my tests on a Laptop with a GeForce GTX 1050 Ti and Intel(R) HD Graphics 630.

At the moment, I do not use any type of post-processing. My engine can handle topics like Soft Shadowing, but they are all disabled at the moment (because of performance problems on older computers).

Share this post


Link to post
Share on other sites
4 hours ago, GalacticCrew said:

After my GPU profiling, I know that over 99% of the rendering time for a single model is spent in the DrawIndexedInstanced and DrawIndexed function calls

The Draw/Dispatch/Copy/Present functions are the only ones that actually queue up GPU work. The other functions just configure what the next Draw/Dispatch will do. 

If GPU-bound (CPU waiting on GPU situation) you should be spending 100% of your GPU time in those functions. The values that you've got for setting textures etc is just measurement error. 

Share this post


Link to post
Share on other sites

Something bothers me about how you're using the term bottleneck in relation to graphics.  Take a look at this PDF:

http://developer.download.nvidia.com/assets/gamedev/docs/Graphics_Performance_Optimization.pdf

Its old but you should find it useful.  It should help you narrow down your problem.

6 hours ago, GalacticCrew said:

I am running my tests on a Laptop with a GeForce GTX 1050 Ti and Intel(R) HD Graphics 630.

One, use the nvidia control panel program to ensure you're using the 1050ti instead of the integrated intel.  In your other thread you mention having problems on some older PC's what were their specs?

How many models total are you drawing? and as Styves asks how many draw calls?

6 hours ago, GalacticCrew said:

One of the NPC models has 3,339 vertices. The other NPC models will have a similar amount of vertices.

Thats really nothing... confirm the number of vertices and triangles of the other models.

File in case link dies:

Graphics_Performance_Optimization.pdf

edit - also I am not convinced that Vsync is disabled, your results are too close to 16.66ms frame time... make sure the driver isn't forcing vsync.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

Oh you might want to take a look at the older version of the above presentation here:

http://developer.download.nvidia.com/books/HTML/gpugems/gpugems_ch28.html

It explains what 'FB' is.

Also I took a look at the presentation and it does contain some outdated info, some of it you should be able to figure out.  On the other hand there are things you;ll need guidance on so ask... for example it suggests in certain circumstances to use tristrips or fans.  Nowadays indexed triangle lists are what most people use and your best bet for performance.  Of course you should optimize your mesh for both the post transform vertex cache and pre transform cache for best performance for indexed tri lists.  You can find a library to do so here: https://gpuopen.com/gaming-product/tootle/

 

Share this post


Link to post
Share on other sites

I am sorry for my late response, but I was out of office. I did not have the time to read the presentations you linked, but I will read them this weekend. I could solve my performance issue. In my game engine, I check all installed graphics adapter and I use the one with the highest memory and the highest possible Feature Level. As it turned out, my testing laptop gave the same result for all graphics adapter although an Intel HD 630 chip is clearly weaker than my GeForce GTX 1050 Ti chip. I adjusted the system settings and now everything runs smooth.

VSync was turned off. In my main menu, I had around 2 ms per frame, because no 3D models were rendered. When I used a more complex scene in my game, the GPU needed 80 ms per frame. Therefore, I had lag. Using the correct graphics chips, the GPU was reduced by a factor of 10 to around 1 ms per frame in my default case and around 10 ms in very complex scenes.

I will update my FAQ, so all players know about this issue.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement
  • Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By chiffre
      Introduction:
      In general my questions pertain to the differences between floating- and fixed-point data. Additionally I would like to understand when it can be advantageous to prefer fixed-point representation over floating-point representation in the context of vertex data and how the hardware deals with the different data-types. I believe I should be able to reduce the amount of data (bytes) necessary per vertex by choosing the most opportune representations for my vertex attributes. Thanks ahead of time if you, the reader, are considering the effort of reading this and helping me.
      I found an old topic that shows this is possible in principal, but I am not sure I understand what the pitfalls are when using fixed-point representation and whether there are any hardware-based performance advantages/disadvantages.
      (TLDR at bottom)
      The Actual Post:
      To my understanding HLSL/D3D11 offers not just the traditional floating point model in half-,single-, and double-precision, but also the fixed-point model in form of signed/unsigned normalized integers in 8-,10-,16-,24-, and 32-bit variants. Both models offer a finite sequence of "grid-points". The obvious difference between the two models is that the fixed-point model offers a constant spacing between values in the normalized range of [0,1] or [-1,1], while the floating point model allows for smaller "deltas" as you get closer to 0, and larger "deltas" the further you are away from 0.
      To add some context, let me define a struct as an example:
      struct VertexData { float[3] position; //3x32-bits float[2] texCoord; //2x32-bits float[3] normals; //3x32-bits } //Total of 32 bytes Every vertex gets a position, a coordinate on my texture, and a normal to do some light calculations. In this case we have 8x32=256bits per vertex. Since the texture coordinates lie in the interval [0,1] and the normal vector components are in the interval [-1,1] it would seem useful to use normalized representation as suggested in the topic linked at the top of the post. The texture coordinates might as well be represented in a fixed-point model, because it seems most useful to be able to sample the texture in a uniform manner, as the pixels don't get any "denser" as we get closer to 0. In other words the "delta" does not need to become any smaller as the texture coordinates approach (0,0). A similar argument can be made for the normal-vector, as a normal vector should be normalized anyway, and we want as many points as possible on the sphere around (0,0,0) with a radius of 1, and we don't care about precision around the origin. Even if we have large textures such as 4k by 4k (or the maximum allowed by D3D11, 16k by 16k) we only need as many grid-points on one axis, as there are pixels on one axis. An unsigned normalized 14 bit integer would be ideal, but because it is both unsupported and impractical, we will stick to an unsigned normalized 16 bit integer. The same type should take care of the normal vector coordinates, and might even be a bit overkill.
      struct VertexData { float[3] position; //3x32-bits uint16_t[2] texCoord; //2x16bits uint16_t[3] normals; //3x16bits } //Total of 22 bytes Seems like a good start, and we might even be able to take it further, but before we pursue that path, here is my first question: can the GPU even work with the data in this format, or is all I have accomplished minimizing CPU-side RAM usage? Does the GPU have to convert the texture coordinates back to a floating-point model when I hand them over to the sampler in my pixel shader? I have looked up the data types for HLSL and I am not sure I even comprehend how to declare the vertex input type in HLSL. Would the following work?
      struct VertexInputType { float3 pos; //this one is obvious unorm half2 tex; //half corresponds to a 16-bit float, so I assume this is wrong, but this the only 16-bit type I found on the linked MSDN site snorm half3 normal; //same as above } I assume this is possible somehow, as I have found input element formats such as: DXGI_FORMAT_R16G16B16A16_SNORM and DXGI_FORMAT_R16G16B16A16_UNORM (also available with a different number of components, as well as different component lengths). I might have to avoid 3-component vectors because there is no 3-component 16-bit input element format, but that is the least of my worries. The next question would be: what happens with my normals if I try to do lighting calculations with them in such a normalized-fixed-point format? Is there no issue as long as I take care not to mix floating- and fixed-point data? Or would that work as well? In general this gives rise to the question: how does the GPU handle fixed-point arithmetic? Is it the same as integer-arithmetic, and/or is it faster/slower than floating-point arithmetic?
      Assuming that we still have a valid and useful VertexData format, how far could I take this while remaining on the sensible side of what could be called optimization? Theoretically I could use the an input element format such as DXGI_FORMAT_R10G10B10A2_UNORM to pack my normal coordinates into a 10-bit fixed-point format, and my verticies (in object space) might even be representable in a 16-bit unsigned normalized fixed-point format. That way I could end up with something like the following struct:
      struct VertexData { uint16_t[3] pos; //3x16bits uint16_t[2] texCoord; //2x16bits uint32_t packedNormals; //10+10+10+2bits } //Total of 14 bytes Could I use a vertex structure like this without too much performance-loss on the GPU-side? If the GPU has to execute some sort of unpacking algorithm in the background I might as well let it be. In the end I have a functioning deferred renderer, but I would like to reduce the memory footprint of the huge amount of vertecies involved in rendering my landscape. 
      TLDR: I have a lot of vertices that I need to render and I want to reduce the RAM-usage without introducing crazy compression/decompression algorithms to the CPU or GPU. I am hoping to find a solution by involving fixed-point data-types, but I am not exactly sure how how that would work.
    • By cozzie
      Hi all,
      I was wondering it it matters in which order you draw 2D and 3D items, looking at the BeginDraw/EndDraw calls on a D2D rendertarget.
      The order in which you do the actual draw calls is clear, 3D first then 2D, means the 2D (DrawText in this case) is in front of the 3D scene.
      The question is mainly about when to call the BeginDraw and EndDraw.
      Note that I'm drawing D2D stuff through a DXGI surface linked to the 3D RT.
      Option 1:
      A - Begin frame, clear D3D RT
      B - Draw 3D
      C - BeginDraw D2D RT
      D - Draw 2D
      E - EndDraw D2D RT
      F - Present
      Option 2:
      A - Begin frame, clear D3D RT + BeginDraw D2D RT
      B - Draw 3D
      C - Draw 2D
      D - EndDraw D2D RT
      E- Present
      Would there be a difference (performance/issue?) in using option 2? (versus 1)
      Any input is appreciated.
    • By Sebastian Werema
      Do you know any papers that cover custom data structures like lists or binary trees implemented in hlsl without CUDA that work perfectly fine no matter how many threads try to use them at any given time?
    • By cozzie
      Hi all,
      Last week I noticed that when I run my test application(s) in Renderdoc, it crashes when it enable my code that uses D2D/DirectWrite. In Visual Studio no issues occur (debug or release), but when I run the same executable in Renderdoc, it crashes somehow (assert of D2D rendertarget or without any information). Before I spend hours on debugging/ figuring it out, does someone have experience with this symptom and/or know if Renderdoc has known issues with D2D? (if so, that would be bad news for debugging my application in the future );
      I can also post some more information on what happens, code and which code commented out, eliminates the problems (when running in RenderDoc).
      Any input is appreciated.
    • By lonewolff
      Hi Guys,
      I understand how to create input layouts etc... But I am wondering is it at all possible to derive an input layout from a shader and create the input layout directly from this? (Rather than manually specifying the input layout format?)
      Thanks in advance :)
       
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!