• Advertisement

DX11 ID3D11Query reporting weird results

Recommended Posts

I implemented DX queries after this blog post:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Queries work perfectly fine... for as long as I don't use VSync or any other form of Sleep. Why would that happe? I record queries right before my Compute/Dispatch code, record right after and then read the results (spinning on GetData if returns S_FALSE).
When I don't VSync then my code takes consistent 0.39-0.4 ms. After turning VSync on it starts with something like 0.46 ms, after a second bumps up to 0.61 ms and a few seconds after I get something like 1.2 ms.

I also used this source:
http://reedbeta.com/blog/gpu-profiling-101/
The difference here is that the author uses the disjoint query for the whole Render()  function instead of using one per particular measurement. When I implemented it this way the timings were incosistent (like above 0.46, 0.61, 1.2) regardless of VSync.

Share this post


Link to post
Share on other sites
Advertisement

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

Share this post


Link to post
Share on other sites
12 hours ago, Hodgman said:

Are you spinning on the query results immediately, or do you wait a frame before trying to get the results? 

I tested both. No difference.

10 hours ago, ajmiles said:

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Edited by maxest

Share this post


Link to post
Share on other sites
39 minutes ago, maxest said:

I tested both. No difference.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Even if you time only the work you're interested in (and not the whole frame), it's still going to take a variable amount of time depending on how high the GPU's clock speed happens to be at that point in time.

If the GPU can see it's only doing 2ms of work every 16ms, then it may underclock itself by a factor of 3-4x such that the 2ms of work ends up taking 6ms-8ms instead.

What's happening is something like this:

1) At 1500MHz, your work takes 0.4ms and ~16.2ms is spent idle at the end of the frame.
2) The GPU realises it could run a bit slower and still be done in plenty of time so it underclocks itself just a little bit to save power.
3) At 1200MHz, your work takes 0.5ms and ~16.1ms is spent idle at the end of the frame.
4) Still plenty of time spare, so it underclocks itself even further.
5) At 900MHz, your work takes 0.6ms and ~16.0ms is spent idle at the end of the frame.
6) *Still* plenty of time spare, so it dramatically underclocks itself.
7) At 500MHz, your work takes 3x longer than it did originally, now costing 1.2ms. There's still 15.4ms of idle time at the end of the frame, so this is still OK.
8) At this point the GPU may not have any lower power states to clock down to, so the work never takes any more than 1.2ms.

In D3D12 we (Microsoft) added an API called ID3D12Device::SetStablePowerState, in part to address this problem.

This API fixes the GPU's clock speed to something it can always run at without having to throttle back from due to thermal or power limitations. So if your GPU has a "Base Clock" of 1500MHz but can periodically "Boost" to 1650MHz, we'll fix the clock speed to 1500MHz. Note that this API does not work on end-users machines as it requires Debug bits to be installed, so can't be used in retail titles. Note also that performance will likely be worse than on an end-user's machine because we've artificially limited the clock speed below the peak to ensure a stable and consistent clock speed. With this in place, profiling becomes easier because the clock speed is known to be stable across runs and won't clock up and down as in your situation.

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

Share this post


Link to post
Share on other sites
2 hours ago, ajmiles said:

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Edited by SoldierOfLight

Share this post


Link to post
Share on other sites

@ajmiles: Thank you so so much for this detailed explanation. I hadn't thought about GPU clock changing its speed. This makes more sense that performing some redundant work :).

I have checked what you proposed. Got some simple DX12 sample, called SetStablePowerState and set it to true (needed to turn on Developer Mode on on my Windows 10; wasn't aware of its existence) and called permanent Sleep. Then I ran my application. Now regardless of whether I use VSync or not, call Sleep in my app or not, I get consistent 0.46 ms. It's more than without-VSync-and-SetStablePowerState 0.4 ms but at least it's stable. So as I understand the GPU is working at lower clock speed than it could (without Boost) but this speed is fixed.

I have one more case whose results I don't entirely understand. I have code of this form:

-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
CopyResource (download from GPU to CPU)
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler


The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I would just like to ask one more, relevant thing. In my quest for search of reliable counters I stumbled upon this (https://msdn.microsoft.com/en-us/library/windows/desktop/ff476364(v=vs.85).aspx) but could find no simple example of usage. Is it working at all?

Edited by maxest

Share this post


Link to post
Share on other sites
39 minutes ago, SoldierOfLight said:

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Interesting, it might be that we haven't pushed anything out yet with that change in. It still exists in the Creators Update SDK and whatever release of Windows 10 'maxest' is running it still seems to work. 

I'll follow up with you offline why we decided the API wasn't useful. It feels like it still has value in scenarios where you want a consistent time from run-to-run and want to analyse whether an algorithmic change improves performance or not. Even if it doesn't give you real numbers for any user in the real world, consistency across runs still seems useful during development / optimisation.

38 minutes ago, maxest said:

I have one more case whose results I don't entirely understand. I have code of this form:


-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
CopyResource (download from GPU to CPU)
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler

The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I've never see D3D11_COUNTER used before, but Jesse (SoldierOfLight) may know whether it ever saw any use.

Edited by ajmiles

Share this post


Link to post
Share on other sites

As far as counters go, they're all for IHV-specific counters. In D3D10 there were API-defined counters, but they were deprecated in D3D11.

The current model for performance counters is the plugin model exposed by PIX.

Also I just checked, and apparently I was wrong about SetStablePowerState, we did keep it around, we just moved it from requiring the D3D12 debug layers, to requiring developer mode. My bad.

Edited by SoldierOfLight

Share this post


Link to post
Share on other sites
1 hour ago, ajmiles said:

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I actually did try placing End query right after CopyResource and before Map and that reported (as far as I remember, can't check now) something around 0.1 ms. Now I'm not really sure how should I measure the time it takes to download data from GPU to CPU. My CPU timer, when used to enclose CopyResource and Map, reported that downloading 11.5 GB took 1 second, what agrees with some CUDA-based test application for measuring PCI-E throughput that I used. When lowered down to 8 MB the download took 1.5 ms and when lowered to 1 MB the download took 1 ms. I'm not sure if PCI-E downloads should scale linearly as a function of data size but my tests show that they don't. At least that's what my CPU timer says. But the 0.1 ms reported by GPU timer when measuring CopyResource would indicate linear scale. Now I'm not sure if I should trust the CPU time reporting 1 ms (CopyResource + Map) or the GPU timer reporting 0.1 ms (just CopyResource).

Share this post


Link to post
Share on other sites

0.1ms sounds about right for copying 1MB over a bus that's roughly 16GB/s, so I'd be inclined to believe that number. It should scale approximately linearly.

You have to bear in mind that the CPU timer isn't just timing how long it takes the CPU to do useful work, but how long it takes the GPU to catch up and do all its outstanding work. By calling Map you've required the GPU to catch up and execute all the work in its queue, do the copy and signal to the CPU that it's done. The more work the GPU has to run prior to the call to "CopyResource", the longer the CPU has to sit there and wait for it to complete. For that reason, I wouldn't expect the CPU timer to ever record a very low value in the region of 0.1ms no matter how small the copy is.

Share this post


Link to post
Share on other sites

I thought it should have been 0.1 ms as after refactoring the whole "system" I'm working on so that I need to only download 1 MB instead of 8 MB the total processing time went down by around 1.5 ms.

Thank you again so much ajmiles.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Advertisement
  • Advertisement
  • Popular Tags

  • Advertisement
  • Popular Now

  • Similar Content

    • By trojanfoe
      Hi there, this is my first post in what looks to be a very interesting forum.
      I am using DirectXTK to put together my 2D game engine but I would like to have a scene graph where each entity has its own transform that is multiplied from its parent, which is pretty traditional I think.  However how do I use SpriteBatch with my own world transform?  I would also like to use the GPU depth buffer in order to avoid sorting back-to-front as well as by texture as well as GPU instancing (I think) so am I looking at implementing my own sprite rendering?
      Thanks in advance!
    • By Matt_Aufderheide
      I am trying to draw a screen-aligned quad with arbitrary sizes.
       
      currently I just send 4 vertices to the vertex shader like so:
      pDevCon->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLESTRIP);
      pDevCon->Draw(4, 0);
       
      then in the vertex shader I am doing this:
      float4 main(uint vI : SV_VERTEXID) : SV_POSITION
      {
       float2 texcoord = float2(vI & 1, vI >> 1);
      return float4((texcoord.x - 0.5f) * 2, -(texcoord.y - 0.5f) * 2, 0, 1);
      }
      that gets me a screen-sized quad...ok .. what's the correct way to get arbitrary sizes?...I have messed around with various numbers, but I think I don't quite get something in these relationships.
      one thing I tried is: 
       
      float4 quad = float4((texcoord.x - (xpos/screensizex)) * (width/screensizex), -(texcoord.y - (ypos/screensizey)) * (height/screensizey), 0, 1);
       
      .. where xpos and ypos is number of pixels from upper right corner..width and height is the desired size of the quad in pixels
      this gets me somewhat close, but not right.. a bit too small..so I'm missing something ..any ideas?
       
      .
    • By Stewie.G
      Hi,
      I've been trying to implement a gaussian blur recently, it would seem the best way to achieve this is by running a bur on one axis, then another blur on the other axis.
      I think I have successfully implemented the blur part per axis, but now I have to blend both calls with a proper BlendState, at least I think this is where my problem is.
      Here are my passes:
      RasterizerState DisableCulling { CullMode = BACK; }; BlendState AdditiveBlend { BlendEnable[0] = TRUE; BlendEnable[1] = TRUE; SrcBlend[0] = SRC_COLOR; BlendOp[0] = ADD; BlendOp[1] = ADD; SrcBlend[1] = SRC_COLOR; }; technique11 BlockTech { pass P0 { SetVertexShader(CompileShader(vs_5_0, VS())); SetGeometryShader(NULL); SetPixelShader(CompileShader(ps_5_0, PS_BlurV())); SetRasterizerState(DisableCulling); SetBlendState(AdditiveBlend, float4(0.0, 0.0, 0.0, 0.0), 0xffffffff); } pass P1 { SetVertexShader(CompileShader(vs_5_0, VS())); SetGeometryShader(NULL); SetPixelShader(CompileShader(ps_5_0, PS_BlurH())); SetRasterizerState(DisableCulling); } }  
      D3DX11_TECHNIQUE_DESC techDesc; mBlockEffect->mTech->GetDesc( &techDesc ); for(UINT p = 0; p < techDesc.Passes; ++p) { deviceContext->IASetVertexBuffers(0, 2, bufferPointers, stride, offset); deviceContext->IASetIndexBuffer(mIB, DXGI_FORMAT_R32_UINT, 0); mBlockEffect->mTech->GetPassByIndex(p)->Apply(0, deviceContext); deviceContext->DrawIndexedInstanced(36, mNumberOfActiveCubes, 0, 0, 0); } No blur

       
      PS_BlurV

      PS_BlurH

      P0 + P1

      As you can see, it does not work at all.
      I think the issue is in my BlendState, but I am not sure.
      I've seen many articles going with the render to texture approach, but I've also seen articles where both shaders were called in succession, and it worked just fine, I'd like to go with that second approach. Unfortunately, the code was in OpenGL where the syntax for running multiple passes is quite different (http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/). So I need some help doing the same in HLSL :-)
       
      Thanks!
    • By Fleshbits
      Back around 2006 I spent a good year or two reading books, articles on this site, and gobbling up everything game dev related I could. I started an engine in DX10 and got through basics. I eventually gave up, because I couldn't do the harder things.
      Now, my C++ is 12 years stronger, my mind is trained better, and I am thinking of giving it another go.
      Alot has changed. There is no more SDK, there is evidently a DX Toolkit, XNA died, all the sweet sites I used to go to are 404, and google searches all point to Unity and Unreal.
      I plainly don't like Unity or Unreal, but might learn them for reference.
      So, what is the current path? Does everyone pretty much use the DX Toolkit? Should I start there? I also read that DX12 is just expert level DX11, so I guess I am going DX 11.
      Is there a current and up to date list of learning resources anywhere?  I am about tired of 404s..
       
       
    • By Stewie.G
      Hi,
       
      I've been trying to implement a basic gaussian blur using the gaussian formula, and here is what it looks like so far:
      float gaussian(float x, float sigma)
      {
          float pi = 3.14159;
          float sigma_square = sigma * sigma;
          float a = 1 / sqrt(2 * pi*sigma_square);
          float b = exp(-((x*x) / (2 * sigma_square)));
          return a * b;
      }
      My problem is that I don't quite know what sigma should be.
      It seems that if I provide a random value for sigma, weights in my kernel won't add up to 1.
      So I ended up calling my gaussian function with sigma == 1, which gives me weights adding up to 1, but also a very subtle blur.
      Here is what my kernel looks like with sigma == 1
              [0]    0.0033238872995488885    
              [1]    0.023804742479357766    
              [2]    0.09713820127276819    
              [3]    0.22585307043511713    
              [4]    0.29920669915475656    
              [5]    0.22585307043511713    
              [6]    0.09713820127276819    
              [7]    0.023804742479357766    
              [8]    0.0033238872995488885    
       
      I would have liked it to be more "rounded" at the top, or a better spread instead of wasting [0], [1], [2] with values bellow 0.1.
      Based on my experiments, the key to this is to provide a different sigma, but if I do, my kernel values no longer adds up to 1, which results to a darker blur.
      I've found this post 
      ... which helped me a bit, but I am really confused with this the part where he divide sigma by 3.
      Can someone please explain how sigma works? How is it related to my kernel size, how can I balance my weights with different sigmas, ect...
       
      Thanks :-)
  • Advertisement