maxest

DX11 ID3D11Query reporting weird results

Recommended Posts

maxest    623

I implemented DX queries after this blog post:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Queries work perfectly fine... for as long as I don't use VSync or any other form of Sleep. Why would that happe? I record queries right before my Compute/Dispatch code, record right after and then read the results (spinning on GetData if returns S_FALSE).
When I don't VSync then my code takes consistent 0.39-0.4 ms. After turning VSync on it starts with something like 0.46 ms, after a second bumps up to 0.61 ms and a few seconds after I get something like 1.2 ms.

I also used this source:
http://reedbeta.com/blog/gpu-profiling-101/
The difference here is that the author uses the disjoint query for the whole Render()  function instead of using one per particular measurement. When I implemented it this way the timings were incosistent (like above 0.46, 0.61, 1.2) regardless of VSync.

Share this post


Link to post
Share on other sites
ajmiles    3320

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

Share this post


Link to post
Share on other sites
maxest    623
12 hours ago, Hodgman said:

Are you spinning on the query results immediately, or do you wait a frame before trying to get the results? 

I tested both. No difference.

10 hours ago, ajmiles said:

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Edited by maxest

Share this post


Link to post
Share on other sites
ajmiles    3320
39 minutes ago, maxest said:

I tested both. No difference.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Even if you time only the work you're interested in (and not the whole frame), it's still going to take a variable amount of time depending on how high the GPU's clock speed happens to be at that point in time.

If the GPU can see it's only doing 2ms of work every 16ms, then it may underclock itself by a factor of 3-4x such that the 2ms of work ends up taking 6ms-8ms instead.

What's happening is something like this:

1) At 1500MHz, your work takes 0.4ms and ~16.2ms is spent idle at the end of the frame.
2) The GPU realises it could run a bit slower and still be done in plenty of time so it underclocks itself just a little bit to save power.
3) At 1200MHz, your work takes 0.5ms and ~16.1ms is spent idle at the end of the frame.
4) Still plenty of time spare, so it underclocks itself even further.
5) At 900MHz, your work takes 0.6ms and ~16.0ms is spent idle at the end of the frame.
6) *Still* plenty of time spare, so it dramatically underclocks itself.
7) At 500MHz, your work takes 3x longer than it did originally, now costing 1.2ms. There's still 15.4ms of idle time at the end of the frame, so this is still OK.
8) At this point the GPU may not have any lower power states to clock down to, so the work never takes any more than 1.2ms.

In D3D12 we (Microsoft) added an API called ID3D12Device::SetStablePowerState, in part to address this problem.

This API fixes the GPU's clock speed to something it can always run at without having to throttle back from due to thermal or power limitations. So if your GPU has a "Base Clock" of 1500MHz but can periodically "Boost" to 1650MHz, we'll fix the clock speed to 1500MHz. Note that this API does not work on end-users machines as it requires Debug bits to be installed, so can't be used in retail titles. Note also that performance will likely be worse than on an end-user's machine because we've artificially limited the clock speed below the peak to ensure a stable and consistent clock speed. With this in place, profiling becomes easier because the clock speed is known to be stable across runs and won't clock up and down as in your situation.

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

Share this post


Link to post
Share on other sites
SoldierOfLight    2163
2 hours ago, ajmiles said:

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Edited by SoldierOfLight

Share this post


Link to post
Share on other sites
maxest    623

@ajmiles: Thank you so so much for this detailed explanation. I hadn't thought about GPU clock changing its speed. This makes more sense that performing some redundant work :).

I have checked what you proposed. Got some simple DX12 sample, called SetStablePowerState and set it to true (needed to turn on Developer Mode on on my Windows 10; wasn't aware of its existence) and called permanent Sleep. Then I ran my application. Now regardless of whether I use VSync or not, call Sleep in my app or not, I get consistent 0.46 ms. It's more than without-VSync-and-SetStablePowerState 0.4 ms but at least it's stable. So as I understand the GPU is working at lower clock speed than it could (without Boost) but this speed is fixed.

I have one more case whose results I don't entirely understand. I have code of this form:

-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
CopyResource (download from GPU to CPU)
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler


The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I would just like to ask one more, relevant thing. In my quest for search of reliable counters I stumbled upon this (https://msdn.microsoft.com/en-us/library/windows/desktop/ff476364(v=vs.85).aspx) but could find no simple example of usage. Is it working at all?

Edited by maxest

Share this post


Link to post
Share on other sites
ajmiles    3320
39 minutes ago, SoldierOfLight said:

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Interesting, it might be that we haven't pushed anything out yet with that change in. It still exists in the Creators Update SDK and whatever release of Windows 10 'maxest' is running it still seems to work. 

I'll follow up with you offline why we decided the API wasn't useful. It feels like it still has value in scenarios where you want a consistent time from run-to-run and want to analyse whether an algorithmic change improves performance or not. Even if it doesn't give you real numbers for any user in the real world, consistency across runs still seems useful during development / optimisation.

38 minutes ago, maxest said:

I have one more case whose results I don't entirely understand. I have code of this form:


-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
CopyResource (download from GPU to CPU)
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler

The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I've never see D3D11_COUNTER used before, but Jesse (SoldierOfLight) may know whether it ever saw any use.

Edited by ajmiles

Share this post


Link to post
Share on other sites
SoldierOfLight    2163

As far as counters go, they're all for IHV-specific counters. In D3D10 there were API-defined counters, but they were deprecated in D3D11.

The current model for performance counters is the plugin model exposed by PIX.

Also I just checked, and apparently I was wrong about SetStablePowerState, we did keep it around, we just moved it from requiring the D3D12 debug layers, to requiring developer mode. My bad.

Edited by SoldierOfLight

Share this post


Link to post
Share on other sites
maxest    623
1 hour ago, ajmiles said:

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I actually did try placing End query right after CopyResource and before Map and that reported (as far as I remember, can't check now) something around 0.1 ms. Now I'm not really sure how should I measure the time it takes to download data from GPU to CPU. My CPU timer, when used to enclose CopyResource and Map, reported that downloading 11.5 GB took 1 second, what agrees with some CUDA-based test application for measuring PCI-E throughput that I used. When lowered down to 8 MB the download took 1.5 ms and when lowered to 1 MB the download took 1 ms. I'm not sure if PCI-E downloads should scale linearly as a function of data size but my tests show that they don't. At least that's what my CPU timer says. But the 0.1 ms reported by GPU timer when measuring CopyResource would indicate linear scale. Now I'm not sure if I should trust the CPU time reporting 1 ms (CopyResource + Map) or the GPU timer reporting 0.1 ms (just CopyResource).

Share this post


Link to post
Share on other sites
ajmiles    3320

0.1ms sounds about right for copying 1MB over a bus that's roughly 16GB/s, so I'd be inclined to believe that number. It should scale approximately linearly.

You have to bear in mind that the CPU timer isn't just timing how long it takes the CPU to do useful work, but how long it takes the GPU to catch up and do all its outstanding work. By calling Map you've required the GPU to catch up and execute all the work in its queue, do the copy and signal to the CPU that it's done. The more work the GPU has to run prior to the call to "CopyResource", the longer the CPU has to sit there and wait for it to complete. For that reason, I wouldn't expect the CPU timer to ever record a very low value in the region of 0.1ms no matter how small the copy is.

Share this post


Link to post
Share on other sites
maxest    623

I thought it should have been 0.1 ms as after refactoring the whole "system" I'm working on so that I need to only download 1 MB instead of 8 MB the total processing time went down by around 1.5 ms.

Thank you again so much ajmiles.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Similar Content

    • By gomidas
      I am trying to add normal map to my project I have an example of a cube: 
      I have normal in my shader I think. Then I set shader resource view for texture (NOT BUMP)
                  device.ImmediateContext.PixelShader.SetShaderResource(0, textureView);             device.ImmediateContext.Draw(VerticesCount,0); What should I do to set my normal map or how it is done in dx11 generally example c++?
    • By fighting_falcon93
      Imagine that we have a vertex structure that looks like this:
      struct Vertex { XMFLOAT3 position; XMFLOAT4 color; }; The vertex shader looks like this:
      cbuffer MatrixBuffer { matrix world; matrix view; matrix projection; }; struct VertexInput { float4 position : POSITION; float4 color : COLOR; }; struct PixelInput { float4 position : SV_POSITION; float4 color : COLOR; }; PixelInput main(VertexInput input) { PixelInput output; input.position.w = 1.0f; output.position = mul(input.position, world); output.position = mul(output.position, view); output.position = mul(output.position, projection); output.color = input.color; return output; } And the pixel shader looks like this:
      struct PixelInput { float4 position : SV_POSITION; float4 color : COLOR; }; float4 main(PixelInput input) : SV_TARGET { return input.color; } Now let's create a quad consisting of 2 triangles and the vertices A, B, C and D:
      // Vertex A. vertices[0].position = XMFLOAT3(-1.0f, 1.0f, 0.0f); vertices[0].color = XMFLOAT4( 0.5f, 0.5f, 0.5f, 1.0f); // Vertex B. vertices[1].position = XMFLOAT3( 1.0f, 1.0f, 0.0f); vertices[1].color = XMFLOAT4( 0.5f, 0.5f, 0.5f, 1.0f); // Vertex C. vertices[2].position = XMFLOAT3(-1.0f, -1.0f, 0.0f); vertices[2].color = XMFLOAT4( 0.5f, 0.5f, 0.5f, 1.0f); // Vertex D. vertices[3].position = XMFLOAT3( 1.0f, -1.0f, 0.0f); vertices[3].color = XMFLOAT4( 0.5f, 0.5f, 0.5f, 1.0f); // 1st triangle. indices[0] = 0; // Vertex A. indices[1] = 3; // Vertex D. indices[2] = 2; // Vertex C. // 2nd triangle. indices[3] = 0; // Vertex A. indices[4] = 1; // Vertex B. indices[5] = 3; // Vertex D. This will result in a grey quad as shown in the image below. I've outlined the edges in red color to better illustrate the triangles:

      Now imagine that we’d want our quad to have a different color in vertex A:
      // Vertex A. vertices[0].position = XMFLOAT3(-1.0f, 1.0f, 0.0f); vertices[0].color = XMFLOAT4( 0.0f, 0.0f, 0.0f, 1.0f);
      That works as expected since there’s now an interpolation between the black color in vertex A and the grey color in vertices B, C and D. Let’s revert the previus changes and instead change the color of vertex C:
      // Vertex C. vertices[2].position = XMFLOAT3(-1.0f, -1.0f, 0.0f); vertices[2].color = XMFLOAT4( 0.0f, 0.0f, 0.0f, 1.0f);
      As you can see, the interpolation is only done half of the way across the first triangle and not across the entire quad. This is because there's no edge between vertex C and vertex B.
      Which brings us to my question:
      I want the interpolation to go across the entire quad and not only across the triangle. So regardless of which vertex we decide to change the color of, the color interpolation should always go across the entire quad. Is there any efficient way of achieving this without adding more vertices and triangles?
      An illustration of what I'm trying to achieve is shown in the image below:

       
      Background
      This is just a very brief explanation of the problems background in case that would make it easier for you to understand the problems roots and maybe help you with finding a better solution to the problem.
      I'm trying to texture a terrain mesh in DirectX11. It's working, but I'm a bit unsatisfied with the result. When changing the terrain texture of a single vertex, the interpolation with the other vertices results in a hexagon shape instead of a squared shape:

      As the red arrows illustrate, I'd like the texture to be interpolated all the way into the corners of the quads.
    • By -Tau-
      Hello, I'm close to releasing my first game to Steam however, my game keeps failing the review process because it keeps crashing. The problem is that the game doesn't crash on my computer, on my laptop, on our family computer, on fathers laptop and i also gave 3 beta keys to people i know and they said the game hasn't crashed.
      Steam reports that the game doesn't crash on startup but few frames after a level has been started.
      What could cause something like this? I have no way of debugging this as the game works fine on every computer i have.
       
      Game is written in C++, using DirectX 11 and DXUT framework.
    • By haiiry
      I'm trying to get, basically, screenshot (each 1 second, without saving) of Direct3D11 application. Code works fine on my PC(Intel CPU, Radeon GPU) but crashes after few iterations on 2 others (Intel CPU + Intel integrated GPU, Intel CPU + Nvidia GPU).
      void extractBitmap(void* texture) { if (texture) { ID3D11Texture2D* d3dtex = (ID3D11Texture2D*)texture; ID3D11Texture2D* pNewTexture = NULL; D3D11_TEXTURE2D_DESC desc; d3dtex->GetDesc(&desc); desc.BindFlags = 0; desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ | D3D11_CPU_ACCESS_WRITE; desc.Usage = D3D11_USAGE_STAGING; desc.Format = DXGI_FORMAT_R8G8B8A8_UNORM_SRGB; HRESULT hRes = D3D11Device->CreateTexture2D(&desc, NULL, &pNewTexture); if (FAILED(hRes)) { printCon(std::string("CreateTexture2D FAILED:" + format_error(hRes)).c_str()); if (hRes == DXGI_ERROR_DEVICE_REMOVED) printCon(std::string("DXGI_ERROR_DEVICE_REMOVED -- " + format_error(D3D11Device->GetDeviceRemovedReason())).c_str()); } else { if (pNewTexture) { D3D11DeviceContext->CopyResource(pNewTexture, d3dtex); // Wokring with texture pNewTexture->Release(); } } } return; } D3D11SwapChain->GetBuffer(0, __uuidof(ID3D11Texture2D), reinterpret_cast< void** >(&pBackBuffer)); extractBitmap(pBackBuffer); pBackBuffer->Release(); Crash log:
      CreateTexture2D FAILED:887a0005 DXGI_ERROR_DEVICE_REMOVED -- 887a0020 Once I comment out 
      D3D11DeviceContext->CopyResource(pNewTexture, d3dtex); 
      code works fine on all 3 PC's.
    • By Fluffy10
      Hi i'm new to this forum and was wondering if there are any good places to start learning directX 11. I bought Frank D Luna's book but it's really outdated and the projects won't even compile. I was excited to start learning from this book because it gives detailed explanations on the functions being used as well as the mathematics. Are there any tutorials / courses /books that are up to date which goes over the 3D math and functions in a detailed manner? Or where does anyone here learn directX 11? I've followed some tutorials from this website http://www.directxtutorial.com/LessonList.aspx?listid=11 which did a nice job but it doesn't explain what's happening with the math so I feel like I'm not actually learning, and it only goes up until color blending. Rasteriks tutorials doesn't go over the functions much at all or the math involved either. I'd really appreciate it if anyone can point me in the right direction, I feel really lost. Thank you
  • Popular Now