ID3D11Query reporting weird results

Started by
11 comments, last by maxest 6 years, 10 months ago

I implemented DX queries after this blog post:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Queries work perfectly fine... for as long as I don't use VSync or any other form of Sleep. Why would that happe? I record queries right before my Compute/Dispatch code, record right after and then read the results (spinning on GetData if returns S_FALSE).
When I don't VSync then my code takes consistent 0.39-0.4 ms. After turning VSync on it starts with something like 0.46 ms, after a second bumps up to 0.61 ms and a few seconds after I get something like 1.2 ms.

I also used this source:
http://reedbeta.com/blog/gpu-profiling-101/
The difference here is that the author uses the disjoint query for the whole Render()  function instead of using one per particular measurement. When I implemented it this way the timings were incosistent (like above 0.46, 0.61, 1.2) regardless of VSync.

Advertisement

Are you spinning on the query results immediately, or do you wait a frame before trying to get the results? 

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

12 hours ago, Hodgman said:

Are you spinning on the query results immediately, or do you wait a frame before trying to get the results? 

I tested both. No difference.

10 hours ago, ajmiles said:

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

39 minutes ago, maxest said:

I tested both. No difference.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Even if you time only the work you're interested in (and not the whole frame), it's still going to take a variable amount of time depending on how high the GPU's clock speed happens to be at that point in time.

If the GPU can see it's only doing 2ms of work every 16ms, then it may underclock itself by a factor of 3-4x such that the 2ms of work ends up taking 6ms-8ms instead.

What's happening is something like this:

1) At 1500MHz, your work takes 0.4ms and ~16.2ms is spent idle at the end of the frame.
2) The GPU realises it could run a bit slower and still be done in plenty of time so it underclocks itself just a little bit to save power.
3) At 1200MHz, your work takes 0.5ms and ~16.1ms is spent idle at the end of the frame.
4) Still plenty of time spare, so it underclocks itself even further.
5) At 900MHz, your work takes 0.6ms and ~16.0ms is spent idle at the end of the frame.
6) *Still* plenty of time spare, so it dramatically underclocks itself.
7) At 500MHz, your work takes 3x longer than it did originally, now costing 1.2ms. There's still 15.4ms of idle time at the end of the frame, so this is still OK.
8) At this point the GPU may not have any lower power states to clock down to, so the work never takes any more than 1.2ms.

In D3D12 we (Microsoft) added an API called ID3D12Device::SetStablePowerState, in part to address this problem.

This API fixes the GPU's clock speed to something it can always run at without having to throttle back from due to thermal or power limitations. So if your GPU has a "Base Clock" of 1500MHz but can periodically "Boost" to 1650MHz, we'll fix the clock speed to 1500MHz. Note that this API does not work on end-users machines as it requires Debug bits to be installed, so can't be used in retail titles. Note also that performance will likely be worse than on an end-user's machine because we've artificially limited the clock speed below the peak to ensure a stable and consistent clock speed. With this in place, profiling becomes easier because the clock speed is known to be stable across runs and won't clock up and down as in your situation.

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

2 hours ago, ajmiles said:

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

@ajmiles: Thank you so so much for this detailed explanation. I hadn't thought about GPU clock changing its speed. This makes more sense that performing some redundant work :).

I have checked what you proposed. Got some simple DX12 sample, called SetStablePowerState and set it to true (needed to turn on Developer Mode on on my Windows 10; wasn't aware of its existence) and called permanent Sleep. Then I ran my application. Now regardless of whether I use VSync or not, call Sleep in my app or not, I get consistent 0.46 ms. It's more than without-VSync-and-SetStablePowerState 0.4 ms but at least it's stable. So as I understand the GPU is working at lower clock speed than it could (without Boost) but this speed is fixed.

I have one more case whose results I don't entirely understand. I have code of this form:


-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
CopyResource (download from GPU to CPU)
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler


The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I would just like to ask one more, relevant thing. In my quest for search of reliable counters I stumbled upon this (https://msdn.microsoft.com/en-us/library/windows/desktop/ff476364(v=vs.85).aspx) but could find no simple example of usage. Is it working at all?

39 minutes ago, SoldierOfLight said:

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Interesting, it might be that we haven't pushed anything out yet with that change in. It still exists in the Creators Update SDK and whatever release of Windows 10 'maxest' is running it still seems to work. 

I'll follow up with you offline why we decided the API wasn't useful. It feels like it still has value in scenarios where you want a consistent time from run-to-run and want to analyse whether an algorithmic change improves performance or not. Even if it doesn't give you real numbers for any user in the real world, consistency across runs still seems useful during development / optimisation.

38 minutes ago, maxest said:

I have one more case whose results I don't entirely understand. I have code of this form:



-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
CopyResource (download from GPU to CPU)
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler

The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I've never see D3D11_COUNTER used before, but Jesse (SoldierOfLight) may know whether it ever saw any use.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

As far as counters go, they're all for IHV-specific counters. In D3D10 there were API-defined counters, but they were deprecated in D3D11.

The current model for performance counters is the plugin model exposed by PIX.

Also I just checked, and apparently I was wrong about SetStablePowerState, we did keep it around, we just moved it from requiring the D3D12 debug layers, to requiring developer mode. My bad.

1 hour ago, ajmiles said:

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I actually did try placing End query right after CopyResource and before Map and that reported (as far as I remember, can't check now) something around 0.1 ms. Now I'm not really sure how should I measure the time it takes to download data from GPU to CPU. My CPU timer, when used to enclose CopyResource and Map, reported that downloading 11.5 GB took 1 second, what agrees with some CUDA-based test application for measuring PCI-E throughput that I used. When lowered down to 8 MB the download took 1.5 ms and when lowered to 1 MB the download took 1 ms. I'm not sure if PCI-E downloads should scale linearly as a function of data size but my tests show that they don't. At least that's what my CPU timer says. But the 0.1 ms reported by GPU timer when measuring CopyResource would indicate linear scale. Now I'm not sure if I should trust the CPU time reporting 1 ms (CopyResource + Map) or the GPU timer reporting 0.1 ms (just CopyResource).

This topic is closed to new replies.

Advertisement