• Content count

  • Joined

  • Last visited

Community Reputation

3305 Excellent

About ajmiles

  • Rank

Personal Information

  • Interests


  • Twitter
  • Steam
  1. If it's unused/unavailable and the API gives you no way to write to it or read from it then the IHVs are free to not allocate it. The 'X24' part isn't necessarily allocated (but it might be). There are no guarantees whether it is or not. On DX12 you can use the GetResourceAllocationInfo API to query how much memory a particular driver/GPU needs to allocate for a surface and you'll be able to see whether DXGI_FORMAT_D32_FLOAT_S8X24_UINT is 64-bit or not.
  2. AMD hardware (GCN at least) allocates two separate planes for Depth and Stencil. D24/D32 reside in a 32-bit plane and S8 resides in a separate 8-bit plane. There's no Depth-Stencil format that has the memory footprint of 64-bits per sample that I'm aware of.
  3. It's possible that the version I'm on (16251) has newer GPU Validation bits that what you're running. What version of Windows 10 are you running? Run 'winver' at a command prompt and there should be an OS Build number in parentheses.
  4. Turns out the hang wasn't 100%. It 'Succeeded' and render the cube after the test for the first few times, but did hang on a later run. The GPU-Based Validation error is still there though.
  5. I get no GPU hang here on a 980 Ti but I do get a GPU Based Validation error that you seem to have introduced: D3D12 ERROR: GPU-BASED VALIDATION: Dispatch, Descriptor heap index out of bounds: Heap Index To DescriptorTableStart: [0], Heap Index From HeapStart: [0], Heap Type: D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV, Num Descriptor Entries: 0, Index of Descriptor Range: 0, Shader Stage: COMPUTE, Root Parameter Index: [1], Dispatch Index: [0], Shader Code: TestCompute.hlsl(13,3-40), Asm Instruction Range: [0xbc-0xdf], Asm Operand Index: [0], Command List: 0x000001F3C5E38C20:'m_testComputeList', SRV/UAV/CBV Descriptor Heap: 0x000001F3C5C824B0:'m_testComputeCBVHeap', Sampler Descriptor Heap: <not set>, Pipeline State: 0x000001F3C5973380:'m_testComputePipeline', [ EXECUTION ERROR #936: GPU_BASED_VALIDATION_DESCRIPTOR_HEAP_INDEX_OUT_OF_BOUNDS]
  6. In the latest Xbox compiler the alleged race condition is just a warning rather than an error like it is in the 15063 Windows SDK. I've added a bug to make sure it's neither.
  7. If you'd like to get rid of the iffy looking calculation for "MipMapCount", just call Tex->GetDesc() and the MipCount field will be populated with the actual mip count (rather than the 0 you filled it in with to create it). Glad it's working!
  8. You seem to be updating subresource 'i' here: D3D11DeviceContext->UpdateSubresource(Tex.Get(), i, 0, initData[i].pSysMem, initData[i].SysMemPitch, size); Whereas you should be updating subresource index D3D11CalcSubresource(mip, slice, numMips). The 6 mips you want to update are not subresources 0,1,2,3,4,5, but rather (0 * numMips), (1 * numMips), (2 * numMips)... etc.
  9. I would try running this on WARP, just to be sure. The stretch marks at the edge of the copy I have no explanation for, and so it may well be a bug in a particular hardware vendor's driver / implementation.
  10. 0.1ms sounds about right for copying 1MB over a bus that's roughly 16GB/s, so I'd be inclined to believe that number. It should scale approximately linearly. You have to bear in mind that the CPU timer isn't just timing how long it takes the CPU to do useful work, but how long it takes the GPU to catch up and do all its outstanding work. By calling Map you've required the GPU to catch up and execute all the work in its queue, do the copy and signal to the CPU that it's done. The more work the GPU has to run prior to the call to "CopyResource", the longer the CPU has to sit there and wait for it to complete. For that reason, I wouldn't expect the CPU timer to ever record a very low value in the region of 0.1ms no matter how small the copy is.
  11. Interesting, it might be that we haven't pushed anything out yet with that change in. It still exists in the Creators Update SDK and whatever release of Windows 10 'maxest' is running it still seems to work. I'll follow up with you offline why we decided the API wasn't useful. It feels like it still has value in scenarios where you want a consistent time from run-to-run and want to analyse whether an algorithmic change improves performance or not. Even if it doesn't give you real numbers for any user in the real world, consistency across runs still seems useful during development / optimisation. I don't have a definitive answer to why this might be, but I do have one theory. You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again. You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps? I've never see D3D11_COUNTER used before, but Jesse (SoldierOfLight) may know whether it ever saw any use.
  12. Even if you time only the work you're interested in (and not the whole frame), it's still going to take a variable amount of time depending on how high the GPU's clock speed happens to be at that point in time. If the GPU can see it's only doing 2ms of work every 16ms, then it may underclock itself by a factor of 3-4x such that the 2ms of work ends up taking 6ms-8ms instead. What's happening is something like this: 1) At 1500MHz, your work takes 0.4ms and ~16.2ms is spent idle at the end of the frame. 2) The GPU realises it could run a bit slower and still be done in plenty of time so it underclocks itself just a little bit to save power. 3) At 1200MHz, your work takes 0.5ms and ~16.1ms is spent idle at the end of the frame. 4) Still plenty of time spare, so it underclocks itself even further. 5) At 900MHz, your work takes 0.6ms and ~16.0ms is spent idle at the end of the frame. 6) *Still* plenty of time spare, so it dramatically underclocks itself. 7) At 500MHz, your work takes 3x longer than it did originally, now costing 1.2ms. There's still 15.4ms of idle time at the end of the frame, so this is still OK. 8) At this point the GPU may not have any lower power states to clock down to, so the work never takes any more than 1.2ms. In D3D12 we (Microsoft) added an API called ID3D12Device::SetStablePowerState, in part to address this problem. This API fixes the GPU's clock speed to something it can always run at without having to throttle back from due to thermal or power limitations. So if your GPU has a "Base Clock" of 1500MHz but can periodically "Boost" to 1650MHz, we'll fix the clock speed to 1500MHz. Note that this API does not work on end-users machines as it requires Debug bits to be installed, so can't be used in retail titles. Note also that performance will likely be worse than on an end-user's machine because we've artificially limited the clock speed below the peak to ensure a stable and consistent clock speed. With this in place, profiling becomes easier because the clock speed is known to be stable across runs and won't clock up and down as in your situation. Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.
  13. This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency. By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.
  14. The example does use textures of the same resolution, but indeed there is no reason that they need have the same Width, Height, Format or Mip Count. So long as they are an array of 2D Textures, that's fine. Depending on how many textures you want bound at once, be aware that you may be excluding Resource Binding Tier 1 hardware: https://msdn.microsoft.com/en-gb/library/windows/desktop/dn899127(v=vs.85).aspx Note that in order to get truly non-uniform resource indexing you need to tell HLSL + the compiler that the index is non-uniform using the "NonUniformResourceIndex" intrinsic. Failing to do this will likely result in the index from the first thread of the wave deciding which texture to sample from. https://msdn.microsoft.com/en-us/library/windows/desktop/dn899207(v=vs.85).aspx