ajmiles

Members
  • Content count

    337
  • Joined

  • Last visited

Community Reputation

3294 Excellent

About ajmiles

  1. I would try running this on WARP, just to be sure. The stretch marks at the edge of the copy I have no explanation for, and so it may well be a bug in a particular hardware vendor's driver / implementation.
  2. 0.1ms sounds about right for copying 1MB over a bus that's roughly 16GB/s, so I'd be inclined to believe that number. It should scale approximately linearly. You have to bear in mind that the CPU timer isn't just timing how long it takes the CPU to do useful work, but how long it takes the GPU to catch up and do all its outstanding work. By calling Map you've required the GPU to catch up and execute all the work in its queue, do the copy and signal to the CPU that it's done. The more work the GPU has to run prior to the call to "CopyResource", the longer the CPU has to sit there and wait for it to complete. For that reason, I wouldn't expect the CPU timer to ever record a very low value in the region of 0.1ms no matter how small the copy is.
  3. Interesting, it might be that we haven't pushed anything out yet with that change in. It still exists in the Creators Update SDK and whatever release of Windows 10 'maxest' is running it still seems to work. I'll follow up with you offline why we decided the API wasn't useful. It feels like it still has value in scenarios where you want a consistent time from run-to-run and want to analyse whether an algorithmic change improves performance or not. Even if it doesn't give you real numbers for any user in the real world, consistency across runs still seems useful during development / optimisation. I don't have a definitive answer to why this might be, but I do have one theory. You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again. You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps? I've never see D3D11_COUNTER used before, but Jesse (SoldierOfLight) may know whether it ever saw any use.
  4. Even if you time only the work you're interested in (and not the whole frame), it's still going to take a variable amount of time depending on how high the GPU's clock speed happens to be at that point in time. If the GPU can see it's only doing 2ms of work every 16ms, then it may underclock itself by a factor of 3-4x such that the 2ms of work ends up taking 6ms-8ms instead. What's happening is something like this: 1) At 1500MHz, your work takes 0.4ms and ~16.2ms is spent idle at the end of the frame. 2) The GPU realises it could run a bit slower and still be done in plenty of time so it underclocks itself just a little bit to save power. 3) At 1200MHz, your work takes 0.5ms and ~16.1ms is spent idle at the end of the frame. 4) Still plenty of time spare, so it underclocks itself even further. 5) At 900MHz, your work takes 0.6ms and ~16.0ms is spent idle at the end of the frame. 6) *Still* plenty of time spare, so it dramatically underclocks itself. 7) At 500MHz, your work takes 3x longer than it did originally, now costing 1.2ms. There's still 15.4ms of idle time at the end of the frame, so this is still OK. 8) At this point the GPU may not have any lower power states to clock down to, so the work never takes any more than 1.2ms. In D3D12 we (Microsoft) added an API called ID3D12Device::SetStablePowerState, in part to address this problem. This API fixes the GPU's clock speed to something it can always run at without having to throttle back from due to thermal or power limitations. So if your GPU has a "Base Clock" of 1500MHz but can periodically "Boost" to 1650MHz, we'll fix the clock speed to 1500MHz. Note that this API does not work on end-users machines as it requires Debug bits to be installed, so can't be used in retail titles. Note also that performance will likely be worse than on an end-user's machine because we've artificially limited the clock speed below the peak to ensure a stable and consistent clock speed. With this in place, profiling becomes easier because the clock speed is known to be stable across runs and won't clock up and down as in your situation. Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.
  5. This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency. By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.
  6. The example does use textures of the same resolution, but indeed there is no reason that they need have the same Width, Height, Format or Mip Count. So long as they are an array of 2D Textures, that's fine. Depending on how many textures you want bound at once, be aware that you may be excluding Resource Binding Tier 1 hardware: https://msdn.microsoft.com/en-gb/library/windows/desktop/dn899127(v=vs.85).aspx Note that in order to get truly non-uniform resource indexing you need to tell HLSL + the compiler that the index is non-uniform using the "NonUniformResourceIndex" intrinsic. Failing to do this will likely result in the index from the first thread of the wave deciding which texture to sample from. https://msdn.microsoft.com/en-us/library/windows/desktop/dn899207(v=vs.85).aspx
  7. The only small quirk left is that you map it for READ_WRITE rather than just READ, but that shouldn't be a problem. You can remove CPU_ACCESS_WRITE from the temp texture creation as well if you never intend to write to it. Do you know that you have actually rendered something to the source texture?
  8. There's a few obvious errors in the code you've written which are worth fixing: The pitch of the source (read pointer) should be incremented by mapped.RowPitch, not Desc.Width. Desc.Width is the number of pixels the texture is wide, and is not only measured in the wrong units (pixels, instead of bytes) but the pitch is likely something other than "Desc.Width * 4". The pitch of the destination (dest) should be incremented by Desc.Width * 4 (bytes) since 'dest' is an unsigned char*. Your for loop attempts to copy the data row by row, but should be iterating "for(int i = 0; i < Desc.Height; ..." rather than Desc.Width. The amount memcpy'ed out per row should be Desc.Width * 4, not Desc.Height * 4. I'm not sure what the relevance of '1200' is, but you're printing out just the first 1200 colour channels (RGBA 300 times). So if the first 300 pixels are transparent black, then you'll get 0 printed out all the time. Try iterating over every pixel just to be sure.
  9. Are you using object-space normal maps rather than tangent-space normal maps? The ability to do "this has no normal map, so I'll replace it with a 1x1 texture" works for tangent-space normal maps, but not object-space ones. I would generally try and avoid branching on presence / non-presence of textures and ensure I'd bound a cut-down shader without normal map support for objects that don't want to provide the texture.
  10. You should probably be making use of D3D11_APPEND_ALIGNED_ELEMENT instead of calculating each attribute offset manually. If you ever want to go back and compress one of the attributes (you really shouldn't be using 32 bit signed indices!) you'll have to recalculate the offsets for every attribute that appears after the one you're compressing.
  11.   Why do the indices and weights overlap? Indices are 16 bytes starting at offset 24 bytes, but weights starts just 12 bytes later. You have an overlap between BoneIndices.w and Weights.x.
  12. Your Additive blend state looks fine (SrcBlend = SrcAlpha, DestBlend = 1.0). Alpha blended objects can only blend onto something if that "something" has already been rendered. So yes, the objects onto which you want to blend must already be rendered. This becomes difficult when you want to start blending alpha-blended objects onto other alpha-blended objects. As best you can you will want to render alpha blended objects from back to front so that the next alpha blended object draws on top of the ones behind it. Order-independent transparency is still an active area of research in computer graphics and is difficult to solve.
  13. If you're going to use alpha blending, then you need to draw transparent objects after all the opaque objects. Alpha-testing is different to alpha-blending. Alpha testing involves "killing" a pixel in the shader (in D3D11) by using the 'clip' or 'discard' intrinsics. However, it sounds like you probably aren't using that. Your blend state doesn't look correct for normal alpha blending either. You want: result = srcAlpha * Src + (1-srcAlpha) * Dest. You need to change your SrcBlend to D3D11_BLEND_SRC_ALPHA after you've made the change to draw transparent objects last.
  14. What are you drawing first, the pillar or the flowers? Are you using alpha-testing to avoid writing depth for transparent pixels? Sounds to me like you're clearing, alpha blending the leaves and writing depth for transparent/translucent pixels, and then when you come to draw the pillar behind it it's failing the depth test again the transparent pixels.