• Content count

  • Joined

  • Last visited

Community Reputation

622 Excellent

About maxest

  • Rank
    Advanced Member
  1. I had no idea where to start this thread so here it is. I have barely 1 message in my inbox and yet when I want to compose a new message I get "Your inbox is full. You must delete some messages before you can send any more". A bug?
  2. I thought it should have been 0.1 ms as after refactoring the whole "system" I'm working on so that I need to only download 1 MB instead of 8 MB the total processing time went down by around 1.5 ms. Thank you again so much ajmiles.
  3. I actually did try placing End query right after CopyResource and before Map and that reported (as far as I remember, can't check now) something around 0.1 ms. Now I'm not really sure how should I measure the time it takes to download data from GPU to CPU. My CPU timer, when used to enclose CopyResource and Map, reported that downloading 11.5 GB took 1 second, what agrees with some CUDA-based test application for measuring PCI-E throughput that I used. When lowered down to 8 MB the download took 1.5 ms and when lowered to 1 MB the download took 1 ms. I'm not sure if PCI-E downloads should scale linearly as a function of data size but my tests show that they don't. At least that's what my CPU timer says. But the 0.1 ms reported by GPU timer when measuring CopyResource would indicate linear scale. Now I'm not sure if I should trust the CPU time reporting 1 ms (CopyResource + Map) or the GPU timer reporting 0.1 ms (just CopyResource).
  4. @ajmiles: Thank you so so much for this detailed explanation. I hadn't thought about GPU clock changing its speed. This makes more sense that performing some redundant work :). I have checked what you proposed. Got some simple DX12 sample, called SetStablePowerState and set it to true (needed to turn on Developer Mode on on my Windows 10; wasn't aware of its existence) and called permanent Sleep. Then I ran my application. Now regardless of whether I use VSync or not, call Sleep in my app or not, I get consistent 0.46 ms. It's more than without-VSync-and-SetStablePowerState 0.4 ms but at least it's stable. So as I understand the GPU is working at lower clock speed than it could (without Boost) but this speed is fixed. I have one more case whose results I don't entirely understand. I have code of this form: -- Begin CPU Profiler (with QueryPerformanceCounter etc.) -- Begin GPU Profile CopyResource (download from GPU to CPU) Map -- End GPU Profiler do something with mapped data Unmap -- End CPU Profiler The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms. I would just like to ask one more, relevant thing. In my quest for search of reliable counters I stumbled upon this ( but could find no simple example of usage. Is it working at all?
  5. I tested both. No difference. I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time? I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12. EDIT: Encompassing the whole Render function with one disjoint ( ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.
  6. I implemented DX queries after this blog post: Queries work perfectly fine... for as long as I don't use VSync or any other form of Sleep. Why would that happe? I record queries right before my Compute/Dispatch code, record right after and then read the results (spinning on GetData if returns S_FALSE). When I don't VSync then my code takes consistent 0.39-0.4 ms. After turning VSync on it starts with something like 0.46 ms, after a second bumps up to 0.61 ms and a few seconds after I get something like 1.2 ms. I also used this source: The difference here is that the author uses the disjoint query for the whole Render() function instead of using one per particular measurement. When I implemented it this way the timings were incosistent (like above 0.46, 0.61, 1.2) regardless of VSync.
  7. Yeah, I'm perfectly aware of that workaround and I do it this way. But because I can't pass a shared memory array to function I can't make the function more general. Instead I need to copy it to a few files I use it in.
  8. I have code like this: groupshared uint tempData[ElementsCount]; [numthreads(ElementsCount/2, 1, 1)] void CSMain(uint3 gID: SV_GroupID, uint3 gtID: SV_GroupThreadID) {     tempData[gtID.x] = 0; } And it works fine. Now I change it to this: void MyFunc(inout uint3 gtID: SV_GroupThreadID, inout uint inputData[ElementsCount]) {     inputData[gtID.x] = 0; } groupshared uint tempData[ElementsCount]; [numthreads(ElementsCount/2, 1, 1)] void CSMain(uint3 gID: SV_GroupID, uint3 gtID: SV_GroupThreadID) {     MyFunc(gtID, tempData); } and I get "error X3695: race condition writing to shared memory detected, consider making this write conditional.". Any way to go around this?