maxest

Members
  • Content count

    598
  • Joined

  • Last visited

Community Reputation

625 Good

About maxest

  • Rank
    Advanced Member

Personal Information

  • Interests
    Programming
  1. Bloom

    I think Styves is right. In Call of Duty they take 4% of the HDR scene's colors and use that to apply bloom on. No need for thresholding. But it's not a problem to apply it.
  2. Bloom

    I would recommend looking here: http://www.iryoku.com/next-generation-post-processing-in-call-of-duty-advanced-warfare The bloom proposed here is very simple to implement and works great. I implemented it and checked. Keep in mind though that there is a mistake in the slides which I pointed out in the comments. Basically, the avoid getting bloom done badly you can't undersample or you will end up with nasty aliasing/ringing. So you take the original image, downsample it once (from 1920x1080 to 960x540) to get second layer, then again, and again, up to n layers. After you have generated, say, 6 layers, you combine them by upscaling the n'th layer to the size of (n-1)'th layer by summing them. The you do the same with the new (n-1) layer and (n-2)'th layer. Up to the full resolution. This is quite fast as the downsample and upsample filters need very small kernels but since you're going down to a very small layer you eventually get a very broad and stable bloom.
  3. I wasn't aware NVIDIA doesn't recommend warp-synchronous programming anymore. Good to know. I checked my GPU's warp size simply with some CUDA sample that prints debug info. That size is 32 for my GeForce 1080 GTX what does not surprise me as NV's GPUs have long been characterized by this number (I think AMD's is 64). I have two more listings for you. Actually I had to change my code to operate on 16x16=256 pixel blocks instead of 8x8=64 pixels what forced me to call barriers. My first attempt: void CalculateErrs(uint threadIdx) { if (threadIdx < 128) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 64) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 16) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 8) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 4) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 2) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 1) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); } And the second attempt: void CalculateErrs(uint threadIdx) { if (threadIdx < 128) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 64) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 16) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; if (threadIdx < 8) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 4) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; if (threadIdx < 2) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 1) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; } I dropped a few barriers as from some point on I'm working with <= 32 threads. Both of these listings produce exactly the same outcome. If I skipped one more barrier the race condition appears. Performance differs in both listings. Second one is around 15% faster.
  4. Implementation using one array: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 32]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 16]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 8]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 4]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 2]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 1]; } It works but you might be surprised that it runs slower than when I used this: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) { errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; } } This one is modification of my first snippet (from the first post) that is ping-ponging two arrays. And here again, this one is faster by 15-20% then the one-array version. So my guess is it's the barriers that cost time. Please note that I run CalculateErrs 121 times in my shader, which runs for every pixel so that is a lot. I would be perfectly fine on *not* relying on warp size to avoid using barriers because maybe DirectCompute does not allow this "trick" as it's not NV-only. But what bites my neck is that when I run bank-conflicked second snippet from this post, or the first snippet from the first post, it works like a charm. And I save performance by not having to use barriers.
  5. Oh, I completely forgot that I can't have divergent branches to make use of that "assumption". But I've tried this code before as well: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) { errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32]; errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16]; errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8]; errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4]; errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2]; errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1]; } } And it also causes race conditions even though there are no branches within the warp. "Do you really get a performance drop when adding barriers in your first snippet? (You did'n made this clear, but i'd be very disappointed.)" The *second* snippet, yes. When I add barriers to the second snippet the code is slower than the one from the first snipper.
  6. In countless sources I've found that, when operating within a warp, one might skip syncthreads because all instructions are synchronous within a single warp. In CUDA-related sources. I followed that advice and applied it in DirectCompute (I use NV's GPU). I wrote this code that does nothing else but good old prefix-sum of 64 elements (64 is the size of my block): groupshared float errs1_shared[64]; groupshared float errs2_shared[64]; groupshared float errs4_shared[64]; groupshared float errs8_shared[64]; groupshared float errs16_shared[64]; groupshared float errs32_shared[64]; groupshared float errs64_shared[64]; void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[2*threadIdx] + errs4_shared[2*threadIdx + 1]; if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[2*threadIdx] + errs8_shared[2*threadIdx + 1]; if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[2*threadIdx] + errs16_shared[2*threadIdx + 1]; if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[2*threadIdx] + errs32_shared[2*threadIdx + 1]; } This works flawlessly. I noticed that I have bank conflicts in here so I changed that code to this: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32]; if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16]; if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8]; if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4]; if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2]; if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1]; } And to my surprise this one causes race conditions. Is it because I should not rely on that functionality (auto-sync within warp) when working with DirectCompute instead of CUDA? Because that hurts my performance by measurable margin. With bank conflicts (first version) I am still faster by around 15-20% than in the second version, which is conflict-free but I have to add GroupMemoryBarrierWithGroupSync in between each assignment.
  7. PCI Express Throughput

    Just wanted to let you know that I made a test with CUDA to measure memory transfer rate and it peaked at around ~ 12 GB/s. Also, measuring CopyResource time with D3D11 queries result in very similar throughput.
  8. Hey, Not really sure if the right forum but hey - codecs display graphics so... :). I've been wondering how to calculate motion vectors. When encoding a video offline I can imagine one could spend enough time to search a large neighbourhood to find a motion vector that maps the current frame block to the previous frame block with minimal differences (hence achieve better compression). But what about live broadcasting where time is of value? How would a codec estimate a motion vector? Search small neighbourhood?
  9. I found a better workaround. So simple I can't imagine how I could had not come up with it before. I just used macro. Still, would be nice if this bug was fixed. In the meantime I will be using macros on functions getting shared buffers as input.
  10. I would like to run some computation using compute shaders. A lot of computation. Since GPUs have separate memory engine I thought I could make use of it, just like with CUDA streams, and have asynchronous computation and data download GPU -> CPU. So I would do something like this: Dispatch 1 (first half of data) CopyResource 1 Dispatch 2 (second hald of data) CopyResource 2 Now the question is: will CopyResource 1 and Dispatch 2 overlap in time? I heard from someone that Discard causes a flush; it waits until all previous commands have been completed and then gets called but can't find that in MSDN. Can anyone confirm?
  11. I had no idea where to start this thread so here it is. I have barely 1 message in my inbox and yet when I want to compose a new message I get "Your inbox is full. You must delete some messages before you can send any more". A bug?
  12. I thought it should have been 0.1 ms as after refactoring the whole "system" I'm working on so that I need to only download 1 MB instead of 8 MB the total processing time went down by around 1.5 ms. Thank you again so much ajmiles.
  13. I actually did try placing End query right after CopyResource and before Map and that reported (as far as I remember, can't check now) something around 0.1 ms. Now I'm not really sure how should I measure the time it takes to download data from GPU to CPU. My CPU timer, when used to enclose CopyResource and Map, reported that downloading 11.5 GB took 1 second, what agrees with some CUDA-based test application for measuring PCI-E throughput that I used. When lowered down to 8 MB the download took 1.5 ms and when lowered to 1 MB the download took 1 ms. I'm not sure if PCI-E downloads should scale linearly as a function of data size but my tests show that they don't. At least that's what my CPU timer says. But the 0.1 ms reported by GPU timer when measuring CopyResource would indicate linear scale. Now I'm not sure if I should trust the CPU time reporting 1 ms (CopyResource + Map) or the GPU timer reporting 0.1 ms (just CopyResource).
  14. @ajmiles: Thank you so so much for this detailed explanation. I hadn't thought about GPU clock changing its speed. This makes more sense that performing some redundant work :). I have checked what you proposed. Got some simple DX12 sample, called SetStablePowerState and set it to true (needed to turn on Developer Mode on on my Windows 10; wasn't aware of its existence) and called permanent Sleep. Then I ran my application. Now regardless of whether I use VSync or not, call Sleep in my app or not, I get consistent 0.46 ms. It's more than without-VSync-and-SetStablePowerState 0.4 ms but at least it's stable. So as I understand the GPU is working at lower clock speed than it could (without Boost) but this speed is fixed. I have one more case whose results I don't entirely understand. I have code of this form: -- Begin CPU Profiler (with QueryPerformanceCounter etc.) -- Begin GPU Profile CopyResource (download from GPU to CPU) Map -- End GPU Profiler do something with mapped data Unmap -- End CPU Profiler The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms. I would just like to ask one more, relevant thing. In my quest for search of reliable counters I stumbled upon this (https://msdn.microsoft.com/en-us/library/windows/desktop/ff476364(v=vs.85).aspx) but could find no simple example of usage. Is it working at all?