Jump to content
  • Advertisement

maxest

Member
  • Content count

    607
  • Joined

  • Last visited

Community Reputation

638 Good

About maxest

  • Rank
    Advanced Member

Personal Information

  • Role
    Programmer
  • Interests
    Programming

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. I'm glad you like it :). There is a link in the article to my framework which contains the shadows demo. It's here: https://github.com/maxest/MaxestFramework Here are the files directly related to the shadows demo: https://github.com/maxest/MaxestFramework/tree/master/samples/shadows But now I think I should separate the shadows demo from the rest of the framework and other samples. Meanwhile you can inspect the above. Should you have any problems running the demo let me know. Cheers
  2. Hey guys, In here (http://maxest.gct-game.net/content/chss.pdf) is an article I've just published about making fast contact-hardening soft shadows based on PCSS. It shows how to implement PCSS at the cost of non-contact-hardening soft shadows. And it also shows how to make nice soft shadows in general :). Hope you will find it useful. Any comments welcomed. Cheers, Wojtek
  3. Enjoy http://wojtsterna.blogspot.com/2018/03/solving-for-quadratic-equation-with.html
  4. But that's the problem that with high mips minZ/maxZ will be *very* crude. Sure, but the *real* complexity is spitting out draw calls of objects that are occluded but that have passed the occlusion test due to very crude high-mipmap minZ/maxZ. Could you elaborate a little more on this?
  5. In all presentations regarding software occlusion culling that I have stumbled upon there is mention to use mipmapping of the software rasterized depth buffer to speed up queries. Given that a bounding box covers say 500 pixels on the screen it's better (faster) to test a single pixel from a higher (low-res) mip-level than test many pixels in the lower (high-res) mip-level. But these mipmaps are usually generated using a conservative min or max filter which greatly reduces occlusion culling efficiency. So an idea popped in my head. Why not, in some cases, instead of storing min or max of depth values, store a plane equation of the triangle that is covering the area? That would greatly help in cases where we have long corridors at angles oblique to the camera. In those cases min filter would act very poorly whereas storing plane equation would work extra precisely. Variant of this idea was used in one GPU Pro book article about speeding up shadow map testing. Depending on situation, either min/max of Z are stored in helper shadow map or plane equation of the underlying geometry. I am wondering if anyone here has used or tried using this approach and what the outcomes were.
  6. Huh, my joy was premature. I made a mistake in my code and the change we're taking about here was not applied at all. After I have ensured it is applied artifacts showed up. I did make use of the trick desrcibed above though. I used this to more efficiently store differences between DC coefficients. I think it is not possible to apply any non-linear operator (like xor or modulo) *before* running DCT/quantization on the data and expect correct results on the decoding side. I might be wrong here but that is my intuition so far.
  7. At first I did not understand how this was supposed to work. My though was: hey, I still have [-255, 255] range to cover so how can I skip the extra bit? But your suggestion enlightened me on that - I *don't* have that range. I indeed have negative values to cover but the total number of values I need to represent is 256, not 511. I'll give an example to elaborate a bit more. Let's say we have a pixel which in frame1 has value of 20 (in [0..255] range) and in frame2 the same pixel has a value of 7. Difference to encode here is 7 - 20 = -13. Now, since that 7 has to be only in range [0..255], our limits are: 0 - 20 = -20 255 - 20 = 235 So for that case the only viable differences to encode are in range [-20, 235]. So as you see we still have only 256 values so 8 bits is enough. So back to our example, we got 7 - 20 = -13. We take -13 % 256 = 243 and store that. On the decoding side we have previous, frame1's value of 20. We decode by summing 20 + 243 = 263. Taking 263 % 256 = 7. This solution not only improved my compression so that the whole spectrum of values is encoded but, to my surprise, the compression ratio has increased by a measurable few %. Not sure why that happened but I won't complain :). Thank you rnlf_in_space immensely for making me realize what I have just described :).
  8. maxest

    Bloom

    Yeah, blooming the whole scene (as in Call of Duty) makes sense but only if you bloom on HDR target, where bright areas have way higher values than dark ones.
  9. I'm implementing a video codec. One way of improving efficiency is to compress delta between two frames. The problem with computing difference though is that it increases the range of values. So if my input image has RGB values in [0, 255] range the diff can be range [-255, 255]. My solution to this is to calculate (in floats) clamp(diff, -0.5, 0.5) + 0.5. This gives me range [0, 255] but cuts off higher values which actually is not a problem; at least I don't seee much difference. I was suggested that instead of using "raw" difference between input frames pixels I should use xor; without much further explanation. I seriously doubt if I should xor input RGB values before applying conversion to luma-chroma, DCT and quantization on the result does not yield good results (I see severe artifcats). Anyway, I've tried different approaches and here are my various findings. As a test case I took two similar pictures. 1. Compression of a single picture, no delta, JPEG-like (conversion of RGB to luma-chroma, quantization, DCT and finally Huffman-based RLE), gives compression ratio x26. 2. Computing xor of input frames and then compressing that (JPEG-like) gives compression ratio x39 and afore-mentioned artifacts. 3. Computing clamp(difference, -0.5, 0.5) + 0.5, followed by JPEG-like compression, results in compression ratio x76. 4. Since xor itself seemed to make sense but applied on *DCTs* of the input frames, not the RGBs, I tried that. Storing xor of DCT's and running RLE on that gave me compression ratio of x72. So as you see, I indeed achieved some nice compression using xor but only with 4. 3. gives the best compression ratio and has some other advantages. Since differences are more "natural" than xor nothing stands against blurring difference from 3. and thus achieving even better compression ratio at the cost of decreased, yes noticeable but not that much, quality. Do you have any thoughts on improving delta compression of images? I'm asking because in extreme cases delta compression produces blocks and eventually whole frames which have *worse* compression ratio than when compressed without delta.
  10. maxest

    Bloom

    I think Styves is right. In Call of Duty they take 4% of the HDR scene's colors and use that to apply bloom on. No need for thresholding. But it's not a problem to apply it.
  11. maxest

    Bloom

    I would recommend looking here: http://www.iryoku.com/next-generation-post-processing-in-call-of-duty-advanced-warfare The bloom proposed here is very simple to implement and works great. I implemented it and checked. Keep in mind though that there is a mistake in the slides which I pointed out in the comments. Basically, the avoid getting bloom done badly you can't undersample or you will end up with nasty aliasing/ringing. So you take the original image, downsample it once (from 1920x1080 to 960x540) to get second layer, then again, and again, up to n layers. After you have generated, say, 6 layers, you combine them by upscaling the n'th layer to the size of (n-1)'th layer by summing them. The you do the same with the new (n-1) layer and (n-2)'th layer. Up to the full resolution. This is quite fast as the downsample and upsample filters need very small kernels but since you're going down to a very small layer you eventually get a very broad and stable bloom.
  12. I wasn't aware NVIDIA doesn't recommend warp-synchronous programming anymore. Good to know. I checked my GPU's warp size simply with some CUDA sample that prints debug info. That size is 32 for my GeForce 1080 GTX what does not surprise me as NV's GPUs have long been characterized by this number (I think AMD's is 64). I have two more listings for you. Actually I had to change my code to operate on 16x16=256 pixel blocks instead of 8x8=64 pixels what forced me to call barriers. My first attempt: void CalculateErrs(uint threadIdx) { if (threadIdx < 128) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 64) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 16) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 8) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 4) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 2) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 1) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); } And the second attempt: void CalculateErrs(uint threadIdx) { if (threadIdx < 128) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 64) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 16) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; if (threadIdx < 8) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 4) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; if (threadIdx < 2) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 1) errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; } I dropped a few barriers as from some point on I'm working with <= 32 threads. Both of these listings produce exactly the same outcome. If I skipped one more barrier the race condition appears. Performance differs in both listings. Second one is around 15% faster.
  13. Implementation using one array: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 32]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 16]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 8]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 4]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 2]; GroupMemoryBarrierWithGroupSync(); if (threadIdx < 32) errs1_shared[threadIdx] += errs1_shared[threadIdx + 1]; } It works but you might be surprised that it runs slower than when I used this: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) { errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; errs1_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; } } This one is modification of my first snippet (from the first post) that is ping-ponging two arrays. And here again, this one is faster by 15-20% then the one-array version. So my guess is it's the barriers that cost time. Please note that I run CalculateErrs 121 times in my shader, which runs for every pixel so that is a lot. I would be perfectly fine on *not* relying on warp size to avoid using barriers because maybe DirectCompute does not allow this "trick" as it's not NV-only. But what bites my neck is that when I run bank-conflicked second snippet from this post, or the first snippet from the first post, it works like a charm. And I save performance by not having to use barriers.
  14. Oh, I completely forgot that I can't have divergent branches to make use of that "assumption". But I've tried this code before as well: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) { errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32]; errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16]; errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8]; errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4]; errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2]; errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1]; } } And it also causes race conditions even though there are no branches within the warp. "Do you really get a performance drop when adding barriers in your first snippet? (You did'n made this clear, but i'd be very disappointed.)" The *second* snippet, yes. When I add barriers to the second snippet the code is slower than the one from the first snipper.
  15. In countless sources I've found that, when operating within a warp, one might skip syncthreads because all instructions are synchronous within a single warp. In CUDA-related sources. I followed that advice and applied it in DirectCompute (I use NV's GPU). I wrote this code that does nothing else but good old prefix-sum of 64 elements (64 is the size of my block): groupshared float errs1_shared[64]; groupshared float errs2_shared[64]; groupshared float errs4_shared[64]; groupshared float errs8_shared[64]; groupshared float errs16_shared[64]; groupshared float errs32_shared[64]; groupshared float errs64_shared[64]; void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[2*threadIdx] + errs4_shared[2*threadIdx + 1]; if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[2*threadIdx] + errs8_shared[2*threadIdx + 1]; if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[2*threadIdx] + errs16_shared[2*threadIdx + 1]; if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[2*threadIdx] + errs32_shared[2*threadIdx + 1]; } This works flawlessly. I noticed that I have bank conflicts in here so I changed that code to this: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32]; if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16]; if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8]; if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4]; if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2]; if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1]; } And to my surprise this one causes race conditions. Is it because I should not rely on that functionality (auto-sync within warp) when working with DirectCompute instead of CUDA? Because that hurts my performance by measurable margin. With bank conflicts (first version) I am still faster by around 15-20% than in the second version, which is conflict-free but I have to add GroupMemoryBarrierWithGroupSync in between each assignment.
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!