you might reduce the issue by dividing by abs(w), as most of the problem arises due to the flip of x and y when you divide by something negative. it's still not correct, but far less noticeable and you get away with per-vertex cost.
I tried that after I posted my first post, but it didn't help. I also tried with -abs(w). It seems like the problem is that the motion vector becomes extremely large when a vertex lies very close to (linear) Z=0. Direction isn't really the main problem; it's the length of the vector. I clamp my motion vectors so my motion blur doesn't explode, but the motion vectors are several magnitudes too large even for unnoticable movement.
Posted by theagentd
on 15 November 2013 - 09:03 AM
I finally had time to finish my implementation!
Old culling: 73 FPS
New culling: 89 FPS
Old culling: 13 FPS (36 with multithreading)
New culling: 20 FPS (45 with multithreading)
(The multithreading is not optimized for the new culling method yet.)
1. Update bounds of all objects.
2. For each frustum, test all objects against frustum.
1. Update bounds of all objects.
2a. Update the bounds of the octree to perfectly fit all objects based on their bounds.
2b. Clear all entries in the octree. (2a and 2b are done at the same time by traversing the tree.)
3. For each frustum, query the octree for intersecting objects.
There are many parameters to tweak, but so far an octree with 4 or 5 levels seems optimal, which can be updated in under around 0.6 milliseconds. Although a tree with 5 levels was overall around 10-20% slower in both updating and culling compared to a 4 level tree, when comparing threaded performance 5 levels was actually slightly faster which I believe is due to hyperthreading helping with cache misses when traversing the deeper tree. Performance was actually a bit worse for very simple scenes running at >150 FPS, but in those cases performance was not a problem in the first place. Octree's seem to, like deferred shading, sacrifice performance in simple scenes to improve scaling to more complex scenes, an excellent trade-off in my opinion.
I will profile my octree updating a bit to see if the automatic bounds computing really is worth it, etc. When I optimize my threading code to run other tasks in parallel with the octree updating I expect threaded performance to improve quite a bit there too, and octree updating overhead to have a smaller impact on overall performance.
Posted by theagentd
on 21 November 2012 - 09:09 PM
Hello. I have a small tip for anyone using EVSMs. I don't know if it's something completely obvious or so, but here it is anyway.
The papers I've found recommends that as a shadow map, you render to a MSAA 32-bit float RGBA render target to store the two moments (m1, m1*m1, m2, m2*m2). You also need a depth buffer since the RGBA buffer can't be used as a depth buffer. This is incredibly expensive. For 4xMSAA, we get 16+4 bytes per MSAA sample. That's 80MBs just for a 4 sample 1024^2 variance map! We also need a resolved variance map, so add another 16MBs there. In total: 96MBs just for a 1024^2 shadow map. Don't you dare try a resolution 2048^2...
However, I found that I can reduce this memory footprint a lot. We don't have to calculate the moments until we resolve the MSAA texture! Instead of having a 32-bit float RGBA texture + a depth texture, we can get by with only a 32-bit float depth texture. By outputting view-space depth and modifying the depth range, we can just pass it down to gl_FragDepth (in the case of OpenGL) in the shadow rendering shader. When resolving we simply read the depth samples and calculate the moments and average them together! The result is that we only need 4 bytes per MSAA sample, period. That's 16MBs for a a 4xMSAA 1024^2 variance. The resolved variance texture is identical to before so that's another 16MBs. In total: 32MBs, which is a LOT better than 96MBs.
This not only reduces VRAM usage a lot, it also massively reduces the bandwidth needed for the shadow map rendering and resolve passes. On my GTX 295 (only one GPU, equal to a GTX 260/275), the performance is a lot better with my optimization. I'm getting 240 FPS in using the standard technique (1024^2 + 4xMSAA) and 440 FPS with my new one, which is almost twice as fast. Quality is identical to the normal technique since it's identical to normal EVSM stuff after resolving the MSAA texture.
I hope someone finds this useful, and if something's unclear feel free to ask!
EDIT: Fun fact: It's not possible to create a 8xMSAA 32-bit float RGBA texture, but it is possible to create a 8xMSAA 32-float depth buffer, so my technique works with 8xMSAA while the standard technique does not (at least on my hardware). Even funnier: Mine with 8xMSAA is 50% faster than the original with 4xMSAA and uses half as much memory.
I think the rendering is working fine, but you're not clearing the FBO, causing the depth test to fail after the first frame. Make sure that all FBOs are cleared properly, both the shadow map and the shadow post processing buffer. I see for example that you call glViewport(), then glClear() and THEN glBindFramebuffer(). I don't think you wanted to clear a shadow map sized part of the screen, right? =S
You don't have a depth buffer in your FBO, so the geometry just appears in the order it's drawn. You need to have a depth buffer in step 2 too since you're trying to draw all the geometry again, and you obviously need the closest depth there.
I get the feeling you're trying to do a screen space blur of the shadow map to get smooth shadows, correct? This won't look very good and will be very inaccurate, especially for steep angles since it's uniformly blurred. You'll also possible get light bleeding where you're not supposed to unless you take the depth of neighboring pixels into account. Anyway, it's far from optimal. The easiest way to do soft shadows is with PCF, which is basically sampling lots of nearby depth values from the shadow map and checking how many of these that pass. This gives problems with self shadowing though so you need a high depth bias which in turn pretty much removes self shadowing. If you're feeling really adventurous look up variance shadow maps.
This is completely possible and can look great. I implemented this a few years ago for a tile-based 2D game, and it looked great but I never finished that game. Today the source code is way too hacked up and some resources are missing so I sadly can't provide you with a screenshot. I just implemented hard shadows from this article using framebuffers in OpenGL: http://www.gamedev.n...t-shadows-r2032 It's easy to render the shadows of blocking tiles as the technique in the article works fine for them, but it's also possible to make sprites cast shadows by rendering them as in a black color stretched away from the shadow caster. The real problem is making sprites have correct lighting. A solution to this is to sample a line of the "lighting map" at the bottom of each sprite and apply that over the whole sprite, but this does not work too well for tall objects.
Jeez, now I want to implement this myself!!! T___T
EDIT: Managed to dig something out and get it running after some oiling up...
It doesn't show any tile shadows, but it should give you an idea of how it looks. Well, maybe a little at least... Anyway, this demo runs at over 200 FPS on my laptop with over 300 lights. Granted there's only a single shadow-caster, and many of the lights have been culled, but it also runs smoothly with 1000 lights, barely managing 60 FPS...
You could keep a static triangle buffer and only generate an sorted index buffer each frame. This would reduce the CPU, but possibly reduce GPU load a little since the triangle data will not be read in a linear order. Also, the order of non-overlapping triangles do not matter of course. You might be able to take advantage of that.