Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 15 Nov 2010
Offline Last Active Jul 20 2014 07:09 PM

#5159233 Temporal Subpixel Reconstruction Antialiasing

Posted by theagentd on 09 June 2014 - 05:25 AM



I came up with a way of extending Subpixel Reconstruction Antialiasing with a temporal component to improve subpixel accuracy and achieve antialiasing of shading as well, while also completely preventing ghosting.


Here's a comparison of no antialiasing, SRAA and TSRAA to help catch your interest. =P




Here's a link to the article I wrote about it.

#5155317 Using the ARB_multi_draw_indirect command

Posted by theagentd on 22 May 2014 - 08:18 PM

Are you sure that you're not simply prematurely optimizing? How exactly is your situation looking? Have you identified the bottleneck?


Slightly off-topic: I was inspired by your post and decided to try out glMultiDrawElementsIndirect() since I identified a part in my engine where I simply called glDrawElementsInstancedBaseVertex() in a loop. This was for shadow rendering, so no texture switches were required. Depending on how many types of tiles that were visible, around 20 draw calls were issued in a row, which I replaced with a single glMultiDrawElementsIndirect() call instead. That left my code with 3 different modes, depending on OpenGL support.



OGL3: Although all the instance data for all draw calls is packed into the same VBO, the vertex attribute pointer needs to be updated before each draw call so that it reads the correct subset of instances from that buffer.

			glVertexAttribPointer(instancePositionLocation, 3, GL_FLOAT, false, 0, baseInstance * 12);
			glDrawElementsInstancedBaseVertex(GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, baseIndex*2, numInstances, baseVertex);

ARB_base_instance: If ARB_base_instance is supported, I can instead simply pass in a base instance instead of modifying the instance data pointer, removing the last set of state change from the mesh rendering loop:

glDrawElementsInstancedBaseVertexBaseInstance(GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, baseIndex*2, numInstances, baseVertex, baseInstance);

ARB_multi_draw_indirect: If ARB_multi_draw_indirect is supported, I can pack together the above data into an array (an IntBuffer in my case since I'm using Java, hence the weird code), and draw them all with a single draw call:

//In the mesh "rendering" loop

//After the loop:
ARBMultiDrawIndirect.glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, multiDrawBuffer, multiDrawCount, 0);
multiDrawCount = 0;



OGL3: 56 FPS

ARB_base_instance: 56 FPS (seems like the overhead of glVertexAttribPointer() is extremely low)

ARB_multi_draw_indirect: 62 FPS


The scene used was a purposely CPU intensive scene with 1944 shadow maps being rendered (extremely low resolution and most simply had no shadow casters that passed frustum culling). The resolution was intentionally kept very low and the GPU load was at around 69-71%. My Java code was NOT the bottleneck; my OpenGL commands take approximately 8.5 ms to execute, and then an additional ~8 ms is spent blocking on buffer swap (= waiting for the driver to complete the queued commands, e.g. C code (or something) in the driver). My conclusion is that glMultiDrawElementsIndirect() effectively reduced the load on the driver thread significantly, even when batching together just 10-20 draw calls into each glMultiDrawElementsIndirect() commands.

#5155294 Using the ARB_multi_draw_indirect command

Posted by theagentd on 22 May 2014 - 04:59 PM

You won't gain anything by simply replacing each glDrawElements() call with a glMultiDrawElementsIndirect(). The whole point of glMultiDrawElementsIndirect() is to allow you to upload everything you need for all your draw calls to the GPU (using uniform buffers, texture buffers, bindless textures, sparse textures, etc) and then replace ALL your glDrawElements() calls with a single glMultiDrawElementsIndirect() call. As far as I know, glMultiDrawElementsIndirect() is not faster than glDrawElements() when simply used as a replacement for the latter.


I strongly recommend you take a look at this presentation http://www.slideshare.net/CassEveritt/beyond-porting which explains really well both the problems and how to solve them.

#5146973 where to start Physical based shading ?

Posted by theagentd on 14 April 2014 - 02:11 PM

Holy shit! Enabling the Toksvig map in this test application makes it look extremely good during magnification! Something that's always bothered me was how smooth surfaces far away looked, but with a Toksvig map it almost looks exactly as I'd expect during minification! The shader there seems to be tuned for Blinn-Phong which uses a specular exponent instead of roughness. How do I this for Cook-Torrance and roughness?


EDIT: I NEED THIS. I've been staring at this for so long now.

#5145540 where to start Physical based shading ?

Posted by theagentd on 08 April 2014 - 09:04 PM

Let me see if I have understood this correctly. To implement Cook-Torrance I need to:


1. Modify my G-buffer to store specular intensity (AKA ref_at_norm_incidence in the article) and a roughness value.

2. Normalize the function by multiplying the diffuse term by (1 - (specular intensity AKA ref_at_norm_incidence)).

3. Bathe in the glory of physical based shading.


My bet is that this is 100x easier to tweak to realistic results compared to my current lighting.

#5137031 Average luminance (2x downsample) filter kernel

Posted by theagentd on 06 March 2014 - 06:48 PM

In your current implementation, you've written:

out.vColor += Sample(Luminance, in.vTex0 + float2(vTexel.y, vTexel.y));


You're using vTexel.y twice. ^^

#5136974 Motion vector calculation problem

Posted by theagentd on 06 March 2014 - 03:48 PM


you might reduce the issue by dividing by abs(w), as most of the problem arises due to the flip of x and y when you divide by something negative. it's still not correct, but far less noticeable and you get away with per-vertex cost.

I tried that after I posted my first post, but it didn't help. I also tried with -abs(w). It seems like the problem is that the motion vector becomes extremely large when a vertex lies very close to (linear) Z=0. Direction isn't really the main problem; it's the length of the vector. I clamp my motion vectors so my motion blur doesn't explode, but the motion vectors are several magnitudes too large even for unnoticable movement.

#5109480 Efficient frustum culling of static terrain and dynamic objects

Posted by theagentd on 15 November 2013 - 09:03 AM

I finally had time to finish my implementation!



Heavy test

Old culling: 73 FPS

New culling: 89 FPS


Extreme test

Old culling: 13 FPS (36 with multithreading)

New culling: 20 FPS (45 with multithreading)

(The multithreading is not optimized for the new culling method yet.)




Old implementation:

1. Update bounds of all objects.

2. For each frustum, test all objects against frustum.


New implementation:

1. Update bounds of all objects.

2a. Update the bounds of the octree to perfectly fit all objects based on their bounds.

2b. Clear all entries in the octree. (2a and 2b are done at the same time by traversing the tree.)

3. For each frustum, query the octree for intersecting objects.


There are many parameters to tweak, but so far an octree with 4 or 5 levels seems optimal, which can be updated in under around 0.6 milliseconds. Although a tree with 5 levels was overall around 10-20% slower in both updating and culling compared to a 4 level tree, when comparing threaded performance 5 levels was actually slightly faster which I believe is due to hyperthreading helping with cache misses when traversing the deeper tree. Performance was actually a bit worse for very simple scenes running at >150 FPS, but in those cases performance was not a problem in the first place. Octree's seem to, like deferred shading, sacrifice performance in simple scenes to improve scaling to more complex scenes, an excellent trade-off in my opinion.


I will profile my octree updating a bit to see if the automatic bounds computing really is worth it, etc. When I optimize my threading code to run other tasks in parallel with the octree updating I expect threaded performance to improve quite a bit there too, and octree updating overhead to have a smaller impact on overall performance.

#5003103 EVSM performance tip!

Posted by theagentd on 21 November 2012 - 09:09 PM

Hello. I have a small tip for anyone using EVSMs. I don't know if it's something completely obvious or so, but here it is anyway.

The papers I've found recommends that as a shadow map, you render to a MSAA 32-bit float RGBA render target to store the two moments (m1, m1*m1, m2, m2*m2). You also need a depth buffer since the RGBA buffer can't be used as a depth buffer. This is incredibly expensive. For 4xMSAA, we get 16+4 bytes per MSAA sample. That's 80MBs just for a 4 sample 1024^2 variance map! We also need a resolved variance map, so add another 16MBs there. In total: 96MBs just for a 1024^2 shadow map. Don't you dare try a resolution 2048^2...

However, I found that I can reduce this memory footprint a lot. We don't have to calculate the moments until we resolve the MSAA texture! Instead of having a 32-bit float RGBA texture + a depth texture, we can get by with only a 32-bit float depth texture. By outputting view-space depth and modifying the depth range, we can just pass it down to gl_FragDepth (in the case of OpenGL) in the shadow rendering shader. When resolving we simply read the depth samples and calculate the moments and average them together! The result is that we only need 4 bytes per MSAA sample, period. That's 16MBs for a a 4xMSAA 1024^2 variance. The resolved variance texture is identical to before so that's another 16MBs. In total: 32MBs, which is a LOT better than 96MBs.

This not only reduces VRAM usage a lot, it also massively reduces the bandwidth needed for the shadow map rendering and resolve passes. On my GTX 295 (only one GPU, equal to a GTX 260/275), the performance is a lot better with my optimization. I'm getting 240 FPS in using the standard technique (1024^2 + 4xMSAA) and 440 FPS with my new one, which is almost twice as fast. Quality is identical to the normal technique since it's identical to normal EVSM stuff after resolving the MSAA texture.

I hope someone finds this useful, and if something's unclear feel free to ask!

EDIT: Fun fact: It's not possible to create a 8xMSAA 32-bit float RGBA texture, but it is possible to create a 8xMSAA 32-float depth buffer, so my technique works with 8xMSAA while the standard technique does not (at least on my hardware). Even funnier: Mine with 8xMSAA is 50% faster than the original with 4xMSAA and uses half as much memory.

#4948098 Rendering to texture/FBO issue

Posted by theagentd on 11 June 2012 - 03:21 AM

I think the rendering is working fine, but you're not clearing the FBO, causing the depth test to fail after the first frame. Make sure that all FBOs are cleared properly, both the shadow map and the shadow post processing buffer. I see for example that you call glViewport(), then glClear() and THEN glBindFramebuffer(). I don't think you wanted to clear a shadow map sized part of the screen, right? =S

#4947891 Rendering to texture/FBO issue

Posted by theagentd on 10 June 2012 - 05:52 AM

You don't have a depth buffer in your FBO, so the geometry just appears in the order it's drawn. You need to have a depth buffer in step 2 too since you're trying to draw all the geometry again, and you obviously need the closest depth there.

I get the feeling you're trying to do a screen space blur of the shadow map to get smooth shadows, correct? This won't look very good and will be very inaccurate, especially for steep angles since it's uniformly blurred. You'll also possible get light bleeding where you're not supposed to unless you take the depth of neighboring pixels into account. Anyway, it's far from optimal. The easiest way to do soft shadows is with PCF, which is basically sampling lots of nearby depth values from the shadow map and checking how many of these that pass. This gives problems with self shadowing though so you need a high depth bias which in turn pretty much removes self shadowing. If you're feeling really adventurous look up variance shadow maps.

#4905792 Unlimited Detail Technology

Posted by theagentd on 24 January 2012 - 08:48 AM

They are showing that:
a) It can easily be animated
b) It has zero problems with memory size issues
c) It is unlimited

So seems all blaming comments here was just jealous.

Yeah, let me know when they release a white paper.

Also, if anyone is looking for a challenging drinking game, watch that video and take a shot every time you hear the world 'unlimited'.

... drinking gamo reason ...


#4905077 2D lighting- Would this work with my game?

Posted by theagentd on 22 January 2012 - 05:16 AM

This is completely possible and can look great. I implemented this a few years ago for a tile-based 2D game, and it looked great but I never finished that game. Today the source code is way too hacked up and some resources are missing so I sadly can't provide you with a screenshot. I just implemented hard shadows from this article using framebuffers in OpenGL:
It's easy to render the shadows of blocking tiles as the technique in the article works fine for them, but it's also possible to make sprites cast shadows by rendering them as in a black color stretched away from the shadow caster. The real problem is making sprites have correct lighting. A solution to this is to sample a line of the "lighting map" at the bottom of each sprite and apply that over the whole sprite, but this does not work too well for tall objects.

Jeez, now I want to implement this myself!!! T___T

EDIT: Managed to dig something out and get it running after some oiling up...

Posted Image

It doesn't show any tile shadows, but it should give you an idea of how it looks. Well, maybe a little at least... Anyway, this demo runs at over 200 FPS on my laptop with over 300 lights. Granted there's only a single shadow-caster, and many of the lights have been culled, but it also runs smoothly with 1000 lights, barely managing 60 FPS...

#4902425 Translucent rendering based on depth sort : Pro and cons ?

Posted by theagentd on 13 January 2012 - 01:39 PM

You could keep a static triangle buffer and only generate an sorted index buffer each frame. This would reduce the CPU, but possibly reduce GPU load a little since the triangle data will not be read in a linear order. Also, the order of non-overlapping triangles do not matter of course. You might be able to take advantage of that.