Questions to Yann: cache for shadow maps

Started by
22 comments, last by Yann L 19 years, 9 months ago
Quote:Original post by Yann L
The shared pbuffer will also heavily reduce the number of context switches required.


so GL_EXT_render_target and ARB_super_buffer (both yet to be released) would solve that problem and let you set everything as a Render_Target, right ?
-* So many things to do, so little time to spend. *-
Advertisement
Quote:Original post by Yann L


The cache for the shadow maps is around 32 MB in the current engine, but it's a user setting and can be modified at will.

Generally, the caching is handled using a priority scheme. Each light has an importance factor associated with it, a visual priority. The higher the priority, the more important the shadow is deemed by the system, and the more resolution it gets. Smaller priority maps get gradually less resolution. Direct sunlight always has maximum priority, and is guaranteed to get a 2048 map at least. In nightscenes, the moon takes the role of the sun, but with reduced map resolution. Maps assigned to sun or moon are never cached, as they are viewdependent and updated every frame.

All other lightsources are then assigned maps from the pool, using a modified LRU scheme (modified to take the priorities into account). If a lightsource was moved, or geometry changed within its visual range, its associated shadowmap is sent to an update manager. The manager tries to balance shadowmap regeneration over several frames, in order to avoid updating hundreds of maps in a single frame. Again, the visual importance factor helps a lot: lights with lower priority can be updated less often, and their update can be defered to a later frame. This can lead to an effect that the shadow of some low priority light lags behind the object (especially if the object moves very fast), but is generally unnoticeable if the system is well balanced.

The priority is assigned depending on several different visual metrics: distance from the viewer and occlusion based temporal coherence are the two most important ones. You can add several others, but this depends on your engine and the type of realism you're looking for. Also, visual metrics is the perfect spot to include a user adjustable shadow quality setting.


Is The pripority used to create the shadow map at the correct resolution also used to obtain the most influencing light for a geometry chunk?
Thanks Davide
Quote:Original post by Ysaneya
Ok, so because you are using PSM/TSM, you need to update the sun's shadow map every frame.. that makes sense. But i have no idea how you were able to do it so quickly on a GF4 at 60 fps with scenes like 100 to 300k triangles in view. If you download the PSM demo on NVidia's site (which also has a fairly complex scene), the framerate is more like in the 20-30 fps, on a Geforce fx. And you're rendering to a 2048x2048. Is there a trick to get such a speed ?

The 2048 map is on the new engine, the resolution on the GF4 was probably lower (I can't say for sure, as the resolution selection by the cache manager is dynamic, and depends on the viewpoint. It also is influenced by the framerate - if it drops too much, the resolutions are reduced). The view in the cathedral screenshot is also the most optimal case for our modified PSM: orthogonal light coming from above the viewpoint, and from the side, almost at 90° angle to the view direction. That's why the shadows still look crisp in that shot - although they aren't really that sharp, the complex geometry they are projected on gives that impression. Most of the lighting complexity actually comes from the simple N*L equation.

About the speed, well the sun is comming from above and outside of the cathedral. The outside walls all act as occluders, with the upper windows as the only openings. From the light view, a lot of geometry is culled away by the HOM. Furthermore, the inside is rendered to the shadowmap using a lower LOD. There isn't that much geometry to render to the SM, it's more a fillrate issue. That again can be reduced by clever resolution selection, using RTTs on larger maps, and using a very simple fragment pipeline setup in order to reduce bandwidth requirements. Also keep in mind, that the depth render pass won't access any textures (except if you want alpha masking), which is often a fillrate bottleneck.


Were some all of them per-pixel, or some per-vertex ?

That's dynamic. Switching from per-pixel to per-vertex lighting is a part of the light LOD system.

Quote:
What were you preprocessing for your lights ?

For standard dynamic lights, nothing except their influence bounding box. For virtual lights (a simple radiosty look-alike simulation), ambient shadows were approximated as a preprocess. It was then decided if the light requires a realtime shadow, or if the light transfer could be prestored as directional incidence components per vertex (read: a small cubemap per vertex). That's similar to spherical harmonics, only that incident light is only stored over very few quantized directions. AFAIR, we used 8 directions per vertex, encoded in two vertex streams as RGBA - one direction per component. The decoding is done in a vertex program. This system gives you more or less dynamic ambient light, ie. for simulating the diffuse indirect light from a moving sun.

Back then, this system was used per vertex. Later on, we added directional indirect light incidence maps (DILIMs), which take the per vertex concept to the per-pixel level and allow much better precision. The system gives similar visual results as SH lighting does, but doesn't have the low-frequency cutoff problem of spherical harmonics: it can store very sharp but still perfectly dynamic ambience light. Imagine each DILIM texel as a small cubemap storing the incident light over a few quantized directions.

Also, the concept of virtual lights was dropped, in favour of a special PR radiosity / photon mapper combo, that creates the DILIMs directly (basically by storing directional incidence information per lightmap texel, instead of a simple RGB illumination value). The result of either the per-vertex directional information (as used in the cathedral), or the DILIMs is then combined with the light from the fully dynamic sources. This gives a very pleasing combo of fully dynamic light sources and a GI ambience.

Quote:
I see, the most simple explanation was the good one. Did you left the shadows sharp on ATI cards ?

Yep, in the non-ARB_FP path, the shadows are non-filtered on ATI cards.

Quote:
That sounds interesting, but it's not clear to me how it worked. If your shadow maps contain a depth information, how can you downsample this ? The most immediate idea is to render-to-texture (say 512x512) in ortho mode a single quad using the original 1024x1024 texture. But how do you perform the depth-copy operation ?

What do you mean by depth-copy operation ? You simply render the textue values into the depth component (or another channel, depends on your shadow map format) using a fragment shader. The result is the same as you would get if you rendered normal geometry. We didn't use that system on the GF4, because it would be rather tricky to make it work on a texture shader setup (although not impossible). It really depends on the texture data format you use for you shadow maps. What format do you use ? A real depth format, a packed RGBA, a floating point texture ?

Quote:
That's a neat idea, and i think i've done it the wrong way in my system so far. When my shadow maps are out of the view i simply free them from the cache to make room for other lights. It obviously becomes a nightmare when you are turning the head quickly. I need a bit of time to think about and improve the priorities thing.

Only throw out maps based on frustum culling information if absolutely required (ie. if the alternative would be a cache overflow). In typical FPS games, the player will constantly turn the head around, so "it's behind the camera" is a bad metric. Occlusion is a much better and more reliable parameter.

Quote:
so GL_EXT_render_target and ARB_super_buffer (both yet to be released) would solve that problem and let you set everything as a Render_Target, right ?

Yep.

Quote:
Is The pripority used to create the shadow map at the correct resolution also used to obtain the most influencing light for a geometry chunk?

Not directly. The most influencing light per CG is an object space operation, and is based on distance, light intensity and attenuation. The priority metric, however, is a purely eye space and view dependent operation. While it also uses distance and intensity as factors, it can take additional view dependent factors such as screenspace coverage and occlusion into account.
Quote:
Furthermore, the inside is rendered to the shadowmap using a lower LOD.


Thanks, i think that's the main reason why i'm getting low framerates. I delayed the implementation of LOD after the one of the shadowing cache, but i can see now that i cannot have a good idea of the final performance without it. It's not rare for me to have to render up to 500k polys when updating a single shadow map.

Quote:
Also, the concept of virtual lights was dropped, in favour of a special PR radiosity / photon mapper combo


That's what i had in mind too, as you get the benefits of dynamic lighting/shadowing, but can also have static lights with soft shadows (thanks to the lightmap filtering) and ambient lighting.

Quote:
Yep, in the non-ARB_FP path, the shadows are non-filtered on ATI cards.


And in the ARB_FP path ? Are you antialiasing them in the fragment shader ? If so how many samples, and did you notice a large performance drop ? I also tried to dither it when using a low amount of samples (1 or 4 samples), but it looked quite ugly.

Quote:
What do you mean by depth-copy operation ? You simply render the textue values into the depth component (or another channel, depends on your shadow map format) using a fragment shader.


But can you really do that ? I read that accessing the depth component of a texture in a pixel shader is truncated to 8 bits on NVidia cards.

I'm using ARB_depth_texture for spotlights, and a cube map with depth encoded in RGBA for my point lights.

Another question before i forget, i am assuming you were using a cube map with encoded RGBA for point lights (as depth cube maps aren't supported), but doesn't that mean you also had sharp edges (no hardware PCF like with depth 2D textures on NVidia cards) ?

Y.
Quote:Original post by Ysaneya
Thanks, i think that's the main reason why i'm getting low framerates. I delayed the implementation of LOD after the one of the shadowing cache, but i can see now that i cannot have a good idea of the final performance without it. It's not rare for me to have to render up to 500k polys when updating a single shadow map.

LOD does a lot, obviously. You can also use image based rendering techniques during the depth map creation. Parallax errors tend to be much less visible on a shadow map, than they would be in an actual rendering. Also, aggressive occlusion culling can help a lot.

The human brain mainly uses shadows as cues to evaluate the depth and relative positioning of objects in an environment. However, the brain is no raytracer - when the environment is complex enough, it can't determine the projective accuracy of each single shadow. It will directly spot artifacts such as holes or light bleeding, but it will have a very hard time spotting small perspective anomalies and parallax errors in the shadows. As long as the global shadowing is mostly correct, the scene will look perfectly natural. Take advantage of this.

Quote:
And in the ARB_FP path ? Are you antialiasing them in the fragment shader ? If so how many samples, and did you notice a large performance drop ?

We have different fragment programs with different amounts of samples, the system selects one based on a user preference setting. Obviously there is a performance drop, and a unfortunately a significant one when using many samples. That's the price you pay for quality. But 4 samples are acceptable in terms of speed, and are comparable to nvidias hardware PCF from a quality point of view. It doesn't look very nice if the map resolution is low, but at higher resolutions, quality is generally OK in most scenes. If people have more powerful hardware, they can increase the sample count.

Quote:
I also tried to dither it when using a low amount of samples (1 or 4 samples), but it looked quite ugly.

Hmm. How exactly did you do the sampling ? I suspect something specific, but, hmm, can you post the relevant fragment program snippet ?

Quote:
But can you really do that ? I read that accessing the depth component of a texture in a pixel shader is truncated to 8 bits on NVidia cards.

Yep, z-depth maps will be treated as 8bit monochrome textures when bound as colour texture. But there seem to be a little misunderstanding due to my sloppy terminology: when I said "depth map", I was referring to your texture containing depth values, I didn't mean the GL_DEPTH_COMPONENTxx format specifically. No, you can't directly render a GL_DEPTH texture to the depth buffer without going through an intermediate format (eg. as a packed RGBA or floating point depth map, it can be rendered to the "real" zbuffer using a fragment program). If you keep everything in packed RGBA, then downsampling the map on the hardware is rather straightforward. Don't forget to turn off bilinear filtering, though.

Quote:
I'm using ARB_depth_texture for spotlights, and a cube map with depth encoded in RGBA for my point lights.

OK. In that case, the hardware supported downsampling will be a little more involved, and might require a readback operation.

Quote:
Another question before i forget, i am assuming you were using a cube map with encoded RGBA for point lights (as depth cube maps aren't supported),

On the ARB_FP path, yes. But not on the GF3/4 path, as this hardware can't unpack the texture as needed. First, I simulated depth cubemaps manually, by projecting six separate 2D maps and recombining the results. That added a lot of complexity to the render engine, and was less than optimal (it was the system used on the cathedral shot, and a main reason why I run out of texture units on the GF4). I later dropped cubemaps in favour of dual paraboloid entirely. Even later, when I added the final ARB_FP code path, I used cubemaps again, with optional DP maps as a fallback option and as part of the LOD system.

Quote:
but doesn't that mean you also had sharp edges (no hardware PCF like with depth 2D textures on NVidia cards) ?

Well, on GeForce FX and above, the ARV_FP path is taken, just as with ATI cards - the filtering is then done in the fragment program, so the format doesn't really matter (although I use GL_DEPTH maps wherever possible on NV hardware, because the hardware PCF is faster - unless the user selects a high shadow quality setting).

On GF3/4, some variation of GL_DEPTH is used, projected in various ways depending on the light type. Antialias is then done by nvidias' hardware filter.
Hello,

Quote:Also, aggressive occlusion culling can help a lot.


Do you use occlusion culling for all shadow maps, or maybe just for the sun shadow map? I know you use a software rasterizer to create the occlusion maps, but even if you take advantage of CPU/GPU concurrency, this seems to be a big overhead.

Maybe when traversing the tree you cull against the bounding volume of the light. This could save a bit of work.

Ciao, Thorris.
Quote:Original post by Thorris
Do you use occlusion culling for all shadow maps, or maybe just for the sun shadow map?

Always for the sun map. For the other lights, it depends. By default, the system will render an occlusion map per light. IT then monitors the efficiency of the culling, ie. it maintains a simple culled nodes to global nodes ratio over some frames. If the ratio drops below some threshold, then the occlusion is inefficient, and the system doesn't perform it anymore on that light. Until the light source moves, or some major geometry changes in its influence volume - then the system will try again.

Quote:Original post by Thorris
Maybe when traversing the tree you cull against the bounding volume of the light. This could save a bit of work.

For point and spot lights, this influence distance clipping is implicitely done by the far plane of the light frustum.
Quote:Original post by Yann L
the price you pay for quality. But 4 samples are acceptable in terms of speed, and are comparable to nvidias hardware PCF from a quality point of view.


Comparable to hardware PCF ? My experience has been the opposite, but i only tested my code on ATI cards, so i don't know if the result on NVidia cards looks similar or not (note to self: take some time to test it). My logic tells me it should as i'm only using the ARB_fp path. I only have 4 shades of brightness in my shadows edges (with 4 samples, that is), and from what i saw, NVidia's PCF is similar to bilinear filtering the results (it looks nice and smooth). Unless you are looking at the shadows from the distance, but if you zoom in, even 4 samples are looking quite ugly. 8 samples is a bit better but you can still notice the sampling. 16 samples is almost perfect but is horribly slow. Do you have some screenshots of 4 samples so that i can compare with what i got ?

Quote:Original post by Yann L
Hmm. How exactly did you do the sampling ? I suspect something specific, but, hmm, can you post the relevant fragment program snippet ?


The sampling for dithering is done in eye space. That's one of the reasons why it looks ugly. I basically have a small noise texture (a pattern tiled a lot of times) that is used to offset randomly the tex coords after the projection of the shadow map.

I'm only posting the relevant parts of the shader:

# ditheringTEMP		texc;TEMP		screenPos;TEMP		dither;# texcoord 1 contains the tex coords for shadow projectionTXP		dither, fragment.texcoord[1], texture[1], 2D;MAD		dither, dither, 2.0, -1.0;MUL		dither, dither, 0.0005;RCP		texc.w, fragment.texcoord[1].w;MUL		texc, fragment.texcoord[1], texc.w;MUL		screenPos.x, texc.x, 400.0;MUL		screenPos.y, texc.y, 300.0;FRC		screenPos, screenPos;SGE		dither, screenPos, 0.5;ADD		dither.y, dither.y, dither.x;SGE		dither.z, dither.y, 1.1;SUB		dither.z, 1.0, dither.z;MUL		dither.y, dither.y, dither.z;MUL		dither, dither, 0.0005;


After this, "dither" contains an offset that is applied when sampling the shadow 4 times. I also tried with a regular pattern (not so random) but results the are always ugly, and the performance drop is tremendous.

Quote:Original post by Yann L
misunderstanding due to my sloppy terminology: when I said "depth map", I was referring to your texture containing depth values, I didn't mean the GL_DEPTH_COMPONENTxx format specifically. No, you can't directly render a GL_DEPTH texture to the depth buffer without going through an intermediate format (eg. as a packed RGBA or floating point depth map, it can be rendered to the "real" zbuffer using a fragment program). If you keep everything in packed RGBA, then downsampling the map on the hardware is rather straightforward. Don't forget to turn off bilinear filtering, though.


True, but then you do not benefit from NVidia's hardware PCF. All of that is becoming a bit confusing :) So as i understand it, since in your cathedral you were using hardware PCF, you were not able to use that shadow map redimensionning trick, or am i even more confused than i thought :p ? Or is there a way to enable hardware PCF in a pixel shader (doubtful) ?

Quote:Original post by Yann L
OK. In that case, the hardware supported downsampling will be a little more involved, and might require a readback operation.


I think i'll just switch all my shadow maps to pixel shaders with depth encoded as RGBA. That way i will have a single path for all cards. Remains the question of compatibility with older cards like the GF3/GF4. Can all of these shaders be implemented with NV pixel shaders ?

Quote:Original post by Yann L
hardware can't unpack the texture as needed. First, I simulated depth cubemaps manually, by projecting six separate 2D maps and recombining the results.


I also tried that, but the performance drop is quite impressive too...

Quote:Original post by Yann L
I later dropped cubemaps in favour of dual paraboloid entirely. Even later, when I added the final ARB_FP code path, I used cubemaps again, with optional DP maps as a fallback option and as part of the LOD system.


That makes me wonder.. i know the theory behind DP maps, but would you say it is really worth the effort ? As i understand it you need a pretty highly tesselated scene in order to avoid the artifacts due to the texture coordinates interpolation which should no longer be linear.

Y.
Quote:Original post by Ysaneya
Comparable to hardware PCF ? My experience has been the opposite, but i only tested my code on ATI cards, so i don't know if the result on NVidia cards looks similar or not (note to self: take some time to test it). My logic tells me it should as i'm only using the ARB_fp path. I only have 4 shades of brightness in my shadows edges (with 4 samples, that is), and from what i saw, NVidia's PCF is similar to bilinear filtering the results (it looks nice and smooth).

Ah, that's exactly what I was suspecting in my post above :) OK, the problem is terminology again, or better our loose use of the term "PCF". There is "real" PCF, and there is fake PCF. What you are (probably) doing is the real thing: at each fragment, take x samples and average the result. Of course, if you only take 4 samples, you'll get maximal 4 shades of grey, which isn't enough. I agree that you need more samples for real PCF, that's what the "high quality" setting I mentioned in my post does - and that's why I'm disabling nvidia style PCF when using that setting.

Now, nVidia does something very different. It's not percentage closer filtering at all, it's just a cheap approximation. It uses also 4 samples, thus our little misunderstanding above :) In fact nvidia is just doing that: bilinear filtering. They take four depth comparisons per fragment, but at the four shadowmap texel corners. Then they compute the fractional position of the current fragment within the shadow texel (in both u and v directions), and linearly interpolate between the results of the four corner comparisons. Basically, it's just bilinear filtering of four 1-sample shadow comparisons. You can simulate that behaviour in your pixel shader, and you'll get similar results as the hardware fake-PCF nvidia uses. It's also pretty fast, as it only uses 4 depthmap samples. Compared to real PCF with more samples, the quality will be worse, oviously.

Quote:Original post by Yann L
True, but then you do not benefit from NVidia's hardware PCF. All of that is becoming a bit confusing :) So as i understand it, since in your cathedral you were using hardware PCF, you were not able to use that shadow map redimensionning trick, or am i even more confused than i thought :p ?

That's correct. We use that trick in later engine revisions.

Quote:
Or is there a way to enable hardware PCF in a pixel shader (doubtful) ?

Well, you can simulate it within the shader, but you can't control the hardware feature on per-pixel base.

Quote:
I think i'll just switch all my shadow maps to pixel shaders with depth encoded as RGBA. That way i will have a single path for all cards.

That's a possibility.

Quote:
Remains the question of compatibility with older cards like the GF3/GF4. Can all of these shaders be implemented with NV pixel shaders ?

Nope. GF3/4 don't do pixel shaders at all. They have "texture shaders", which are basically a set of predefined fragment programs encoded in the GPU. You can select and combine those programs, but this is rather limited. You can't unpack an RGBA encoded depthmap, and you can't do PCF in a shader. With a GF3/4, your only real option is to use GL_DEPTH maps, and the built-in hardware fake-PCF. You can't use native cubemaps then anymore.

Quote:
That makes me wonder.. i know the theory behind DP maps, but would you say it is really worth the effort ? As i understand it you need a pretty highly tesselated scene in order to avoid the artifacts due to the texture coordinates interpolation which should no longer be linear.

Depends on how you define high tesselation. I never encountered any major problems in our engine, but our scenes are in fact pretty well tesselated. There are also some tricks to optimize the technique, and partially avoid the tesselation issue. here are some interesting notes.

Whether they are worth the effort or not highly depends on your engine, and on what kind of scenes you apply the shadows to. In our case, it was worth it - but your mileage may vary. They're the only viable option to get pointlights on GF3/4 type hardware (due to the lack of shadow cubemaps), so that's a plus point. If you restrict your codepath to DX9 type hardware, then cubemaps will most certainly be easier and more versatile. If you have to increase your scene resolution just to fit DP mapping, then forget it. But if it works without changing the scene data, then you should give them a try.
Quote:Original post by Yann L
the problem is terminology again, or better our loose use of the term "PCF". There is "real" PCF, and there is fake PCF. What you


I know :) When i mention "hardware PCF" for NVidia cards, i actually think to the bilinear filtering trick to make the shadow edges smoother. It's not real PCF for sure.

Quote:Original post by Yann L
In fact nvidia is just doing that: bilinear filtering. They take four depth comparisons per fragment, but at the four shadowmap texel corners. Then they compute the fractional position of the current fragment within the shadow texel (in both u and v directions), and linearly interpolate between the results of the four corner comparisons. Basically, it's just bilinear filtering of four 1-sample shadow comparisons. You can simulate that behaviour in your pixel shader, and you'll get similar results as the hardware fake-PCF nvidia uses. It's also pretty fast, as it only uses 4 depthmap samples. Compared to real PCF with more samples, the quality will be worse, oviously.


That does sound interesting but i'm not sure to see how you can do that in a pixel shader.

Here's what i currently got:





Particularly in that last shot, you can see the shades of gray with 4 samples.. it looks pretty good from the distance (first screen), but quite ugly when you start to zoom. I'll eventually switch to TSMs for directional lights but i'll still use antialiased shadow maps for spot or omni lights, so i'd like to fix that problem without having to use > 8 samples.

Speaking of TSMs, i've read the papers about it, but how well does it handle frustums with an infinite (or at least very far) far clipping plane ?

Quote:Original post by Yann L
Depends on how you define high tesselation. I never encountered any major problems in our engine, but our scenes are in fact pretty well tesselated. There are also some tricks to optimize the technique, and partially avoid the tesselation issue. here are some interesting notes.


I know, Tom is an old collegue of mine, so i'm pretty aware of his work on DPSMs :) But i remember him saying it was only useful when the scene is already pretty well tesselated. I'm not sure i can guarantee that, but when i'll have some time i'll just test it and see if it's good/bad enough.

Y.

This topic is closed to new replies.

Advertisement