Deferred lighting and instant radiosity
Fortunately, there's a solution to the problem.. enter the fantastic universe of deferred lighting !
Traditionally, it is possible implement dynamic lighting without any precomputations via forward lighting. The algorithm is surprisingly simple: in a first pass, the scene is rendered to the depth buffer and to the color buffer using a constant ambient color. Then, for each light you render the geometry that is affected by this light only, with additive blending. This light pass can include many effects, such as normal mapping/per pixel lighting, shadowing, etc..
This technique, used in games silmilar to Doom 3, does work well, but is very dependent on the granularity of the geometry. Let's take an object of 5K triangles that is partially affected by 4 lights. This means that to light this object, you will need to render 25K triangles over 5 passes total ( ambient pass + 4 lights passes, each 5K ). An obvious optimization is, given one light and one object, to only render the triangles of the object that are affected by the light, but this would require some precomputations that a game such as Infinity cannot afford, due to its dynamic and procedural nature.
Now let's imagine the following situation: you've got a battleship made of a dozen of 5K-to-10K triangles objects, and you want to place a hundred lights on its hull. How many triangles do you need to render to achieve this effect with forward lighting ? Answer: a lot. Really, a lot. Too much.
Another technique that is getting more and more often used in modern games is deferred lighting. It was a bit impractical before shader model 3.0 video cards, as it required many passes to render the geometry too. But using multiple render targets, it is possible to render all the geometry once, and exactly once ! independently of the number of lights in the scene. One light or a hundred lights: you don't need to re-render all the objects affected by the lights. Sounds magical, doesn't it ?
The idea with deferred lighting is that, in a forward pass, geometric informations are rendered to a set of buffers, usually called "geometry buffers" ( abbrev: G-buffers ). Those informations usually include the diffuse color ( albedo ), the normal of the surface, the depth or linear distance between the pixel and the camera, the specular intensity, self-illumination, etc.. Note that no lighting is calculated yet at this stage.
Once this is done, for each light, a bounding volume ( which can be as simple as a 12-triangles box for a point light ) is rendered with additive blending. In the pixel shader, the G-buffers are accessed to reconstruct the pixel position from the current ray and depth, then this position is then used to compute the light color and attenuation, do normal mapping or shadowing, etc..
There are a few tricks and specificities in Infinity. Let's have a quick look at them. First of all, the G-buffers.
I use 4 RGBAF16 buffers. They store the following data:
- R G B A
Buffer 1 FL FL FL Depth
Buffer 2 Diffuse Diffuse Diffuse Self-illum
Buffer 3 Normal Normal Normal Specular
Buffer 4 Velocity Velocity Extinction MatID
'FL' = Forward lighting. That's one of the specificity of Infinity. I still do one forward lighting pass, for the sun and ambient lighting ( with full per-pixel lighting, normal mapping and shadowing ) and store the result in the RGB channels of the first buffer. I could defer it too, but then I'd have a problem related to atmospheric scattering. At pixel level, the scattering equation is very simple: it's simply modulating an extinction color ( Fex ) and adding an in-scattering color ( Lin ):
Final = Color * Fex + Lin
Fex and Lin are computed per vertex, and require some heavy calculations. Moving those calculations per pixel would kill the framerate.
If I didn't have a forward lighting pass, I'd have to store the scattering values in the G-buffers. This would require 6 channels ( 3 for Fex and 3 for Lin ). Here, I can get away with only 4 and use a grayscale 'Extinction' for the deferred lights ( while sun light really needs an RGB color extinction ).
'Velocity' is the view-space velocity vector used for motion blur ( computed by taking the differences of positions of the pixel between the current frame and the last frame ).
'Normal' is stored in 3 channels. I have plans to store it in 2 channels only and recompute the 3rd in the shader. However this will require to encode the sign bit in one of the two channels, so I haven't implemented it yet. Normals ( and lighting in general ) are computed in view space.
'MatID' is an ID that can be used in the light shader to perform material-dependent calculations.
As you can see, there's no easy way to escape using 4 G-buffers.
As for the format, I use F16. It is necessary both for storing the depth, but also encoding values in HDR.
At first, I was a bit disapointed by the performance hit / overhead caused by G-buffers. There are 4 buffers after all, in F16: that requires a lot of bandwidth. On an ATI X1950 XT, simply setting up the G-buffers and clearing them to a constant color resulted in a framerate of 130 fps at 1280x1024. That's before even sending a single triangle. As expected, changing the screen resolution dramatically changed the framerate, but I found this overhead to be linear with the screen resolution.
I also found yet-another-bug-in-the-ATI-OpenGL-drivers. The performance of clearing the Z-buffer only was dependent on the number of color attachments. Clearing the Z-buffer when 4 color buffers are attached ( even when color writes are disabled ) took 4 more time than clearing the Z-buffer when only 1 color buffer was attached. As a "fix", I simply dettach all color buffers when I need to clear the Z-buffer alone.
Once the forward lighting pass is done and all this data is available in the G-buffers, I perform frustum culling on the CPU to find all the lights that are visible in the current camera's frustum. Those lights are then sorted by type: point lights, spot lights, directional lights and ambient point lights ( more on that last category later ).
The forward lighting ( 'FL' ) color is copied to an accumulation buffer. This is the buffer in which all lights will get accumulated. The depth buffer used in the forward lighting pass is also bound to the deferred lighting pass.
For each light, a "pass" is done. The following states are used:
* depth testing is enabled ( that's why the forward lighting's depth buffer is reattached )
* depth writing is disabled
* culling is enabled
* additive blending is enabled
* if the camera is inside the light volume, the depth test function is set to GREATER, else it uses LESS
A threshold is used to determine if the camera is inside the light volume. The value of this threshold is chosen to be at least equal to the znear value of the camera. Bigger values can even be used, to reduce a bit the overdraw. For example, for a point light, a bounding box is used and the test looks like this:
const SBox3DD& bbox = pointLight->getBBoxWorld();
SBox3DD bbox2 = bbox;
bbox2.m_min -= SVec3DD(m_camera->getZNear() * 2.0f);
bbox2.m_max += SVec3DD(m_camera->getZNear() * 2.0f);
bbox2.m_min -= SVec3DD(pointLight->getRadius());
bbox2.m_max += SVec3DD(pointLight->getRadius());
TBool isInBox = bbox2.isIn(m_camera->getPositionWorld());
m_renderer->setDepthTesting(true, isInBox ? C_COMP_GREATER : C_COMP_LESS);
Inverting the depth test to GREATER as the camera enters the volume allows to discard pixels in the background / skybox very quickly.
I have experimented a bounding sphere for point lights too, but found that the reduced overdraw was cancelled out by the larger polycount ( a hundred polygons, against 12 triangles for the box ).
I haven't implemented spot lights yet, but I'll probably use a pyramid or a conic shape as their bounding volume.
As an optimization, all lights of the same type are rendered with the same shader and textures. This means less state changes, as I don't have to change the shader or textures between two lights.
For each light, a Z range is determined on the cpu. For point lights, it is simply the distance between the camera and the light center, plus or minus the light radius. When the depth is sampled in the shader, the pixel is discarded if the depth is outside this Z range. This is the very first operation done by the shader. Here's a snippet:
vec4 ColDist = texture2DRect(ColDistTex, gl_FragCoord.xy);
if (ColDist.w < LightRange.x || ColDist.w > LightRange.y)
There isn't much to say about the rest of the shader. A ray is generated from the camera's origin / right / up vectors and current pixel position. This ray is multiplied by the depth value, which gives a position in view space. The light position is uploaded to the shader as a constant in view space; the normal, already stored in view space, is sampled from the G-buffers. It is very easy to implement a lighting equation after that. Don't forget the attenuation ( color should go to black at the light radius ), else you'll get seams in the lighting.
In a final pass, a shader applies antialiasing to the lighting accumulation buffer. Nothing particularly innovative here: I used the technique presented in GPU Gems 3 for Tabula Rasa. An edge filter is used to find edges either in the depth or the normals from the G-buffers, and "blur" pixels in those edges. The parameters had to be adjusted a bit, but overall I got it working in less than an hour. The quality isn't as good as true antialiasing ( which cannot be done by the hardware in a deferred lighting engine ), but it is acceptable, and the performance is excellent ( 5-10% hit from what I measured ). Here's a picture showing the edges on which pixels are blurred for antialiasing:
Once I got my deferred lighting working, I was surprised to see how well it scaled with the number of lights. In fact, the thing that matters is pixel overdraw, which is of course logical and expected given the nature of deferred lighting, but still I found it amazing that as long as overdraw remained constant, I could spawn a hundred light and have less than a 10% framerate hit.
This lead me to think about using the power of deferred lighting to add indirect lighting via instant radiosity.
The algorithm is relatively simple: each light is set up and casts N photon rays in a random direction. At each intersection of the ray with the scene, a photon is generated and stored in a list. The ray is then killed ( russian roulette ) or bounces recursively in a new random direction. The photon color at each hit is the original light color multiplied by the surface color recursively at each bounce. I sample the diffuse texture with the current hit's barycentric coordinates to get the surface color.
In my tests, I use N = 2048, which results in a few thousands photons in the final list. This step takes around 150 ms. I have found that I could generate around 20000 photons per second in a moderately complex scene ( 100K triangles ), and it's not even optimized to use many CPU cores.
In a second step, a regular grid is created and photons that share the same cell get merged ( their color is simply averaged ). Ambient point lights are then generated for each cell with at least one photon. Depending on N and the granularity of the grid, it can result in a few dozen ambient point lights, up to thousands. This step is very fast: around one millisecond per thousand photons to process.
You can see indirect lighting in the following screenshot. Note how the red wall leaks light on the floor and ceiling. Same for the small green box. Also note that no shadows are used for the main light ( located in the center of the room, near the ceiling ), so some light leaks on the left wall and floor. Finally, note the ambient occlusion that isn't fake: no SSAO or precomputations! There's one direct point light and around 500 ambient point lights in this picture. Around 44 fps on an NVidia 8800 GTX in 1280x1024 with antialiasing.
I have applied deferred lighting and instant radiosity to Wargrim's hangar. I took an hour to texture this hangar with SpAce's texture pack. I applied a yellow color to the diffuse texture of some of the side walls you'll see in those screenshots: note how light bounces off them, and created yellow-ish ambient lighting around that area.
There are 25 direct point lights in the hangar. Different settings are used for the instant lighting, and as the number of ambient point lights increase, their effective radius decrease. Here are the results for different grid sizes on a 8800 GTX in 1280x1024:
Cell size # amb point lights Framerate
0.2 69 91
0.1 195 87
0.05 1496 46
0.03 5144 30
0.02 10605 17
0.01 24159 8
I think this table is particularly good at illustrating the power of deferred lighting. Five thousand lights running at 30 fps ! And they're all dynamic ( although in this case they're used for ambient lighting, so there would be no point in that ): you can delete them or move every single of them in real time without affecting the framerate !
In the following screenshots, a few hundred ambient point lights were used ( sorry, I don't remember the settings exactly ). You'll see some green dots/spheres in some pics: those highlight the position of ambient lights.
Full lighting: direct lighting + ambient lighting
Direct lighting only
Ambient ( indirect ) lighting only