Before we get to each algorithm, we will introduce some notations and conventions that will be used in the article. Our goal in the following algorithms is to render as efficiently as possible an arbitrary cube map located in an arbitrary location in the scene. The cube we are rendering to is aligned on the axes of the world basis (X, Y, Z).

It is assumed in this article that the reader is already familiar with the concept of cube mapping as well as the fundamentals of working with vertex and geometry shaders.

Algorithm 1 - Multi-Pass Rendering

Multi-Pass Rendering is the simplest algorithm and also the only one that can be used on Direct3D9-class hardware. The idea is to render the scene to each one of the cube faces separately. The (naive) algorithm could look like this:

BaseObjectSet = Select Objects to be Rendered on the Cube Map (1) For each face F of the cube map Set the Render Target corresponding to F Set the view and projection matrices corresponding to F Foreach Object O in BaseObjectSet Render( O ) End End (1) is a step that allows you to select a subset of the scene according to the context of the cube map. For example, if you are rendering a cube map for shadow mapping, you will here select only the objects in the light range and casting shadows. If it is a reflection cube-map, you could include only objects close enough to the cube position.

If BaseObjectSet contains N objects, we will then render exactly 6N objects and make six render target switches.

To improve on this first version of the algorithm, we can notice that we could apply one more culling step: as we are rendering one face at a time, we can test the current object to see if it is in the current view frustum (constructed from the view/projection matrices for the current face). The improved algorithm would then look like this:

BaseObjectSet = Select Objects to be Rendered on the Cube Map (1) For each face F of the cube map Set the Render Target corresponding to F Set the view and projection matrices corresponding to F Compute the view frustum VF corresponding to F Foreach Object O in BaseObjectSet If O is visible in VF Then Render( O ) End End We can thus skip objects that are not visible on all the faces. An object of standard size such as a character or a car, will usually only be visible on two to three faces at one time. Thus, at worst we still render 6N objects but in practice we will render on average 2N to 4N objects (this of course depends on the exact set of objects in the scene). We however still have to switch the render target six times.

Algorithm 2 - Single-Pass Rendering Using Only a Geometry Shader

Direct3D10 comes with a sample program that renders a cube map in a single pass using a geometry shader. The idea is to pass six view/projection matrices to the geometry shader, and for each object primitive, generate six other primitives that are projected onto their respective cube face. We can thus take our previous algorithm and factorize some operations:

BaseObjectSet = Select Objects to be Rendered on the Cube Map (1) Set the six Render Targets corresponding to all the faces as an array Set the six view and projection matrices corresponding to all the faces as an array Foreach Object O in BaseObjectSet Render( O ) End As you can see, we got rid of the cube face loop. Before iterating the object set, we compute and set the needed variables for all faces at once (matrices and render targets). Similarly, there will be only one draw call per object. The number of API calls has been divided by six here compared to the worst-case scenario! Here is what the vertex shader and geometry shader could look like:

VS_OUTPUT VertexShader( VS_INPUT input ) { VS_OUTPUT output = (VS_OUTPUT) 0; output.position = mul( float4( input.position, 1), matWorld ); // Position in world space return output; } [maxvertexcount(18)] void GeometryShader( triangle VS_OUTPUT input[3], inout TriangleStream CubeMapStream ) { // For each face of the cube, create a new triangle [unroll] for( int f = 0; f < 6; ++f ) { GS_OUTPUT output = (GS_OUTPUT) 0; // Assign triangle to the render target corresponding to this cube face output.RTIndex = f; // For each vertex of the triangle, compute screen space position and pass distance [unroll] for( int v = 0; v < 3; ++v ) { output.position = mul( input[v].position, matViewProj[f] ); CubeMapStream.Append( output ); } // New triangle CubeMapStream.RestartStrip(); } } The vertex shader simply transforms the vertices from object space to world space. Most of the work is done in the geometry shader. In fact our face loop from the first algorithm can be found here: we have traded CPU cycles for GPU cycles. One problem that seems obvious in this naive implementation is that there is no culling on a face basis. Which means that the geometry shader will always be outputting six times more vertices.

One way to go around this problem is to test the object against each of the faces' respective frusta and set a flag to be used in the geometry shader to skip generation of some primitives. We can then change the algorithm as follow:

BaseObjectSet = Select Objects to be Rendered on the Cube Map (1) Set the six Render Targets corresponding to all the faces as an array Set the six view and projection matrices corresponding to all the faces as an array Compute the six view frusta corresponding to all the faces Foreach Object O in BaseObjectSet For each face F of the cube map Test O with the F's frustum If O is visible, set the geometry shader flag for F (cubeFaceWriteFlag |= 1 << F) End Render( O ) End [maxvertexcount(18)] void GeometryShader( triangle VS_OUTPUT input[3], inout TriangleStream CubeMapStream ) { // For each face of the cube, create a new triangle [unroll] for( int f = 0; f < 6; ++f ) { // Only output primitives to the required cube faces [branch] if ( ( cubeFaceWriteFlag & ( 1 << f ) ) != 0 ) { GS_OUTPUT output = (GS_OUTPUT) 0; // Assign triangle to the RT corresponding to this cube face output.RTIndex = f; // For each vertex of the triangle, compute screen space position and pass distance [unroll] for( int v = 0; v < 3; ++v ) { output.position = mul( input[v].position, matViewProj[f] ); CubeMapStream.Append( output ); } // New triangle CubeMapStream.RestartStrip(); } } } Although we have managed to skip generation of some primitives, the geometry shader is still not optimal. Indeed, the geometry shaders are usually optimized for generation of at most four vertices whereas we are generating at most 18.

Algorithm 3 - Single-Pass Rendering Using a Geometry Shader and Geometry Instancing

Geometry instancing is a concept that has already been around in Direct3D9. However, Direct3D10 and its geometry shaders open some new ways of using it. In Direct3D9, one would pass a vertex buffer and a buffer containing instance data (for example six transformation matrices) and the API will, in a single draw call, render the vertex buffer six times, each with different instance data. Very nice, but useless for us when we need to render to a cube map: there is no way to select dynamically the render target that will be used for a given instance.

This is where the geometry shaders come into picture. It is indeed possible as seen in the previous algorithm to indicate per primitive to which render target it should be rendered. The idea will then be to have an instance buffer dynamically filled with the face indices on which the object is visible. So here we are with a first version of our third algorithm:

Select Objects to be Rendered on the Cube Map (1) Set the six Render Targets corresponding to all the faces as an array Set the six view and projection matrices corresponding to all the faces as an array Compute the six view frusta corresponding to all the faces Foreach Object O in BaseObjectSet Reset Dynamic Instance Buffer For each face F of the cube map Test O with the F's frustum If O is visible, append F to the instance buffer End Render six instances of O using our dynamic instance buffer End The algorithm is pretty similar to the previous one: we manage to minimise the number of API calls by rendering to all the cube faces at once. By testing against the face frusta, we only append to the instance buffer the relevant faces, thus not instancing any invisible geometry (as in the previous algorithm). Even though this looks pretty smooth in our pseudo algorithm, there are some hidden complications.

In Direct3D, as we are free to organise the vertex/instance data as we wish, we must tell the API (using vertex layouts) how data we are providing is organised. Usually the engine will store this per mesh and switch the vertex layout when drawing the object. Indeed, we want to keep the freedom of being able to specify any vertex layout (maybe one for characters, one for terrain, one for sky, etc.). In this algorithm, we are appending instance data and thus have to append some information to the existing vertex layout. In the Kourjet Engine for example, the new vertex layouts are computed on the fly the first time we draw an object and kept in a cache belonging to the render path.

The vertex shader and geometry shader are pretty simple once we got the vertex layout right. We basically use the vertex shader to do all transformations and the geometry shader simply to direct the primitive to the appropriate render target. Here is the source code:

struct VS_INPUT { float3 position : POSITION; uint cubeFace : CUBEFACE; }; struct VS_OUTPUT { float4 position : SV_POSITION; // Position in screen space uint cubeFace : CUBEFACE; }; struct GS_OUTPUT { float4 position : SV_POSITION; // Position in screen space uint RTIndex : SV_RenderTargetArrayIndex; }; VS_OUTPUT VertexShader( VS_INPUT input ) { VS_OUTPUT output = (VS_OUTPUT) 0; float4 P = mul( float4( input.position, 1), matWorld ); // Position in world space output.position = mul( P, matViewProj[ input.cubeFace ] ); output.cubeFace = input.cubeFace; return output; } [maxvertexcount(3)] void GeometryShader( triangle VS_OUTPUT input[3], inout TriangleStream CubeMapStream ) { GS_OUTPUT output = (GS_OUTPUT) 0; // Assign triangle to the render target corresponding to this cube face output.RTIndex = input[0].cubeFace; // For each vertex of the triangle [unroll] for( int v = 0; v < 3; ++v ) { output.position = input[v].position; CubeMapStream.Append( output ); } // New triangle CubeMapStream.RestartStrip(); }

Conclusion

The algorithm pros and cons

Pros Cons Algorithm 1

- Works on Direct3D9 hardware
- Can do visibility culling on each face's frustum
- Need to switch render targets six times
- At worst six draw calls per object Algorithm 2
- One render target switch
- One draw call per object
- Geometry shader could be less than optimal depending on the hardware and the vertex format [sup][1][/sup]
- No real culling per face (vertex shader is still run six times)
- Direct3D10 only Algorithm 3
- One render target switch
- One draw call per object
- Can do visibility culling on each face's frustum (instancing only geometry if visible on a face)
- Must maintain a cache of duplicated vertex layouts for instancing
- Direct3D10 only

Source code and implementation

This article does not come with any particular source code attached. However implementation of these algorithms applied to rendering of shadow maps can be found in the Kourjet Engine source code, and in particular the following files can be of interest:

- Algorithm 1 and its corresponding effect file
- Algorithm 2 and its corresponding effect file
- Algorithm 3 and its corresponding effect file

Reference articles & other links

- The blog posts about the implementation of these algorithms in Kourjet for shadow mapping (part 1, part 2 and part 3)
- In this paper on GPU programming, NVidia gives some recommandations for Geometry Shaders, and in particular in section 4.6, page 40, they say: The performance of a GS is inversely proportional to the output size (in scalars) declared in the Geometry Shader, which is the product of the vertex size and the number of vertices (maxvertexcount). This performance degradation however occurs at particular output sizes, and is not smooth. For example, on a GeForce 8800 GTX a GS that outputs at most 1 to 20 scalars in total would run at peak performance, whereas a GS would run at 50% of peak performance if the total maximum output was declared to be between 27 and 40 scalars.

More concretely, if your vertex declaration is:

float3 pos: POSITION; float2 tex: TEXCOORD;

Each vertex is 5 scalar attributes in size.

If the GS defines a maxvertexcount of 4 vertices, then it will run at full speed (20 scalar attributes).

But if you increase the vertex size by adding a float3 normal, the number of scalars will increase by 4 * 3, putting your total at 32 scalar attributes. This will result in the GS running at 50% peak performance.