Today I made my GPU cry. Again... :)
I've been a bit busy lately, had to finish off and handin my final year dissertation (See my Image Of The Day) and all sorts of fun things. In my new found spare time I put together this miniarticle and sample code for you guys...
An Example of Optimal Terrain Rendering
I've always been interested in terrain rendering, not only are the end results particularly pretty (browse the Image Of The Day archives for examples) but they're also quite easy to program and often prove to be very good examples of many common graphics algorithms.
Whilst reading through this thread I thought "I want to try that". So I did [smile]
The basic idea
Geometry is built from a number of vertices, each vertex has to be transformed by the GPU in order for it to be correctly presented on the screen. The more vertices you have, the more work the GPU does and the slower the framerate. Simple really!
Direct3D offers a neat optimization for this process  Index Buffers. These can offer a simple reduction in VRAM storage due to eliminating duplicate vertexbuffer data, but they also enable the "Vertex Cache" on your GPU. Making use of the cache can mean that the GPU doesn't need to transform every vertex of every triangle  it can load results of previous transforms from the cache and skip almost all processing. The key is in making efficient use of the cache and maximizing the "hit rate" so we get as many triangles from as little work as possible.
For this example I'm using a simple heightmap based terrain engine  a very simple method that most people will be familiar with. If not, this article might be a good start (if not, try google!).
Heightmap terrains are typically rendered via LOD and heirarchical culling algorithms, neither of which are relevant to this particular example. Quite simply, the huge amounts of geometry involved in a decent terrain engine is providing the test data for the algorithms.
I've designed the C++/D3D9 source code around a Terrain::ITerrainRenderer interface with 3 "in box" subclasses, each of which taking the same source data and rendering it in a slightly different way. The use of a base/interface class intentionally leaves the door open for you to implement your own methods for additional experimentation.
As shown in the following illustration, the display has a dropdown list that you can use to swap between methods and a set of the corresponding statistics. Changing the current rendering mode will change the statistics output, but the actual image should remain identical.
A final point to be made about the previous image  the lighting model. I quite specifically chose one of the most performanceunfriendly lighting models I know of (OrenNayar) and plugged it into a vs_2_0 shader; this intentionally stresses the vertex processing units in your GPU. It also makes any differences much more apparent  a more trivial N dot L approach uses so few instructions that its difficult to notice any improvements.
You can view a disassembly of of the effect file here, noting that the vertex shader weighs in at 91 instructions. Anything to reduce the number of times we have to invoke this shader is A Good Thing(TM).
Simple Triangle Lists
This is the most basic method implemented and uses only a vertex buffer. Given that its not optimized for storage or performance its more of a baseline reference than a serious contender.
For a given terrain of M x N vertices, this method creates (M1) x (N1) "quads", each of two triangles. Each triangle is obviously made up of 3 vertices.
Triangles: 2 * (M1) x (N1)
Vertices: 6 * (M1) x (N1)
In ascii art:
and
Indexed Triangle Lists
As you should be able to tell from the previous method, a single quad has only 4 vertices  yet we were storing many more than this. For a single quad the topright and bottomleft vertices were duplicated, but if you consider the neighbouring quads then almost every vertex is duplicated four or six times. Not particularly smart!
Applying an index buffer to this allows you to store a single vertex for each unique point (giving M x N vertices in total), a substantial saving if you've got a complex vertex type (e.g. with tangents and bitangents for "normal mapping" shaders), however it also enables the GPU to use the posttransform vertex cache.
Essentially, the GPU can look at the 3 indices for an incoming triangle and compare those against the vertices currently stored in the posttransform cache. If there is a match then the GPU can take the results directly from the cache, if not it processes it as normal. A nice little speedup with no penalties  the worst case is still the same as the previous method.
Triangles: 2 * (M1) x (N1)
Vertices: M x N
Indices 6 * (M1) x (N1)
The vertices would be laid out as follows:
And the indices are used to link them together
and
The performance advantage comes when you consider multiple triangles in a row. After the first triangles in a row has been rendered, all subsequent triangles should have 2 vertices in every triangle in the posttransform cache. Therefore you can send a new triangle for rasterization by only transforming a single vertex! However, each time you start a new row you wont have any of the vertices required in the cache. As an example:
Let M = 128
Let N = 128
A total of 32,258 triangles.
For each column we have 127 quads, thus 254 triangles. The first triangle always transforms 3 vertices, but the remaining 253 triangles need only transform 1  meaning that for a given column we transform 256 vertices.
Because we have to start again at each row, we simply multiply 127 rows by 256 transforms each  giving us 32,512 transforms for the whole terrain patch. We can therefore work out that we get 1.001 triangles for every vertex we transform. That is already a 3fold improvement over the first example!
Optimized Indexed Triangle Lists
This is the method that this sample has been building up to [smile]
The previously mentioned thread contains a couple of excellent replies from namethatnobodyelsetook and JohnBolton, although neither specifically take credit for the "Priming the Vertex Cache" technique (I dont know where it originated either!). This technique simply seeks to improve upon the previous design, whilst it might be good its more a convenient arrangement for the programmer rather than one tailored to the underlying hardware.
The aforementioned thread has a great discussion of how the technique works, so I wont waste space my covering it again here. Needless to say the trick revolves around ordering the indices such that you try and get the bestcase performance from the posttransform vertex cache. Using D3D9's queries you can get the size of the vertex cache (24 in the case of my GeForce 6800) but this is almost always going to be smaller the the height/width of the terrain segment we wish to render. Thus it becomes necessary to divide into a number of strips  each strip being 2 vertices less than the known cache size. In a similar way to the previous 'Indexed Triangle List' method, each time a new strip is started we lose the use of the cache, but the fact that this is now substantially less often should be a bonus.
Whilst it might not be "realworld", taking a 12 element cache and a row size of 10:
First we render degenerate triangles (which wont appear on screen, but the vertex processing will still get invoked) to make sure that the cache contains the first row of vertices:
The cache therefore looks like:
[0,1,2,3,4,5,6,7,8,9,?,?]
From here we render each row of each strip in the same way as the previous techniques. The advantage being that you need only transform 1 vertex for every other triangle, and by the time we've finished one row we've got the cache's contents perfectly set up for the next row.
To work through the numbers in a similar way to the previous method:
Let M = 128
Let N = 128
A total of 32,258 triangles.
Assume a cache size of 24 (standard across Nvidia's 4,5,6 and 7 series).
Therefore we have a row of 22 vertices giving us 42 triangles. Firstly 11 degenerate triangles are rendered  each transforming 2 vertices. Next we begin the real triangle rendering, a total of 42 triangles  but we only end up transforming the 22 vertices that make up the subsequent row.
Therefore the perstrip overhead is 22 vertices, and then a further 22 vertices per row with 127 rows giving us 2,816 vertex transformations.
The tricky part comes from taking 22 vertex strips into a 128x128 grid. In total it requires 7 strips (6 full size, 1 smaller), so a simple multiplication yields 17,152 transformations. It is debatable whether you count the degenerate triangles, if you do you get 32,323 triangles or 32,258 triangles without  hardly much of a difference. Assuming you count all triangles then you get 1.885 triangles for every vertex transform. Again, a tidy 88% improvement over the previous method.
Results
The application that is provided as part of this article isn't specifically built to be a benchmarking tool, so it would be unwise to take the results presented here as any sort of definitive performance metric  equally, using this code to compare multiple GPU's is probably not a fair comparison. Instead the results should be used as a form of guidance to compare similar rendering methods. Across the same GPU you should be able to draw rough comparisons  and as shown in the following graphs its fairly clear where the differences are!
The first graph shows a comparison of the three methods, but there are two important characteristics.
The above graph shows the peak throughput of the three methods. The results aren't unexpected and match the calculations previously in this article. If the first, "simple", method is taken as a baseline then the second, "indexed", is 2.7x faster  not far off the 3x faster from the initial calculations. Calculations also showed the 3rd, "optimized" method to be 60% faster  which isn't too far off the theoretical 88% improvement.
It is worth noting that the peak throughput as well as the differences between methods varies depending on the batch size  the 320,000 triangle batch used in the above example is intentionally very high. Sampling different batch sizes generates the following graph:
As can be clearly seen, the advantages of the nontrivial methods only starts to show from a batch size of 45,000 and above. The 3^{rd} optimized method only really kicks in above 125,000. It is likely that the 67 million triangles/second that the simple method routinely clocks up is actually the peak rate for the GPU  21 million vertices at 91 instructions a piece generates a cool 1.9 billion vertex shader instructions per second [oh]
Adding Additional Methods
As has been previously mentioned, the sample code provided with this article is designed to allow you to extend and compare your own methods. An obvious choice here would be to implement one of the many trianglestrip methods; general wisdom suggests you wont achieve the peak rates shown here, but it could still be a viable alternative.
Check out the ITerrainRenderer.h header file  subclass it as appropriate (use the existing 3 implementations as a reference) and then add it to the list starting on line 24 (Terrain::ITerrainRenderer *modes[] = ).
I've specifically avoiding shipping a binary with the source code so as not to get bogged down in D3DX DLL versioning problems; given the simplicity of the code you shouldn't have any problems compiling it against any of the 2005 or 2006 DirectX SDK's. You can download the source code here.
Conclusion
Whilst I take full credit for the implemention and writeup I can't take credit for the original idea and took my inspiration from this thread and its participants (if I hadn't made that clear by now [lol]). As you can probably tell by their positions in the Top50 board, if you come across posts from namethatnobodyelsetook, hplus0603 or JohnBolton they're probably worth reading [wink]
Hopefully you'll find the information I've presented in this article interesting, but better yet  take it further! Maybe it'll give you some ideas about things to look into, optimizations to try...
Comments are greatly appreciated [smile]
An Example of Optimal Terrain Rendering
I've always been interested in terrain rendering, not only are the end results particularly pretty (browse the Image Of The Day archives for examples) but they're also quite easy to program and often prove to be very good examples of many common graphics algorithms.
Whilst reading through this thread I thought "I want to try that". So I did [smile]
The basic idea
Geometry is built from a number of vertices, each vertex has to be transformed by the GPU in order for it to be correctly presented on the screen. The more vertices you have, the more work the GPU does and the slower the framerate. Simple really!
Direct3D offers a neat optimization for this process  Index Buffers. These can offer a simple reduction in VRAM storage due to eliminating duplicate vertexbuffer data, but they also enable the "Vertex Cache" on your GPU. Making use of the cache can mean that the GPU doesn't need to transform every vertex of every triangle  it can load results of previous transforms from the cache and skip almost all processing. The key is in making efficient use of the cache and maximizing the "hit rate" so we get as many triangles from as little work as possible.
For this example I'm using a simple heightmap based terrain engine  a very simple method that most people will be familiar with. If not, this article might be a good start (if not, try google!).
Heightmap terrains are typically rendered via LOD and heirarchical culling algorithms, neither of which are relevant to this particular example. Quite simply, the huge amounts of geometry involved in a decent terrain engine is providing the test data for the algorithms.
I've designed the C++/D3D9 source code around a Terrain::ITerrainRenderer interface with 3 "in box" subclasses, each of which taking the same source data and rendering it in a slightly different way. The use of a base/interface class intentionally leaves the door open for you to implement your own methods for additional experimentation.
As shown in the following illustration, the display has a dropdown list that you can use to swap between methods and a set of the corresponding statistics. Changing the current rendering mode will change the statistics output, but the actual image should remain identical.
A final point to be made about the previous image  the lighting model. I quite specifically chose one of the most performanceunfriendly lighting models I know of (OrenNayar) and plugged it into a vs_2_0 shader; this intentionally stresses the vertex processing units in your GPU. It also makes any differences much more apparent  a more trivial N dot L approach uses so few instructions that its difficult to notice any improvements.
You can view a disassembly of of the effect file here, noting that the vertex shader weighs in at 91 instructions. Anything to reduce the number of times we have to invoke this shader is A Good Thing(TM).
Simple Triangle Lists
This is the most basic method implemented and uses only a vertex buffer. Given that its not optimized for storage or performance its more of a baseline reference than a serious contender.
For a given terrain of M x N vertices, this method creates (M1) x (N1) "quads", each of two triangles. Each triangle is obviously made up of 3 vertices.
Triangles: 2 * (M1) x (N1)
Vertices: 6 * (M1) x (N1)
In ascii art:
++
/
/
+
and
+


++
Indexed Triangle Lists
As you should be able to tell from the previous method, a single quad has only 4 vertices  yet we were storing many more than this. For a single quad the topright and bottomleft vertices were duplicated, but if you consider the neighbouring quads then almost every vertex is duplicated four or six times. Not particularly smart!
Applying an index buffer to this allows you to store a single vertex for each unique point (giving M x N vertices in total), a substantial saving if you've got a complex vertex type (e.g. with tangents and bitangents for "normal mapping" shaders), however it also enables the GPU to use the posttransform vertex cache.
Essentially, the GPU can look at the 3 indices for an incoming triangle and compare those against the vertices currently stored in the posttransform cache. If there is a match then the GPU can take the results directly from the cache, if not it processes it as normal. A nice little speedup with no penalties  the worst case is still the same as the previous method.
Triangles: 2 * (M1) x (N1)
Vertices: M x N
Indices 6 * (M1) x (N1)
The vertices would be laid out as follows:
+ +
+ +
And the indices are used to link them together
++
/
/
+
and
+


++
The performance advantage comes when you consider multiple triangles in a row. After the first triangles in a row has been rendered, all subsequent triangles should have 2 vertices in every triangle in the posttransform cache. Therefore you can send a new triangle for rasterization by only transforming a single vertex! However, each time you start a new row you wont have any of the vertices required in the cache. As an example:
Let M = 128
Let N = 128
A total of 32,258 triangles.
For each column we have 127 quads, thus 254 triangles. The first triangle always transforms 3 vertices, but the remaining 253 triangles need only transform 1  meaning that for a given column we transform 256 vertices.
Because we have to start again at each row, we simply multiply 127 rows by 256 transforms each  giving us 32,512 transforms for the whole terrain patch. We can therefore work out that we get 1.001 triangles for every vertex we transform. That is already a 3fold improvement over the first example!
Optimized Indexed Triangle Lists
This is the method that this sample has been building up to [smile]
The previously mentioned thread contains a couple of excellent replies from namethatnobodyelsetook and JohnBolton, although neither specifically take credit for the "Priming the Vertex Cache" technique (I dont know where it originated either!). This technique simply seeks to improve upon the previous design, whilst it might be good its more a convenient arrangement for the programmer rather than one tailored to the underlying hardware.
The aforementioned thread has a great discussion of how the technique works, so I wont waste space my covering it again here. Needless to say the trick revolves around ordering the indices such that you try and get the bestcase performance from the posttransform vertex cache. Using D3D9's queries you can get the size of the vertex cache (24 in the case of my GeForce 6800) but this is almost always going to be smaller the the height/width of the terrain segment we wish to render. Thus it becomes necessary to divide into a number of strips  each strip being 2 vertices less than the known cache size. In a similar way to the previous 'Indexed Triangle List' method, each time a new strip is started we lose the use of the cache, but the fact that this is now substantially less often should be a bonus.
Whilst it might not be "realworld", taking a 12 element cache and a row size of 10:
First we render degenerate triangles (which wont appear on screen, but the vertex processing will still get invoked) to make sure that the cache contains the first row of vertices:
0 1 2 3 4 5 6 7 8 9
++ ++ ++ ++ ++
The cache therefore looks like:
[0,1,2,3,4,5,6,7,8,9,?,?]
From here we render each row of each strip in the same way as the previous techniques. The advantage being that you need only transform 1 vertex for every other triangle, and by the time we've finished one row we've got the cache's contents perfectly set up for the next row.
To work through the numbers in a similar way to the previous method:
Let M = 128
Let N = 128
A total of 32,258 triangles.
Assume a cache size of 24 (standard across Nvidia's 4,5,6 and 7 series).
Therefore we have a row of 22 vertices giving us 42 triangles. Firstly 11 degenerate triangles are rendered  each transforming 2 vertices. Next we begin the real triangle rendering, a total of 42 triangles  but we only end up transforming the 22 vertices that make up the subsequent row.
Therefore the perstrip overhead is 22 vertices, and then a further 22 vertices per row with 127 rows giving us 2,816 vertex transformations.
The tricky part comes from taking 22 vertex strips into a 128x128 grid. In total it requires 7 strips (6 full size, 1 smaller), so a simple multiplication yields 17,152 transformations. It is debatable whether you count the degenerate triangles, if you do you get 32,323 triangles or 32,258 triangles without  hardly much of a difference. Assuming you count all triangles then you get 1.885 triangles for every vertex transform. Again, a tidy 88% improvement over the previous method.
Results
The application that is provided as part of this article isn't specifically built to be a benchmarking tool, so it would be unwise to take the results presented here as any sort of definitive performance metric  equally, using this code to compare multiple GPU's is probably not a fair comparison. Instead the results should be used as a form of guidance to compare similar rendering methods. Across the same GPU you should be able to draw rough comparisons  and as shown in the following graphs its fairly clear where the differences are!
The first graph shows a comparison of the three methods, but there are two important characteristics.
 The terrain segment is set to 400x400, the largest my test hardware (GeForce 6800) can handle. The intention being to keep the GPU busy and not to add call/contextswitch overhead to the benchmark
 The cameras view is zoomed out and rotated such that the rasterizer/pixelshader is not given any work to do. The intention being to measure the vertex processing capabilities rather than get a result that may be biased by any shading/rasterization.
The above graph shows the peak throughput of the three methods. The results aren't unexpected and match the calculations previously in this article. If the first, "simple", method is taken as a baseline then the second, "indexed", is 2.7x faster  not far off the 3x faster from the initial calculations. Calculations also showed the 3rd, "optimized" method to be 60% faster  which isn't too far off the theoretical 88% improvement.
It is worth noting that the peak throughput as well as the differences between methods varies depending on the batch size  the 320,000 triangle batch used in the above example is intentionally very high. Sampling different batch sizes generates the following graph:
As can be clearly seen, the advantages of the nontrivial methods only starts to show from a batch size of 45,000 and above. The 3^{rd} optimized method only really kicks in above 125,000. It is likely that the 67 million triangles/second that the simple method routinely clocks up is actually the peak rate for the GPU  21 million vertices at 91 instructions a piece generates a cool 1.9 billion vertex shader instructions per second [oh]
Adding Additional Methods
As has been previously mentioned, the sample code provided with this article is designed to allow you to extend and compare your own methods. An obvious choice here would be to implement one of the many trianglestrip methods; general wisdom suggests you wont achieve the peak rates shown here, but it could still be a viable alternative.
Check out the ITerrainRenderer.h header file  subclass it as appropriate (use the existing 3 implementations as a reference) and then add it to the list starting on line 24 (Terrain::ITerrainRenderer *modes[] = ).
I've specifically avoiding shipping a binary with the source code so as not to get bogged down in D3DX DLL versioning problems; given the simplicity of the code you shouldn't have any problems compiling it against any of the 2005 or 2006 DirectX SDK's. You can download the source code here.
Conclusion
Whilst I take full credit for the implemention and writeup I can't take credit for the original idea and took my inspiration from this thread and its participants (if I hadn't made that clear by now [lol]). As you can probably tell by their positions in the Top50 board, if you come across posts from namethatnobodyelsetook, hplus0603 or JohnBolton they're probably worth reading [wink]
Hopefully you'll find the information I've presented in this article interesting, but better yet  take it further! Maybe it'll give you some ideas about things to look into, optimizations to try...
Comments are greatly appreciated [smile]
0
Sign in to follow this
Followers
0
14 Comments
Recommended Comments
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now