Today I made my GPU cry. Again... :-)

Published May 03, 2006
Advertisement
I've been a bit busy lately, had to finish off and hand-in my final year dissertation (See my Image Of The Day) and all sorts of fun things. In my new found spare time I put together this mini-article and sample code for you guys...
An Example of Optimal Terrain Rendering

I've always been interested in terrain rendering, not only are the end results particularly pretty (browse the Image Of The Day archives for examples) but they're also quite easy to program and often prove to be very good examples of many common graphics algorithms.

Whilst reading through this thread I thought "I want to try that". So I did [smile]

The basic idea

Geometry is built from a number of vertices, each vertex has to be transformed by the GPU in order for it to be correctly presented on the screen. The more vertices you have, the more work the GPU does and the slower the frame-rate. Simple really!

Direct3D offers a neat optimization for this process - Index Buffers. These can offer a simple reduction in VRAM storage due to eliminating duplicate vertex-buffer data, but they also enable the "Vertex Cache" on your GPU. Making use of the cache can mean that the GPU doesn't need to transform every vertex of every triangle - it can load results of previous transforms from the cache and skip almost all processing. The key is in making efficient use of the cache and maximizing the "hit rate" so we get as many triangles from as little work as possible.

For this example I'm using a simple height-map based terrain engine - a very simple method that most people will be familiar with. If not, this article might be a good start (if not, try google!).

Height-map terrains are typically rendered via LOD and heirarchical culling algorithms, neither of which are relevant to this particular example. Quite simply, the huge amounts of geometry involved in a decent terrain engine is providing the test data for the algorithms.

I've designed the C++/D3D9 source code around a Terrain::ITerrainRenderer interface with 3 "in box" sub-classes, each of which taking the same source data and rendering it in a slightly different way. The use of a base/interface class intentionally leaves the door open for you to implement your own methods for additional experimentation.

As shown in the following illustration, the display has a drop-down list that you can use to swap between methods and a set of the corresponding statistics. Changing the current rendering mode will change the statistics output, but the actual image should remain identical.



A final point to be made about the previous image - the lighting model. I quite specifically chose one of the most performance-unfriendly lighting models I know of (Oren-Nayar) and plugged it into a vs_2_0 shader; this intentionally stresses the vertex processing units in your GPU. It also makes any differences much more apparent - a more trivial N dot L approach uses so few instructions that its difficult to notice any improvements.

You can view a disassembly of of the effect file here, noting that the vertex shader weighs in at 91 instructions. Anything to reduce the number of times we have to invoke this shader is A Good Thing(TM).

Simple Triangle Lists

This is the most basic method implemented and uses only a vertex buffer. Given that its not optimized for storage or performance its more of a base-line reference than a serious contender.

For a given terrain of M x N vertices, this method creates (M-1) x (N-1) "quads", each of two triangles. Each triangle is obviously made up of 3 vertices.

Triangles: 2 * (M-1) x (N-1)
Vertices: 6 * (M-1) x (N-1)

In ascii art:

+--+  / /+


and

   +   |   |+--+


Indexed Triangle Lists

As you should be able to tell from the previous method, a single quad has only 4 vertices - yet we were storing many more than this. For a single quad the top-right and bottom-left vertices were duplicated, but if you consider the neighbouring quads then almost every vertex is duplicated four or six times. Not particularly smart!

Applying an index buffer to this allows you to store a single vertex for each unique point (giving M x N vertices in total), a substantial saving if you've got a complex vertex type (e.g. with tangents and bitangents for "normal mapping" shaders), however it also enables the GPU to use the post-transform vertex cache.

Essentially, the GPU can look at the 3 indices for an incoming triangle and compare those against the vertices currently stored in the post-transform cache. If there is a match then the GPU can take the results directly from the cache, if not it processes it as normal. A nice little speed-up with no penalties - the worst case is still the same as the previous method.

Triangles: 2 * (M-1) x (N-1)
Vertices: M x N
Indices 6 * (M-1) x (N-1)

The vertices would be laid out as follows:

+  ++  +


And the indices are used to link them together

+--+  / /+


and

   +   |   |+--+


The performance advantage comes when you consider multiple triangles in a row. After the first triangles in a row has been rendered, all subsequent triangles should have 2 vertices in every triangle in the post-transform cache. Therefore you can send a new triangle for rasterization by only transforming a single vertex! However, each time you start a new row you wont have any of the vertices required in the cache. As an example:

Let M = 128
Let N = 128

A total of 32,258 triangles.

For each column we have 127 quads, thus 254 triangles. The first triangle always transforms 3 vertices, but the remaining 253 triangles need only transform 1 - meaning that for a given column we transform 256 vertices.

Because we have to start again at each row, we simply multiply 127 rows by 256 transforms each - giving us 32,512 transforms for the whole terrain patch. We can therefore work out that we get 1.001 triangles for every vertex we transform. That is already a 3-fold improvement over the first example!

Optimized Indexed Triangle Lists

This is the method that this sample has been building up to [smile]

The previously mentioned thread contains a couple of excellent replies from namethatnobodyelsetook and JohnBolton, although neither specifically take credit for the "Priming the Vertex Cache" technique (I dont know where it originated either!). This technique simply seeks to improve upon the previous design, whilst it might be good its more a convenient arrangement for the programmer rather than one tailored to the underlying hardware.

The aforementioned thread has a great discussion of how the technique works, so I wont waste space my covering it again here. Needless to say the trick revolves around ordering the indices such that you try and get the best-case performance from the post-transform vertex cache. Using D3D9's queries you can get the size of the vertex cache (24 in the case of my GeForce 6800) but this is almost always going to be smaller the the height/width of the terrain segment we wish to render. Thus it becomes necessary to divide into a number of strips - each strip being 2 vertices less than the known cache size. In a similar way to the previous 'Indexed Triangle List' method, each time a new strip is started we lose the use of the cache, but the fact that this is now substantially less often should be a bonus.

Whilst it might not be "real-world", taking a 12 element cache and a row size of 10:

First we render degenerate triangles (which wont appear on screen, but the vertex processing will still get invoked) to make sure that the cache contains the first row of vertices:

0  1  2  3  4  5  6  7  8  9+--+  +--+  +--+  +--+  +--+


The cache therefore looks like:

[0,1,2,3,4,5,6,7,8,9,?,?]

From here we render each row of each strip in the same way as the previous techniques. The advantage being that you need only transform 1 vertex for every other triangle, and by the time we've finished one row we've got the cache's contents perfectly set up for the next row.

To work through the numbers in a similar way to the previous method:

Let M = 128
Let N = 128

A total of 32,258 triangles.

Assume a cache size of 24 (standard across Nvidia's 4,5,6 and 7 series).

Therefore we have a row of 22 vertices giving us 42 triangles. Firstly 11 degenerate triangles are rendered - each transforming 2 vertices. Next we begin the real triangle rendering, a total of 42 triangles - but we only end up transforming the 22 vertices that make up the subsequent row.

Therefore the per-strip overhead is 22 vertices, and then a further 22 vertices per row with 127 rows giving us 2,816 vertex transformations.

The tricky part comes from taking 22 vertex strips into a 128x128 grid. In total it requires 7 strips (6 full size, 1 smaller), so a simple multiplication yields 17,152 transformations. It is debatable whether you count the degenerate triangles, if you do you get 32,323 triangles or 32,258 triangles without - hardly much of a difference. Assuming you count all triangles then you get 1.885 triangles for every vertex transform. Again, a tidy 88% improvement over the previous method.

Results

The application that is provided as part of this article isn't specifically built to be a benchmarking tool, so it would be unwise to take the results presented here as any sort of definitive performance metric - equally, using this code to compare multiple GPU's is probably not a fair comparison. Instead the results should be used as a form of guidance to compare similar rendering methods. Across the same GPU you should be able to draw rough comparisons - and as shown in the following graphs its fairly clear where the differences are!

The first graph shows a comparison of the three methods, but there are two important characteristics.

  1. The terrain segment is set to 400x400, the largest my test hardware (GeForce 6800) can handle. The intention being to keep the GPU busy and not to add call/context-switch overhead to the benchmark
  2. The cameras view is zoomed out and rotated such that the rasterizer/pixel-shader is not given any work to do. The intention being to measure the vertex processing capabilities rather than get a result that may be biased by any shading/rasterization.




The above graph shows the peak throughput of the three methods. The results aren't unexpected and match the calculations previously in this article. If the first, "simple", method is taken as a base-line then the second, "indexed", is 2.7x faster - not far off the 3x faster from the initial calculations. Calculations also showed the 3rd, "optimized" method to be 60% faster - which isn't too far off the theoretical 88% improvement.

It is worth noting that the peak throughput as well as the differences between methods varies depending on the batch size - the 320,000 triangle batch used in the above example is intentionally very high. Sampling different batch sizes generates the following graph:



As can be clearly seen, the advantages of the non-trivial methods only starts to show from a batch size of 45,000 and above. The 3rd optimized method only really kicks in above 125,000. It is likely that the 6-7 million triangles/second that the simple method routinely clocks up is actually the peak rate for the GPU - 21 million vertices at 91 instructions a piece generates a cool 1.9 billion vertex shader instructions per second [oh]

Adding Additional Methods

As has been previously mentioned, the sample code provided with this article is designed to allow you to extend and compare your own methods. An obvious choice here would be to implement one of the many triangle-strip methods; general wisdom suggests you wont achieve the peak rates shown here, but it could still be a viable alternative.

Check out the ITerrainRenderer.h header file - subclass it as appropriate (use the existing 3 implementations as a reference) and then add it to the list starting on line 24 (Terrain::ITerrainRenderer *modes[] = ).

I've specifically avoiding shipping a binary with the source code so as not to get bogged down in D3DX DLL versioning problems; given the simplicity of the code you shouldn't have any problems compiling it against any of the 2005 or 2006 DirectX SDK's. You can download the source code here.

Conclusion

Whilst I take full credit for the implemention and write-up I can't take credit for the original idea and took my inspiration from this thread and its participants (if I hadn't made that clear by now [lol]). As you can probably tell by their positions in the Top-50 board, if you come across posts from namethatnobodyelsetook, hplus0603 or JohnBolton they're probably worth reading [wink]

Hopefully you'll find the information I've presented in this article interesting, but better yet - take it further! Maybe it'll give you some ideas about things to look into, optimizations to try...

Comments are greatly appreciated [smile]
Previous Entry Initial results...
Next Entry Jack: one D3DX: nil
0 likes 14 comments

Comments

Iced_Eagle
Holy cow amazing article!! :D

I'm still reading it, but just wanted to say this was an awesome article...
May 03, 2006 09:07 PM
ET3D
Nice demo.

I had to remove several lines of 'InheritedPropertySheets="C:\Visual Studio 8\VC\VCProjectDefaults\UpgradeFromVC71.vsprops' to get the project to load.

Also, the demo crashes for me when size is over 512x512. The default 50x50 doesn't show any improvement with any method, so it's not a good size.
May 04, 2006 12:35 AM
ET3D
As I suspected, using D3DXOptimizeFaces is more optimal than the manual optimisation you've included.

Here are some measurements on my Radeon 9800 Pro:

400x400:
Simple: 6.05M t/s
Indexed: 18.47M t/s
Optimized: 28.67M or 28.99M t/s (90-91 fps)
D3DX: 30.88M t/s

350x350:
Simple: 6.09M t/s
Indexed: 16-20M t/s
Optimized: 26-30M t/s
D3DX: 29-31M t/s

300x300:
Simple: 6.08M t/s
Indexed: 16-20M t/s
Optimized: 26-29M t/s
D3DX: 28-31M t/s

200x200:
Simple: 5-6.5M t/s
Indexed: 16-18M t/s
Optimized: 19M t/s
D3DX: 19M t/s


Not hugely significant, but a 5-10% improvement is still worthwhile. Plus it's simpler to just use the D3DX function instead of doing special tricks.

Note: The code couldn't find the vertex cache on my hardware, so used 10, which is less than optimal (which is 14). I suspect that on the GeForce the difference will be less, since strips will be wider (as it has a larger cache). I sent you the updated sample, and I'll be happy to see what numbers you get.

Note also that I haven't used D3DXOptimizeVertices, which optimises vertex loading, and changes the order of vertices in the vertex buffer itself. This could possibly improve things a little further.
May 04, 2006 01:17 AM
jollyjeffers
Thanks for the comments... [smile]

Eyal - I got your email, and I'll see about adding an updated journal entry to cover my results later.

I do wonder why the query doesn't find the correct cache size for ATI hardware though. Oh well [headshake]

Cheers,
Jack
May 04, 2006 07:10 AM
ET3D
The query seems to not exist. This is strange, as it's a really basic query that doesn't seem to need real hardware support.

Anyway, I sent you a new version with a further method which renders the terrain in small batches (5000 triangles), and strangely seems to work a tad faster (1 fps difference) than the D3DX optimised method.
May 04, 2006 07:42 AM
Washu
That actually has been covered before, see here: My own little DirectX FAQ, see the Vertex cache optimisation entry, and previous ones. [grin] If you hung out with me more on #graphicsdev, you would have already known about that!
May 04, 2006 10:56 AM
Cypher19
Quote:That actually has been covered before, see here: My own little DirectX FAQ, see the Vertex cache optimisation entry, and previous ones. If you hung out with me more on #graphicsdev, you would have already known about that!


I'm on #dev all of the time and I didn't know that *scratches head*
May 04, 2006 05:57 PM
jollyjeffers
hehe, I've not kept up to date with recent entries - but I have read every bit of Tom's DX FAQ on a couple of occasions. The guy certainly knows his stuff [grin]

I have to admit that I didn't really like the IRC "format" on my brief exploration a while back. Its also another time sink that I dont really need [grin]

Jack
May 05, 2006 08:30 AM
X3
and what was the purpose of test ? :) well, we all know that we should use indexed geometry and optimize it (d3d optimized mesh sample)... so i dont see any connection with real-world terrain rendering techniques....
May 06, 2006 12:58 PM
ET3D
Perhaps "we all know", but there are a lot of people not included in this "we". There are a lot of people who are learning Direct3D and find it helpful to be shown what effect optimisations have. So it's good we have people like Jack to write samples.
May 06, 2006 04:32 PM
jollyjeffers
Quote:and what was the purpose of test ? :) well, we all know that we should use indexed geometry and optimize it (d3d optimized mesh sample)... so i dont see any connection with real-world terrain rendering techniques....
In addition to what ET3D replied with...

Where did you find out that you should optimize your layout? Way back when you were learning - did you know why you had to optimize the layout? Did you know what happened if you didn't optimize the layout? What exactly does this "black box" D3DX optimize function do?

The "best practices" documents/samples are great, but sometimes its useful to have a "what if" comparison. The more experienced developers could just say "DO THIS".. "DO THAT"... "DONT DO THAT"... and the n00b learners might well follow their advice.

Alternatively we could point people at a sample like this and say "Well, you should do it this way, look what happens if you dont..." Hopefully then they can realise for themselves why the more experienced guy told them to do X in a certain way...

Cheers,
Jack
May 06, 2006 06:36 PM
jollyjeffers
Quote:i was just confused with that header
Ok - I will concede that the choice of title could have been better [smile]

fwiw, I was planning on doing more than I've already written about. Provided I get the time for that then it will be more relevant/related to terrain..

If you've read this journal entry, you might want to check out this one as it includes updated code and results!

Cheers,
Jack
May 11, 2006 03:55 AM
Arex
I took a quick look at your code. Code :

while( S_FALSE == pVCacheQuery->GetData( &VCacheStats, sizeof( D3DDEVINFO_VCACHE ), D3DGETDATA_FLUSH ) );

Bad, bad.. :)

This is might end up as an endless loop, try running it like 16 times or so, but you should never ever do things like this. :)
October 20, 2006 07:37 AM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement
Advertisement