# Terrain Rendering,very high memory consumption problem

This topic is 3571 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi; I am currently working on a project,( a strategy game )in which I create and draw a terrain, using a height map. The terrain consists of 1024x1024 vertices and (1023)x(1023)x2 cells. I use a quadtree algorithm for space partitioning and frustum test; the terrain is divided to 4 sub areas and the sub areas are divided into new sub areas until they go below of a certain triangle count limit.(1000, currently) I use a big vertex buffer which holds the whole terrain geometry; (the every one of the 1024x1024 vertices). The leaf nodes of the quadtree are holding just small index buffers, which they use to draw the portion of the terrain which lies in their bounding box. The problem is,this program consumes too much memory space while running. (According to the Task Manager, it varies between 250 MB and 450 MB). A vertex just holds the x,y,z coordinates and two pairs of u,v coordinates for the textures.(I use multitexturing.) And I use 32 bit index data in every leaf node. I keep all such data in the managed pool of the Direct3D.(D3DPOOL_MANAGED) According to the Direct 3D documentation, Direct3D keeps a copy of data in the managed pool, in the system memory. However,switching to the D3DPOOL_DEFAULT doesn't seem to solve the memory consuming issue.(It consumes barely less memory than D3DPOOL_MANAGED). When I try to exit from my game, closing the program nearly lasts for 2-3 minutes,too. (I release the big VB and the IB's in the leaf nodes of the quadtree,in a recursive fashion.) I checked my code countless times for a memory leak and it seems to run properly,without such a thing. So,my question is, what could be the reason for such a high memory consumption? The total sizes of every IB and the VB,with the Texture data, are obviously lower than 250 MB..

##### Share on other sites
The long closing time of your program suggests to me that either:

- You have a leak somewhere, and get prints to the output panel (which would make the program exit slowly).

- You're doing something a lot worse than you think in terms of the data structure.

I'd suggest that you try to time the release of the data structure, and see if the time is taken there. If it's there, try to add statistics to your quadtree (number of leaves, etc.), and see if anything looks wrong there.

BTW, assuming your terrain data is normal for such data, the indices of tiles of the same size will be the same except for a vertex offset. In this case, storing just the offset would be enough.

Although that won't help much unless you change your terrain to 1024x1024 squares (1025x1025 vertices) instead of 1023x1023, so that it will divide nicely. It may be that this is one reason your quadtree algorithm is having problems in the first place.

[Edited by - ET3D on May 11, 2008 6:39:33 AM]

##### Share on other sites
Instead of one huge vertex buffer for all 220 vertices, it may be a considerable saving if you break down your terrain into fixed size tiles as the first thing.

You can still use your quadtree if you want (having several tiles in one node at the upper levels), although it may almost not be worth it.
I'm brute forcing my terrain tiles for frustum culling and use the simplest possible mip-style LOD without worrying about stitching (skirts instead). For my needs, this works just fine, but it obviously depends a lot on what your needs (such as maximum viewing range) are.

The reason why fixed size tiles are an advantage is that you can have a vertex buffer which only contains the height, and another one that is reused for every tile (constant), containing the longitude/latitude bit of the coordinates (which you can reuse for texcoord, too). That way, you reduce your memory footprint and upload bandwidth by 2/3. It takes 3-4 lines of code for the vertex shader, but so what... it's fast.

You may have to play with the fixed tile size. Again, it depends on your needs (terrain resolution, viewing range), so there is no single correct size (but something like 65x65 is probably a good start, to keep the number of batches reasonable).

##### Share on other sites
samoth's post made me notice that you're breaking your terrain into too small tiles. 1000 triangles is 500 quads, which would mean that your tiles will be 15x15 to 16x16. That's somewhat low if you draw each with a separate drawing call, which I assume you do, since you said an IB per leaf.

The higher you go in size (say to 64x64), the less relevant the quadtree is, since just going over each tile and deciding whether to cull it won't take too much CPU time (1024 such tests for 64x64). I agree with samoth that this looks like a good solution.

An alternative would be to keep your current quadtree as is, but instead of using separate index buffers, keep index lists and bundle them into a dynamic index buffer. This will allow you to draw everything in one call (subject to the max primitive, etc., caps). Won't save you the memory as samoth described, but may be an easier change.

##### Share on other sites
Ok. First of all congrats on working terrain renderer implementation. Its a great starting point, but Im afraid that youll have to scrap a lot of code if you want to move further, since thats normal in the world of gfx programming. And youre limited as to how big terrain you can handle.

Youre currently consuming a massive amount of RAM/VRAM needlessly for :
- Indices
- Vertices

Indices
Since youve got just one VB, all your indices have to be unique. That is (1023x1023x2x3) = 6.2M Indices which at 32-bit weigh ~24 MB. On top of that, a lot of that is duplicited with each quadtree level. Id guess that you have at least 4 additional quadtree levels to get into 23x23 vertices, i.e. 4x24 = 96 MB. In total you need at least 24+96 = 120 MB just for Indices. This is already consuming your bandwidth since at any time, almost all quadtree levels are used, thus almost all of the 120 MB are being accessed. Thats just bad.
If you had a chunk of, say 256x256, youd need IB in size of 256x256x2x3 = 0.393M Indices, which weigh just 1.5 MB. But, if your terrain was of dimension 327678x32768, you would still consume just 1.5MB for Indices compared to 32x32x6.2M = 6.4 GB in your current scheme.
Besides, you can easily create LOD scheme this way - just create separete IBs with less triangles, itll consume about 1 MB in addition, which is nothing.

Vertices
You use 7 floats for your Vertex (28 Bytes). So, in total its 1024x1024x28 = 28 MB. Theres a lot of duplicated and easily calculated data.
If you had divided your terrain into chunks of, say 256x256, a separate stream would hold the XZ positions of each vertex of the chunk and UV1/UV2, i.e. 24 Bytes per vertex. That stream would consume 256x256x24 = 1.5 MB, but even if you wanted to render a terrain of dimensions 32768x32768, it would still consume just those 1.5 MB and not more.
Then, you need a stream of YPOS for each chunk, i.e. 256x256x4 = 0.25 MB. But theres no duplicate data, just YPOS (and usually, you wont need 4 Bytes, often just 2 Bytes are enough - i.e. 16-bit heightmap, and you can use remaining 2 Bytes for anything else). In total youll need 4x4=16 of these chunks, i.e. just 4 MB.
So, in total thats just 4.25 MB for Vertices. It might be a little higher, depending on your texturing, but thats quite a difference compared to 28 MB.

Then, before rendering of each chunk, just set a VS constant for the chunks offset and add that offset inside shader into Stream0s XZ position and youre set.

Fast tip : Another memory-reduction step could be to use Pixel Shader and just have heightmap as DXT-compressed images, in which case, youd consume just 0.5 Byte per vertex (instead of current 4), which is 1.8th of the current size of VBs. Which, in our case would be 1024x1024x0.5 = 0.5 MB for terrain heightmap.

Not bad, if you ask me ;-) And if that is still too much, then you gotta generate the mesh procedurally on the fly and thus give a good use to those 3-7 idling cores ;-)

Textures
You could also easily spend 102.4 MB for the texture, if you had a colormap 4096x4096 (in 32bits) and make it create full mipmap chain (4096x4096x4x1.6) upon loading. So check for textures too, if you dont update that code often. Fast tip : use DXT and itll consume just 1/8th of the space, i.e. 12.8 MB.

##### Share on other sites
In addition to the good answers you've got so far I'll throw in a few things I've found in my numerous trips in terrain rendering:

1. Use float16 for your XYZ elements. Can be hit or miss, but I found only edge cases where the reduced resolution had a visual impact (think huge draw distances) and the 50% saving is compelling.

2. Don't bother with 32bit IB's. Not only are they a bit of a con (most hardware only allows upto 24 bits to be used but you always store 32 bits - instant 25% waste) but the performance has been questionable on some generations of GPU. Switching identical data from 16 to 32 bit indices can have a significant hit. Off the top of my head you should be able to squeeze a 128x128 patch into a single 16bit IB, which is fine.

3. As has been mentioned, look into offsets. Index data is often a repeating pattern thus you can usually just store a single heirarchy of LOD data in a single IB and just use offset parameters to DIP calls to get it working across the entire terrain.

4. Again, as has been suggested, skip the quadtree. Conceptually they're a perfect data structure for terrain rendering but these days linking the CPU and GPU using algorithms like this often slows performance! Firing big chunks of brute-force rendering at a GPU is usually better than having the CPU micro-manage/optimize for a smaller data set. Similar arguments go against LOD algorithms like ROAM.

hth
Jack

##### Share on other sites
Quote:
 Original post by jollyjeffersUse float16 for your XYZ elements. Can be hit or miss, but I found only edge cases where the reduced resolution had a visual impact (think huge draw distances)
Could you explain this more ? Since XZ grid is regular, I cant see how it could become a problem, since the XZ position is always just a multiply of some base Spacing between two Vertices, which is constant throught whole terrain. Maybe you were referring to Z-Buffer precision issue with huge terrains (in which cases its recommended to use linear Z-Buffer) ?

Quote:
 Original post by jollyjeffersand the 50% saving is compelling.
Although, I always pack every 3D data, in this particular case, I havent found a difference in performance between a non-packed Base Stream and a packed one. Of course, a 50% savings can be decompressed for free inside the Vertex Shader, so it doesnt hurt to use it. But in our case, the Base VB is only 1.5 MB (256x256x6 floats), so we would shave off only 0.75 MB. But of course, every KB counts.
Generally, its good to use 16-bit floats for UVs , since you get a precision of 0.01 texel which is more than enough for common textures/uses.

Quote:
 Original post by jollyjeffersDon't bother with 32bit IB's. Not only are they a bit of a con (most hardware only allows upto 24 bits to be used but you always store 32 bits - instant 25% waste) but the performance has been questionable on some generations of GPU.
Which ones were those ? Also, are there already cards that can use more than 24bit of Indices ? Even my 7950 halts at 24 bit. Maybe 8800 series doesnt ?

Quote:
 Original post by jollyjeffersAs has been mentioned, look into offsets. Index data is often a repeating pattern thus you can usually just store a single heirarchy of LOD data in a single IB and just use offset parameters to DIP calls to get it working across the entire terrain.
Id like to clarify a little bit more what Jack means here. The idea is to put all levels of LOD into one IB and instead of switching among different IBs (each IB for separate LOD), just use the offset. This way, 1 IB is enough for whole terrain for all LODs, so you wont ever have to switch it.

##### Share on other sites
Quote:
Quote:
 Original post by jollyjeffersUse float16 for your XYZ elements. Can be hit or miss, but I found only edge cases where the reduced resolution had a visual impact (think huge draw distances)
Could you explain this more ? Since XZ grid is regular, I cant see how it could become a problem, since the XZ position is always just a multiply of some base Spacing between two Vertices, which is constant throught whole terrain. Maybe you were referring to Z-Buffer precision issue with huge terrains (in which cases its recommended to use linear Z-Buffer) ?
To be honest, I can't really remember whether I found the root cause to be the use of FP16 or the more common depth buffer resolution issues. It was a pet project of mine circa 2005/2006 when I was playing with terrain that could be millions of units in size. I just have a vague memory that it was switching from FP32 to FP16 that introduced the artifacts.

Quote:
 Original post by VladRI havent found a difference in performance between a non-packed Base Stream and a packed one. Of course, a 50% savings can be decompressed for free inside the Vertex Shader, so it doesnt hurt to use it. But in our case, the Base VB is only 1.5 MB (256x256x6 floats), so we would shave off only 0.75 MB. But of course, every KB counts.
I was thinking in the context of storage savings rather than any performance gain. Like yourself I've never observed a notable performance difference between the two.

The last time I was experimenting with this the saving was on a XYZ+TBN+UV1+UV2 vertex of 40 bytes down to 20bytes (or down to 12 bytes with a GS [grin]) and the saving was more noticeable.

Quote:
Quote:
 Original post by jollyjeffersDon't bother with 32bit IB's. Not only are they a bit of a con (most hardware only allows upto 24 bits to be used but you always store 32 bits - instant 25% waste) but the performance has been questionable on some generations of GPU.
Which ones were those ?
From what I remember, it was at least the GeForceFX's and possibly some of the pre-SM3 ATI models. Back when some GPU's still had limited internal precision and/or seperate hardware for different precisions.

Quote:
 Original post by VladRAlso, are there already cards that can use more than 24bit of Indices ? Even my 7950 halts at 24 bit. Maybe 8800 series doesnt ?
I doubt it, but if any do it'd be the D3D10 parts. If you think about it, a 224 allowance gives you 16.7 million unique vertices - that's at least 320mb of vertex data in a single buffer! The use-cases for those sorts of numbers are probably almost non-existant outside of the CAD world.

Cheers,
Jack

##### Share on other sites
Quote:
 Original post by jollyjeffersI doubt it, but if any do it'd be the D3D10 parts. If you think about it, a 224 allowance gives you 16.7 million unique vertices - that's at least 320mb of vertex data in a single buffer! The use-cases for those sorts of numbers are probably almost non-existant outside of the CAD world.
Well, since were talking about terrain rendering, it could very well be just 64 MB 16.7 * 4, since 2 Bytes are for 16-bit heightmap and remaining 2 bytes can be used for anything else, e.g. UV or XZ offset or whatever else is needed - or maybe a color (assuming lighting is done through normal maps). But Id think that such a huge VB would just slow things down. I remember some paper from nVidia where the size of VB directly influenced the performance and after some threshold, the big VBs were actually slowing the performance down.

Actually, I made a mistake, since those 24bits would mean youd get 16.7M vertices. But I always get just up to 1.048.576 vertices, which is relatively low number even for smaller VBs (e.g. VB of size 4 MB isnt very big). I believe its stored in Caps.MaxVertexIndex. Even with GF7600 its been capped at 220. If they finally raised to 224, its great.

##### Share on other sites
Just a simple suggestion aside from all this arguing about index counts, vertex counts, etc..

If you can use vertex textures, use them and abandon all other forms of terrain rendering.

##### Share on other sites
Quote:
 Original post by Matt AufderheideIf you can use vertex textures, use them and abandon all other forms of terrain rendering.
The last I looked into this was the Gf6 timeframe and whilst it technically worked it was too slow to be a major architecture feature - good for eye candy (e.g. water surfaces) but not for a terrain.

Has the performance profile improved sufficiently since ~2005??

Jack

##### Share on other sites
Exactly. GF6 can render a massive terrain using traditional technique at FPS of excess 300-600. No chance with vertex texturing for same set of data.

Plus, since youre already abusing pixel shader units, youre being left with much less performance for actual per-pixel effects, be-it post-processing, normal-mapping, whatever.

On the other hand, if you can declare that you need 8800GT card just to run the game, then by all means go for it. But if you want your SW be playable also by GF3-style cards (and also GF2-style for backwards compatibility), theres no other way than the traditional one.

I think that the traditional technique shall be with us for next 5 yrs at least.

##### Share on other sites
Samoth posted
"The reason why fixed size tiles are an advantage is that you can have a vertex buffer which only contains the height, and another one that is reused for every tile (constant), containing the longitude/latitude bit of the coordinates (which you can reuse for texcoord, too). That way, you reduce your memory footprint and upload bandwidth by 2/3. It takes 3-4 lines of code for the vertex shader, but so what... it's fast."

I think I can figure out how to put in the multiple vertex streams on my own but how do you access them via the shader? This approach is exactly how I want to do my own terrain, but I have come unstuck working out how to get at the height etc data.

##### Share on other sites
When you declare the vertex format that the shader shall use, you specify, per each stream, actual registers and their size/type.

Then, inside the shader just use the register that you put inside the vertex declaration.

However, bear in mind that having more streams kills performance. During my latest experiments, 3 is a critical number and having 4 streams for terrain is killing the performance a lot. I had to reduce the stream count to 3 and live with some duplicated data (which means bigger VBs), since the performance drop was unacceptable (400 fps down to 80 fps). And thats with cache-friendly sizes of vertex formats.

Anyone got an idea why is that ? Maybe HW isnt well equipped for working with 4 and more vertex streams ? Maybe thats because it cant execute instructions in advance and has to wait till all streams are fetched ? BTW, its happenning on 7950GT (but also on 7600 and 6600, as far as I had a chance to test it).

##### Share on other sites
Thanks very much for that. I have just had a bit of a eureka moment and finally get it! Hopefully now I will finally get my terrain component going at a sensible rate.