Sign in to follow this  
67rtyus

Terrain Rendering,very high memory consumption problem

Recommended Posts

Hi; I am currently working on a project,( a strategy game )in which I create and draw a terrain, using a height map. The terrain consists of 1024x1024 vertices and (1023)x(1023)x2 cells. I use a quadtree algorithm for space partitioning and frustum test; the terrain is divided to 4 sub areas and the sub areas are divided into new sub areas until they go below of a certain triangle count limit.(1000, currently) I use a big vertex buffer which holds the whole terrain geometry; (the every one of the 1024x1024 vertices). The leaf nodes of the quadtree are holding just small index buffers, which they use to draw the portion of the terrain which lies in their bounding box. The problem is,this program consumes too much memory space while running. (According to the Task Manager, it varies between 250 MB and 450 MB). A vertex just holds the x,y,z coordinates and two pairs of u,v coordinates for the textures.(I use multitexturing.) And I use 32 bit index data in every leaf node. I keep all such data in the managed pool of the Direct3D.(D3DPOOL_MANAGED) According to the Direct 3D documentation, Direct3D keeps a copy of data in the managed pool, in the system memory. However,switching to the D3DPOOL_DEFAULT doesn't seem to solve the memory consuming issue.(It consumes barely less memory than D3DPOOL_MANAGED). When I try to exit from my game, closing the program nearly lasts for 2-3 minutes,too. (I release the big VB and the IB's in the leaf nodes of the quadtree,in a recursive fashion.) I checked my code countless times for a memory leak and it seems to run properly,without such a thing. So,my question is, what could be the reason for such a high memory consumption? The total sizes of every IB and the VB,with the Texture data, are obviously lower than 250 MB..

Share this post


Link to post
Share on other sites
The long closing time of your program suggests to me that either:

- You have a leak somewhere, and get prints to the output panel (which would make the program exit slowly).

- You're doing something a lot worse than you think in terms of the data structure.

I'd suggest that you try to time the release of the data structure, and see if the time is taken there. If it's there, try to add statistics to your quadtree (number of leaves, etc.), and see if anything looks wrong there.

BTW, assuming your terrain data is normal for such data, the indices of tiles of the same size will be the same except for a vertex offset. In this case, storing just the offset would be enough.

Although that won't help much unless you change your terrain to 1024x1024 squares (1025x1025 vertices) instead of 1023x1023, so that it will divide nicely. It may be that this is one reason your quadtree algorithm is having problems in the first place.

[Edited by - ET3D on May 11, 2008 6:39:33 AM]

Share this post


Link to post
Share on other sites
Instead of one huge vertex buffer for all 220 vertices, it may be a considerable saving if you break down your terrain into fixed size tiles as the first thing.

You can still use your quadtree if you want (having several tiles in one node at the upper levels), although it may almost not be worth it.
I'm brute forcing my terrain tiles for frustum culling and use the simplest possible mip-style LOD without worrying about stitching (skirts instead). For my needs, this works just fine, but it obviously depends a lot on what your needs (such as maximum viewing range) are.

The reason why fixed size tiles are an advantage is that you can have a vertex buffer which only contains the height, and another one that is reused for every tile (constant), containing the longitude/latitude bit of the coordinates (which you can reuse for texcoord, too). That way, you reduce your memory footprint and upload bandwidth by 2/3. It takes 3-4 lines of code for the vertex shader, but so what... it's fast.

You may have to play with the fixed tile size. Again, it depends on your needs (terrain resolution, viewing range), so there is no single correct size (but something like 65x65 is probably a good start, to keep the number of batches reasonable).

Share this post


Link to post
Share on other sites
samoth's post made me notice that you're breaking your terrain into too small tiles. 1000 triangles is 500 quads, which would mean that your tiles will be 15x15 to 16x16. That's somewhat low if you draw each with a separate drawing call, which I assume you do, since you said an IB per leaf.

The higher you go in size (say to 64x64), the less relevant the quadtree is, since just going over each tile and deciding whether to cull it won't take too much CPU time (1024 such tests for 64x64). I agree with samoth that this looks like a good solution.

An alternative would be to keep your current quadtree as is, but instead of using separate index buffers, keep index lists and bundle them into a dynamic index buffer. This will allow you to draw everything in one call (subject to the max primitive, etc., caps). Won't save you the memory as samoth described, but may be an easier change.

Share this post


Link to post
Share on other sites
Ok. First of all congrats on working terrain renderer implementation. It`s a great starting point, but I`m afraid that you`ll have to scrap a lot of code if you want to move further, since that`s normal in the world of gfx programming. And you`re limited as to how big terrain you can handle.

You`re currently consuming a massive amount of RAM/VRAM needlessly for :
- Indices
- Vertices

Indices
Since you`ve got just one VB, all your indices have to be unique. That is (1023x1023x2x3) = 6.2M Indices which at 32-bit weigh ~24 MB. On top of that, a lot of that is duplicited with each quadtree level. I`d guess that you have at least 4 additional quadtree levels to get into 23x23 vertices, i.e. 4x24 = 96 MB. In total you need at least 24+96 = 120 MB just for Indices. This is already consuming your bandwidth since at any time, almost all quadtree levels are used, thus almost all of the 120 MB are being accessed. That`s just bad.
If you had a chunk of, say 256x256, you`d need IB in size of 256x256x2x3 = 0.393M Indices, which weigh just 1.5 MB. But, if your terrain was of dimension 327678x32768, you would still consume just 1.5MB for Indices compared to 32x32x6.2M = 6.4 GB in your current scheme.
Besides, you can easily create LOD scheme this way - just create separete IBs with less triangles, it`ll consume about 1 MB in addition, which is nothing.

Vertices
You use 7 floats for your Vertex (28 Bytes). So, in total it`s 1024x1024x28 = [b]28 MB[b]. There`s a lot of duplicated and easily calculated data.
If you had divided your terrain into chunks of, say 256x256, a separate stream would hold the XZ positions of each vertex of the chunk and UV1/UV2, i.e. 24 Bytes per vertex. That stream would consume 256x256x24 = 1.5 MB, but even if you wanted to render a terrain of dimensions 32768x32768, it would still consume just those 1.5 MB and not more.
Then, you need a stream of YPOS for each chunk, i.e. 256x256x4 = 0.25 MB. But there`s no duplicate data, just YPOS (and usually, you won`t need 4 Bytes, often just 2 Bytes are enough - i.e. 16-bit heightmap, and you can use remaining 2 Bytes for anything else). In total you`ll need 4x4=16 of these chunks, i.e. just 4 MB.
So, in total that`s just 4.25 MB for Vertices. It might be a little higher, depending on your texturing, but that`s quite a difference compared to 28 MB.

Then, before rendering of each chunk, just set a VS constant for the chunk`s offset and add that offset inside shader into Stream0`s XZ position and you`re set.

Fast tip : Another memory-reduction step could be to use Pixel Shader and just have heightmap as DXT-compressed images, in which case, you`d consume just 0.5 Byte per vertex (instead of current 4), which is 1.8th of the current size of VBs. Which, in our case would be 1024x1024x0.5 = 0.5 MB for terrain heightmap.

Not bad, if you ask me ;-) And if that is still too much, then you gotta generate the mesh procedurally on the fly and thus give a good use to those 3-7 idling cores ;-)

Textures
You could also easily spend 102.4 MB for the texture, if you had a colormap 4096x4096 (in 32bits) and make it create full mipmap chain (4096x4096x4x1.6) upon loading. So check for textures too, if you don`t update that code often. Fast tip : use DXT and it`ll consume just 1/8th of the space, i.e. 12.8 MB.

Share this post


Link to post
Share on other sites
In addition to the good answers you've got so far I'll throw in a few things I've found in my numerous trips in terrain rendering:

  1. Use float16 for your XYZ elements. Can be hit or miss, but I found only edge cases where the reduced resolution had a visual impact (think huge draw distances) and the 50% saving is compelling.

  2. Don't bother with 32bit IB's. Not only are they a bit of a con (most hardware only allows upto 24 bits to be used but you always store 32 bits - instant 25% waste) but the performance has been questionable on some generations of GPU. Switching identical data from 16 to 32 bit indices can have a significant hit. Off the top of my head you should be able to squeeze a 128x128 patch into a single 16bit IB, which is fine.

  3. As has been mentioned, look into offsets. Index data is often a repeating pattern thus you can usually just store a single heirarchy of LOD data in a single IB and just use offset parameters to DIP calls to get it working across the entire terrain.

  4. Again, as has been suggested, skip the quadtree. Conceptually they're a perfect data structure for terrain rendering but these days linking the CPU and GPU using algorithms like this often slows performance! Firing big chunks of brute-force rendering at a GPU is usually better than having the CPU micro-manage/optimize for a smaller data set. Similar arguments go against LOD algorithms like ROAM.


hth
Jack

Share this post


Link to post
Share on other sites
Quote:
Original post by jollyjeffers
  1. Use float16 for your XYZ elements. Can be hit or miss, but I found only edge cases where the reduced resolution had a visual impact (think huge draw distances)
Could you explain this more ? Since XZ grid is regular, I can`t see how it could become a problem, since the XZ position is always just a multiply of some base Spacing between two Vertices, which is constant throught whole terrain. Maybe you were referring to Z-Buffer precision issue with huge terrains (in which cases it`s recommended to use linear Z-Buffer) ?

Quote:
Original post by jollyjeffers
and the 50% saving is compelling.
Although, I always pack every 3D data, in this particular case, I haven`t found a difference in performance between a non-packed Base Stream and a packed one. Of course, a 50% savings can be decompressed for free inside the Vertex Shader, so it doesn`t hurt to use it. But in our case, the Base VB is only 1.5 MB (256x256x6 floats), so we would shave off only 0.75 MB. But of course, every KB counts.
Generally, it`s good to use 16-bit floats for UVs , since you get a precision of 0.01 texel which is more than enough for common textures/uses.

Quote:
Original post by jollyjeffers
  • Don't bother with 32bit IB's. Not only are they a bit of a con (most hardware only allows upto 24 bits to be used but you always store 32 bits - instant 25% waste) but the performance has been questionable on some generations of GPU.
  • Which ones were those ? Also, are there already cards that can use more than 24bit of Indices ? Even my 7950 halts at 24 bit. Maybe 8800 series doesn`t ?


    Quote:
    Original post by jollyjeffers
  • As has been mentioned, look into offsets. Index data is often a repeating pattern thus you can usually just store a single heirarchy of LOD data in a single IB and just use offset parameters to DIP calls to get it working across the entire terrain.
  • I`d like to clarify a little bit more what Jack means here. The idea is to put all levels of LOD into one IB and instead of switching among different IBs (each IB for separate LOD), just use the offset. This way, 1 IB is enough for whole terrain for all LODs, so you won`t ever have to switch it.

    Share this post


    Link to post
    Share on other sites
    Quote:
    Original post by VladR
    Quote:
    Original post by jollyjeffers
    1. Use float16 for your XYZ elements. Can be hit or miss, but I found only edge cases where the reduced resolution had a visual impact (think huge draw distances)
    Could you explain this more ? Since XZ grid is regular, I can`t see how it could become a problem, since the XZ position is always just a multiply of some base Spacing between two Vertices, which is constant throught whole terrain. Maybe you were referring to Z-Buffer precision issue with huge terrains (in which cases it`s recommended to use linear Z-Buffer) ?
    To be honest, I can't really remember whether I found the root cause to be the use of FP16 or the more common depth buffer resolution issues. It was a pet project of mine circa 2005/2006 when I was playing with terrain that could be millions of units in size. I just have a vague memory that it was switching from FP32 to FP16 that introduced the artifacts.

    Quote:
    Original post by VladR
    I haven`t found a difference in performance between a non-packed Base Stream and a packed one. Of course, a 50% savings can be decompressed for free inside the Vertex Shader, so it doesn`t hurt to use it. But in our case, the Base VB is only 1.5 MB (256x256x6 floats), so we would shave off only 0.75 MB. But of course, every KB counts.
    I was thinking in the context of storage savings rather than any performance gain. Like yourself I've never observed a notable performance difference between the two.

    The last time I was experimenting with this the saving was on a XYZ+TBN+UV1+UV2 vertex of 40 bytes down to 20bytes (or down to 12 bytes with a GS [grin]) and the saving was more noticeable.

    Quote:
    Original post by VladR
    Quote:
    Original post by jollyjeffers
  • Don't bother with 32bit IB's. Not only are they a bit of a con (most hardware only allows upto 24 bits to be used but you always store 32 bits - instant 25% waste) but the performance has been questionable on some generations of GPU.
  • Which ones were those ?
    From what I remember, it was at least the GeForceFX's and possibly some of the pre-SM3 ATI models. Back when some GPU's still had limited internal precision and/or seperate hardware for different precisions.

    Quote:
    Original post by VladR
    Also, are there already cards that can use more than 24bit of Indices ? Even my 7950 halts at 24 bit. Maybe 8800 series doesn`t ?
    I doubt it, but if any do it'd be the D3D10 parts. If you think about it, a 224 allowance gives you 16.7 million unique vertices - that's at least 320mb of vertex data in a single buffer! The use-cases for those sorts of numbers are probably almost non-existant outside of the CAD world.


    Cheers,
    Jack

    Share this post


    Link to post
    Share on other sites
    Quote:
    Original post by jollyjeffers
    I doubt it, but if any do it'd be the D3D10 parts. If you think about it, a 224 allowance gives you 16.7 million unique vertices - that's at least 320mb of vertex data in a single buffer! The use-cases for those sorts of numbers are probably almost non-existant outside of the CAD world.
    Well, since we`re talking about terrain rendering, it could very well be just 64 MB 16.7 * 4, since 2 Bytes are for 16-bit heightmap and remaining 2 bytes can be used for anything else, e.g. UV or XZ offset or whatever else is needed - or maybe a color (assuming lighting is done through normal maps). But I`d think that such a huge VB would just slow things down. I remember some paper from nVidia where the size of VB directly influenced the performance and after some threshold, the big VBs were actually slowing the performance down.

    Actually, I made a mistake, since those 24bits would mean you`d get 16.7M vertices. But I always get just up to 1.048.576 vertices, which is relatively low number even for smaller VBs (e.g. VB of size 4 MB isn`t very big). I believe it`s stored in Caps.MaxVertexIndex. Even with GF7600 it`s been capped at 220. If they finally raised to 224, it`s great.

    Share this post


    Link to post
    Share on other sites
    Quote:
    Original post by Matt Aufderheide
    If you can use vertex textures, use them and abandon all other forms of terrain rendering.
    The last I looked into this was the Gf6 timeframe and whilst it technically worked it was too slow to be a major architecture feature - good for eye candy (e.g. water surfaces) but not for a terrain.

    Has the performance profile improved sufficiently since ~2005??

    Jack

    Share this post


    Link to post
    Share on other sites
    Exactly. GF6 can render a massive terrain using traditional technique at FPS of excess 300-600. No chance with vertex texturing for same set of data.

    Plus, since you`re already abusing pixel shader units, you`re being left with much less performance for actual per-pixel effects, be-it post-processing, normal-mapping, whatever.

    On the other hand, if you can declare that you need 8800GT card just to run the game, then by all means go for it. But if you want your SW be playable also by GF3-style cards (and also GF2-style for backwards compatibility), there`s no other way than the traditional one.

    I think that the traditional technique shall be with us for next 5 yrs at least.

    Share this post


    Link to post
    Share on other sites
    Samoth posted
    "The reason why fixed size tiles are an advantage is that you can have a vertex buffer which only contains the height, and another one that is reused for every tile (constant), containing the longitude/latitude bit of the coordinates (which you can reuse for texcoord, too). That way, you reduce your memory footprint and upload bandwidth by 2/3. It takes 3-4 lines of code for the vertex shader, but so what... it's fast."

    I think I can figure out how to put in the multiple vertex streams on my own but how do you access them via the shader? This approach is exactly how I want to do my own terrain, but I have come unstuck working out how to get at the height etc data.

    Share this post


    Link to post
    Share on other sites
    When you declare the vertex format that the shader shall use, you specify, per each stream, actual registers and their size/type.

    Then, inside the shader just use the register that you put inside the vertex declaration.

    However, bear in mind that having more streams kills performance. During my latest experiments, 3 is a critical number and having 4 streams for terrain is killing the performance a lot. I had to reduce the stream count to 3 and live with some duplicated data (which means bigger VBs), since the performance drop was unacceptable (400 fps down to 80 fps). And that`s with cache-friendly sizes of vertex formats.

    Anyone got an idea why is that ? Maybe HW isn`t well equipped for working with 4 and more vertex streams ? Maybe that`s because it can`t execute instructions in advance and has to wait till all streams are fetched ? BTW, it`s happenning on 7950GT (but also on 7600 and 6600, as far as I had a chance to test it).

    Share this post


    Link to post
    Share on other sites

    Create an account or sign in to comment

    You need to be a member in order to leave a comment

    Create an account

    Sign up for a new account in our community. It's easy!

    Register a new account

    Sign in

    Already have an account? Sign in here.

    Sign In Now

    Sign in to follow this