Jump to content
  • Advertisement
Sign in to follow this  
Norman Barrows

instancing issues

This topic is 671 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

instancing issues

 

dx9.0c shader model 2 code.  hardware instancing.

 

just got shader_test5 running - instancing with texturing and lighting and alpha test.

 

i can draw 14,400 instances of a 30-40 vertex plant mesh with unique worldmats no problem. 62 fps rock solid  vsync on. no culling at all.

 

the problem seems to be the size of the instance data.

 

kick it up from 120*120 to 130*130 = 16,900 instances.and it blows up.

 

i create an array of 4x4 worldmats. set the mats, then copy them to a vb.

 

the array is currently a local variable allocated on the stack.

 

130*130 is an array a little over 1 meg.  would that make it blow up?  win32. wouldn't think so...

 

a concern is that my terrain chunks are 300*300 in size. so far i can't get over 120*120.

 

 

ideas:

 

put the array on the heap.

 

lock the vb and set it directly, removing the need for an array.

 

pass x,y,z, and yr  (all i really need) and compute the worldmat from those in the shader. saves me 12 floats per instance.  i assume HLSL 2 or 3 can do that, correct?

 

use smaller 100x100 "grass chunks" that are separate from terrain chunks.

 

ideally, i'd like to have plants just be part of a terrain chunk, with 300*300 = 90,000 instances per chunk.

 

but that's 90,000 instances * at least 4 floats per instance * 4 bytes per float = 1.44 million bytes. whats the max size of a VB?

 

is it just me or does 16,900 seem a little low to be running out of ram?

 

 

EDIT: Ok, looks like there's a 1 meg default limit on the stack.  That answers that.  Well, I have a few gig of heap to work with - working set size of about 700-800 meg the last time I looked.   That would be preferable to increasing the stack.  In general, i would prefer a solution that doesn't require "big memory".  But i have the megs available to burn if i need to use them.  

 

But i would think that instance bandwidth would be a greater issue.

 

should i just pass x,y,z,and yr?  can HLSL create a worldmat from those params?  premature optimization?   

 

any problem with locking and editing the vb directly? this is done before the render loop in the test. in the game this would be foreground or background chunk generation.

 

what about max VB size? check caps? 

 

time to google!

Edited by Norman Barrows

Share this post


Link to post
Share on other sites
Advertisement

any problem with locking and editing the vb directly? this is done before the render loop in the test. in the game this would be foreground or background chunk generation.

Writing into mapped buffers directly is perfectly fine. However, writing via memcpy is the safest way to go.
The memory pages of mapped buffers are often configured in write-combine cache mode instead of the typical write-back cache mode, which means that any read instructions will completely destroy performance.
 
The reason that I say that memcpy is preferred, is because you can be sure it will only write to the buffer and not read from it. It's also going to do a contiguous write with no holes. These things make write-combining happy.

The typical example of failure is some innocent looking code that writes a zero to the buffer, such as:
*buffer = 0;
but this can compile into:
AND DWORD PTR [EAX],0
aka:
*buffer = *buffer & 0; (a read-modify-write operation!)

Share this post


Link to post
Share on other sites

i changed the array to a D3DXMATRIX pointer.

 

then i set it to the void pointer returned by lock

 

mWorld = (D3DXMATRIX*)p

 

then i just use array addressing to set the mats

 

mWorld=Mmat

i++

 

Where Mmat is my "matrix manipulator" matrix. imagine a register, but its a 4x4 mat. and you can perform ops on it like set to identity, cat a scale, cat rotations, and cat a translation.  I use it as a temporary variable to create mats.

 

IE i use good old C style pointer addressing (the array version thereof).  *(mWorld+i)=Mmat would do the same thing.

 

that seemed to work just fine.

 

so then i started kicking up the number of instances.

 

shader_test5 is now drawing 1,000,000 grass plants in a 1000x1000 foot area around the origin, each with unique worldmats, in a single draw call,  with the camera at 0,5,-20, looking in the pos z direction. no clipping whatsoever, still rock solid at 62 FPS w/ vsync on. each plant is 30-40 verts, about half a foot across, and half a foot tall, one every foot. i'm not drawing ground or anything else. just clear the screen and draw the plants at y=0.

 

so that works. 

 

the plan is to add 4 arrays of 25,000 worldmats each to a terrain chunk, along with 4 meshes and textures for the plants. when i generate a chunk, i'll use the generic random map to determine mesh, texture, scale, rotation, and jitter based on position, so its deterministic. each array will hold the worldmats for one mesh and texture combo. i figure 4 types of plants should be sufficient.

 

adding 90,000 worldmats to a chunk raised the size of the chunk cache to 595meg, which is still reasonable.

 

The problem now is that this will add to the time to generate a chunk, which is already a tad slow. Stepping up from 1.3Ghz to 4.0Ghz didn't help as much as hoped for.  

 

The big problem is that the world map changes every day.  The game models climate change. So at midnight, a forest might change to savanna. or scrub might change to grasslands.  So i can't just pre-compute terrain chunks when i generate the world. They might only be valid for one game day at most. 

 

So the plan is to optimize terrain chunk generation, then test and time generating a chunk vs loading it from disk. if loading is faster, i'll save after generating, with a gametime stamp. when i need a chunk, if its on disk, and not so old it might have changed, i load it from disk, else i generate it.

 

I have both foreground and background chunk generation.  the only real problem is when you startup the game and it has to generate a bunch of chunks in the foreground, or when you tab to another band member else where in the world, and it needs a whole bunch of different chunks.  and occasionally, background generation can't keep up, and it has to generate a chunk in the foreground.

 

randomly generated, seamless, persistent, and modifiable i can handle with no load screen type delays.  but when you throw in an ever changing world?  i may have to just live with the occasional delay from loading / generating chunks.

 

smaller chunks is another option.  i have tested chunk sizes from 100x100 to 400x400,  but 300x300 seemed fastest from a drawing point of view.  but 100x100 would generate in about 1/9th the time.

 

although i haven't timed it yet, its not really that bad, maybe 1/4 second to generate a chunk.  So about a one to two second delay when it has to generate them all in the foreground.

 

Another concern is when the player modifies the world.  right now i just clear the terrain chunk and collision map caches, forcing a re-gen of everything. if i'm also paging from disk, what do i do? invalidate every chunk on the disk?

 

More things to think about.  Guess i'd better go play some more Skyrim.  I find its just the right level of distraction to let my subconscious work these things out. In figuring out all this shader stuff, i've probably only actually spent 6 to 8 hours on it. The rest of the time i've been thinking about it while playing skyrim, shogun2, and rome2 over the last week or two.  You know how it is. You try something, its not working. You lose motivation. You take a break and play some skyrim SE, and then spend all you time running around looking at their plant models!  <g>.  pretty soon you get motivated again, and then its back to work.

 

I do have to admit that some parts of skyrim SE are rather impressive when it comes to nature scenes. If everything wasn't about 2x normal size it would be great! <g>.

Edited by Norman Barrows

Share this post


Link to post
Share on other sites

ok, now i have a new problem:

 

when i add 90,000 worldmats to a chunk:

 

struct chunk2
{
int active,mx,mz,x,z,age;
tex_list t;
drawinfo_list d;
unsigned char gmap[CHUNKSIZE][CHUNKSIZE];
D3DXMATRIX mWorld[300*300]; 
};
 
and set it to draw 1,000,000 instances:
 
#define WIDTH 1000
 
(where it draws WIDTH*WIDTH instances), the call to create vertex buffer:
 
result=Zd3d_device_ptr->CreateVertexBuffer(64*WIDTH*WIDTH,D3DUSAGE_WRITEONLY,0,D3DPOOL_MANAGED,&m1.vb,NULL);
 
fails with:
 
    case E_OUTOFMEMORY:
        msg3("shader test 5: create vertex buffer: out of memory");
        return;
 
For the moment, i've reduced the size of the terrain chunk cache from 60 to 30 chunks.  But the player can now control up to 50 band members. and the background chunk generation algo should look ahead at least a 2 chunk radius around each band member (3 would be nicer...).  
 
I know! I know! the numbers just don't add up.
 
D3DPOOL_SYSTEMMEM wouldn't help I assume...
 
it kind of weird that just one more call to create a vb would fail after i've already allocated the equivalent of 60 of them. OTOH, those 60 are arrays of 4x4 mats in the data segment, not actual allocated VBs.
 
without the worldmats, the cache size is 250-300 meg. with them, 595 meg.
 
just coincidence that list #61 breaks the bank?
 
i'll  mod init_chunks to allocate 4 real VBs of 25,000 instances each to make sure.
 
But this points to a deeper issue...
 
The amount of data in a terrain chunk, between renderables, instance worldmats, collision maps, etc.is having an impact on both generation time and memory usage.
 
The only things i can think of are:
* optimizing chunk generation
* optimizing chunk culling
* using smaller chunks
* a smaller chunk cache
* something more extreme, like when a player first enters a map square, generating all the chunks in a map square (5 miles across = 88 300 foot wide chunks across) and saving them to disk, then paging them.  this might take 10s of seconds to generate.
 
if you want to cache the chunks in visible range of up to 50 PCs at once, and the chunk size is large...
 
Am i looking at load screens?     or "loading area..." messages?
 
large worlds can be streamed off disk, how do you also work in parts of that world changing every game day?  stream in the background (foreground if needed), and change in the background ?
 
it would be nice to stay away from paging off disk. local map unexplored bitmasks and band at shelter trading inventory lists are the only things i page so far.
 
 
 
 
 
Edited by Norman Barrows

Share this post


Link to post
Share on other sites
Try DEFAULT instead of MANAGED.

MANAGED allocates in both system RAM and GPU RAM so that it can recover from a lost device automatically.

DEFAULT is GPU RAM only (and requires you to handle the lost device recovery yourself).

Share this post


Link to post
Share on other sites
Try DEFAULT instead of MANAGED.

 

good point. i'll switch that.

 

but I just tried this: 

 

struct chunk2
{
int active,mx,mz,x,z,age;
tex_list t;
drawinfo_list d;
unsigned char gmap[CHUNKSIZE][CHUNKSIZE];
// D3DXMATRIX mWorld[300*300]; 
IDirect3DVertexBuffer9 *mWorld1;
IDirect3DVertexBuffer9 *mWorld2;
IDirect3DVertexBuffer9 *mWorld3;
IDirect3DVertexBuffer9 *mWorld4;
};
 
 
fn v init_chunk_VBs
i a
HRESULT result;
4 a max_chunks
  cr result Zd3d_device_ptr->CreateVertexBuffer 64*25000 D3DUSAGE_WRITEONLY 0 D3DPOOL_MANAGED &CL[a].mWorld1 NULL
  != result D3D_OK
    c Zmsg2 "init chunk VB's error"
    c exit 1
    .
  cr result Zd3d_device_ptr->CreateVertexBuffer 64*25000 D3DUSAGE_WRITEONLY 0 D3DPOOL_MANAGED &CL[a].mWorld2 NULL
  != result D3D_OK
    c Zmsg2 "init chunk VB's error"
    c exit 1
    .
  cr result Zd3d_device_ptr->CreateVertexBuffer 64*25000 D3DUSAGE_WRITEONLY 0 D3DPOOL_MANAGED &CL[a].mWorld3 NULL
  != result D3D_OK
    c Zmsg2 "init chunk VB's error"
    c exit 1
    .
  cr result Zd3d_device_ptr->CreateVertexBuffer 64*25000 D3DUSAGE_WRITEONLY 0 D3DPOOL_MANAGED &CL[a].mWorld4 NULL
  != result D3D_OK
    c Zmsg2 "init chunk VB's error"
    c exit 1
    .
  .
.
 

 

and call init_chunk_VBs() at the end of initprog(), just before i display chunk cache size (now down to 125Meg), and then call shader_test5() - which still works with 1,000,000 instances, and 4 allocated VBs of 25,000 worldmats per chunk, and a cache size of 30 still.  The original cache size was 30. then i kicked it up to 60, then 90 as i kept increasing MAX_BANDMEMBERS.

 

So it didn't work because ...  (not that it really matters)   ...the worldmat data is on the heap and in vidram, instead of in the data segment?

 

EDIT:

 

I just kicked MAX_CHUNKS back up to 90, and it still works.  cache size at 90 is 375 meg. Of course that dosen't include the 576 meg worth of worldmat VBs in vidram.

 

I'd kick it up higher, but the chunks currently use the shared mesh asset pool for their 4 interleaved ground meshes (one per ground texture tile type), and use a lookup table to map a chunk index to the corresponding meshIDs.  right now the lookup table (and reserved slots in the mesh pool) only goes up to 90. i'd have to add more code to the lookup table to add more chunks (and maybe kick up the max size of the mesh pool). What i really need to do is make the ground meshes part of a chunk, instead of storing them in the shared mesh asset pool. then i could easily change cache size and address the meshes without the need for a lookup table.  a spot that could stand some refactoring. OTOH, i'm not sure i NEED a cache bigger than 90. It seems than in a game, the ideal size for a party under human control is 4. any more and you stop really caring.  i need 4 or 6 chunks around each band member to be able to tab between them seamlessly. so with a cache of 90, i can go up to 15 band members before it starts generating chunks in the foreground when you tab between band members.  This assumes the band members are all so distant from one another that none of the band members can see the same chunks. Most of the time the band tends to stick together, with perhaps 20% out gathering resources or exploring or whatever at any given time.  Even when i was playing bands as large as 32 members, at most you might have 4 or five groups at any time (who can all see more or less the same terrain chunks).

 

 

Since memory is becoming a potential issue, whats the easiest way to determine how much ram and vidram a game uses?   surely there must be something better than adding up the sizeof() everything...

Edited by Norman Barrows

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!