Jump to content
  • Advertisement
Sign in to follow this  
VladR

Usefulness of Geometry Instancing

This topic is 4049 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

So far, I lived under assumption that Geometry Instancing saves number of DIP calls and abuse of vertex-shader units through some intelligent HW that knows which parts of VS should be reexecuted per given instance, thus raising the maximum vertex-throughput. But I just read this old paper (http://download.nvidia.com/developer/presentations/2004/6800_Leagues/6800_Leagues_SM3_Best_Practices.pdf) and it seems that the only performance benefit comes from savings in CPU overhead of DIP calls. Who would be so stupid that would be rendering a forest through 10.000 DIP calls anyway ? I mean, everybody is batching some-how. Thus, if your current system renders the items through up to 10 calls, there`s no performance reason to switch it to instancing, if memory is not an issue. And memory can`t be an issue, since even if your Vertex Buffer would contain 100.000 vertices, a separate stream containg indices to constant buffer (holding per-instance data) would be less than 0.4 MB. In fact, it might be a little bit slower, since there`s a fixed overhead when using instancing. So, instancing seems to be usefull only for cases, when one is lazy or must code a rendering routine of some huge crowd within 2 hours and has no time for optimizations. Which would be ridiculous, but that`s how it seems to be, since if instancing doesn`t save you from vertex-transforms, you could easily batch it yourself anyway. Am I therefore right, that each vertex of each instance goes through a vertex shader anyway ?

Share this post


Link to post
Share on other sites
Advertisement
Yes, every vertex of each instance goes through the vertex shader, the benefit of instancing is not to do with saving vertex shader work (which is rarely a bottleneck anyway).

The benefit of instancing is that it lets you advance through different vertex streams with different frequencies and it lets you loop over the same vertex data more than once. This allows you to draw the same model multiple times in a single draw call with different positions with only a single copy of the vertex data, a single copy of the index data and an additional stream containing one copy of each instance's transform. Without instancing you have to have to have multiple copies of your vertex and index data which wastes memory.

Share this post


Link to post
Share on other sites
Having multiple copies of the data not only wastes memory, it wastes bandwidth and cache usage as well, which can be pretty significant. Since instancing is practically just a "mod" in the input assembler hardware, it's fairly obvious why this functionality is useful :)

Share this post


Link to post
Share on other sites
Quote:
Original post by mattnewport
the benefit of instancing is not to do with saving vertex shader work (which is rarely a bottleneck anyway).
Well, I personally lean to using as many triangles as possible - usually about 500.000 per scene since that`s smooth even on extreme low-end cards (e.g. GF6600). I wanted to implement a fully-polygonal blades of grass, for which I need at least another 500.000 triangles (i.e. over 1M tris per frame) for immediate surroundings of the player (that`s including 5 LODs).
I hoped that instancing shall off-load VS pipes and that the number shall be at least doubled until I read that above paper and realized that its purpose is completely different. Thus, there`s no way around the maximum vertex throughput that the card is physically capable (based on number of VS pipes and clock frequency).

Quote:
Original post by mattnewport
Without instancing you have to have to have multiple copies of your vertex and index data which wastes memory.

Again, this is not an issue in general, since my vertices are pretty compressed and tend to consume around 12-16 Bytes (and grass vertices specifically could be compressed easily down to 8 bytes). I have yet to benchmark whether it`s faster to spend less instructions decompressing them or use them uncompressed but take up larger amount of bandwidth.
But, frankly, 500.000 vertices * 8 Bytes = 4 MB. What`s that on currently low-end 128/256 MB cards ?


AndyTX: What actually do you mean by wasted cache here ? Since each vertex goes through vertex shader anyway, what good is the cache here ? Post-transform is obviously of no use and pre-transform id not big enough. Maybe, saving some vertex fetch from the pipeline or some other associated pipeline activity ?

Besides, if my instances have around 180-350 Vertices, they don`t fit into cache anyway, so I still don`t see what you`re pointing at.

Share this post


Link to post
Share on other sites
Quote:
Original post by VladR
Besides, if my instances have around 180-350 Vertices, they don`t fit into cache anyway, so I still don`t see what you`re pointing at.


If your vertices are 16 bytes then 350 vertices is less than 6K - that will comfortably fit into many cards' pre-transform vertex cache. Newer cards are quite likely to have 32K vertex caches - the hardware companies traditionally don't give out cache sizes but AMD has broken the trend with the new 2900XT and that has a 32K L1 vertex cache (shared with unfiltered texture fetches) and a 256K L2 cache (shared with all texture fetches). As far as I know 32K is not uncommon for a pre-transform vertex cache and certainly many cards will have at least 8K.

You might be able to get away with 12 to 16 byte vertices for grass but if you're trying to instance a more complex model with multiple UVs and lighting or skinning information then they can easily get to 32 bytes or more, vertex data can end up consuming quite a bit of memory. Plus if you're changing which instances are getting drawn from frame to frame or need to update instance data in the vertex buffer there's potentially a fair bit of CPU overhead copying the data around or modifying it all the time, even if all your modifications are just to the index buffer.

Share this post


Link to post
Share on other sites
Thank you very much for sharing the numbers of cache sizes. Didn`t know that pre-transform cache is 6 KB on older cards and up to 32KB on newer ones. That could take off some burden from the pipeline, which never hurts and is the only way to maximize the performance.
Plus, I can base my further experiments on this number and check the actual threshold by watching the performance differ between various sizes of the instance. This could also explain various weird performance issues in the past, when I`m thinking of it (even if the vertex size wasn`t aligned ideally).

Do you happen to know if there`s some latency involved with the pre-transform cache, if the vertex size isn`t a multiply of 16 Bytes ? Or is Indexing into cache equally expensive regardless of the index value ?

Quote:
You might be able to get away with 12 to 16 byte vertices for grass
Actually, for simple instanced grass patch, 8 Bytes is enough for position/normal/UV ;-)
Quote:
but if you're trying to instance a more complex model with multiple UVs and lighting or skinning information then they can easily get to 32 bytes or more, vertex data can end up consuming quite a bit of memory.
Yeah, army of 500 characters is probably the only reasonable reason for HW instancing.

Although, when I`m thinking of it now, you just need to change the constant array each frame holding WorldMatrix (and rotation around Y-axis), so it`s still very little data and update the IB only when huge changes occur on the playfield (which is once every few seconds). Sure, it`s not that easy as just setting up the streams for instances, but it isn`t anything complicated either. Most importantly, the HW streaming method is virtually bug-free compared to manual implementation. Anyway, at least for hundred characters, it`s easier.

BTW, anyone has some links or tips how the pre-transform / post-transform cache works in more detail ?

Share this post


Link to post
Share on other sites
ive written about instancing quite a bit in the past so wont repeat it, but its another feature similar to say point sprites, its great when u hear about it. ie instancing, wohoo, punch fist in the air. but that soon wears off when u think/study it a bit more.

geometry shaders/occlusion querys etc now these OTOH are interesting + useful

Share this post


Link to post
Share on other sites
I disagree that instancing is not useful. Besides the obvious "grass, trees, crowds" examples, it gets a lot more useful when combined with the geometry shader, and the ability to send geometry to different render targets and viewports. For example, instancing can be used rather than geometry shader cloning to render multiple perspectives, or multiple shadow maps splits (CSM, PSSM), etc. fairly efficiently. The problem with GS amplification is that it's hard to implement with any level of efficiency and it tends to break parallelism. Instancing does not have this problem.

Besides my argument would be why *not* use instancing? It will never be slower than the alternatives and is often much easier to implement.

Share this post


Link to post
Share on other sites
Now im not going to say instancing is useless, since it seems faster and more efficient than not using it, but im confused as to why it doesnt actually perform as well as it should.

Heres an example:
In Nvidia's SDK 9.5, they have a directx instancing sample. Now, when I try to run that on my 8800 gtx with 2000 ships and 5000 rocks, with a total of 7000 instances, 23 draw calls, and about 525k verticies rendered in total, I get 13 FPS. Clearly the number of draw calls is not the problem and 525k verticies is virtually nothing for the 8800, why the huge slowdown?

Actually, the same type of thing happens with their opengl sample as well. With 32000 instances with about 60k triangles total, I get around 20 FPS, again which doesnt seem right considering that instancing supposedly only sends one copy of the vertex data along with a second vertex buffer filled with the per instance data.

Am I missing something here or shouldnt instancing be working alot better than it does?

I investigated using the geometry shader in dx 10 to get around this but the geometry shader seems to be a whole lot slower and extremely unoptimized ATM. I guess efficient geometry shaders wont come untill the next generation of gfx cards.

Share this post


Link to post
Share on other sites
Quote:
Original post by AndyTXBesides my argument would be why *not* use instancing? It will never be slower than the alternatives and is often much easier to implement.

the nvidia guys disagree, from memory with d3d9 instancing was slower if mesh were greater than ~120 vertices, with opengl (+ d3d10 also i assume) this figure is gonna be less

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!