Jump to content

  • Log In with Google      Sign In   
  • Create Account


Usefulness of Geometry Instancing


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
18 replies to this topic

#1 VladR   Members   -  Reputation: 722

Like
0Likes
Like

Posted 15 May 2007 - 06:00 AM

So far, I lived under assumption that Geometry Instancing saves number of DIP calls and abuse of vertex-shader units through some intelligent HW that knows which parts of VS should be reexecuted per given instance, thus raising the maximum vertex-throughput. But I just read this old paper (http://download.nvidia.com/developer/presentations/2004/6800_Leagues/6800_Leagues_SM3_Best_Practices.pdf) and it seems that the only performance benefit comes from savings in CPU overhead of DIP calls. Who would be so stupid that would be rendering a forest through 10.000 DIP calls anyway ? I mean, everybody is batching some-how. Thus, if your current system renders the items through up to 10 calls, there`s no performance reason to switch it to instancing, if memory is not an issue. And memory can`t be an issue, since even if your Vertex Buffer would contain 100.000 vertices, a separate stream containg indices to constant buffer (holding per-instance data) would be less than 0.4 MB. In fact, it might be a little bit slower, since there`s a fixed overhead when using instancing. So, instancing seems to be usefull only for cases, when one is lazy or must code a rendering routine of some huge crowd within 2 hours and has no time for optimizations. Which would be ridiculous, but that`s how it seems to be, since if instancing doesn`t save you from vertex-transforms, you could easily batch it yourself anyway. Am I therefore right, that each vertex of each instance goes through a vertex shader anyway ?

Sponsor:

#2 mattnewport   GDNet+   -  Reputation: 1029

Like
1Likes
Like

Posted 15 May 2007 - 07:56 AM

Yes, every vertex of each instance goes through the vertex shader, the benefit of instancing is not to do with saving vertex shader work (which is rarely a bottleneck anyway).

The benefit of instancing is that it lets you advance through different vertex streams with different frequencies and it lets you loop over the same vertex data more than once. This allows you to draw the same model multiple times in a single draw call with different positions with only a single copy of the vertex data, a single copy of the index data and an additional stream containing one copy of each instance's transform. Without instancing you have to have to have multiple copies of your vertex and index data which wastes memory.

#3 AndyTX   Members   -  Reputation: 802

Like
0Likes
Like

Posted 15 May 2007 - 07:58 AM

Having multiple copies of the data not only wastes memory, it wastes bandwidth and cache usage as well, which can be pretty significant. Since instancing is practically just a "mod" in the input assembler hardware, it's fairly obvious why this functionality is useful :)

#4 VladR   Members   -  Reputation: 722

Like
0Likes
Like

Posted 15 May 2007 - 08:56 AM

Quote:
Original post by mattnewport
the benefit of instancing is not to do with saving vertex shader work (which is rarely a bottleneck anyway).
Well, I personally lean to using as many triangles as possible - usually about 500.000 per scene since that`s smooth even on extreme low-end cards (e.g. GF6600). I wanted to implement a fully-polygonal blades of grass, for which I need at least another 500.000 triangles (i.e. over 1M tris per frame) for immediate surroundings of the player (that`s including 5 LODs).
I hoped that instancing shall off-load VS pipes and that the number shall be at least doubled until I read that above paper and realized that its purpose is completely different. Thus, there`s no way around the maximum vertex throughput that the card is physically capable (based on number of VS pipes and clock frequency).

Quote:
Original post by mattnewport
Without instancing you have to have to have multiple copies of your vertex and index data which wastes memory.

Again, this is not an issue in general, since my vertices are pretty compressed and tend to consume around 12-16 Bytes (and grass vertices specifically could be compressed easily down to 8 bytes). I have yet to benchmark whether it`s faster to spend less instructions decompressing them or use them uncompressed but take up larger amount of bandwidth.
But, frankly, 500.000 vertices * 8 Bytes = 4 MB. What`s that on currently low-end 128/256 MB cards ?


AndyTX: What actually do you mean by wasted cache here ? Since each vertex goes through vertex shader anyway, what good is the cache here ? Post-transform is obviously of no use and pre-transform id not big enough. Maybe, saving some vertex fetch from the pipeline or some other associated pipeline activity ?

Besides, if my instances have around 180-350 Vertices, they don`t fit into cache anyway, so I still don`t see what you`re pointing at.

#5 mattnewport   GDNet+   -  Reputation: 1029

Like
0Likes
Like

Posted 15 May 2007 - 10:03 AM

Quote:
Original post by VladR
Besides, if my instances have around 180-350 Vertices, they don`t fit into cache anyway, so I still don`t see what you`re pointing at.


If your vertices are 16 bytes then 350 vertices is less than 6K - that will comfortably fit into many cards' pre-transform vertex cache. Newer cards are quite likely to have 32K vertex caches - the hardware companies traditionally don't give out cache sizes but AMD has broken the trend with the new 2900XT and that has a 32K L1 vertex cache (shared with unfiltered texture fetches) and a 256K L2 cache (shared with all texture fetches). As far as I know 32K is not uncommon for a pre-transform vertex cache and certainly many cards will have at least 8K.

You might be able to get away with 12 to 16 byte vertices for grass but if you're trying to instance a more complex model with multiple UVs and lighting or skinning information then they can easily get to 32 bytes or more, vertex data can end up consuming quite a bit of memory. Plus if you're changing which instances are getting drawn from frame to frame or need to update instance data in the vertex buffer there's potentially a fair bit of CPU overhead copying the data around or modifying it all the time, even if all your modifications are just to the index buffer.

#6 VladR   Members   -  Reputation: 722

Like
0Likes
Like

Posted 15 May 2007 - 11:15 AM

Thank you very much for sharing the numbers of cache sizes. Didn`t know that pre-transform cache is 6 KB on older cards and up to 32KB on newer ones. That could take off some burden from the pipeline, which never hurts and is the only way to maximize the performance.
Plus, I can base my further experiments on this number and check the actual threshold by watching the performance differ between various sizes of the instance. This could also explain various weird performance issues in the past, when I`m thinking of it (even if the vertex size wasn`t aligned ideally).

Do you happen to know if there`s some latency involved with the pre-transform cache, if the vertex size isn`t a multiply of 16 Bytes ? Or is Indexing into cache equally expensive regardless of the index value ?

Quote:
You might be able to get away with 12 to 16 byte vertices for grass
Actually, for simple instanced grass patch, 8 Bytes is enough for position/normal/UV ;-)
Quote:
but if you're trying to instance a more complex model with multiple UVs and lighting or skinning information then they can easily get to 32 bytes or more, vertex data can end up consuming quite a bit of memory.
Yeah, army of 500 characters is probably the only reasonable reason for HW instancing.

Although, when I`m thinking of it now, you just need to change the constant array each frame holding WorldMatrix (and rotation around Y-axis), so it`s still very little data and update the IB only when huge changes occur on the playfield (which is once every few seconds). Sure, it`s not that easy as just setting up the streams for instances, but it isn`t anything complicated either. Most importantly, the HW streaming method is virtually bug-free compared to manual implementation. Anyway, at least for hundred characters, it`s easier.

BTW, anyone has some links or tips how the pre-transform / post-transform cache works in more detail ?

#7 zedz   Members   -  Reputation: 291

Like
0Likes
Like

Posted 15 May 2007 - 11:18 AM

ive written about instancing quite a bit in the past so wont repeat it, but its another feature similar to say point sprites, its great when u hear about it. ie instancing, wohoo, punch fist in the air. but that soon wears off when u think/study it a bit more.

geometry shaders/occlusion querys etc now these OTOH are interesting + useful

#8 AndyTX   Members   -  Reputation: 802

Like
0Likes
Like

Posted 15 May 2007 - 12:35 PM

I disagree that instancing is not useful. Besides the obvious "grass, trees, crowds" examples, it gets a lot more useful when combined with the geometry shader, and the ability to send geometry to different render targets and viewports. For example, instancing can be used rather than geometry shader cloning to render multiple perspectives, or multiple shadow maps splits (CSM, PSSM), etc. fairly efficiently. The problem with GS amplification is that it's hard to implement with any level of efficiency and it tends to break parallelism. Instancing does not have this problem.

Besides my argument would be why *not* use instancing? It will never be slower than the alternatives and is often much easier to implement.

#9 coderchris   Members   -  Reputation: 207

Like
0Likes
Like

Posted 15 May 2007 - 01:19 PM

Now im not going to say instancing is useless, since it seems faster and more efficient than not using it, but im confused as to why it doesnt actually perform as well as it should.

Heres an example:
In Nvidia's SDK 9.5, they have a directx instancing sample. Now, when I try to run that on my 8800 gtx with 2000 ships and 5000 rocks, with a total of 7000 instances, 23 draw calls, and about 525k verticies rendered in total, I get 13 FPS. Clearly the number of draw calls is not the problem and 525k verticies is virtually nothing for the 8800, why the huge slowdown?

Actually, the same type of thing happens with their opengl sample as well. With 32000 instances with about 60k triangles total, I get around 20 FPS, again which doesnt seem right considering that instancing supposedly only sends one copy of the vertex data along with a second vertex buffer filled with the per instance data.

Am I missing something here or shouldnt instancing be working alot better than it does?

I investigated using the geometry shader in dx 10 to get around this but the geometry shader seems to be a whole lot slower and extremely unoptimized ATM. I guess efficient geometry shaders wont come untill the next generation of gfx cards.



#10 zedz   Members   -  Reputation: 291

Like
0Likes
Like

Posted 15 May 2007 - 10:19 PM

Quote:
Original post by AndyTXBesides my argument would be why *not* use instancing? It will never be slower than the alternatives and is often much easier to implement.

the nvidia guys disagree, from memory with d3d9 instancing was slower if mesh were greater than ~120 vertices, with opengl (+ d3d10 also i assume) this figure is gonna be less



#11 AndyTX   Members   -  Reputation: 802

Like
0Likes
Like

Posted 16 May 2007 - 03:43 AM

Quote:
Original post by zedz
Quote:
Original post by AndyTXBesides my argument would be why *not* use instancing? It will never be slower than the alternatives and is often much easier to implement.

the nvidia guys disagree, from memory with d3d9 instancing was slower if mesh were greater than ~120 vertices, with opengl (+ d3d10 also i assume) this figure is gonna be less

I don't see how that's possible except for poor driver coding. Of course OpenGL is an entirely different (and messy) story, but D3D9 and especially D3D10 should see a benefit from instancing and at the very least not be any slower than multiple calls or a large duplicated buffer (the latter two makes NO sense).

#12 eq   Members   -  Reputation: 654

Like
0Likes
Like

Posted 16 May 2007 - 06:31 AM

Quote:
I don't see how that's possible except for poor driver coding. Of course OpenGL is an entirely different (and messy) story, but D3D9 and especially D3D10 should see a benefit from instancing and at the very least not be any slower than multiple calls or a large duplicated buffer (the latter two makes NO sense).

Couldn't it be that they have messed up the pre T&L cache?
I.e: vertex 0 in instance 0 IS NOT EQUAL TO vertex 0 in instance 1 (as far as the cache is concerned).
Anyway this should be fixed by now or else... ;)



#13 zedz   Members   -  Reputation: 291

Like
0Likes
Like

Posted 16 May 2007 - 08:42 AM

Quote:
at the very least not be any slower than multiple calls or a large duplicated buffer (the latter two makes NO sense)

read these

from a nvidia pdf
Quote:
Use of instancing isn’t free. There is a small amount of per instance overhead in the driver
(luckily this is much less than normal draw calls). Also, since we are passing down extra
instance data in the vertex stream, all the instance data adds to our vertex stride, which will
reduce our vertex cache efficiency. As well, the instance data may require that we do extra
math ops per-vertex when we could be doing that math per-instance. Since with instancing,
you might pass down the world transform, you may have to do a matrix multiply to obtain
the WorldViewProjection matrix in the vertex shader. None of these concerns are too bad,
and in many situations, instancing is a win


from a ATI guy
Quote:

It's not nearly as dramatic in OpenGL as in D3D since you don't have the context switch to ring0 and back to ring3 again for each draw call. I'm doubtful instancing will ever be particularly useful on the OpenGL side. It's hard enough to motivate on the D3D side. If your object has > 100 triangles, the bottleneck has already moved to the vertex shader even with a simple shader. Perhaps future hardware will change the balance and an instancing feature can be considered then, but today it's not really needed.

Not sure what you're saying "not really" about, as the data passing the AGP is still the same regardless of pre-T&L cache utilization, but anyway, I understand your argument, but I'm not sure I agree for three reasons. The first is that in general memory access is seldom the bottleneck anyway, so it usually doesn't matter that much. The other is that instancing already screws up for the pre-T&L already as it requires two vertex streams. The third reason I'm afraid involves some non-public information that I can't disclose, which I believe is good argument why what you say isn't the case in practice, but it's hard of course to make this convincing without going into details.


#14 AndyTX   Members   -  Reputation: 802

Like
0Likes
Like

Posted 16 May 2007 - 11:47 AM

Yes, I've heard similar comments myself. The problem is that while they *may* apply to D3D9, they certainly do not apply to D3D10 or even OpenGL as of the G80 extensions.

The reason is because there's no longer a need to pass "per-instance" data since an instance ID is accessible in all shaders. This in turn can be used to index into constant arrays, textures, etc. as necessary.

Therefore anything that would normally have changed between different draw calls (matrices, constants, etc) can simply be dumped into an array and selected at very little - if any - additional cost.

Thus with instancing you *save* the draw call and lose nothing. Instancing can be faster in theory due to less data movement and tighter assumptions that the driver/hardware can make about the incoming streams.

However I do certainly agree that given the currently fairly low state change overhead it isn't useful for as many things as it used to be (still grass seems like a prime candidate...). That said there's still no excuse for it being *slower* except poor handling by the drivers/hardware (since there's theoretically fewer total operations, not more).

#15 zedz   Members   -  Reputation: 291

Like
0Likes
Like

Posted 16 May 2007 - 08:20 PM

Quote:
Original post by AndyTX
However I do certainly agree that given the currently fairly low state change overhead it isn't useful for as many things as it used to be (still grass seems like a prime candidate...). That said there's still no excuse for it being *slower* except poor handling by the drivers/hardware (since there's theoretically fewer total operations, not more).

bugger theory, results are what matters ( my mantra )
my main gripe with instancing is it deals with small data repeated a lot ( u mention grass, now grass if u deal with it simply (as all games do) instancing is great, but if u have (like i have) each blade having its own spring system instancing does not work, full stop, unless u transfer bucket loads of info with each instance, hence making it slower.
to sum up my drift, ultimately theres a set goal with graphics 'to model or improve how we perceive reality'. another analogy (not instancing but closesly related)
i was looking at the beta of halo3 a couple of days ago, now it looked terrible the tiled textures of the ground, now with some instancing (of texture :)) it would look better but still would have a uniform/unnatural appearance to it, what we want is unique texturing eg megatexture (though u cant zoom in with that)



#16 remigius   Members   -  Reputation: 1172

Like
0Likes
Like

Posted 16 May 2007 - 09:09 PM

I've always been a big fan of hardware instancing since it does provide a performance gain over manual DIPs and it's very tidy compared to manual batching. Going from this thread though, it seems it isn't the cure-all I thought it to be, but I'm a bit puzzled why even the NVidia and ATI guys says it isn't really useful.

From my own experiments, I found HW instancing is up to par with constants instancing performancewise and easier to use at that. Considering constants instancing is a limited form of batching, there doesn't seem to be that much of of a performance drop between HW instancing and batching, at least in my experiments on ATI hardware. While 'true batching' may be easier on the GPU, it does introduce at least some CPU overhead (especially in dynamic scenarios) and since most games are still reported to be CPU limited, I'm not entirely convinced HW instancing is that useless.


Quote:
It's hard enough to motivate on the D3D side. If your object has > 100 triangles, the bottleneck has already moved to the vertex shader even with a simple shader. Perhaps future hardware will change the balance and an instancing feature can be considered then, but today it's not really needed.


Is this solely due to the pre-T&L cache problems then? If "general memory access is seldom the bottleneck anyway", the cache thing should not matter, right? And regardless of whether you use HW instancing or batching, the same number of vertices will need to be transformed with the ViewProjection matrix and I can't imagine that single additional matrix mul for the world matrix makes that much of a difference.

I'm going with the "The third reason I'm afraid involves some non-public information that I can't disclose", that they've botched up something with the drivers and/or hardware. True, the results are what matters, but it'd be a shame to add HW instancing to the list of dubious technologies (like VTF and apparently the GS) that aren't up to spec.

Rim van Wersch [ MDXInfo ] [ XNAInfo ] [ YouTube ] - Do yourself a favor and bookmark this excellent free online D3D/shader book!

#17 AndyTX   Members   -  Reputation: 802

Like
0Likes
Like

Posted 17 May 2007 - 02:49 AM

Quote:
Original post by zedz
bugger theory, results are what matters ( my mantra )

That's fine, although if it is just a driver thing, it can be fixed, which is an important distinction to make. Furthermore I've seen no indication that instancing is any less efficient than otherwise in D3D10 (on G80/R600) and I am extremely skeptical that it is given its first-class integration into the API. I'd have to see hard numbers to question that.

And yes of course we all agree that variability is the best, but at a certain point the hardware can only handle so many state transitions, etc. without breaking the parallelism (and thus performance) of the computations. Instancing can help here giving an obvious way to inform the driver of further work that can be parallelized. With some cleverness and use of the geometry shader you *could* handle things like deforming grass quite efficiently with instancing, although manually feeding the driver more data is another way to increase the parallelism.

In any case I don't find that I use instancing a lot myself, but it's certainly a useful tool that is good to have on the shelf that can solve some "practical" problems elegantly and efficiently.


#18 damianGray   Members   -  Reputation: 126

Like
1Likes
Like

Posted 19 May 2007 - 07:02 PM

Quote:
Original post by coderchris
Now im not going to say instancing is useless, since it seems faster and more efficient than not using it, but im confused as to why it doesnt actually perform as well as it should.

Heres an example:
In Nvidia's SDK 9.5, they have a directx instancing sample. Now, when I try to run that on my 8800 gtx with 2000 ships and 5000 rocks, with a total of 7000 instances, 23 draw calls, and about 525k verticies rendered in total, I get 13 FPS. Clearly the number of draw calls is not the problem and 525k verticies is virtually nothing for the 8800, why the huge slowdown?

Actually, the same type of thing happens with their opengl sample as well. With 32000 instances with about 60k triangles total, I get around 20 FPS, again which doesnt seem right considering that instancing supposedly only sends one copy of the vertex data along with a second vertex buffer filled with the per instance data.

Am I missing something here or shouldnt instancing be working alot better than it does?

I investigated using the geometry shader in dx 10 to get around this but the geometry shader seems to be a whole lot slower and extremely unoptimized ATM. I guess efficient geometry shaders wont come untill the next generation of gfx cards.


I'm fairly new to the whole coding in 3d thing, but on my PC that example runs fine with max ship OR max rock but not with both. When I studied the actual behavior of the ships I noticed that they change direcction a lot more when there is a lot more rocks, so I think the bottleneck in this case is actually the algorithm for ships dodging rocks rather than the amount of objects displayed on screen... I could be wrong of course but it's what made the most sense to me without actually delving too deeply into the code

#19 Basiror   Members   -  Reputation: 241

Like
0Likes
Like

Posted 20 May 2007 - 10:45 AM

Isn t instancing more or less the same as rendering the same geometry several times with different transformation matrices with just one draw call? At least thats how I understand it.

So the real advantage of geometry instancing should be rendering complex meshes several times. Since geometry shaders only have the information about the current face right? You can t clone a mesh with information only about a single face, well in theory you could, but you had to pass a tone of transformation matrices that require more registers than available.

From wikipedia
Quote:

A geometry shader can generate new primitives from existing primitives like pixels, lines and triangles.

Geometry shader is executed after Vertex shader and its input is the whole primitive or primitive with adjacency information. For example, when operating on triangles, three vertices are geometry shader's input. Geometry shader can then emit zero or more primitives, which are rasterized and their fragments ultimately passed to Pixel shader.

Typical uses of a geometry shader include point sprite generation, geometry tessellation, shadow volume extrusion, single pass rendering to a cube map.


Populating some terrain patch with grass polygons should be easy to do with a geometry shader, no need to use instancing here.


seems like geforce 8 supports instancing for opengl
opengl instancing




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS