Jump to content

  • Log In with Google      Sign In   
  • Create Account

VBO Pooling: Does it make sense?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
10 replies to this topic

#1 metsfan   Members   -  Reputation: 654

Like
0Likes
Like

Posted 08 November 2012 - 04:30 PM

Hello all,

In the project I am currently working on, the objects on screen are very transitory. There are never more than a few dozen objects on screen at a time, but the objects change all the time, and every object has slightly different geometry. Obviously I could just create a new VBO and delete the old ones every time new objects come in and old objects are removed, but does it make more sense to create a VBO Pool, whereby I allocate a preset number of VBOs, and any time one is needed, I pull an available VBO from the pool, and when an object is freed, its VBO is returned to the pool. Additionally the pool would automatically resize if it got too big or ran out of objects.

Does this sort of thing sound like a good idea, or a waste of time?

Thanks.

-Adam

Sponsor:

#2 L. Spiro   Crossbones+   -  Reputation: 14240

Like
2Likes
Like

Posted 08 November 2012 - 06:54 PM

It doesn’t make a lot of sense since you can’t control what OpenGL is doing inside the driver. If your goal is to avoid run-time allocations of VBO’s, that means you allocate them all up-front.
If you end up not using them all, you have allocated unnecessarily which in itself could be a burden on the driver. You never know.
It’s fine enough to just allocate them when necessary and only when necessary.



L. Spiro
It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#3 mhagain   Crossbones+   -  Reputation: 8276

Like
2Likes
Like

Posted 08 November 2012 - 08:34 PM

I wouldn't create and destroy VBOs at runtime - object creation and deletion is generally a quite expensive process.

For your case I'd look to see how much of your data can be kept absolutely static. OK, you've got a number of objects with different geometry, but you'll probably find that there are multiple instances of the same object type being used, just with a different transform, so that's a good candidate for making static. If other per-object properties differ you can pull them out as shader uniforms and just keep the common stuff in static VBOs; doing a little bit of extra shader work can frequently be substantially cheaper than having to constantly update VBO data.

If that doesn't apply then I'd go for a streaming buffer pattern - have a read of this page for further info on implementing that.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#4 swiftcoder   Senior Moderators   -  Reputation: 10364

Like
1Likes
Like

Posted 09 November 2012 - 05:03 PM

I wouldn't create and destroy VBOs at runtime - object creation and deletion is generally a quite expensive process.

So, I'm not clear that this is entirely true for OpenGL resource handles.

glGenBuffers() really doesn't do that much work - most of the cost is when you define the buffer with glBufferData(). In fact, before the call to glBufferData, OpenGL knows neither the size nor the type of memory to allocate, so very little actual work can be done during glGenBuffers(). Similarly, calling glDeleteBuffers() is hardly more expensive than the tear-down that happens when you re-load the buffer via glBufferData().

You are of course correct that it is key to reduce the number of buffer updates, but creation and destruction shouldn't be a performance issue.

Edited by swiftcoder, 09 November 2012 - 05:04 PM.

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#5 samoth   Crossbones+   -  Reputation: 5032

Like
2Likes
Like

Posted 10 November 2012 - 07:33 AM

Buffer objects are created when first bound (see the reference page, also this is among the tips in the "OpenGL Insights" book). The memory for backing the buffer object is allocated when you call glBufferData et al.

Thus, if you want to do this optimization, you should bind each buffer at least once, too. Still, this doesn't reserve memory (but that is something you probably can't, and shouldn't do upfront in any case).

#6 larspensjo   Members   -  Reputation: 1557

Like
0Likes
Like

Posted 10 November 2012 - 04:35 PM

Some ideas, most depending on the fact that there is at least some algorithmic relation between the objects
  • Using indexed drawing. Maybe you can have a "big" predefined VBO, and simply need to update the indices.
  • Using glBufferSubData(). Maybe some of the vertex data is the same, and some is not.
  • Is it a setup similar to animation? Animation can be greatly improved by using bones. Vertices are then attached to bones, and you only have to move a few bones. The shader will then compute the new vertices from the bones data, which is a much smaller set. The technique can be extended to arbitrary complex transformations, if the number of vertices is fairly constant but have a mathematical relation to a smaller subset of data.
  • Use a geometry shader to create the vertices you need.

Current project: Ephenation.
Sharing OpenGL experiences: http://ephenationopengl.blogspot.com/

#7 max343   Members   -  Reputation: 340

Like
2Likes
Like

Posted 10 November 2012 - 05:43 PM

Some ideas, most depending on the fact that there is at least some algorithmic relation between the objects

  • Using indexed drawing. Maybe you can have a "big" predefined VBO, and simply need to update the indices.
  • Using glBufferSubData(). Maybe some of the vertex data is the same, and some is not.
  • Is it a setup similar to animation? Animation can be greatly improved by using bones. Vertices are then attached to bones, and you only have to move a few bones. The shader will then compute the new vertices from the bones data, which is a much smaller set. The technique can be extended to arbitrary complex transformations, if the number of vertices is fairly constant but have a mathematical relation to a smaller subset of data.
  • Use a geometry shader to create the vertices you need.


Not everything is entirely accurate.
  • Indexed drawing is very useful, but you should be careful not to abuse it. In order for a pipeline stage to run efficiently most data should be cacheable. Abusing indexed drawing may actually backfire by causing a lot of cache misses in the vertex shader (obviously this will happen only for relatively large vertex buffers).
  • glBufferSubData is a double edged sword. It's really nice to be able to update parts of the buffer with one API call, but in many cases it actually ruins performance, because many synchronizations have to occur (basically you'll be stalling the hardware). The right way to do partial buffer updates is more complicated, and it involves using glMapBufferRange with GL_MAP_UNSYNCHRONIZED_BIT, while manually managing fences and of course doing it on two threads. Basically, avoid partial buffer updates whenever you can, and when you can't then try to split your buffers so you can. And if that fails then I don't envy you.
  • Bones are great. Use them whenever you can. They have their limitations, and some coding is required to do them right, but supporting them is practically a must.
  • If performance is your concern, avoid GS like the plague. There are some cases in which you won't loose performance by using GS. However, there are no cases in which it'll give you better performance than using the alternatives. This is mostly a hardware limitation, but (unfortunately) it's here to stay.
    One good thing to note about GS is that it makes your code neater, shorter and cleaner. So if you can't really be bothered by some annoying performance issues, GS is awesome.

Edited by max343, 10 November 2012 - 05:45 PM.


#8 mhagain   Crossbones+   -  Reputation: 8276

Like
0Likes
Like

Posted 10 November 2012 - 06:45 PM

max343 is more or less correct on everything, but the point about not abusing indexes needs elaboration.

That shouldn't be read as an all-out proscription on using indexed drawing; rather it's a caution to make sure that your verts and indexes are properly ordered so as to make the most optimal use of your hardware's vertex caches (in fact, indexed drawing is a requirement for your hardware's vertex cache to even activate, so if you're not using indexes you by definition do not have a vertex cache).

So if your indexes are randomly jumping around in your vertex buffer then there is higher likelihood of an upcoming vertex not already being in the cache; also higher likelihood of vertexes that would otherwise be reused from the cache failing to be so on account of them being replaced in the cache sooner than they should be.

It's also important to make sure that the index sizes and values you use are actually supported in hardware. Thankfully 32-bit support is now essentially ubiquitous (with the possible exception of some mobile devices) but one often sees GL tutorial code using GL_UNSIGNED_BYTE.... which brings me to the next point...

It's tempting to see indexed drawing as being all about memory saving because that's something that's directly measurable by you in your own code, but it's really only a small part of the story. Getting more efficient vertex cache usage is where the real performance benefit lies, as well as the ability to stitch together multiple disjoint primitives (and mix fans and strips) without needing to invoke the Cthulhu that is degenerate triangles.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#9 Aks9   Members   -  Reputation: 914

Like
0Likes
Like

Posted 11 November 2012 - 07:47 AM

...That shouldn't be read as an all-out proscription on using indexed drawing; rather it's a caution to make sure that your verts and indexes are properly ordered so as to make the most optimal use of your hardware's vertex caches (in fact, indexed drawing is a requirement for your hardware's vertex cache to even activate, so if you're not using indexes you by definition do not have a vertex cache)...

Can anyone provide a useful link or results of any experiment that could confirm the story about vertex post-transform cache on post-Fermi cards?
I have read a lot about that (and implemented some schemes), and indeed there are benefits if applied on old cards, but I had no improvements on Fermi.
Also, it significantly depends upon the driver, and the way vertices are distributed between multiple processing units. We should delve deeper into GPUs architecture and drivers' design to get correct answer. It is much simpler to carry out some experiments. That's why I ask for your results. is there any benefits of optimizing indexing on modern GPUs?

#10 mhagain   Crossbones+   -  Reputation: 8276

Like
0Likes
Like

Posted 11 November 2012 - 11:02 AM

Can anyone provide a useful link or results of any experiment that could confirm the story about vertex post-transform cache on post-Fermi cards?


I'd be interested in seeing that too, but until then it's reasonable to assume that one doesn't want to restrict one's target hardware to post-Fermi only.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#11 max343   Members   -  Reputation: 340

Like
0Likes
Like

Posted 11 November 2012 - 06:13 PM

Can anyone provide a useful link or results of any experiment that could confirm the story about vertex post-transform cache on post-Fermi cards?
I have read a lot about that (and implemented some schemes), and indeed there are benefits if applied on old cards, but I had no improvements on Fermi.
Also, it significantly depends upon the driver, and the way vertices are distributed between multiple processing units. We should delve deeper into GPUs architecture and drivers' design to get correct answer. It is much simpler to carry out some experiments. That's why I ask for your results. is there any benefits of optimizing indexing on modern GPUs?


The most major difference in the memory department in Fermi was that NVIDIA introduced L1/L2. About 700kb of L2, and 16kb of L1 (in default mode). This means that now you know how to use the cache better, or how to fool it (if you're really into that). Before Fermi, caching in NVIDIA's hardware was basically a black-box and involved a lot of finger crossing.
Since the introduction of normal cache architecture, the notion of VRAM transfer rate is not so interesting. Now there are L2 misses instead. A rule of thumb is that if you have a miss in some cache level you'll waste roughly 10 times more cycles by trying to fetch the same data from an upper level. And you always start from L1 where fetches are very cheap.

With larger buffers all you'll probably see is capacity misses, and they'll generally apply only for L1. These are not so bad, and this is what was checked up until now, except there was no L1/L2 on pre-Fermi hardware so essentially you'd get something like an L2 miss.
On the other hand, triggering L2 misses won't go unnoticed. These are not so hard to trigger, you just need to keep in mind that cache line's size is 128b (so there are about 6k lines), now assume some associativity/eviction policy, and finally you can use any of the widely known ways to create conflict misses for this configuration. Here random jumps in the buffer won't give you the desired result, as they'll just uniformly map themselves on the L2, but using some simple pattern should do the trick quite well.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS