Mind if we discuss VBO streaming?

Started by
9 comments, last by tanzanite7 11 years, 7 months ago
Greetings.

I've decided to do some experiments with different methods of VBO streaming in order to see what kind of performance I can get. Most of the ideas that used came from this discussion, but I'm a little bit reluctant to necro that thread since it is so old.

Basically, I don't have any goal in mind except to maximize the amount of verts that I can render for a particle system (or something else that requires geometry that is updated every frame). So, I've set up a test project that generates a bunch of particles and renders them as falling snow. The verts contain positions only (no normals or texture coordinates or anything like that). The primitive type is GL_POINTS and texturing and lighting are disabled, and there is absolutely nothing else in the scene. I'm trying to minimize all the variables so I can concentrate specifically on the performance issues inherent in moving a bunch of vertex data from the CPU to the GPU. Here is a screenshot:

4GetR.png

I've decided to try three different approaches and compare them:

1. Arrays

With this method, I use straight vertex arrays without any sort of VBO. This is my baseline. It should be the slowest because it is not asynchronous.

2. VBO w/ Orphan

With this method, I use a VBO. However, as an added twist, I call glBufferData each frame, passing NULL as the data. That way, it will allocate a new buffer for me, thus orphaning the old buffer. I then use glMapBuffer to set the data.

3. VBO w/ glMapBufferRange

This is similar to method 2 above, except that when I called glBufferData, I allocate a chunk that is much larger than I need (say, 10X larger). So, that basically gives me 10 segments that I can treat like 10 separate buffers. I use glMapBufferRange to load data into the first segment, and then I use glVertexPointer and glDrawArrays to draw that segment. On the next frame, I use glMapBufferRange to load data into the second segment, and then I draw that segment. Each time I call glMapBufferRange, I pass in GL_MAP_UNSYNCHRONIZED_BIT so that it won't block. Yet, I can be sure there won't be any read/write collisions because each frame uses a different part of the buffer than the last. Once all 10 segments are used up, I call glBufferData again (with NULL as the data ptr) to orphan the old buffer and start a new one.

Method 3 seems to be what everyone is recommending, and Method 1 was sure to be the slowest. What I found, in fact, was that they both had the same exact performance! In fact, Method 1 was a little faster (but it was close enough to call it a tie). For about 22,000 verts, each took about 2.2 ms per frame. For 400,000 verts, each took about 50 ms per frame. There was no difference in either case. I even tried changing the buffer usage between GL_STREAM_DRAW and GL_DYNAMIC_DRAW to no avail. VSync is disabled. So, I'm pretty stumped.

I have a few questions about this:

1. Do those frame times sound reasonable? 2.2 ms for 22k verts and 50 ms for 400k verts? My computer isn't super-powerful; it's a laptop with Intel Core i5 2.30GHz, 4.00 GB RAM, GeForce GT 550M. I know it's impossible to tell me how it should perform from these stats, but maybe if I'm an order of magnitude off, it will jump out at someone.

2. It seems like whatever my bottleneck is, it isn't affected by which method I choose. Where should I look next to find the bottleneck? I've looked at it through geDebugger, and unfortunately it won't show me the orphaned VBOs so I'm not 100% sure if I'm filling up memory. I will say this: I've timed the updating of the particles on the CPU side, and I've also timed the glMapBufferRange/memcpy/glUnmapBuffer code block, and neither takes more than a fraction of a millisecond. I was kind of surprised about that last one, actually, because I expected the memcpy to be the bottleneck.

3. Has anyone ever implemented a particle system using transform feedback? I haven't looked into this OpenGL feature at all, but it seems like it would allow one to upload the verts to the GPU once, transform them while performing the physics in the vertex shader, and then store the transformed verts in a different VBO. Then, one could use that "Result VBO" as the starting point for the next frame and just ping pong VBOs that way. I'm not entirely sure I understand the transform feedback feature yet, though.
Advertisement
Ah, well I guess my computer was choosing "Integrated Graphics" when running my app. It's usually pretty good about choosing the right graphics chip, but not this time.

I set it to use Dedicated globally, and that did give each method a significant performance boost, with Method 3 performing about 33% better than method 1.

When rendering 600k verts, I now get 131fps with Method 1 and 175fps with Method 3.

600k verts! That doesn't sound to shabby to me. =)
you have to remember that the driver is responsible for doing all the dirty work on the backend, and it doesn't gurantee that any of the above method's well do what they are suppose to do, only that the resulting image should be the same, so method 1, might actually be doing what method 3 does on the backend, but again, it's completely up to the driver to implement things however it see's fit.
Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.
3. Has anyone ever implemented a particle system using transform feedback? I haven't looked into this OpenGL feature at all, but it seems like it would allow one to upload the verts to the GPU once, transform them while performing the physics in the vertex shader, and then store the transformed verts in a different VBO. Then, one could use that "Result VBO" as the starting point for the next frame and just ping pong VBOs that way. I'm not entirely sure I understand the transform feedback feature yet, though.
I have.

My first realtime particle system stored things in textures, which were updated with FBOs and GLSL shaders. It was a mess, but it worked.

My second used transform feedback. It's a really powerful feature. When every particle is a single pixel, it rendered 2.7 million particles at 55fps (GeForce 580M). The main problem ended up being rasterization and memory.

However, even better is my OpenCL particle system. Not only was it a WHOLE LOT simpler than the transform feedback implementation, it runs significantly faster too--it was running 4 million particles at 60 fps, and it took 8 million particles to bump it down to 42.

My advice: it's good to know what the fastest method of data transfer is, but the real problem is that data transfer in the first place. Basically everything cool can be done on the GPU--and a particle system is probably the best example. Skip transform feedback and roll an OpenCL mini-library. Then implement everything in it and skip the graphics bus.

-Ian


For fun, this is what 8 million particles looks like:
image1qd.png


. . . a solid.

[size="1"]And a Unix user said rm -rf *.* and all was null and void...|There's no place like 127.0.0.1|The Application "Programmer" has unexpectedly quit. An error of type A.M. has occurred.
[size="2"]

That is an interesting comparison Geometrian. I always thought transform feedback would have minimal impact on performance compared to compute shaders and OpenCL, but from your example it seems it does make a lot of difference. Are there any settings you could tweak to optimize the performance?

Perhaps this is going a bit off topic, but I am curious to know. Does anyone know how compute shaders compare to OpenCL and transform feedback?
Yeah, I'm actually extremely interested in that comparison. When I read about transform feedback, that's pretty much exactly what it sounded like to me -- OpenCL-like functionality within OpenGL. I have to say, Geometrian, your post was very enlightening.

slicer4ever, that's a good point.
I've done some work with compute shaders in D3D (not GL) and in general I've found that they're a good bit slower for this kind of simpler use case, partially because of resource binding rules and partially because current hardware incurs some overhead when switching compute shaders on and off each frame. Future hardware should be better, of course, and more complex use cases do exist where the gain from using a CS outweighs the overhead; something like this is just currently more likely to come out on the wrong side of the tradeoff.

In all cases your option (3) should be faster, but that's not the only bottleneck in your code. You're also drawing a lot of particles so you have quite substantial fillrate overhead too, and with a sufficiently high number of particles that's just going to swamp the time taken to do buffer updates (especially when you come to do more complex per-fragment ops on them). That's what I'd intentify as a major cause of your times coming out more equal than they otherwise would be.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Interesting. Do you suppose I could test that by making sure that the particles are out of view, and thus clipped?
About test setup nr 3:

I would use a sync object instead of orphaning - no need to trash the memory, just reuse (with range invalidation [GPU->CPU is VERY costly and unnecessary here] / unsynchronized of course). Also, ~x3-x4 buffer size overhead should be more than enough (my own streaming world-geometry buffer has a lag of only 2 frames). Test it please - expecting an improvement.

Bottleneck:
* CPU bound: per vertex calculations at CPU side can be quite costly - IF you can not do that at GPU side then uninterleaving the vertex position might speed it up considerably by using vector instructions (SSE2 or whatever else is available).
* GPU bound: hell no, the GPU is bored to death ... it is hardly doing anything with thous vertices in comparison.
* Transfer bound: i somewhat suspect your test implementation is mostly CPU bound, but transfer probably still takes quite a lot of time. If your streaming is done right (show relevant code?) then you can not get any better with that and need to consider alternatives (transform feedback / opencl as mentioned).
Orphaning is actually an incredibly cheap operation - D3D has been doing it since version 7 back in the 1990s or thereabouts, and drivers/hardware are built around this usage scheme. With orphaning what happens is the first one or two times you may get some memory allocation, but after that the driver is able to detect that the programmer is doing this and will keep previously used copies of the buffer memory around, flipping between them as required - classic double-buffering but managed for you by the driver. So memory trashing doesn't occur, the driver will just hand back a chunk of memory that had previously been used and all is well.

If you're still concered about that, there is an alternative method. If you size your buffer so that it's always at least large enough for 3 or more frames of data, you don't even need to orphan at all - by the time you reset your position back to 0 the GPU will already have finished with the vertexes at the start of the buffer, so you can just map the range as normal. This obviously needs you to know a bit more about how much data you're pushing, what your maximums are, etc, but if your use case meets those requirements it can be used and can work very well. No need for sync objects and everything runs smoothly.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

This topic is closed to new replies.

Advertisement