I've decided to do some experiments with different methods of VBO streaming in order to see what kind of performance I can get. Most of the ideas that used came from this discussion, but I'm a little bit reluctant to necro that thread since it is so old.
Basically, I don't have any goal in mind except to maximize the amount of verts that I can render for a particle system (or something else that requires geometry that is updated every frame). So, I've set up a test project that generates a bunch of particles and renders them as falling snow. The verts contain positions only (no normals or texture coordinates or anything like that). The primitive type is GL_POINTS and texturing and lighting are disabled, and there is absolutely nothing else in the scene. I'm trying to minimize all the variables so I can concentrate specifically on the performance issues inherent in moving a bunch of vertex data from the CPU to the GPU. Here is a screenshot:

I've decided to try three different approaches and compare them:
1. Arrays
With this method, I use straight vertex arrays without any sort of VBO. This is my baseline. It should be the slowest because it is not asynchronous.
2. VBO w/ Orphan
With this method, I use a VBO. However, as an added twist, I call glBufferData each frame, passing NULL as the data. That way, it will allocate a new buffer for me, thus orphaning the old buffer. I then use glMapBuffer to set the data.
3. VBO w/ glMapBufferRange
This is similar to method 2 above, except that when I called glBufferData, I allocate a chunk that is much larger than I need (say, 10X larger). So, that basically gives me 10 segments that I can treat like 10 separate buffers. I use glMapBufferRange to load data into the first segment, and then I use glVertexPointer and glDrawArrays to draw that segment. On the next frame, I use glMapBufferRange to load data into the second segment, and then I draw that segment. Each time I call glMapBufferRange, I pass in GL_MAP_UNSYNCHRONIZED_BIT so that it won't block. Yet, I can be sure there won't be any read/write collisions because each frame uses a different part of the buffer than the last. Once all 10 segments are used up, I call glBufferData again (with NULL as the data ptr) to orphan the old buffer and start a new one.
Method 3 seems to be what everyone is recommending, and Method 1 was sure to be the slowest. What I found, in fact, was that they both had the same exact performance! In fact, Method 1 was a little faster (but it was close enough to call it a tie). For about 22,000 verts, each took about 2.2 ms per frame. For 400,000 verts, each took about 50 ms per frame. There was no difference in either case. I even tried changing the buffer usage between GL_STREAM_DRAW and GL_DYNAMIC_DRAW to no avail. VSync is disabled. So, I'm pretty stumped.
I have a few questions about this:
1. Do those frame times sound reasonable? 2.2 ms for 22k verts and 50 ms for 400k verts? My computer isn't super-powerful; it's a laptop with Intel Core i5 2.30GHz, 4.00 GB RAM, GeForce GT 550M. I know it's impossible to tell me how it should perform from these stats, but maybe if I'm an order of magnitude off, it will jump out at someone.
2. It seems like whatever my bottleneck is, it isn't affected by which method I choose. Where should I look next to find the bottleneck? I've looked at it through geDebugger, and unfortunately it won't show me the orphaned VBOs so I'm not 100% sure if I'm filling up memory. I will say this: I've timed the updating of the particles on the CPU side, and I've also timed the glMapBufferRange/memcpy/glUnmapBuffer code block, and neither takes more than a fraction of a millisecond. I was kind of surprised about that last one, actually, because I expected the memcpy to be the bottleneck.
3. Has anyone ever implemented a particle system using transform feedback? I haven't looked into this OpenGL feature at all, but it seems like it would allow one to upload the verts to the GPU once, transform them while performing the physics in the vertex shader, and then store the transformed verts in a different VBO. Then, one could use that "Result VBO" as the starting point for the next frame and just ping pong VBOs that way. I'm not entirely sure I understand the transform feedback feature yet, though.







