First of all: I've found that the reason for the slowdown does not seem to lie in how I upload the data. See below.
But to answer your question: My VBO handler works like this:
- handler.nextFrame() is called when a new frame starts. It notifies the handler that a new frame has started and that it should start using the next set of VBOs. When X (= 6) frames have passed it will loop around and start using the first set of buffers again.
- handler.nextBuffer() retrieves a previously allocated VBO. If it runs out of VBOs (the number of VBOs needed per frame depends on how many different types of meshes that were needed) it will allocate a new one and store it so that it can be reused later.
- vbo.ensureCapacity() ensures that the retrieved VBO is has enough capacity. If it is to small it will call glBufferData() to resize it to the requested capacity.
- vbo.map()/unmap() simply calls the glMapBufferRange() and glUnmapBuffer().
All in all, this means that new buffers are almost never allocated except for during the first 6 frames (the number of mesh types is constant), and that they are almost as rarely resized. Once there are enough buffers and those buffers are big enough it won't have to do anything at all. In my test scene this stabilizes itself after looking around for around 2-3 seconds. The reduction in VRAM efficiency is not a problem since each buffer is less than 50kbs in size.
I did some profiling to identify potential bottlenecks, and discovered something very suspicious. glDrawElementsInstanced() is taking a very large chunk of the time it takes to render a frame!
- The rendering resolution was set to 192x108 (1920/10 x 1080/10) to ensure I'm not fragment limited.
- The view distance was increased a lot so much more of the world is visible.
- The meshes were replaced with simple quads (4 verts, 2 tris).
This resulted in around 101 000 meshes being drawn per frame. Then I modified my pseudo-instancing renderer to instead render using instancing. Data is still uploaded using glUniform() and rendered in batches of 128 as usual, but this completely eliminates any use of (dynamic) VBOs! I did the test on my GTX 295 with only one GPU enabled. Using profiling I could determine the following things:
- Runs at 112 FPS.
- Frustum culling takes 80.6% of the time.
- glUniform() takes 8.1% of the time, glDrawElements() takes 2.4% = 10.5% of the time.
- The remaining 8.9% are miscellaneous OpenGL calls and some collision detection.
- GPU-Z reports 53% GPU usage, 9% memory controller load.
These results are pretty much expected.
Pseudo-instancing code with glDrawElementsInstanced() call instead:
- Runs at 54 FPS.
- glDrawElementsInstanced() suddenly stands for 49.4% of the CPU time!
- Frustum culling takes only 37.7% of the time.
- glUniform() takes around 7.5% of the time.
- GPU-Z reports 88% GPU usage, 4% memory controller load.
These results pretty much prove that this has got to be a driver bug, and that I mistakenly blamed glMapBufferRange() for the slowness. The inflated GPU load makes no sense. The weirdest part is that the CPU overhead of glDrawElementsInstanced() seem to scale with the number of instances drawn, effectively making it pretty useless. Of course it's faster to do one glDrawElementsInstanced() call instead of doing 101 000 glDrawElements() calls each frame, but batching together those meshes into 790 glDrawElements() calls is still more than 10x faster!