Ah if I could point you to some older thread that was about this kind of thing. Sadly I cant seem to be able to search keywords in my own posts only, so I can't find it.
Anyway, ARB_shader_draw_parameters seems to not be supported in any GL 3 hardware (and I've read for some reason it has a non-negligible performance impact). So the idea is to work around that using this beautifully named draw call:
glDrawElementsInstancedBaseVertexBaseInstance
Which comes in the extension arb_base_instance, made core in 4.2, which also is supported by all the GL 3 hardware you should care about.
Now this gives you a couple of things:
A way to specify a vertex offset to start drawing from in a vertex buffer.
A way to specify an index offset to start drawing from in an index buffer.
A way to specify the instance you're starting to draw from in instanced rendering.
With this you can have a single VAO, with all your meshes, and a way to combine instanced rendering with normal rendering in a single draw call (like Vulkan!).
You need a way to bind these single/multiple mesh instances to their respective transform/material data right? And uploading a single uniform per draw call wont cut it, since you might have instanced calls with more than one instance. So you need a way to upload data for several different instances, preferably without caring if they're of the same mesh or not, so you can issue a single upload then make draw calls as needed that operate with that uploaded data.
Now the issue here is that instance ID still works as usual, that is, its zero based. So if you're drawing 5 instances of mesh 10 in your draw list, you need it to fetch transform data at index 10, 11, 12, 13 and 14, so you need a way to communicate to the shader that it should start from index 10. So, with a normal instanced draw call, if you tell it to draw 5 instances, it will start at zero.
A way to get around this is to specify a separate instanced attribute buffer, that contains the indices. From 0 to whatever maximum instances you can draw in a single uniform buffer upload (usually 4096 with a 64Kb UBO limit).
Now the trick here is to tell the instanced draw call to start at instance 10, which will fetch the instanced attribute at index 10, which will be the real index you want (ten!). gl_InstanceID will still be zero, but you dont care about that, because you got the index you want automatically fetched from the instanced attribute buffer.
So your render loop becomes something like:
for ( allPassedShaderPrograms) {
program.bind();
do {
// Fill the UBOs as much as you can with the render tasks data.
for (allUniformBufferInThisBatch)
update(renderTask);
// Draw all the render tasks that had their data uploaded.
for (allTasksThatCouldBeUpdated)
draw(renderTask);
// And repeate while there are tasks to draw.
} while (thereAreMoreTasksToDraw);
}
You can find a more detailed explanation here if you read the PDF: http://www.gamedev.net/blog/2042/entry-2261259-from-yaml-to-renderer-in-50ms/
Have in mind that my renderer is pretty basic, no multi layered materials or anything, so your mileage may vary. But those are the basics to get more draw calls per buffer upload.
EDIT: Also, these kinds of approaches were described in nVidia's advanced OpenGL scene rendering presentations from GTC, they did one and updated it each year with different methodologies and benchmarks. You can google those. The idea is more or less the same, how to minimize buffer uploads, how to get the most out of your drawcalls, and how to efficiently upload instance IDs for indexed resources (whether they're UBOs, TBOs, SSBOs, etc).