The CPU cost of a draw call depends on the state changes that preceeded it.ie, as I see it you'd have two ways of doing it:
So you always end up with one draw call per each different mesh. Thing that differs is UBO updating scheme (no scheme in first one, batching scheme in the second one).
- Update the mesh transform, then issue a glDraw*Instanced call with a single instance, always fetch transform in index 0. Repeat for every single mesh.
- Update transform UBO with all the transforms that can fit, then issue glDraw*Instanced call, repeat this draw call increasing the base instance ID by one for every single mesh until you run out of transforms in the UBO (doing the instanced index buffer trick you mentioned since instance ID is always 0).
Apparently setting the base instance ID state is much cheaper than binding a new UBO, which makes sense, as there's a tonne of resource management code that has to run behind the scenes whenever you bind any resource, especially if it's an orphaned resource.
Also, yes, updating one large UBO is going to be much cheaper than updating thousands of small ones. Especially if you use persistent unsynchronized updates.
On the GPU side, draw calls are free. What costs is context/segment switches. If two draw-calls use the same "context", the GPU bundles them together, avoiding stalls.
Certain state changes "roll the context"/"begin a segment"/etc, which means the next draw can't overlap with the previous one.
It would be interesting to find out where base-instance-id state and UBO bindings stand in regards to context rolls on different GPUs...