A different way to look at the concept I'm talking about is don't put uniforms for group A at offset 0 and uniforms for group B at offset 12 (a single vec3). If you're updating an odd amount of data like 1 vec3, pad out to the next 64B boundary for the start of object B. Do not trust my 64 byte number there. I have no idea if 64 bytes is the right value. I feel like a single 4x4 matrix was the right size, but I could easily be remembering wrong and it could have changed anyways.
I don't know which calls or vendors took the hit, just the end result was X-byte align each group of uniforms. You'll also have to test that this is better performance since it may hurt you if you are bandwidth starved.
I know I've said it before, and I'm sure you already know it, but make sure you are bottlenecking on something this fixes before wasting a lot of time tuning the performance. And make sure you're late enough in development that the perf data you collect is going to be right. Unless you're familiar with coding something for development that also accounts for future optimizations (var for group alignment, one for map alignment, etc.), you're best off delaying this type of coding as long as humanly possible. For example, you want to 1-byte align during development so you catch any bugs where you stomp past your buffer instead of hiding the bug due to padding.
OK, I think I understand. If we are thinking about the same thing calls to glBindBufferRange actually have to have an offset that is a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, which on my machine (and I think on many other cards) 256.
Just now I tried to evaluate if the uniform updates are a bottleneck in my case. For this test I stripped down the rendering pipeline as much as I could, regarding OpenGL interaction. I simulated the performance of the "optimized" uniform updates by replacing glBufferSubData(float4x4) with glBindBufferRange().
I compared the two approaches with 1K and 4K draw calls for very simple geometry (same vb for every draw call) and could not see any noticeable difference.
I concluded that the optimized version could not possibly be faster than just calling glBindBufferRange() for every differently transformed object, which in turn means this is not my bottleneck.
So has the driver situation improved or is my test/conclusion flawed?