I'm the author of the original recommendation, and it came about through considerable experimentation in an attempt to get comparable performance out of GL UBOs as you can get from D3D cbuffers.
The issue was that in D3D the per-object orphan/write/use/orphan/write/use/orphan/write/use pattern (with a single small buffer containing storage for a single object) works, it runs fast and gives robust and consistent behaviour across different hardware vendors. In GL none of this happens.
The answer to the question "have you tried <insert tired old standard buffer object recommendation here>?" is "yes - and it didn't work".
The only solution that worked robustly across different hardware from different vendors was to iterate over each object twice - this may not be such a big deal as you're probably already doing it anyway (e.g once to build a list of drawable objects, once to draw them) - with the first iteration choosing a range in a single large UBO (sized large enough for the max number of objects you want to handle in a frame) to use for the object and writing in it's data, the second calling glBindBufferRange and drawing.
The data was originally written to a system memory intermediate storage area, then that was loaded (between the two iterations) to the UBO using glBufferSubData. Mapping (even MapBufferRange with the appropriate flags) sometimes worked on one vendor but failed on another and gave no appreciable performance difference anyway, so it wasn't pursued much. glBufferData (NULL) before the glBufferSubData call gave no measurable performance difference. glBufferData on it's own gave no measurable performance difference. Different usage hints were a waste of time as drivers ignore these anyway.
I'm fully satisfied that all of this is 100% a specification/driver issue, particularly coming from the OpenGL buffer object specification, and poor driver implementations. After all, the hardware vendors can write a driver where the single-small-buffer and per-object orphan/write/use pattern works and works fast - they've done it in their D3D drivers. It would be interesting to test if the GL4.4 immutable buffer storage functionality helps any, but in the absence of AMD and Intel implementations I don't see anything meaningful or useful coming from such a test.
Finally, by way of comparison with standalone uniforms, the problem there was that in GL standalone uniforms are per-program state and there are no global/shared uniforms, outside of UBOs.
Edited by mhagain, 26 February 2014 - 02:44 AM.