EDIT: This seem to be caused by glDrawElementsInstanced() being extremely CPU intensive. See my 4th post further down.
I have this situation where I need to render a large number of instances of a few different very simple meshes (100-300 triangles). Rendering them one and one turned out to be too CPU intensive, so instancing seemed like the perfect solution except for the fact that it requires OGL3. Therefore I came up with a "pseudo"-instancing method where I duplicated and stored my model 128 times in a VBO (instead of just once) and could therefore render up to 128 tiles in a single draw call by uploading instance positions to a uniform vec3 which was used by a shader to position each instance.
Now I've also implemented an OGL3 version where I upload my per-instance data using super efficient manually synchronized VBO mapping instead of using glUniform3f and render the geometry using real instancing. However, this turned out to be remarkably slower on Nvidia hardware up to the point where my pseudo-instancing was 40% faster than real instancing. On the other hand, on AMD and Intel hardware real instancing is (sometimes much) faster.
Here's my test result data. Test1 = real instancing, Test2 = pseudo-instancing.