I'm sorry for taking so long to respond. Work is killing me...
My manual "synchronization" is actually no synchronization at all. I'm depending on a rolling buffer approach where I allocate and resize VBOs as they are needed and then ensure that the same VBO is not reused until at least 6 frames have passed. 6 frames is a lot of time and should be much longer than the OpenGL driver is prepared to let the GPU fall behind before stalling the CPU, and neither decreasing or increasing this value has an effect on performance (although low values introduce artifacts of course). Regardless, the fact that I am using GL_MAP_UNSYNCHRONIZED_BIT should disable all synchronization and be the fastest way of doing this. I don't really care if this is not 100% correct or safe at this point, I'm just saying that at the moment I'm not doing any 2-way communication with the GPU at all, so I don't see any possible way that the performance problems on Nvidia cards are my fault.
My instancing algorithm is as simple as it can get. I simply upload a buffer (using the above described VBO handling) filled with 3D-positions (16-bit values, padded from 6 to 8 bytes) of where to render each instance, which is read into the shader as a per-instance attribute (glVertexAttribDivisor(instancePositionLocation, 1)). Then everything is drawn using a single call to glDrawElementsInstanced().
Concerning performance, 5 out of 7 perform as I expect. The AMD HD5500 and the Intel HD5000 are both very limited by fragment performance, not vertex performance. I'd also like to argue that the AMD cards are too slow when doing psuedo-instancing, not the other way around. The performance numbers are also adding up when comparing the cards:
GTX 295 vs HD7790: The GTX 295 was only running on one GPU. When both are enabled I get around 90% higher FPS, which is very close to the HD7790. Those two cards have very similar theoretical computing performance.