The reason I say my results are reliable (regardless of my flawed presentation of them) is that they represent the most primitive set of API calls possible.
As outlined above, unless you completely change your approach to benchmarking the APIs, your results are invalid. It's not a question about presentation, it's your core benchmarking method that is flawed. If you want reliable benchmarking results, then I suggest your first learn about how to properly benchmark a graphics system. Then build a performance analysis framework around this. You will be surprised by the results - because there will more than likely be zero difference, except for the usual statistical fluctuations.
Not even assigning textures to slots for those benchmarks. Not changing any non-essential states such as culling or lighting. All matrix operations are from my own library and exactly the same overhead in DirectX.
I have very little room for common OpenGL performance pitfalls. In such a simple system, I am open to ideas of what I may have missed that could help OpenGL come closer to Direct3D.
I tried combinations of VBO’s. VBO’s for only large buffers, VBO’s for all buffers, etc.
Redundancy checks prevent the same shader from being set twice in a row. That did help a lot. But it helped Direct3D just as much.
I have no logical loop. Only rendering, so although you say I am benchmarking the CPU, I am only doing so as far as I am bound to call OpenGL API functions from the CPU, which is how all others are bound too.
This perfectly outlines what I was trying to explain to you in my post above:
you don't understand what you are benchmarking. You are analyzing the CPU bound API overhead. Your numbers may even be meaningful within this particular context. However, and that is the point, these numbers don't say anything about the API 'performance' (if there even is such a thing) !
I will try to explain this a bit more. What you need to understand is that the GPU is an independent processing unit that largely operates without CPU interference. Assume you have a modern SM4+ graphics card. Assuming further a single Uber-shader (which may not always be a good design choice, but let's take this as an example), fully atlased/arrayed textures, uniform data blocks and no blending / single pass rendering. Rendering a full frame would essentially look like this:
ActivateShader()
ActivateVertexStreams()
UploadUniformDataBlock()
RenderIndexedArray()
Present/SwapBuffers() -> Internal_WaitForGPUFrameFence()
In pratice you would use at least a few state changes and possibly multiple passes, but the basic structure could look like this. What happens here ? The driver (through the D3D/OpenGL API) sends some very limited data to the GPU (the large data blocks are already in VRAM) - and then waits for the GPU to complete the frame unless it can defer a new frame or cue up more frames to the command FIFO. Yup, the driver
waits. This is a situation that you call GPU-bound. Being fill-rate limited, texture or vertex stream memory bandwidth bound, vertex transform bound - all these are GPU-bound scenarios.
And now comes the interesting part: neither OpenGL nor D3D have
anything to do with all this ! Once the data and the commands are on the GPU, the processing will be exactly the same, whether you're using OpenGL or D3D. There will be absolutely no performance differences. Zero. Nada.
What
you are measuring is only the part where data is manipulated CPU side. This part is only relevant if you are CPU bound, ie. if the GPU is waiting for the CPU. And this is a situation that an engine programmer will do anything to avoid. It's a worst case scenario, because it implies you aren't using the GPU to its fullest potential. Yet, this is exactly the situation you currently benchmark !
If you are CPU bound, then this usually means that you are not correctly batching your geometry or that you are doing too many state changes. This doesn't imply that the API is 'too slow', it usually implies that your engine design, data structures or assets need to be improved. Sometimes this is not possible or a major challenge, especially if you are working on legacy engines designed around a high frequency state changing / FFP paradigm. But often it is, especially on engines designed around a fully unified memory and shader based paradigm, and you can tip the balance back to being GPU bound.
So in conclusion: CPU-side API performance doesn't really matter. If you are GPU bound, even a ten-fold performance difference would have zero impact on framerate, since it would be entirely amortized while waiting for the GPU. Sure, you can measure these differences - but they are meaningless in a real world scenario. It is much more important to optimize your algorithms, which will have orders of magnitude more effect on framerate than the CPU call overhead.
* changed "open source" to "many open source extensions"
There are no "open source extensions". OpenGL has
absolutely nothing to do with open source.