opengl 2.x vs 3.x

Started by
19 comments, last by andy_boy 14 years, 1 month ago
Quote:Original post by Aks9
But still claim that implementing the same thing in shaders (as it is in fixed functionality) cannot be faster. It is totally different thing if you skip to implement something in your custom shaders.


Most (if not all) modern cards don't have a fixed function. The driver implements the fixed function through shaders. So, your claim is nonsense :).
Advertisement
Quote:Original post by HuntsMan
Most (if not all) modern cards don't have a fixed function. The driver implements the fixed function through shaders. So, your claim is nonsense :).


Exactly! And those are default shaders set by the drivers. If you can compete with driver developers in shaders tuning, than I admire you! [cool]
Quote:Original post by Aks9
Exactly! And those are default shaders set by the drivers. If you can compete with driver developers in shaders tuning, than I admire you! [cool]

Depends. Some time ago, the performance of ATIs FFP simulation was abysmal, for example. You could reimplement the whole FFP in the most naive way imaginable, and it would still be twice as fast as the vendor supplied one. This is changed with the new AMD drivers of course, but better optimizations are always thinkable. Keep in mind that high speed support for the FFP is rapidly dropping in priority for the GPU manufacturers. It just has to work for the legacy CAD folks, but top-most performance is not required, as everybody else has moved on to shaders.

That said, reimplementing the FFP with shaders yourself is obviously nonsense from a practical point of view.
Quote:Original post by Yann L
Depends. Some time ago, the performance of ATIs FFP simulation was abysmal, for example. You could reimplement the whole FFP in the most naive way imaginable, and it would still be twice as fast as the vendor supplied one. This is changed with the new AMD drivers of course, but better optimizations are always thinkable.

I haven't had an opportunity to program for AMD/ATI, but my shaders on NVIDIA hardware can be only few percents faster than standard fixed functionality. And it is a case only if I skip the full implementation (of lighting, for example).

Quote:Original post by Yann L
Keep in mind that high speed support for the FFP is rapidly dropping in priority for the GPU manufacturers. It just has to work for the legacy CAD folks, but top-most performance is not required, as everybody else has moved on to shaders.

Can you point out some resource for that statement? I thought the same way, but the current situation strongly confutes it. I have carried out many experiments with GL 3.2 Core profile and so far observe no speed up at all in any aspect of code execution. I would be glad if it wasn't true, but so far it is.
Quote:Original post by Aks9
I haven't had an opportunity to program for AMD/ATI, but my shaders on NVIDIA hardware can be only few percents faster than standard fixed functionality. And it is a case only if I skip the full implementation (of lighting, for example).

You mean that your simplified shaders are only marginally faster than the full FFP ? That is impossible, since on all modern HW, FFP vertex and fragment processing are entirely implemented as shaders (except for the parts that are still FF, like blending). You are doing something wrong. Either you are writing highly unoptimized shaders, or, and this is more likely, you are profiling incorrectly.

Quote:Original post by Aks9
Can you point out some resource for that statement. I thought the same way, but the current situation strongly confute it. I have carried out many experiments with GL 3.2 Core profile and so far observe no speed up at all in any aspect of code execution. I would be glad if it wasn't true, but so far it is.

There is no FFP in 3.2 core profile. Are you comparing 3.2 core shaders with 2.x FFP ? In that case, see above. You are doing something wrong. Probably profiling.
You are probably right again.

Shaders are not unoptimized. In fact they are pretty simple. Just the per-vertex lighting, and modulate texturing. No conditional branching, no multipassing, no invariances, no loops... nothing that can slowdown them.

But I have to admit that my application is pretty complex. Although all VBOs are static, there can be from few hundreds up to 65K VBOs. To suppress CPU L2 cache pollution I'm using bindless extensions. Of course, the bindless is not the problem, I have just mentioned it to emphasize that the test was not a single triangle application. But you are probably right that the bottleneck exists and it is neither in vertex nor in fragment shader.

Can you suggest any profiler that can be used to find the bottleneck? gDEBugger is probably one of them, and I'll try to dig it out...

Thank you!
Quote:Original post by Aks9
But you are probably right that the bottleneck exists and it is neither in vertex nor in fragment shader.

Correct. If the performance difference between two very different shaders is very small, then something is wrong. Usually it means, as you said, that the bottleneck is somewhere else, and that the shaders cannot be profiled unless they are artificially made the bottleneck (or with access to GPU internal perf counters, see below).

Quote:Original post by Aks9
Can you suggest any profiler that can be used to find the bottleneck? gDEBugger is probably one of them, and I'll try to dig it out...

Yes gDebugger, together with NVPerfkit and instrumented drivers.
Quote:Original post by Aks9
Exactly! And those are default shaders set by the drivers. If you can compete with driver developers in shaders tuning, than I admire you!

This is ridiculous. I'll give you an example in the CPU world where an end user writes assembly for functions such as memcpy and so on that beats those written by Intel/Microsoft: http://www.agner.org/optimize/#asmlib Click on Manual, page 3, Table 1.5 and especially note the memcpy on unaligned operands performance difference between implementations, it is quite significant.
Analogously, I can quite easily believe that even shader functionality might be better optimized by a good programmer than the driver writers!
"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)
Quote:Original post by Prune
... I'll give you an example in the CPU world where an end user writes assembly for functions such as memcpy and so on that beats those written by Intel/Microsoft: http://www.agner.org/optimize/#asmlib Click on Manual, page 3, Table 1.5 and especially note the memcpy on unaligned operands performance difference between implementations, it is quite significant...

I'm glad that there are such enthusiasts and experts. [smile]

But writing fixed functionality shaders for a whole world of users still use it is a completely different thing than optimizing a memory copying. I don't say that it is impossible but that it is not likely for most of programmers.

Thank you for a useful link!
You can use strictly generic attributes in GL 2.x, however the NVIDIA driver will not allow you to mix them with the built-ins, which they've given static bindings. As long as you commit to all generics and don't try to cheat you'll be fine.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

This topic is closed to new replies.

Advertisement