Wild Magic 5 removes inheritance for concrete renderers for performance

Started by
18 comments, last by Dave Eberly 13 years, 7 months ago
I was skimming through the newly released Wild Magic 5 from geometrictools.com and read the overview of changes from the previous version. The following statement caught my eye:
Quote:
/../
The main drawback to this approach is that Renderer contained a large number
of virtual functions. In an application with a large number of calls to the
virtual functions, there is a performance hit due to those calls. Specically,
there are many data cache misses due to the lookup of the function pointers in
the virtual function table (the tables are global data). WM5 has a concrete
class Renderer that does not have virtual functions. The class is implemented for each graphics API. The code for these APIs is also part of WM5
LibGraphics. The selection of the API is controlled via build congurations.


Now I can understand that virtual functions are a performance hit, but skimming through the source code to Wild Magic 5 (Wm5Renderer.cpp in particular), I was stunned to find that each Enable()/Disable()/Bind()/Unbind() call did a std::map::find() operation to lookup the corresponding platform dependent concrete class (VertexShader <-> PdrVertexShader). This to me seems like an even worse performance hit than the actual virtual function call. After all, if the majority of your renderer's work is in executing virtual function calls or looking up platform specific implementations from a std::map, you surely is doing something wrong, no?

(the Pdr prefix is his notation of implementation specific class, decided at compile-time, e.g. there is a PdrVertexBuffer for both OpenGL and D3D9)

I'm very curious about the profiling results if one where to eliminate all std::map::find() lookups and instead store an opaque pointer to the implementation, e.g. VertexShader would contain a void* storing the PdrVertexShader.

I would like to think there are ways to still have inheritance and virtual functions for renderer implementations, but the virtual function call is only done once in the Draw() call from the application, and not for every state change, as would seem to be the case in Wild Magic.

Thoughts?

[Edited by - void0 on September 13, 2010 7:40:34 AM]
Advertisement
Optimizing on that level nowadays is usually a complete waste of time. In a big application/game, there is so much other stuff going on, that optimizing function calls is completely pointless. Tweaking on algorthmic level has normally a much larger impact on an application than most other optimizations. These things are normally not even dependent on the language and API's used. But algorithmic improvements is one of the hardest tasks to do. Thats why lots of people try to optimize in areas that seem promising at first, but in fact are not. - Like optimizing function calls that are in any case only called once per frame. So in my experience, that type of optimizations is usually a waste of time - unless u have optimized every single other algortithm in your engine.

Edit:

Considering the author of the engine seems to be quite capable of doing algorithms and math (at least his books give that impression), there is probably no other place for improvements in his engine anymore. Still, under normal circumstances, such optimizations should not be considered as high priority.

Edit 2:

I also don't think the removal of these virtual functions brings a reasonable performance gain in a real world application. The std::map find lookup also doesn't seem to be an optimal solution.
Quote:Original post by void0
I was stunned to find that each Enable()/Disable()/Bind()/Unbind() call did a std::map::find() operation to lookup the corresponding platform dependent concrete class (VertexShader <-> PdrVertexShader). This to me seems like an even worse performance hit than the actual virtual function call.
A direct function call is one step. A vtable-based virtual requires 3 steps (even more if you're using multiple inheritance, or your compiler uses a memory-optimized vtable implementation). The bigger issue is that vtables cause problems for a cpu's caching and branch prediction, which slows performance even more.

Without looking at the source, but assuming the author knows what he's doing, I would guess the map lookup in at least some cases to be amortized over several calls. In other words, a single lookup allows the conversion to a concrete class, thereby replacing multiple virtual calls with direct ones.

Quote:and instead store an opaque pointer to the implementation, e.g. VertexShader would contain a void* storing the PdrVertexShader.
To do this, you either break polymorphism, or you need to store a pointer for each derived class. Then you need some method of locating the proper pointer, given the name or other identifying token of the derived class. And then voila! you have reimplemented dynamic dispatch via an indexed lookup, function ptr table, or similar approach anyway.
Quote:Original post by Echkard
Without looking at the source, but assuming the author knows what he's doing, I would guess the map lookup in at least some cases to be amortized over several calls. In other words, a single lookup allows the conversion to a concrete class, thereby replacing multiple virtual calls with direct ones.


Take a look here, at Wm5Renderer.cpp. I might be missing something, but it doesn't look like that's what's happening.

Quote:...you need some method of locating the proper pointer, given the name or other identifying token of the derived class. And then voila! you have reimplemented dynamic dispatch via an indexed lookup, function ptr table, or similar approach anyway.


Why can't you just have different .cpp files with the various implementations, and link to the correct one at build time? Why do you need to make the decision at runtime?
Quote:Original post by Echkard
Quote:Original post by void0
I was stunned to find that each Enable()/Disable()/Bind()/Unbind() call did a std::map::find() operation to lookup the corresponding platform dependent concrete class (VertexShader <-> PdrVertexShader). This to me seems like an even worse performance hit than the actual virtual function call.
A direct function call is one step. A vtable-based virtual requires 3 steps (even more if you're using multiple inheritance, or your compiler uses a memory-optimized vtable implementation). The bigger issue is that vtables cause problems for a cpu's caching and branch prediction, which slows performance even more.


I don't know if I'd call the find operation 'one step'. It involves a RB tree lookup with O(log n) complexity. Granted, I have no idea how vtables are implemented so I should not talk about things I don't know much about. Do you have any good resources explaining the complexity of vtables and the cpu caching and branch prediction?

Quote:
Quote:and instead store an opaque pointer to the implementation, e.g. VertexShader would contain a void* storing the PdrVertexShader.
To do this, you either break polymorphism, or you need to store a pointer for each derived class. Then you need some method of locating the proper pointer, given the name or other identifying token of the derived class. And then voila! you have reimplemented dynamic dispatch via an indexed lookup, function ptr table, or similar approach anyway.


I should have explained his approach more detail. He is not using
polymorphism at all, instead each renderer implementation has a std::map of VertexBuffer to PdrVertexBuffer, IndexBuffer to PdrIndexBuffer, etc. (The Pdr prefix is his notation of implementation specific class, PdrIndexBuffer would be the OpenGL or D3D9 implementation, decided at compile-time).
The classes are all concrete, no inheritance. Storing the opaque pointer requires no conversion. This would perhaps be classified as 'ugly' in the OOP/OOD sense, as the class needs to expose this opaque pointer in its public interface. But hey, we are talking about performance, right?

Quote:Original post by void0
I don't know if I'd call the find operation 'one step'. It involves a RB tree lookup...
I was speaking of a direct function call, such as calling a class's non-virtual member.

Quote:Granted, I have no idea how vtables are implemented so...
It's really not that complex, if you're talking about a non-chained vtable:

1. Lookup vptr
2. Add on table offset (if mult. inheritance being used)
3. Lookup function ptr w
4. Indirect call via resultant ptr

A chaining table is more complex, but thats pretty rare for C++.

Quote:I should have explained his approach more detail. He is not using
polymorphism at all, instead each renderer implementation has a std::map of VertexBuffer to PdrVertexBuffer, IndexBuffer to PdrIndexBuffer, etc...
Right, but this is simply simulated polymorphism, implemented via a custom dynamic dispatch method, rather than vtables. Viewed from on high, polymorphism is simply a way to present a uniform runtime interface to objects of different type. My point was that, if you don't want a map lookup or a vtable, you still need some way at runtime to "morph" a render into a Dx9Render, OpenGL render, etc.

Quote:Original post by Gage64
Why can't you just have different .cpp files with the various implementations, and link to the correct one at build time? Why do you need to make the decision at runtime?
Because at build time, you don't know whether you're running on DX9, OpenGL, or something else. If you built for a specific platform, then yes you don't need polymorphism.

Quote:Take a look here, at Wm5Renderer.cpp. I might be missing something, but it doesn't look like that's what's happening.
Take a look at (version 4) Wm4Dx9Renderer.h. There's a whole page full of virtual members that have been replaced with concrete ones in the new version, not simply bind() and lock(). Without delving deeper, I imagine that once you've bound to the platform-dependent functions, all those additional calls are now concrete -- so while the bind itself is a bit slower due to the map lookup, the much larger number of subsequent calls more than compensates for it.
Quote:Original post by Echkard
Quote:Original post by Gage64
Why can't you just have different .cpp files with the various implementations, and link to the correct one at build time? Why do you need to make the decision at runtime?
Because at build time, you don't know whether you're running on DX9, OpenGL, or something else. If you built for a specific platform, then yes you don't need polymorphism.
Actually, it looks like this *is* what's happening...
His code basically boils down to something along the lines of:
#if defined(WM5_USE_DX9)typedef IDirect3DIndexBuffer9* PdrIndexBuffer;#elif defined(WM5_USE_OPENGL)typedef GLuint PdrIndexBuffer;#endif
This is the standard approach I've seen in most engines. If building on Windows, use DX. If building on Mac/Linux, use GL. If building on console X, use API X...

The map/find operations mentioned by the OP are part of the engine's logic, and is unrelated to the virtual-function/API-selection mechanism.
Quote:Original post by cruZ
Optimizing on that level nowadays is usually a complete waste of time. In a big application/game, there is so much other stuff going on, that optimizing function calls is completely pointless.
Quote:In an application with a large number of calls to the virtual functions, there is a performance hit due to those calls. Specifically, there are many data cache misses due to the lookup of the function pointers in the virtual function table (the tables are global data).
Cache-misses are one of the biggest performance killers on current-gen hardware. The bad thing is, that in a badly designed engine they won't even make any single function show up as a bottleneck on your profiler, so depending on your tools, you might not even be alerted to the fact that most of your code is 15% slower than it should be. You've got no time-hogging function to point the finger at, but they're there, in the background, leeching your cycles.
Quote:Original post by Hodgman
The map/find operations mentioned by the OP are part of the engine's logic, and is unrelated to the virtual-function/API-selection mechanism.


Yes, that is correct. My point being, why going through such great lengths of removing the virtual function calls, but not addressing the obvious issue of multiple map lookup operations for each visual object seems odd. If you look at the implementation of Renderer::Draw(VisualSet) which is executed from the application, this boils down to basically:

foreach(Visual in VisualSet)  Renderer::Draw(Visual)


Then take a look at Renderer::Draw(Visual). Here Enable()/Disable()/Bind()/Unbind() etc. are executed. Every single frame. For every single visual is multiple map lookups done. This seems really wasteful.

Quote:Cache-misses are one of the biggest performance killers on current-gen hardware. The bad thing is, that in a badly designed engine they won't even make any single function show up as a bottleneck on your profiler, so depending on your tools, you might not even be alerted to the fact that most of your code is 15% slower than it should be. You've got no time-hogging function to point the finger at, but they're there, in the background, leeching your cycles.


Interesting! I need to learn more about cache misses. Know of any good resources?
Edit: I found this excellent article: Gallery of Processor Cache Effects

[Edited by - void0 on September 13, 2010 12:56:17 PM]
I'm curious as to whether they saw any actual performance increase. I'd expect that the DirectX/OpenGL calls would be where the actual bottlenecks reside, but I could be wrong. This solution is pretty ugly and hacky, and I'd be loathe to utilize it unless I saw some pretty significant boosts.
Quote:Original post by Hodgman
The map/find operations mentioned by the OP are part of the engine's logic, and is unrelated to the virtual-function/API-selection mechanism.
The documentation for the library says: "The Bind call creates a platform-dependent object that corresponds to the platform-independent resource. For example, Bind applied to a VertexBuffer will create a corresponding platform-dependent object PdrVertexBuffer"

I only glanced at the code, but the docs seem to be correct. Wm5Renderer contains platform independent calls; the dependent calls are now all concrete -- there are no virtual functions.

If the library only allows compile time platform selection, then those few remaining independent calls in Wm5Renderer could also be rewritten with static bindings. I don't know why he left them, but the performance hit should be minimal, if these are only being executed a few times per frame. In the old version, some virtual functions were likely being executed many thousands of times per frame. That penalty may not leap out at you like a map lookup does, but if its being executed orders of magnitude more often, is going to be a far larger hit.

This topic is closed to new replies.

Advertisement