• # Designing Good C++ Game Middleware

General and Gameplay Programming

Panagiotis Christopoulos Charitos (@Godlike) is the maintainer for the Anki 3D Engine available on Github at https://github.com/godlikepanos/anki-3d-engine. Learn more at http://anki3d.org/.

For many years I’ve been evaluating and using various game specific open source libraries and at the same time I was designing and implementing my own. Despite the fact that many libraries are quite competent on what they do, their overall design leaves a few things to be desired. Some of the concepts described here sound naive but you can’t imagine how many libraries get them wrong. This article focuses on a few good practices that designers/implementers of performance critical libraries should be aware of.

• How public interfaces should look like.
• Data oriented design.
• Memory management.
• And some general concepts.

Who is the target audience:

• People who want to create performance critical libraries/middleware.
• People who want to attract serious users and not only hobbyists.
• Mainly directed to opensource.

Who is not the target audience:

• People who want to create middleware solely for their own amusement.
• C++ purists.

## Introduction

Looking at the forest and not the tree a performance critical library should:

• be all about performance (obviously),
• should have a specialized set of functionality and shouldn’t be trying to do more than it should,
• and it should have tight integration with the engine.

A recurring example based on a pseudo (rigid body) physics library will be used throughout this article to demonstrate some of the good practices. This pseudo physics library exposes a number of collision objects (eg sphere, AABB, plane etc) and the rigid body object that points to some of the former collision objects.

This library has an “update” function (also known as stepSimulation) that:

• iterates the created rigid bodies,
• does a coarse collision detection (aka broadphase. It basically checks the AABBs of the collision shapes) and gathers some pairs of colliders,
• performs more refined collision detection (aka narrowphase. It checks the actual collision shapes) on those pairs,
• and finally it runs various solvers.

From now on “user” is used to refer to user(s) of a library.

## Public Interfaces

A collection of public interfaces is the means a user will use to interact with your library. Can’t stress enough how important interfaces are for a library and its initial appeal. Having good interfaces is a big challenge and sometimes it’s half of the work. So, what makes a good interface? It should be minimal, documented, self documented, stable, flexible and extendable.

What makes an interface minimal? A minimal interface should avoid any clatter. Internal functionality and/or functionality the user is not expected to interact with, shouldn’t obscure the user’s vision. It’s amazing how many people get that wrong. One solution that hides private functionality is the use of PIMPL idiom. PIMPL is not great in terms of performance though since it implies an indirection (pointer dereference) and a memory allocation so try to avoid it. Similar solutions that will prevent inlining should also be avoided.

Documentation is very important as well. Using doxygen is a pretty standard way to document your interfaces. Even if none ever generates the html documentation having a universally accepted way to document your code is a plus.

Self-documented code is even more important than doxygen documentation. Having some rules that govern the logic of your library will help people understand and reason about every piece of your library’s functionality. One simple example is the use of “const” keyword. Having const methods most of the time implies thread-safety. This also applies to const arguments, member variables etc. Some languages (Rust) mark all of their variables as const by default, that’s how important const is. So don’t be lazy and use const everywhere.

A more complex example of self-documentation can be a scheme that governs the ownership and lifetime of objects (or memory in general). AnKi and Qt are using a scheme where passing objects as pointers means that the ownership of that object will be shared and less often that it will be totally owned by the pointer’s consumer or that the object is optional (nullptr). In all other cases references should be used. Co-ownership practically means that the pointer should be deleted after all the objects that co-own it. Example:

class Foo
{
public:
// boo is passed as pointer. This means that Foo co-owns boo. boo
// shouldn't be deleted before a Foo that co-owns it does.
void someMethod(Boo* boo);

// hoo is passed as a reference. hoo can be deleted after the call
// to someMethod2
void someMethod2(Hoo& hoo);

// Method returns a pointer. This means that the caller of newLoo
// should take (co)ownership of the pointer.
Loo* newLoo();

// Method returns a reference. The caller of getLoo shouldn't try to
// take ownership of the reference.
Loo& getLoo();
};

Boo* boo = new Boo();
Foo* foo = new Foo();
foo.someMethod(boo); // foo co-owns boo
delete foo; // First delete the “owner”
delete boo; // Then delete the “owned”

The stability, flexibility and extensibility are pretty abstract notions when it comes to interfaces and I don’t think there is a point in discussing them. They are very subjective.

Another interesting concept revolves around the programming language of the public interfaces. Many libraries have a C++ implementation but C public interfaces. This is generally a good idea because C will force the public interface to be minimal and clean and at the same time it will make the library easy to embed into other languages (eg python bindings). But that doesn’t apply to everything so keep that in mind.

## Data oriented design

Cache misses are one of the worst performance offenders nowadays and minimizing them should be a priority. Constructing a data oriented aware public interface for your library will play a vital role in performance. The pseudo physics engine is a prime example where wrong interfaces will result in suboptimal performance.

So let’s imagine that our pseudo physics library exposes the rigid body class in a way that allows the user to place it in memory however they want:

class MyLibRigidBody
{
public:
void setForce(...);

void setMass(...);

private:
// Internal members
float m_mass;
Vec3 m_gravity;
// ...
};

The library’s context holds a list of rigid bodies that will be used to iterate during simulation time. The user is pushing their own rigid bodies down to the context:

class MyLibContext
{
public:
// ...
void pushRigidBody(MyLibRigidBody* body)
{
m_rigidBodies.pushBack(body);
}

void popRigidBody(MyLibRigidBody* body)
{
// ...
}
// ...

private:
// ...
Vector<MyLibRigidBody*> m_rigidBodies;
// ...
};

And the update function iterates the user provided rigid bodies for various operations. Example:

void update(MyLibContext& ctx, double deltaTime)
{
Vector<Pair> pairs;
for(unsigned i = 1; i < ctx.m_rigidBodies.getSize(); ++i)
{
const MyLibRigidBody& a = *ctx.m_rigidBodies[i - 1];
const MyLibRigidBody& b = *ctx.m_rigidBodies[i];

if(collide(a, b))
{
pairs.pushBack(a, b);
}
}

// Narrophase
for(Pair& pair : pairs)
{
if(detailedCollide(pair.a, pair.b))
{
// Append to a new vector
}
}

// run the somulation
runSolver(deltaTime, ...);
}

The fact that the library allows the user to allocate MyLibRigidBody themselves sounds like a nice idea. You might think that this is good design since the library gives some responsibility (allocation of MyLibRigidBody) to the user. Well, it’s not.

The update function iterates all the rigid bodies one after the other and does some computations (broadphase). Ideally, all of those rigid bodies should be in a contiguous piece of memory and they should be visited in the order they are laid in memory, this is the way to minimize cache misses. Giving the car keys to the user might not be the best thing to do in this example.

Instructing the user to pack their rigid bodies into a huge contiguous array is also not enough. The update function iterates the m_rigidBodies array in an order only the MyLibContext knows. As we mentioned before, to have optimal caching performance the order of m_rigidBodies should match the user’s memory layout. But that’s not easy in the given example especially if the user pushes and pops rigid bodies all the time.

In this case having your library allocating MyLibRigidBody instead of the user might be a better idea.

class MyLibContext
{
public:
// ...
*MyRigidBody newRigidBody();
// ...

private:
// ...
Vector<MyLibRigidBody> m_rigidBodies;
// ...
};

Having a thread-aware library is very important nowadays since multi-core applications have been the de-facto for ages. At the same time your library shouldn’t be trying to solve problems that the user can solve better.

So the first big thing to solve is to show which functionality is thread-safe and which isn’t. The section about interfaces and const correctness covered that so I won’t expand.

Since your library shouldn’t be the one solving the thread scaling problem does that mean that thread contention/protection (smart pointers, locks and other sync primitives) is outside the scope of your library? Yes and no. Smart pointers should be largely avoided since atomic operations are not free. Ownership should be the responsibility of the user and the library should provide documentation and proper interfaces. But what about locks? There are cases where a function is largely thread-safe but one small part of its work requires a lock. If the locking is left to the user the critical path will be wider since the user will have lock the whole function. In that case, having the library handling the critical sections might be better for performance. One example is malloc-like functions. It’s given that malloc-like functions should be thread-safe by design and this is a prime example of functionality that should be responsible for protecting its critical sections. But things should be transparent. Having the option to lock or not is one solution, having your library accepting an interface for the lock itself is another. A mutex abstraction that the user will implement might be a good idea especially on game engines that instrument mutexes (prime example of “tight integration” we mentioned earlier).

Next thing, avoid global variables (meaning mutable global variables, not your typical “static const* someString;” thing) and that includes singletons. Globals may make sense for some thread local arenas or other use cases but generally they imply laziness and create issues around multithreading. Try to avoid them.

## Memory Management

As mentioned before the library should have tight integration with the engine and as little responsibility as possible. Memory management is a great example where these two rules apply. Ideally, the library should be accepting a number of user provided allocators (you’ll see later why you might need a number of them) and leave the memory management to the user. The allocator could have a pretty simple interface like this one:

class MyLibAllocator
{
public:
void* m_userData;

void* (*m_allocationCallback)(
void* userData, unsigned size, unsigned alignment);

void (*m_freeCallback)(void* userData, void* ptr);
};

There are a few things to note about MyLibAllocator. It doesn’t have virtuals, yes. Virtual methods have an indirection through the vtable. Since the allocation functions will be used a lot and because performance is critical, plain old C callbacks are preferable. Also, providing tight alignment in the allocation function is also preferable.

Another thing that many (actually all) libraries get wrong is that they take the user provided allocator and store it in some global variables (most of the time they store 2 callbacks and maybe a void*). This is a side effect of C++ because overridden operator new and operator delete are static methods and that creates some complications (long talk and I won’t expand here). But we already discussed that globals are bad and that applies to the way the allocator is stored. For that reason have a context that holds the user allocator. Then require all objects of your library to know about that context. If the object knows the context then it can also allocate memory.

class MyLibContext
{
public:
// Create a context with optional user allocators
MyLibContext(const MyLibAllocator& userAllocators = {})
: m_allocator(userAllocator)
{}

private:
// A copy of the user allocator
MyLibAllocator m_allocator;
};

The library shouldn’t necessarily be tight to a specific allocator since it’s absolutely possible and desirable to accept specialized allocators besides a global one. For example, our pseudo physics library can use the global allocator (stored in the MyLibContext) to allocate the MyLibSphereCollisionShape or the MyLibRigidBody classes but have a fast allocator that will be used for the duration of a single function. Example:

// Library code:
extern void update(MyLibAllocator& fastAllocator, double deltaTime);

// User code:
MyLibAllocator superFastStackAllocator;
update(superFastStackAllocator, dt);
superFastStackAllocator.freeAllMemory();

So, the fastAllocator could be a fast linear allocator that doesn’t do any deallocations. The update function will perform temporary allocations using that allocator and when it’s done the application will free all the memory or recycle it.

## General Concepts

Avoid exceptions. Not because they are bad but because game engines traditionally avoid them and discourage them.

The internal source code of your library should be in pristine shape just like your public interfaces. Self documented code and regular documentation is pretty important for advanced users that may end up debugging your code or because they want to extend it. Using clang-format (a popular C/C++ code formatter used in projects such as LLVM), for example, will also give points to your codebase as it implies consistency.

If possible, try to avoid STL containers. STL containers are too generic for most use cases and whenever you see the word “too generic” expect performance issues. If you really really want use them make sure that you have built an STL compatible allocator to pass into them.

## Conclusion

The general theme of this article is to always assume that the users of your middleware are smarter than you. Not because they really are but because they might have use cases you haven’t even imagined. The users of your library are software engineers and the more experienced they get, the more their OCD (Obsessive Compulsive Disorder) kicks in. Things that they don’t appear important to you might be to them so try to take feedback seriously.

This article has been co-published with permission from Panagiotis Christopoulos Charitos, maintainer for the Anki 3D Engine. (Twitter: @anki3d)

Do you have something to share with the GameDev Community? Do you write tutorials on your personal blog?

Report Article

## User Feedback

Posted (edited)

Your

void popRigidBody(MyLibRigidBody* body)

is a bit too similar to the 'push' function, it's impossible to store the popped body outside the function.

EDIT: That is, unless you copy the body content. In that case however, you'd expect the 'push' function to make a copy too.

Edited by Alberth

##### Share on other sites
On 4/13/2019 at 1:12 PM, Alberth said:

Your


void popRigidBody(MyLibRigidBody* body)

is a bit too similar to the 'push' function, it's impossible to store the popped body outside the function.

EDIT: That is, unless you copy the body content. In that case however, you'd expect the 'push' function to make a copy too.

This is almost like pseudo-code. The details don't matter much especially since this is part of a technique that should be avoided.

But to satisfy your curiosity rename this method to removeRigidBody. removeRigidBody will iterate all m_rigidBodies until it finds the pointer. Then it removes it.

##### Share on other sites
Quote

Having const methods most of the time implies thread-safety

I think this leads to a false feeling of thread-safety. An object or a function be const dosen't mean that the implementation or usage is thread-safe by default, even not "most of the time". You can only be sure that the author of the code intended the function to not change any data on the object itself not that data access is thread-safe.

template<tynemae T> class Array
{
inline size_t Size() const
{ ... }

inline void Resize(size_t newSize)
{ ... }
};

In this example, Array<T>::Size() is const correct as the size will never be changed by the function and an array passed to a function as Array<T> const& array can expose its size to the function even it is a const reference but it dosen't mean that Size() is thread-safe while one could still call Resize() to change it on a non-const pointer or reference of array.

Quote

Next thing, avoid global variables ... generally they imply laziness and create issues around multithreading

I think this statement is too general too. As of const correctness, it depends how and when the global variable is used and often there is a reason a global variable is used.

int64 initialTimeStamp = GetUTCFileTime();

int64 MyTime::GetUTCTime()
{
return initialTimeStamp + GetTicksSinceStartup();
}

This example illustrates a case where a global variable is used to maintain an initial state that is obtained from a "slow" function like obtaining the file time from the OS to provide a faster way to get current UTC time, maybe for logs or to display in a game UI. The global variable is initialized once but never changed again so it "is thread-safe" too.

Quote

Virtual methods have an indirection through the vtable. Since the allocation functions will be used a lot and because performance is critical

VTable lookups have a performance impact at some level as the compiler needs at least an additional instruction for the lookup but  they shouldn't in general be considered slow, especially against memory aquisition. The real performance impact here is memory allocation/deallocation so a library that calls itself "high performance" should well decide when and how memory is aquired instead of strictly avoiding inheritance.

One should always avoid multiple-inheritance but in the Allocator case you will most likely have an abstract base class or interface and inherit just one derived class from it.

Quote

Avoid exceptions. Not because they are bad

As I agree with your statement, note that exceptions are bad in performance when they are thrown, even if they're catched and silently handled, an exception causes a performance impact.

Quote

whenever you see the word “too generic” expect performance issues

You should specify, templates are generic but dosen't have to considered be slow in general. Otherwise game engines would have really big trouble using dynamic lists or arrays. Again it depends, memory lookup in a hash table is slow but that dosen't mean that the hash table itself is slow

##### Share on other sites
10 minutes ago, Shaarigan said:

I think this leads to a false feeling of thread-safety. An object or a function be const dosen't mean that the implementation or usage is thread-safe by default, even not "most of the time"

I chose my words poorly here, sorry. Yes it doesn't really imply thread-safety but at least it gives some guarantees. In that case "const" basically states that a method is reads-only and that it's perfectly safe to have multiple threads calling read-only methods at the same time.

10 minutes ago, Shaarigan said:

I think this statement is too general too. As of const correctness, it depends how and when the global variable is used and often there is a reason a global variable is used﻿.

In your example you set the global once and never touch it again so it falls under the: meaning mutable global variables, not your typical “static const* someString;” thing. In your example initialTimestamp should be const.

11 minutes ago, Shaarigan said:

VTable lookups hav﻿e a performance impact at some level as the compiler needs at least an additional instruction for the lookup but  they shouldn't in general be considered slow, especially against memory aquisit﻿ion.

The additional instruction(s) might not be the issue here true. The issue is the additional cache miss to the vtable. vtables are global and all class instances point to them so you have to fetch a cache line located who knows where.

11 minutes ago, Shaarigan said:

As I agree with your statement, note that exceptions are bad in performance when they are thrown, even if they're catched and silently handled, an exception causes a performance impact.

True but there are other hidden issues with exceptions. To use them you need RAII and RAII is not always great for performance. Imagine you have a Vector<int> myVector. If you're using custom allocators and have to have RAII then myVector should know how to deallocate itself. So it should have a copy (or a pointer) to the allocator. That's an additional member to myVector and that will create a larger data structure. Larger structures might result in more cache misses. The alternative to be explicit but that's not compatibe with RAII: myAllocator.destroy(alloc).

11 minutes ago, Shaarigan said:

You should specify, templates are generic but dosen't have to considered be slow in general. Otherwise game engines would have really big trouble using dynamic lists or arrays.

"Too generic" is mentioned in the context of STL.

##### Share on other sites
Quote

Many libraries have a C++ implementation but C public interfaces. This is generally a good idea because C will force the public interface to be minimal and clean and at the same time it will make the library easy to embed into other languages (eg python bindings). But that doesn’t apply to everything so keep that in mind.

The core reason I've seen for this is because C++ doesn't have a stable ABI between different compilers. And then you get this mess: https://www.sfml-dev.org/download/sfml/2.5.1/ if you want to provide binaries.

With with a C API, the ABI is the same for all compilers, and some other languages can hook directly in it (like python with ctypes)

##### Share on other sites
11 minutes ago, Daid said:

The core reason I've seen for this is because C++ doesn't have a stable ABI between different compilers. And then you get this mess: https://www.sfml-dev.org/download/sfml/2.5.1/ if you want to provide binaries.

With with a C API, the ABI is the same for all compilers, and some other languages can hook directly in it (like python with ctypes)

That is a good point. Please note that the article is primarily dealing with opensource software which makes this a non issue.

##### Share on other sites
8 hours ago, Godlike said:

That is a good point. Please note that the article is primarily dealing with opensource software which makes this a non issue.

Note that SFML is an open source library, and they have this issue ;-)

I do have to say that your article is kinda limited to "the exact scope of your project". Even if it hits quite a few valid points.

Also, about const, adding "const" can be important for other reasons then just API. It also impacts performance (as a result from a const function can be re-used by the compiler in certain cases). And occasionally what you are allowed to do (if you deal with rvalue temporaries) And also, there is no guarantee that a const does not modify an object, I've seen something along these lines:

class A
{
public:
Matrix4 getTransformationMatrix() const
{
if (matrix_outdated)
(const_cast<A*>(this))->updateTransformationMatrix();
return transformation_matrix;
}
}

Where a local cache is updated in a const function. It's const in the fact that it returns the same value every time, but it's not const in that it doesn't modify internal state and thus not thread safe for reading. I would never do this, but I've seen it.

##### Share on other sites

Well ..... I break many of these, ha! ha! but then I mostly write libraries for myself (at least until I start to recruit some help), so I guess this doesn't realy apply to me directly. I do pay attention to cache misses a lot.  On the other hand I don't necessarily think that threading should be outside of library code. I like to use thread pools and pass in the number of threads that I allow a library to use. I also think smart pointers can be used in many cases. For instance I sometimes combine them with thread specific heaps.  For other problems I may implement a mini garbage collector that just works on one particular data set and runs between iterations.  With 64 bit machines you have a lot of address space so this gives you a lot of memory management flexibility. For instance it becomes easier to design things to keep your data contiguous to help with cache issues.  It really depends on the exact problem you are trying to solve and bench marking is really the only way to know for sure if you are doing things efficiently.  Even then, there may be alternate solutions you may not have even thought of.

In short,  I don't think you can lay down hard and fast rules that apply to all cases, but interesting read......

##### Share on other sites
6 hours ago, Daid said:

Note that SFML is an open source library, and they have this issue 😉﻿

I do have to say that your article is kinda limited to "the exact scope of your project". Even if it hits quite a few valid points.

As for your second point, yes it's very difficult to cover all kinds of projects. Maybe the title of the article should have been something like "Designing good C++ game middleware that will be integrated into the critical path of game engines" or something

5 hours ago, Gnollrunner said:

On the other hand ﻿I don't necessarily think that threading should be outside of library code. I like to use thread pools and pass in the﻿ number of threads that I allow a library to use.

Do you have an example where threadpools inside a library are useful? Maybe I'm missing a use case.

##### Share on other sites
14 minutes ago, Godlike said:

Do you have an example where threadpools inside a library are useful? Maybe I'm missing a use case.

For instance my voxel library looks down a giant octree and tries to identify where there will be mesh geometry. This is constantly changing because of LOD. It may have to look far down past the current leaf nodes in the tree. To do this it builds a temporary lightweight sub-octrees (to be converted to the real tree later) and runs lots of user defined functions such as noise functions.  I want to do this in parallel for different parts of the tree, so when I hit leaf nodes that I may need to extendm I push them on a queue and when a thread in the thread pool is free, it picks one up and processes it.

There may be some way to take this out of the library but right now it's a single call, which handles all the necessary threading, and thread specific heaps so I don't need to mess with it externally.

##### Share on other sites
8 hours ago, Gnollrunner said:

For insta﻿nce my voxel libr﻿ary looks down a giant octree﻿ and tries to identify where there will be mesh geometry. This is constantly changing because of LOD. It may have to look far down past the current leaf nodes in the tree. To do this it builds a temporary lightweight sub-octrees (to be converted to the real tree later) and runs lots of user defined functions such as noise functions.  I want to do this in parallel for different parts of the tree, so when I hit leaf nodes that I may need to extendm I push them on a queue and when a thread in the thread pool is free, it picks one up and processes it.
﻿﻿
T﻿here may be some way to take this out of the library b﻿ut ﻿right now it's a single﻿﻿ call,﻿ w﻿hich ﻿handles all the necessary threading, and thread specific heaps so I don't need to mess with it externally.﻿﻿﻿﻿﻿﻿﻿﻿

So you have your own implementation of worker threads, which is exactly the point. As you now might have multiple thread pools with workers, and that can have more threads active at once then you would want, hurting performance. Immagine having a few of those libraries that think it's fine have their own internal threadpool, and you have a bunch of threadpools and no control left.

I would recommend having the option to provide a custom threadpool/job-queue and provide a default implementation as well for people that don't want to hook it into anything else.

##### Share on other sites
2 minutes ago, Daid said:

So you have your own implementation of worker threads, which is exactly the point. As you now might have multiple thread pools with workers, and that can have more threads active at once then you would want, hurting performance. Immagine having a few of those libraries that think it's fine have their own internal threadpool, and you have a bunch of threadpools and no control left.

I would recommend having the option to provide a custom threadpool/job-queue and provide a default implementation as well for people that don't want to hook it into anything else.

That's exactly the point of article. If a single entry point if a library wakes up N threads how would an engine (user) integrate that into a job/task based system? The entrypoint would starve the engine threads and the opposite. It's not ideal for performance.

##### Share on other sites
48 minutes ago, Godlike said:

That's exactly the point of article. If a single entry point if a library wakes up N threads how would an engine (user) integrate that into a job/task based system? The entrypoint would starve the engine threads and the opposite. It's not ideal for performance.

First N is a number that you specify.  Second I simply don't assume a programmer doesn't know what he's doing. However, I really don't want to get into some fuzzy religious discussion.  In my experience there are almost no hard and fast rules that should always be adhered to.  If I remember correctly even C++17 has parallel support in the standard library.

## Create an account

Register a new account

• 0
• 0
• 0
• 0
• 11

• 10
• 13
• 13
• 14
• 10
×