# C++ How to distribute executables that take maximum advantage of the clients hardware

## Recommended Posts

Hi there,

I am toying around with my own small C++/OpenGL game engine which I might use to publish some small games in the future. However, there is one question that bothers me for a while now. How do I  make sure, that my executable takes full advantage of the client systems hardware if some features need to be known at compile time? For example:

My engine supports SSE registers of arbitrary size. If I compile with AVX2 enabled, the code won't run on processors that do not support AVX. If I use just SSE 4.2 then I would not utilize the full potential of the processor. How is such a problem usually solved?

Do I build an executable / library for every kind of architecture option and select the right one dynamically on the client system?

Alternatively, I could build the binaries on the client system during installation but I am not sure if this a good way.

So what is the way, this is usually handled?

Greetings

##### Share on other sites
49 minutes ago, DerTroll said:

Do﻿ I﻿ build an executable / library for every kind﻿ of architecture option and select the right one dynamically on the client system﻿?﻿﻿

That's a fairly straightforward way to do it. You can also compile different variations of intensive functions (e.g. visibility culling) within the one exe, and at startup, fetch a function pointer to the appropriate version and use that to call the function every frame.

Code generated by the ISPC language can implement that method automatically and only add something like 7ns overhead per function call (so, practically nothing if your functions are operating on large amounts of data)

##### Share on other sites

There is normally a way to query support at runtime, e.g.:

__cpuid windows

__get_cpuid linux

android_getCpuFeatures

Personally I tend to call these and choose a different path according to what is available, but there are other options.

##### Share on other sites
3 hours ago, Hodgman said:

You can also compile different variations of intensive functions (e.g. visibility culling) within the one exe, and at startup, fetch a function pointer to the appropriate version and use that to call the function every frame. ﻿

3 hours ago, lawnjelly said:

Personally I tend to call these and choose a different path according to what is available, but there are other options.

I also thought about something like that, but I wasn't sure if this is a good solution. So basically, if one reduces it to a single runtime switch I would do something like (pseudo code)?

if (hasAVX)
MainFunctionAVX();
else if (hasSSE)
MainFunctionSSE();
else
MainFunctionSerial();

As long as my processor without AVX support never takes the branch of the AVX implementation the program will not crash even though it has a branch with compiled AVX code?

##### Share on other sites
1 hour ago, DerTroll said:

I also thought about something like that, but I wasn't sure if this is a good solution. So basically, if one reduces it to a single runtime switch I would do something like (pseudo code)?


if (hasAVX)
MainFunctionAVX();
else if (hasSSE)
MainFunctionSSE();
else
MainFunctionSerial();

As long as my processor without AVX support never takes the branch of the AVX implementation the program will not crash even though it has a branch with compiled AVX code?

It's a serviceable solution. Using a function pointer is doing the same thing in a more elegant way, and the Intel compiler is presumably doing something similar underneath. It is also transparent to the user, which is a big concern, and space efficient.

Obviously it is important to do this switch at a slightly higher level than the atomic functions. i.e. Use SSE 4.2, do something 1000x, rather than do 1000x switching on each.

There are some cases where some more care / a different method might be warranted - some people use SIMD accelerated vector etc classes throughout their code rather than just at bottlenecks. You may also be able to tell the compiler to try to autovectorize and do specific SIMD optimizations throughout, which again might warrant a different approach (maybe different builds / choosing different DLL at startup etc).

There's no problem afaik in having parts of your code that don't get called (or rather I have never run into any lol ). Much as if you have part that would crash normally, if it doesn't get called it's not a problem (you can have data in there too afaik). There might be virus scanners etc that try and figure out what is going on in unreached code, but even they shouldn't choke. On the other hand modern virus scanners / OSes don't like it when you try to alter the code segment, so you can't e.g. load the program then fixup the part that does the SIMD, for security reasons.

##### Share on other sites
15 hours ago, lawnjelly said:

Obviously it is important to do this switch at a slightly higher level than the atomic functions. i.e. Use SSE 4.2, do something 1000x, rather than do 1000x switching on each. ﻿

1

Absolutely right. Even though branches on the CPU are not as disastrous as on the GPU, I try to avoid them as much as possible. Every time I benchmark code, avoiding branches or moving them to a level that minimizes their execution yields the highest speedup.

15 hours ago, lawnjelly said:

some people use SIMD accelerated vector etc classes throughout their code rather than just at bottlenecks. You may also be able to tell the compiler to try to autovectorize and do specific SIMD optimizations throughout, which again might warrant a different approach (maybe different builds / choosing different DLL at startup etc).﻿

I wrote my own linear algebra library based on a flexible SSE/AVX implementation. Taught me a lot about vectorization so I use it whenever I can. The auto-vectorization of GCC and Clang yields quite good results, but only if the handwritten implementation would be easy. For example, I recently wrote an SSE accelerated Gaussian elimination algorithm for dense matrices, which I think is almost at maximum efficiency for the matrix size I am aiming for. I compared it to a non-SSE Version of the code and some popular linear algebra library. First I was shocked that the gain of my handwritten SSE version was less than 10% when compared to the simple implementation. However, then I realized that this was only the case when the matrix size was a multiple of the register size. If not, the auto-vectorization compared rather poorly.

However, back to the topic:

Since I also use automatic vectorization because you never know what the compiler can further optimize, I guess choosing different compiled versions of dynamic libraries at runtime would be the way to go for me. This is because I can't tell the compiler for each code section which auto-vectorization level he has to apply. Still not sure how I can select a dynamic library at run-time. I have never done this before.

On the other side, there is still the option of compiling different executables.

Greetings

##### Share on other sites

There are some good options to runtime support of hardware variants addressed here.

The one big problem that hasn't been addressed is the testability of each.  It's fine to write a while lot of alternative code, but it's much harder to make sure you get full test coverage.  This is especially true for cases where you don't have access to hardware, such as an Intel CPU without SSE instructions (last made, what, 20 years ago?).  That kinds of consideration makes it more effective to just decide not to support hardware that lacks certain basic features.  It's a cost-based analysis: is it more profitable to spend time supporting hardware used by 1 or 2 consumers, or to focus on a quality product for the remaining 99.9% of your target market?

##### Share on other sites
6 minutes ago, Bregma said:

There are some good options to runtime support of hardware variants addressed here.

The one big problem that hasn't been addressed is the testability of each.  It's fine to write a while lot of alternative code, but it's much harder to make sure you get full test coverage.  This is especially true for cases where you don't have access to hardware, such as an Intel CPU without SSE instructions (last made, what, 20 years ago?).  That kinds of consideration makes it more effective to just decide not to support hardware that lacks certain basic features.  It's a cost-based analysis: is it more profitable to spend time supporting hardware used by 1 or 2 consumers, or to focus on a quality product for the remaining 99.9% of your target market?

Good point. I am not targetting architectures with SSE support below 4.2. But I have a #define that switches to AVX2 if it is available. So in this case compiling two executables is probably the easiest way. But I am not sure if it will stay that simple in the future. Therefore, the purpose of this discussion was to find possible solution strategies in case it will ever be a problem.

I use Travis CI to test my code and I think as long as the code is running on those machines with -march=native I am fine. I won't aim to support systems with older hardware.

Greetings

##### Share on other sites

1) SIMD

3) OGL

You has 2 ways:

1) compile time polymorphism

2) runtime polymorphism

In worst case you drag all features on all platforms and pray for matching.

##### Share on other sites
7 minutes ago, Makusik Fedakusik said:

3

Thanks, this is a useful link. OGL versions and the maximum number of threads are not a problem in my case.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

• ### Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 13
• 18
• 15
• 9
• 9