SIMD with C++

Started by
6 comments, last by Jan Wassenberg 16 years ago
I've been searching around a bit for the support of SIMD instructions by C++ compilers, in the hopes to find the simplest but still portable way to give functions like e.g. matrix multiplication the optional ability to be compiled with such instructions. It appears both gcc and MSVC support SIMD instructions somehow. However all search results were rather vague about this. Therefore some questions about this: 1) Can compilers automatically turn your standard C++ code (Without you using any non-standard C++ headers and datatypes) into assembly code that contains SIMD instructions, given the right compiler flags? E.g. if I use just float and double datatypes in a way that would allow usage in SIMD instructions, can the compiler internally convert them to vectors of 4 floats or doubles and use them? It is not clear to me from the explanations of gcc and SIMD, whether the flags for SSE and so on will make the optimizer produce SIMD instructions for standard C++ code, or only allow usage of special datatypes. 2) Headers like emmintrin.h: how many of this kind of headers exist, and what compilers support them? 3) Datatypes like __m128: how many of this kind of datatypes exist, and what compilers support them? 4) C preprocessor symbols such as __SSE__ are enabled in gcc if you compile with SSE support (similar for SSE2 etc...). Are these same preprocessor symbols also used by other compilers such as MSVC? 5) Appearantly there exist libraries to use SIMD instructions in a portable way. However is it also possible to code SIMD instructions in a quite portable way (for desktop PCs) by using the preprocessor macros (to have a non-SIMD alternative if not supported), datatypes and headers mentioned above? 6) Does the gaming industry use these instructions often, and what are the most popular ways? 7) Do you have to use datatypes like __m128 everywhere to have an advantage, or can you have multiple standard C++ numbers (e.g. from your matrix), convert them to __m128 & co, do the multiplication, and convert the result back to regular C++ datatypes, and will it still be more efficient then? 8) Is it true that only AMD supports 3DNow! and Intel does not? Does AMD support SSE? Is 3DNow! used by anybody at all given that it doesn't work on many mainstream processors? Thanks [Edited by - Lode on April 3, 2008 4:16:25 AM]
Advertisement
I'd suggest simply not supporting it to start with, especially when you need to handle multiple compilers. Most games won't benefit significantly from SSE anyway. It also adds significant quantities of extra coding and testing work.

If once you've written the game you find it's running too slow and profiling finds that the game is CPU limited and the hotspots are in functions where SSE would be useful, that is the point to consider using it.

Unfortunately one big downside of SSE is that on the slower CPUs where you want the most extra performance the SSE support is worst (they may not even have it at all).
1)
a) Intel compiler: Yes, and rather sophisticated, however there is a big gotcha. The compiler will insert conditional branches to non-optimized code with abysmal performance for AMD processors (see for example Agner Fog's docs for a reference). For me, this rules out the Intel compiler for any kind of development, regardless of how good it is otherwise.
b) gcc 3.4 : Yes for scalar floating point math, and works very well, too. No support for auto-vectorization.
c) gcc 4.2 : Yes for scalar floating point math, highly optimized. Yes for auto-vectorization if -ftree-vectorize is used, but only works on relatively simple loops.
d) gcc 4.3+: Supposedly better auto-vectorizer, have not tried, however.
e) Microsoft : don't know, don't care

2) At least 4 of them exist, and every compiler should support them
mmintrin.h : MMX
xmmintrin.h : SSE (also includes mmintrin.h)
emmintrin.h : SSE2 (also includes xmmintrin.h)
pmmintrin.h : SSE3 (...)

3) All compilers should support these:
__m128 4 floats
__m128i 4 integers
__m128d 2 doubles

4) no idea

5) Preprocessor macros won't help much, since the preprocessor only knows what platform you compile for, not what platform the code runs on. You will still have to check at runtime or make different binaries.
Or, do what I do, simply restrict yourself to SSE/SSE2 and write "only Pentium IV, Athlon 64, or better" onto the package in big letters. This somehow limits your potential customers, but then again, if someone still uses a Pentium II, he is probably not a customer you want, anyway.

6) This could fill an entire thread alone. In short, they sure do.

7) You can mix and match, although you have to be aware that conversions do impact performance negatively. SSE works best if it can prefetch and load a big bulk of aligned data, perform a dozen vector operations and puke the result back into RAM bypassing caches. Of course it still works in other situations, but not nearly as good.
There are other things to pay attention to, besides using __m128, for example you cannot allocate objects for use with SSE using operator new (not unless you have overloaded it, anyway) if you want to be able to use aligned loads/stores. The unaligned versions are much slower and pretty much ruin the point of using SSE. Also, there are a few quirks with SSE instructions, which differ between manufacturers, too.

8) All reasonably recent AMDs support SSE and SSE2, only the newest dual-cores support SSE3. Funnily, many more AMD processors support SSE than MMX.
SSE and SSE2 should be available on pretty much any Intel chip you encounter, and SSE3 on every reasonably recent one. I don't think any Intel chip supports 3DNow, but honestly I don't know.
Of course, you should never rely on such features, but check cupid instead.

I'd recommend you start by laying out all your data so it is "SSE ready". This enables you to add as much SIMD as you need at any time later, and it enables the compiler to better optimize from the start.
Turning on SSE math can be a big performance boost, especially when using expensive operations. If think you can safely forget about pre-2000 processors (I do), then you should always enable SSE math in your compiler options. It doesn't take any extra work from your side, is perfectly safe, and can make some calculations 2 to 5 times faster (for example square roots).
1)
Yes and no. I think only the Intel compiler currently has the ability to auto-vectorise code. I think GCC and MSVC just use the SSE registers and instructions as a faster FPU, as many of the SSE instructions are lower precision, but take less clock cycles to complete. So switching from FPU to SSE, even if you're not using the SIMD capabilities, can get performance benefits. It might also free up the other regular x86 registers, which might decrease memory fetches.
NextWar: The Quest for Earth available now for Windows Phone 7.
1) Visual C++ and GCC can't auto-vectorize code, no. Intel's compiler can do it, at least in simple cases. On the whole, it's probably best to do it yourself.

Quote:
It is not clear to me from the explanations of gcc and SIMD, whether the flags for SSE and so on will make the optimizer produce SIMD instructions for standard C++ code, or only allow usage of special datatypes.

What the SSE flags really do is just tell the compiler to generate floating-point code that uses the SSE scalar instructions rather than x87. (SSE consists of SIMD functions as well as plain old-fashioned scalar instructions. These might in some cases be marginally faster than x87 (the "old" FP unit, since 1) they don't have to work with 80bit precision internally, and 2) it works with "normal" registers you can use freely, where x87's registers form a kind of stack (IIRC, one argument to every instruction must be in the "top" register, which obviously leads to some unnecessary register moves)

7)
Try it. [grin]
Obviously, having to shuffle your data around, in and out of SSE registers and to and from SIMD code implies some overhead. What is fastest depends on what your code does.

8)
I think Intel still doesn't support 3dNow, that's true. (Unless AMD made it a requirement for in 64-bit mode, but there wouldn't be much point)
However, AMD supports SSE, SSE2 and SSE3 depending on the age of the CPU. (IIRC, SSE was first supported on Athlon XP, SSE2 on the first Athlon 64's, and SSE3 a year or two later)
In other words, all recent AMD CPU's support all the SSE instructions you're going to need.
Quote:Original post by Spoonbender
1) Visual C++ and GCC can't auto-vectorize code, no.
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
Docs on compiler support for SSE for Visual C++ here: http://msdn2.microsoft.com/en-us/library/y0dh78ez(vs.80).aspx

Game Programming Blog: www.mattnewport.com/blog

Quote:For me, this rules out the Intel compiler for any kind of development, regardless of how good it is otherwise.

Since you reference Agner's manuals, you may also have come across mention of the code snippet that replaces Intel's deliberately obtuse CPU detection. Doesn't seem right to reject ICC out of hand due to this (overcomable) hurdle.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

This topic is closed to new replies.

Advertisement