CPU specific instructions. When and why.

Started by
15 comments, last by ldeej 16 years, 2 months ago
Hi gamedev, My questions are regarding the cpu specific instruction sets like SSE/MMX. Is there any good literature on what kinds of problems these are intended to solve? Does anyone have any good examples of how they used instructions to solve a problem they had? Do any compilers take advantage of these instructions when compiling c++ code, or does the programmer have to inject their own assembly that specifically calls these instructions? Any links, comments or answers are appreciated, Thanks!
Advertisement
SSE and the like are SIMD instructions (single instruction multiple data.) They are used for performance reasons. Compilers do if they can, usually code is to general to make for good generation of simd so it is usually done by hand.
While reading a book about game maths, I came across SSE, SSE registers use SIMD (single instruction multiple data) this means that with one command, you can calculate stuff on 128 bits of data (4 floats). Let me just copy something I have over at my game-development blog:

Quote:So I asked the friendly community over at gamedev.net if they could knew of any assembly tutorials (mainly concerning SSE and MMX)... SSE isn't all that hard, pretty easy. If you really want to start learning it, read through these tutorials very quickly:

* http://www.neilkemp.us/v3/tutorials/SSE_Tutorial_1.html
* http://www.3dbuzz.com/vbforum/showthread.php?t=104753

And then, use this guide as a reference to available instructions:

* http://www.intel80386.com/simd/mmx2-doc.html
Yes, but out doing the processor without a lot of specific knowledge isn't so easy. Most times your code will be slower then regular C/C++ code.

"Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety." --Benjamin Franklin

Quote:Original post by troll_coder
Is there any good literature on what kinds of problems these are intended to solve?

Check the CPU manuals on Intel's and AMD's websites.
Essentially, they're meant to accelerate certain kinds of floating-point math, by allowing each instruction to be computed on 4 different values simultaneously.
It can be handy in vector/matrix math, for example. (If you have a coordinate in your world, and a vector you want to move by, you can add the two together in a single instruction, updating both x, y and z values). Matrix multiplications can also be done quite a bit faster, if you're careful.

Quote:
Do any compilers take advantage of these instructions when compiling c++ code

A few. GCC didn't last I checked. Visual Studio doesn't do it either. I believe Intel's C++ compiler can do it to some extent, but haven't personally used it.

Quote:
or does the programmer have to inject their own assembly that specifically calls these instructions?

Usually, yeah, that's what you have to do.
A better approach might be to use compiler intrinsics, however. These are special identifiers that get compiled to specific SSE instructions, but they have a few advantages:
- They're written directly in your C++ code (asm instructions have to be in special asm blocks, and aren't allowed at all in 64-bit C/C++ code, at least in VC++)
- Because they compiler understands these intrinsics, and they're interleaved with the rest of your code, the compiler *might* be able to better optimize them, than if you'd used asm blocks.

And as said above, you have to really know what you're doing if you want to outperform your compiler. It's not that the compiler generates perfect code (it doesn't, far from it. As mentioned, it typically doesn't do SIMD instructions at all), but simply that there are a lot of pitfalls in assembly programming, and a lot of ways you can ruin performance when using SSE.
The thing to note is that SSE stands for "streaming SIMD extensions" meaning it works best when used on a lot of data.

Here's a good article on SSE (funny enough, it's on apple.com).
There has been some confusion on this thread, I wanted to clarify the compiler support question:

Every single modern C++ compiler for X86 takes advantage of sse intrinsics (GCC, Microsoft's, and Intel's), the difference is to what extent the instruction set is used, Intel has excellent support for sse intrinsics, it is probably the best compiler in that regard.

As a developer you have three options, with increasing levels of difficulty:
1. Let the compiler use the instruction set automatically where it thinks it makes sense (this is usually a compiler switch that you enable)
2. Use compiler intrinsics for hand coding SSE, basically you get a set of function calls that will allow you to write C++ friendly low level SSE code, some compilers have issues reordering and optimizing SSE intrinsics
3. Write the sse code yourself using inline assembly (you do the low level stuff directly in assembly)

SIMD instruction sets such as MMX, 3Dnow!, SSE, Altivec (PowerPC), and NEON (ARM) are suited to tasks which apply the same series of instructions over large amounts of data. Good examples include software vertex processing, where you apply the same 4x4 matrix transform to a (generally large) number of vertices, and DSP algorithms, among others.

Most compilers either don't support auto-vectorization (targeting SIMD instruction sets from normal, high-level code) or aren't very good at it. Intel's compiler is the leader among commercial compilers in auto-vectorization. GCC doesn't currently support auto-vectorization, AFAIK, but GCC 4 laid down a lot of foundation for advanced optimizations like auto-vectorization, so it may be coming in the future. I don't believe Microsoft's compiler does auto-vectorization, but I could be wrong.

All of these compilers support intrinsics, which basically replace inline SSE assembly with higher-level constructs which look a lot like functions. For example, the intrinsic to add two 4-vectors looks something like add(_m128 &result, const _m128 &lhs, const _m128 &rhs). Of course, they don't act like functions on the back end -- there, they act as a hint to the compiler that there is an appropriate SIMD instruction which should be used *if* it helps.

Generally, when deciding if assembly is worthwhile, you must know a few things:
- Know where the program is slower than required and where the bottleneck is.
- Know that the bottleneck is, or can be, suited to the SIMD approach.
- Know that you can write better SIMD code than what the compiler generates.

throw table_exception("(? ???)? ? ???");

Quote:Original post by ldeej
There has been some confusion on this thread, I wanted to clarify the compiler support question:

Every single modern C++ compiler for X86 takes advantage of sse intrinsics (GCC, Microsoft's, and Intel's), the difference is to what extent the instruction set is used, Intel has excellent support for sse intrinsics, it is probably the best compiler in that regard.

As a developer you have three options, with increasing levels of difficulty:
1. Let the compiler use the instruction set automatically where it thinks it makes sense (this is usually a compiler switch that you enable)
2. Use compiler intrinsics for hand coding SSE, basically you get a set of function calls that will allow you to write C++ friendly low level SSE code, some compilers have issues reordering and optimizing SSE intrinsics
3. Write the sse code yourself using inline assembly (you do the low level stuff directly in assembly)


I don't have a link to the article but a year or two ago there was an article about how intel's compiler checks the processor's SSE flag and weather or not the processor is an Intel processor so that it won't run SSE code on AMD systems even if they support SSE.
Quote:Original post by ldeej[/i
Every single modern C++ compiler for X86 takes advantage of sse intrinsics (GCC, Microsoft's, and Intel's), the difference is to what extent the instruction set is used, Intel has excellent support for sse intrinsics, it is probably the best compiler in that regard.

Not exactly.
Every one of them allow you to use these intrinsics to hand-code SIMD instructions.
But GCC and VC++ can not by themselves vectorize your code. They will *never* take your plain C++ code, and transform it into SIMD.
(They will, if you enable the right compiler setting, use the scalar SSE instructions instead of the x87 floating point ones, but then it's still only operating on a single value at a time, not vectorized, not SIMD)

That is all the "SSE" switch does on VC++ and GCC.

This topic is closed to new replies.

Advertisement