Sign in to follow this  

CPU specific instructions. When and why.

This topic is 3591 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi gamedev, My questions are regarding the cpu specific instruction sets like SSE/MMX. Is there any good literature on what kinds of problems these are intended to solve? Does anyone have any good examples of how they used instructions to solve a problem they had? Do any compilers take advantage of these instructions when compiling c++ code, or does the programmer have to inject their own assembly that specifically calls these instructions? Any links, comments or answers are appreciated, Thanks!

Share this post


Link to post
Share on other sites
SSE and the like are SIMD instructions (single instruction multiple data.) They are used for performance reasons. Compilers do if they can, usually code is to general to make for good generation of simd so it is usually done by hand.

Share this post


Link to post
Share on other sites
While reading a book about game maths, I came across SSE, SSE registers use SIMD (single instruction multiple data) this means that with one command, you can calculate stuff on 128 bits of data (4 floats). Let me just copy something I have over at my game-development blog:

Quote:
So I asked the friendly community over at gamedev.net if they could knew of any assembly tutorials (mainly concerning SSE and MMX)... SSE isn't all that hard, pretty easy. If you really want to start learning it, read through these tutorials very quickly:

* http://www.neilkemp.us/v3/tutorials/SSE_Tutorial_1.html
* http://www.3dbuzz.com/vbforum/showthread.php?t=104753

And then, use this guide as a reference to available instructions:

* http://www.intel80386.com/simd/mmx2-doc.html

Share this post


Link to post
Share on other sites
Quote:
Original post by troll_coder
Is there any good literature on what kinds of problems these are intended to solve?

Check the CPU manuals on Intel's and AMD's websites.
Essentially, they're meant to accelerate certain kinds of floating-point math, by allowing each instruction to be computed on 4 different values simultaneously.
It can be handy in vector/matrix math, for example. (If you have a coordinate in your world, and a vector you want to move by, you can add the two together in a single instruction, updating both x, y and z values). Matrix multiplications can also be done quite a bit faster, if you're careful.

Quote:

Do any compilers take advantage of these instructions when compiling c++ code

A few. GCC didn't last I checked. Visual Studio doesn't do it either. I believe Intel's C++ compiler can do it to some extent, but haven't personally used it.

Quote:

or does the programmer have to inject their own assembly that specifically calls these instructions?

Usually, yeah, that's what you have to do.
A better approach might be to use compiler intrinsics, however. These are special identifiers that get compiled to specific SSE instructions, but they have a few advantages:
- They're written directly in your C++ code (asm instructions have to be in special asm blocks, and aren't allowed at all in 64-bit C/C++ code, at least in VC++)
- Because they compiler understands these intrinsics, and they're interleaved with the rest of your code, the compiler *might* be able to better optimize them, than if you'd used asm blocks.

And as said above, you have to really know what you're doing if you want to outperform your compiler. It's not that the compiler generates perfect code (it doesn't, far from it. As mentioned, it typically doesn't do SIMD instructions at all), but simply that there are a lot of pitfalls in assembly programming, and a lot of ways you can ruin performance when using SSE.

Share this post


Link to post
Share on other sites
There has been some confusion on this thread, I wanted to clarify the compiler support question:

Every single modern C++ compiler for X86 takes advantage of sse intrinsics (GCC, Microsoft's, and Intel's), the difference is to what extent the instruction set is used, Intel has excellent support for sse intrinsics, it is probably the best compiler in that regard.

As a developer you have three options, with increasing levels of difficulty:
1. Let the compiler use the instruction set automatically where it thinks it makes sense (this is usually a compiler switch that you enable)
2. Use compiler intrinsics for hand coding SSE, basically you get a set of function calls that will allow you to write C++ friendly low level SSE code, some compilers have issues reordering and optimizing SSE intrinsics
3. Write the sse code yourself using inline assembly (you do the low level stuff directly in assembly)

Share this post


Link to post
Share on other sites
SIMD instruction sets such as MMX, 3Dnow!, SSE, Altivec (PowerPC), and NEON (ARM) are suited to tasks which apply the same series of instructions over large amounts of data. Good examples include software vertex processing, where you apply the same 4x4 matrix transform to a (generally large) number of vertices, and DSP algorithms, among others.

Most compilers either don't support auto-vectorization (targeting SIMD instruction sets from normal, high-level code) or aren't very good at it. Intel's compiler is the leader among commercial compilers in auto-vectorization. GCC doesn't currently support auto-vectorization, AFAIK, but GCC 4 laid down a lot of foundation for advanced optimizations like auto-vectorization, so it may be coming in the future. I don't believe Microsoft's compiler does auto-vectorization, but I could be wrong.

All of these compilers support intrinsics, which basically replace inline SSE assembly with higher-level constructs which look a lot like functions. For example, the intrinsic to add two 4-vectors looks something like add(_m128 &result, const _m128 &lhs, const _m128 &rhs). Of course, they don't act like functions on the back end -- there, they act as a hint to the compiler that there is an appropriate SIMD instruction which should be used *if* it helps.

Generally, when deciding if assembly is worthwhile, you must know a few things:
- Know where the program is slower than required and where the bottleneck is.
- Know that the bottleneck is, or can be, suited to the SIMD approach.
- Know that you can write better SIMD code than what the compiler generates.

Share this post


Link to post
Share on other sites
Quote:
Original post by ldeej
There has been some confusion on this thread, I wanted to clarify the compiler support question:

Every single modern C++ compiler for X86 takes advantage of sse intrinsics (GCC, Microsoft's, and Intel's), the difference is to what extent the instruction set is used, Intel has excellent support for sse intrinsics, it is probably the best compiler in that regard.

As a developer you have three options, with increasing levels of difficulty:
1. Let the compiler use the instruction set automatically where it thinks it makes sense (this is usually a compiler switch that you enable)
2. Use compiler intrinsics for hand coding SSE, basically you get a set of function calls that will allow you to write C++ friendly low level SSE code, some compilers have issues reordering and optimizing SSE intrinsics
3. Write the sse code yourself using inline assembly (you do the low level stuff directly in assembly)


I don't have a link to the article but a year or two ago there was an article about how intel's compiler checks the processor's SSE flag and weather or not the processor is an Intel processor so that it won't run SSE code on AMD systems even if they support SSE.

Share this post


Link to post
Share on other sites
Quote:
[i]Original post by ldeej[/i
Every single modern C++ compiler for X86 takes advantage of sse intrinsics (GCC, Microsoft's, and Intel's), the difference is to what extent the instruction set is used, Intel has excellent support for sse intrinsics, it is probably the best compiler in that regard.

Not exactly.
Every one of them allow you to use these intrinsics to hand-code SIMD instructions.
But GCC and VC++ can not by themselves vectorize your code. They will *never* take your plain C++ code, and transform it into SIMD.
(They will, if you enable the right compiler setting, use the scalar SSE instructions instead of the x87 floating point ones, but then it's still only operating on a single value at a time, not vectorized, not SIMD)

That is all the "SSE" switch does on VC++ and GCC.

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Quote:

Do any compilers take advantage of these instructions when compiling c++ code

A few. GCC didn't last I checked. Visual Studio doesn't do it either. I believe Intel's C++ compiler can do it to some extent, but haven't personally used it.

By default they do not use them.

For Visual C++, look up /arch:SSE and /arch:SSE2 to enable some scalar floating point stuff. SSE2 also allows some 64-bit integer operations.
For Intel's optimizing compiler, look up /Qx set of options. It is much more aggressive than Visual Studio, and can vectorize some of them.

[caution] If you generate instructions for a chipset, and the cpu running the program doesn't support them, expect Bad Things to happen! This is why they are disabled by default. It is also why Visual Studio through 2003 targeted the 486 chipset. It is better to slow people down a little in exchange for a lot of stability.


The Microsoft Visual C++ compiler doesn't generate code for SSE3 or SSE4.

The Intel compiler has some additional options to generate multiple code paths for all their different chips (See /Qa), Visual C++ does not.

Share this post


Link to post
Share on other sites
Quote:
Original post by frob
For Visual C++, look up /arch:SSE and /arch:SSE2 to enable some scalar floating point stuff.

Yeah, that's what I said. [grin]
With that flag, they switch to using the SSE scalar instructions. But they still don't use the SIMD versions at all. (Unlike Intel's compiler, which can vectorize *some* things)

Share this post


Link to post
Share on other sites
I misspoke, I said "every single modern C++ compiler for X86 takes advantage of sse intrinsics", instead it should be "every single modern C++ compiler for X86 takes advantage of sse instruction set"

The original question was:

Quote:
Original post by troll_coder
Do any compilers take advantage of these instructions when compiling c++ code


This is your answer:
Quote:
Original post by Spoonbender
A few. GCC didn't last I checked. Visual Studio doesn't do it either. I believe Intel's C++ compiler can do it to some extent, but haven't personally used it.


Your answer is wrong, the question is not if compiler take advantage of autovectorization or if they do the latest and greatest optimizations with SSE, the question is if compilers take advantage of certain instruction sets and my answer is that in the case of SSE almost all modern compilers do.
So you are answering the wrong question.

Modern compilers still do analysis on the cost of running regular fp instructions, vs sse instructions, vs mixed instructions, and would choose SSE when the cost analysis indicates that is the best option. Somebody mentioned that the compilers have support for different versions of SSE, and that the optimizer sometimes does not do as good of a job when enabling SSE which is true, but the compiler certainly takes advantage of the instruction set, which was the original question.

Share this post


Link to post
Share on other sites
Quote:
Original post by ldeej
Your answer is wrong, the question is not if compiler take advantage of autovectorization or if they do the latest and greatest optimizations with SSE, the question is if compilers take advantage of certain instruction sets and my answer is that in the case of SSE almost all modern compilers do.
So you are answering the wrong question.

Perhaps. But all instructions are CPU specific, so the question is meaningless if you interpret it like that. [wink]
Yes, the compiler takes advantage of CPU-specific instructions when generating code, because CPU-specific instructions are all the CPU understands...

But yes, I chose to interpret the question as asking about SIMD instructions specifically.
You're right though, the compiler is fully capable of using scalar SSE instructions by itself. If that was what the OP meant, I stand corrected [grin]

Share this post


Link to post
Share on other sites
Even the best auto-vectorising whatnot will not beat hand tuned assembly. It is common practise to use the code generated by a compiler as a reference and then hand tune.

Intrinsics are used because they are cross platform and cross compiler. When you use an assembly block the compiler will not rearrange the code in that block to optimise it.

Simd instruction sets are for optimising highly parallel algorithms. Anything to do with video, audio, and other forms of multimedia may benefit from these optimisations. (Hence MMX: MultiMedia eXtensions)

Intel publish some good books on the topic, in particular I recommend The Software Vectorization Handbook. http://www.intel.com/intelpress/programming.htm?iid=prodmap_tb+prog

Share this post


Link to post
Share on other sites
Quote:
Original post by TheGilb
Even the best auto-vectorising whatnot will not beat hand tuned assembly. It is common practise to use the code generated by a compiler as a reference and then hand tune.


My suggestion here would be to let the compiler do the work, or use intrinsics for a problem domain that is constraint well and that you understand well.

Hand tuning assembly (especially SSE) requires a lot of knowledge, not just of SSE (that is the easy part), it requires knowledge of how the CPU handles the instructions (pipelines, instruction fetches, memory latency, etc). Most SSE code generated by the compiler is not very good (but it usually beats the default floating point code generated), and hand tuning it usually requires taking a completely different approach. Coming up with a mediocre SSE implementation of something is easy, optimal implementation is much harder.

For example, I have seen several SSE implementations of matrix multiply, some very concise and easy to understand, but the ones that we have measured to perform very well are usually incredibly long and obscure, but were written by experts on the field of optimization.

Either way, make sure that you have a testbed to measure performance changes, using SSE does not guarantee that your code is going to run faster.

Share this post


Link to post
Share on other sites

This topic is 3591 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this