I've been doing a lot of work with SSE-related instructions lately, and finally got fed up with the myriad of move instructions available to load and store data to and from the XMM registers. The differences between some are so subtle and poorly documented that it can be hard to tell that there is even any difference at all, which makes choosing the right one for the job almost impossible. So I sat down and poured through the Intel instructions references and optimization manuals, as well as several supplemental sources on the internet, in order to build up some notes on the differences. I figured I might as well document them all here for everyone to use.
The name of the game with picking any instruction is performance, and you always want to choose the one that will get the job done in the least time using the least amount of space. Thus the recommendations here are geared towards these two goals. Each instruction has several bits of information associated with it that we must take into account:
- The type of the data it works with, be it integers, single precision floating point, or double precision floating point.
- The size of the data it moves. This can range from 32-bits to 128-bits.
- Whether it deals with unaligned memory or can be used with aligned memory only.
- If the move only affects a portion of a register, what happens to the remaining bits in that register after the instruction finishes.
- Any other special side-effects that the instruction may have.
Let's start off with the 128-bit moves. These move an entire XMM register's worth of data at a time, making them conceptually simpler. There are seven instructions in this category:
All of these instructions move 128-bits worth of data. Breaking it down further, the first three instructions work with aligned data, whereas the next three are the unaligned versions of the first (we'll talk about the last one in a minute, since it's a bit special). The aligned versions offer better performance, but if you haven't ensured that your data is allocated on a 16-byte boundary, you'll have to use one of the unaligned instructions in order to load. When doing register-to-register (reg-reg) moves, it's best to use the aligned versions.
Each of the three instructions in each category (aligned and unaligned) operate on a different data type. Those with a 'd' suffix work on doubles; those with an 's' work on singles, and movdqa works on double quadwords (integers). This is usually a source of confusion for people, myself included, since regardless of the data type, 128-bits are still being moved, and a move shouldn't care about the raw memory it's moving. The differences here are subtle and easily overlooked, and have to do with the way the superscalar execution engine is structured internally in the microarchitecture. There are several "stacks" internally that can execute various instructions on one of several execution units. In order to better split up instructions to increase parallelism, each move instruction annotates the XMM register with an invisible flag to indicate the type of the data it holds. If you use it for something other than its intended type it will still operate as expected; however, many architectures will experience an extra cycle or two of latency due to the bypass delay of forwarding the value to the proper port.
So for the most part, you should try to use the move instruction that corresponds with the operations you are going to use on those registers. However, there is an additional complication. Loads and stores to and from memory execute on a separate port from the integer and floating point units; thus instructions that load from memory into a register or store from a register into memory will experience the same delay regardless of the data type you attach to the move. Thus in this case, movaps, movapd, and movdqa will have the same delay no matter what data you use. Since movaps (and movups) is encoded in binary form with one less byte than the other two, it makes sense to use it for all reg-mem moves, regardless of the data type.
Finally, there is the lddqu instruction which we have neglected to consider. This is a specialty instruction that handles unaligned loads for any data type, specifically designed to avoid cache-line splits. It operates by finding the closest aligned address before the one we want to load, and then loading the entire 32-byte block and indexing to get the 128-bits we addressed. This can be faster than normal unaligned loads, but doing the load in this way makes stores back to the same address much slower, so if store-to-load forwarding is expected, use one of the standard unaligned loads.
In addition to these instructions, there are four extra 128-bit moves that require mentioning:
These are the non-temporal loads and stores, so named since they hint to the processor that they are one-off in the current block of code and should not require bringing the associated memory into the cache. Thus, you should only use these when you're sure that you won't be doing more than one read or write into the given cache line. The first instruction, movntdqa, is the only non-temporal load, so it's what you have to use even when loading floating point data. The other three are data-specific stores from an XMM register into memory, one each for integers, doubles, and singles. All of these instructions only operate on aligned addresses; there are no unaligned non-temporal moves
Next we come to the moves that operate on 32 and 64-bits of data, which is less than the full size of the XMM registers. Thus this introduces a new wrinkle; namely, what happens to the remaining bits in the register during the move.
movd / movq
movss / movsd
movlps / movlpd
movhps / movhpd
The first instruction in each pair listed above operates on singles (ie. 32 bits of data) and the second works on doubles, which is 64 bits of data. The first set, comprising the first four instructions, generally fill the extra bits in the XMM register with zero. The second set does not; it leaves them as they are. I'll discuss in a moment why this is not necessarily a good thing. movd moves 32 bits between memory and a register. It cannot, however, move between two XMM registers, which is an oddity that the rest of the instructions listed here do not share. movq will always zero extend during any move, including between memory and between registers. movd and movq are meant for integer data.
movss and movsd are meant for floating point data, and only perform zero extension when moving between memory and a register. When used to move between two XMM registers, they do NOT fill the remaining space with zeroes, which is confusing. movlps and movlpd generally perform the same operation, moving 32 and 64 bits of data respectively. They do not, however, perform a zero extension in any case. movhps and movhpd are slightly different from the others in that they move their data to and from the high qword of the XMM register instead of the low qword like the others. They don't do zero extension either.
Since the second set of instructions don't do zero extension, you might think that they would be slightly faster than ones that have to do the extra filling of zeroes. However, these instructions can introduce a false dependence on previous instructions, since the processor doesn't know whether you intended to use the extra data you didn't end up erasing. During out-of-order execution, this can cause stalls in the pipeline while the move instruction waits for any previous instructions that have to write to that register. If you didn't actually need this dependence, you've unnecessarily introduced a slowdown into your application.
[size="2"]There are several other instructions that have special side-effects during the move. Generally these are easier to see the usage, since there is only one for a given operation.
movddup - Moves 64 bits, and then duplicates it into the upper half of the register.
movdq2q - Moves an XMM register into an old legacy MMX register, which requires a transition of the x87 FP stack.
movq2dq - Same as above, except in the opposite direction.
movhlps / movlhps - Moves two 32-bit floats from high-to-low or low-to-high within two XMM registers. The other qwords are unaffected.
movsldup - Moves 2 32-bit floats from the low dwords of two XMM registers into the low dwords of a single destination XMM register, and then duplicates them into the upper dword of each half. Kind of confusing to describe, but the diagram in the documentation makes it easy to visualize if you want to use it.
movmskps / movmskpd - Moves the sign bits from the given floats or doubles into a standard integer register.
maskmovdqu - Selectively moves bytes from a register into a memory location using a specified byte mask. This is a non-temporal instruction and can be quite slow, so avoid using it when another instruction will suffice.
There are a lot of SSE move instructions, as you can see from the above. It annoys me when I don't understand something, and whenever I needed a move I would get bogged down trying to decide which was best. Hopefully these notes will help others make a more informed decision, and shed light on some of the more subtle differences that are hard to find in the documentation.
Besides various forum entries and random webpages found through judicious Googling, I took a lot of information from:
- [size="2"][size="4"][size="2"]Intel Optimization Manual
- [size="2"][size="4"][size="2"]Intel Instruction Reference
- [size="2"][size="4"][size="2"]Agner Fog's Optimization Manual