# Accessing SSE (__m128) vector's fields

This topic is 4049 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello, Just for fun, I'm trying to write a 4-floats vector that uses SSE instructions. I've seen that sort of code in many places:
struct Vec4
{
union {
__m128 v;
struct { float x, y, z, w; };
};
};
This code compiles just fine but one comment in Microsoft's doc raises some questions:
Quote:
 You should not access the __m128 fields directly. You can, however, see these types in the debugger. A variable of type __m128 maps to the XMM[0-7] registers.
To me it looks like using this construct is equivalent to accessing each field directly. What do I risk if I still keep that ? Illegal/faulty code ? Less performance (if that lib were used in a real project) ? What I believe in fact is that every time I will access fields directly, some extra load/store operations will occur which will harm performance a bit. On the other hand, if I *do* need to read/write a field between two calculations without accessing it directly, I will have to do the store/load myself. What do you think/know about that ? Thank you for reading this. JA

##### Share on other sites
Hi janta,
You're pretty much right on. You don't have to worry about incorrect code being generated, but it will move your xmm register back out to the stack, then load your component into floating point registers to do some math, then store back out to the stack, and then finally reload the xmm register with a new value. If you care at all, I'll explain why this is worse than it sounds.

Now, the big problem is that store-forwarding doesn't work for you in this case. Store-forwarding can apply to the case where you store out data from a register to memory and immediately load it back. In general, stores don't go directly to L1, but instead get pushed out to a little (generally cache-line size) store queue. The reason is that you typically have a bunch of stores in a row, and it's better for the memory architecture to flush that queue all out at once instead of a few bytes per instruction.

With store forwarding, special logic detects the reload case and allows the processor to grab the value on its way to the store queue instead of going back to memory. Unfortunately, if the load overlaps the store but doesn't use the same alignment, it breaks the store-forward. It's not clear to me if store-forwards even work on the xmm registers in the first place!

The end result is that you stall first waiting for your store queue to get flushed back to L1, then you get hit again on the same issue when reloading that xmm register. That penalty... I'm not sure if there's an official number, but in cases where I hit it, it was like 20 cycles.

##### Share on other sites
I'll amend what I wrote by mentioning that the compiler can use the *ss form of instructions to carry out the fp math on your vector component. I'm pretty sure it doesn't, but... It could generate a shufps or movhlps to grab your component into the SSE prefered slot, manipulate it via addss mulss etc, and shufps it back into the original vector. It all stays in SSE registers so no penalties I mentioned before apply, and it would be much faster.

If I see a compiler do this... maybe that day I will stop writing assembly :)

##### Share on other sites
That is useful information, thanks a lot. When I get the time I will just try it and see what asm code gets generated, then I will post my results here.

JA