Sign in to follow this  
issch

SSE intrinsics: Extracting floats from __m128

Recommended Posts

Hi,

I am mucking about with some SSE code at the moment and I'm a little confuzed about unpacking 4 packed floats from __m128's.

Originally, I used the union trick (or one of the variants that use structs instead of an array of floats):

union {
__m128 m128;
float f[4];
}




But then I saw that this could make the generated code more complex than it needs to be because of potential pointer aliasing or some other reasons like that.

If I only need to extract a single scalar value, the official method appears to be to use _mm_store_ss:

float result;
_mm_store_ss(&result, m128);
return result;



instead of:

return f[3];




This works fine for results of calculations (especially if they use horizontal operations, ie, anywhere where I can be sure the result will be in the correct element), but if I want to access any other elements from the __m128 vector, then I need to shuffle to use this, eg:


float result;
_mm_store_ss(&result, _mm_shuffle_ps(m128, m128, _MM_SHUFFLE(2,2,2,2));
return result;




which also works fine, but seems overly complicated (and adds an extra instruction).

So my question is which way is better: casting, union trick, movss + shufps or is there another, better, way?

Thanks!

Share this post


Link to post
Share on other sites
Out of curiosity, how often are you accessing one element which isn't the first?

You can always store out all 4 float values and then access them however you please. Not sure which is preferred, but generally I've found I don't really need a single number all that often.

Share this post


Link to post
Share on other sites
You're right. It is not that common of an operation and storing all four at once will work perfectly well for most cases. I mean, if the code reads and writes single elements a lot, then its probably gaining nothing from SSE.

Having said that, I'd still like to know what the correct way of accessing a single element is.

Share this post


Link to post
Share on other sites
What do you want to do with the element?
If you want to do more operations on that one float, the best way might be to just do a shuffle into an __m128 and use the _ss instructions to operate on it. If you enable SSE2 or compile in x64 in VC++ for example, all floating point operations are done like that anyway, and any method you use to store a float will probably be optimized into keeping it in an xmm register. If you actually need to get it into a normal float, then I would just store all 4 and take the one you need from that. Don't know if there is any correct way.

Share this post


Link to post
Share on other sites
Quote:
Original post by issch
Having said that, I'd still like to know what the correct way of accessing a single element is.


The *correct* way is to *NOT* access individual elements. Doing so slows the code down and is best avoided as much as possible. If you actually need to access the elements as a float, then however you do it, will incur a performance penalty. FWIW, shuffling and _mm_store_ss is the best bet....

Share this post


Link to post
Share on other sites
Thanks for the replies! Thats cleared it up.

I don't expect to need to access the elements individually very often. Outside of my SSE code, I would usually want to store all 4 floats at once, eg, to pass to OpenGL or something like that.

Share this post


Link to post
Share on other sites
Why aren't you considering [font=Consolas, Courier, monospace][size=2]_mm_store_ps?[/size][/font]
[font=Consolas, Courier, monospace][size=2]
[/size][/font]
[font="Consolas, Courier, monospace"][size="2"]The compiler is likely to simply discard this instruction if everything feet correctly so that a shuffle would be a costly (except on Core i7) operation when it could be free thanks to compiler optimizations.[/size][/font]

Share this post


Link to post
Share on other sites
Ok depending on the optimization settings and the codebase several things are going to happen to your code:

1) If the compiler actually stores the result into (i.e. _mm_store_ss) main memory, and immediately reads it back, its going to cause a load hit store that is going to dwarf any performance concerns that shuffle is capable of. Basically if this function isn't inlined, is in a separate .cpp file from where its called from or is a dll export function, and whole program optimization is disabled, this is what is going to happen. However if the function is large it probably had to save and restore a large number of SSE registers in its exit/entry code so this might not be a problem at all . . .

2) The MSVC (and intel) compiler understands these intrinsics quite well and can avoid actually storing variables to memory, especially if they are function local variables of some kind. In this case you will see the shuffle and then some math instead of a store and or a load. In this case the compiler is working in your favor with the shuffle since it can figure out that its safe to switch to the SSE scalar ops with your data and work on it directly.


Note: You have to look at optimized builds to even analyze whats going on, as the debug SSE code gen in MSVC is completely atrocious, probably to make watch variables work more cleanly.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this