# SSE2 - generate 0xFFFFFFFF.... in 1 instruction? [C++]

This topic is 3623 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I want to generate a 128-bit vector of all 1 bits.
 __m128 a = _mm_setzero_ps();
return _mm_cmpeq_ps( a, a );


This is as close as I can get (2 instructions). I can't let 'a' be uninitialized (at least Visual Studio 2005) won't let me (disassembly shows a value loaded into it from memory), and inline assembly isn't boiling down to one vector instruction. :(
 __m128 a;
__asm {
movaps		xmmword ptr [a], xmm7
}
return _mm_cmpeq_ps( a, a );


Intrinsics-only version is preferable anyhow, since it's not dependent on xmm7. Any ideas? (SSE2)

##### Share on other sites
This got close, but it still ended up writing to memory and then fetching again (the register keyword didn't help). Is there a better way to transfer from inline asm to C++ variables?

#pragma warning(disable:4700)__forceinline __m128 getOnes(){	register __m128 a;	__asm {		cmpeqps		xmm0, xmm0		movaps		xmmword ptr [a], xmm0	}	return a;}

(Note that the inline asm "cmpeqps xmm0, xmm0" alone would work perfectly, since SSE returns vectors in register xmm0... the only problem is I want the function inlined.)

[Edited by - discman1028 on March 14, 2008 12:53:44 PM]

##### Share on other sites
I'm almost afraid to ask, but why do you want to do this?

##### Share on other sites
There's a bunch of reasons you'd want a mask of FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF for vector operations. Page '5-19' of the IA-32 Software Developer's Manual tells you how to generate some useful constants in SSE-instructions-only, but it's assembly code.

##### Share on other sites
Declare an unsigned int array[4] = {-1,-1,-1,-1}; or similar and load it with movaps.

##### Share on other sites
Quote:
 Original post by outRiderDeclare an unsigned int array[4] = {-1,-1,-1,-1}; or similar and load it with movaps.

That's what I've been doing -- this is faster as it doesn't have to touch memory.

##### Share on other sites
My question is more why are you obsessed with getting this operation down to a single instruction?

##### Share on other sites
Quote:
 Original post by discman1028I want to generate a 128-bit vector of all 1 bits. __m128 a = _mm_setzero_ps(); return _mm_cmpeq_ps( a, a );Any ideas? (SSE2)
What do you need the _mm_setzero_ps() for?

I mean, you compare a xmm register with itself, this is always true, regardless of the contents, isn't it?
This would not be the case if we were talking about normal FPU math, which might compare a 80/96 bit register with a 32 bit memory location (yielding the well-known floating point comparison dread).
However, the floats inside your xmm register are all exactly 32 bits, so there should be no issue?

##### Share on other sites
Tried that, and it seems to work ok (which of course proves nothing...).

#include <xmmintrin.h>#include <stdio.h>int main(){	int f[4];	__m128 a;	__m128 b;	__m128 c;	a = _mm_cmpeq_ps( a, a );	b = _mm_cmpeq_ps( b, b );	c = _mm_cmpeq_ps( c, c );	_mm_storeu_ps((float*) f, a);	printf("%x %x %x %x\n", f[0], f[1], f[2], f[3]);	_mm_storeu_ps((float*) f, b);	printf("%x %x %x %x\n", f[0], f[1], f[2], f[3]);	_mm_storeu_ps((float*) f, c);	printf("%x %x %x %x\n", f[0], f[1], f[2], f[3]);    return 0;}
This shows ffffffff ffffffff ffffffff ffffffff three times, as expected.

##### Share on other sites
I think what he means is that it still generates a load off the stack for a when it doesn't have to. The _mm_setzero_ps() should reduce to xorps xmm#,xmm# instead of a load.

##### Share on other sites
Quote:
 Original post by SiCraneMy question is more why are you obsessed with getting this operation down to a single instruction?

One instruction is better than two. :) I'm not obsessed. :) At least not ATM.

Quote:
 Original post by samothI mean, you compare a xmm register with itself, this is always true, regardless of the contents, isn't it?

Yeah -- my problem is that I can't get it to compile. Read the third sentence of my first post.

Quote:
 Original post by outRiderI think what he means is that it still generates a load off the stack for a when it doesn't have to.

Exactly... so what I currently have is the solution proposed at the top of my first post. Just produces an 'xor' and a 'cmpeq', which is better than a 'movaps' and a 'cmpeq'. However, a simple 'cmpeq' would be best.

##### Share on other sites
Can't you just OR it with the NOT of itself? A or (not A) = 1. It requires another register, but its still pretty cheap.

##### Share on other sites
Quote:
 Original post by samothHowever, the floats inside your xmm register are all exactly 32 bits, so there should be no issue?

Except for one cute little case:
NaN != NaN

[grin]

Quote:
 One instruction is better than two. :) I'm not obsessed. :) At least not ATM.

If you can't come up with a rational reason why it matters, I'd say yes, you are obsessed.
Does the additional instruction actually matter? If it's for performance reasons, I assume you have profiled it.

##### Share on other sites
Spoonbender: If performance didn't matter, why would he be using SSE? ;)

NerdInHisShoe: there is an instruction for and-not, but not or-not. As such the dependency chain is not shorter than XOR/CMP.

Unfortunately I have no better solution to offer; this seems to be a regrettable limitation in the instrinsics (an _mm_setones_ps would be needed).

##### Share on other sites
Quote:
 Original post by SpoonbenderExcept for one cute little case:NaN != NaN [grin]
Dang, you got me there! [grin]
But hey, you could use PCMPEQ[B|W|D] instead, these work bitwise and don't care about NaNs.

##### Share on other sites
Quote:
 But hey, you could use PCMPEQ[B|W|D] instead, these work bitwise and don't care about NaNs.

That would work but incurs a "reformatting delay". XMM registers include a few hidden status bits, which need to be regenerated when casting from int (__m128i) to float/double (__m128). For float data, the _ps instructions should therefore be used whenever possible; additional bonus: their encodings are shorter than those of SSE2 integer instructions.

##### Share on other sites
Quote:
 But hey, you could use PCMPEQ[B|W|D] instead, these work bitwise and don't care about NaNs.

Yeah I thought of that soon after, that comparing int's would be best, even though it's SSE2+ only. SSE2 is the baseline in the SSE series anyways IMHO.

Quote:
Original post by Jan Wassenberg
Quote:
 But hey, you could use PCMPEQ[B|W|D] instead, these work bitwise and don't care about NaNs.

That would work but incurs a "reformatting delay". XMM registers include a few hidden status bits, which need to be regenerated when casting from int (__m128i) to float/double (__m128). For float data, the _ps instructions should therefore be used whenever possible; additional bonus: their encodings are shorter than those of SSE2 integer instructions.

I did not know about either of those two facts!

I reinterpret_cast from __m128 <--> __m128i all the time, and the disassembly looks perfect, so I assumed I was golden. Can you give more details on where this "reformatting delay" shows up in HERE or elsewhere? I would be very interested.

And, does the shorter (SSE1?) encodings translate to less latency?

Quote:
 Original post by SpoonbenderIf you can't come up with a rational reason why it matters, I'd say yes, you are obsessed.Does the additional instruction actually matter? If it's for performance reasons, I assume you have profiled it.

Fine-grained optimizations such as these won't show up in a profile. Don't fall into the "premature optimization is the root of all evil"-preaching crowd.

[Edited by - discman1028 on March 18, 2008 5:23:48 PM]

##### Share on other sites
Quote:
 Original post by discman1028I reinterpret_cast from __m128 <--> __m128i all the time, and the disassembly looks perfect, so I assumed I was golden. Can you give more details on where this "reformatting delay" shows up in HERE or elsewhere?

Found it in Appendix E of http://cr.yp.to/bib/2004/-amd-25112.pdf. But still not a ton of detail. I know going between the FP stack, gen purpose regs, and XMM regs is not so good. But within the XMM regs... I thought the __m128 types and __m128i types were just for type safety only.

I know it's not always bad at least. For example:

I went from _mm_shuffle_ps() for duplicating the float-components of a __m128, to a _mm_shuffle_epi32() operation (SSE2) on a __m128i. I had to reinterpret_cast from and to __m128 before and after. 'pshufd' (the single instruction emitted by _mm_shuffle_epi32()) requires less registers as operands, and as such, the compiler proved to have better register allocation, and profiles show that using that integer shuffle for all my single-vector float shuffles was about 10-20% faster.

[Edited by - discman1028 on March 18, 2008 5:55:23 PM]

##### Share on other sites
Quote:
 Original post by discman1028One instruction is better than two. :) I'm not obsessed. :) At least not ATM.

Not always - two fast instructions are better than one slow instruction. CMPPS isn't particularly fast; the logical ops are quicker.

##### Share on other sites
Quote:
Original post by Jerax
Quote:
 Original post by discman1028One instruction is better than two. :) I'm not obsessed. :) At least not ATM.

Not always - two fast instructions are better than one slow instruction. CMPPS isn't particularly fast; the logical ops are quicker.

(Of course, I meant one instruction of the two. e.g. xor then cmpps is slower than cmpps alone.)

[Edited by - discman1028 on March 20, 2008 3:36:55 PM]

##### Share on other sites
Quote:

Sorry about the "delay" (heh), was without internet access for a week whilst helping someone move house.

Quote:
 I reinterpret_cast from __m128 <--> __m128i all the time, and the disassembly looks perfect, so I assumed I was golden. Can you give more details on where this "reformatting delay" shows up in HERE or elsewhere? I would be very interested.

Yes, this is a microarchitectural issue and thus does not show up on the architectural level (i.e. the instructions).
The previously mentioned AMD manual is unfortunately rather thin on details, but you can read more about it in Agner Fog's microarchitecture and "Optimizing Assembly" manuals. (They do contradict themselves slightly with respect to which processors exhibit such behavior.)

Quote:
 And, does the shorter (SSE1?) encodings translate to less latency?

No, but this is relevant for the fetch/decode stage. Absent other bottlenecks, we can now rely on execution units to handle 3..4 instructions per cycle. However, instruction fetch bandwidth has stagnated at 16 bytes/cycle. The more complex IA-32 and AMD64 instructions can easily reach 8 bytes, so this window limit can definitely be encountered. To achieve peak throughput, it is again relevant nowadays to ensure encodings are small.

Quote:
 I went from _mm_shuffle_ps() for duplicating the float-components of a __m128, to a _mm_shuffle_epi32() operation (SSE2) on a __m128i. I had to reinterpret_cast from and to __m128 before and after. 'pshufd' (the single instruction emitted by _mm_shuffle_epi32()) requires less registers as operands, and as such, the compiler proved to have better register allocation, and profiles show that using that integer shuffle for all my single-vector float shuffles was about 10-20% faster.

Yes, it is a bit strange that SHUFPS and PSHUFD are so different in their operation. However, if using the 'wrong' instruction brings more gains than the reformatting delay, then go for it :)

##### Share on other sites
Ahhh, agner's stuff rules. ;)

Quote:

And there it all is in bold, including the shuffle suggestion. I should save myself some time and read these cover-to-cover. :)

Thanks Jan.

##### Share on other sites
As an aside: It would be awesome if everybody who is interested in halfway-decent SSE compiler optimization go here and leave a comment, or at least rate the topic "important": linky

##### Share on other sites
Hopefully this will be an intrinsic soon. :)