• Advertisement
Sign in to follow this  

SSE2 - generate 0xFFFFFFFF.... in 1 instruction? [C++]

This topic is 3623 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I want to generate a 128-bit vector of all 1 bits.
 __m128 a = _mm_setzero_ps();
 return _mm_cmpeq_ps( a, a );




This is as close as I can get (2 instructions). I can't let 'a' be uninitialized (at least Visual Studio 2005) won't let me (disassembly shows a value loaded into it from memory), and inline assembly isn't boiling down to one vector instruction. :(
 __m128 a;
 __asm {
 	movaps		xmmword ptr [a], xmm7
 }
 return _mm_cmpeq_ps( a, a );




Intrinsics-only version is preferable anyhow, since it's not dependent on xmm7. Any ideas? (SSE2)

Share this post


Link to post
Share on other sites
Advertisement
This got close, but it still ended up writing to memory and then fetching again (the register keyword didn't help). Is there a better way to transfer from inline asm to C++ variables?


#pragma warning(disable:4700)

__forceinline __m128 getOnes()
{
register __m128 a;
__asm {
cmpeqps xmm0, xmm0
movaps xmmword ptr [a], xmm0
}
return a;
}






(Note that the inline asm "cmpeqps xmm0, xmm0" alone would work perfectly, since SSE returns vectors in register xmm0... the only problem is I want the function inlined.)

[Edited by - discman1028 on March 14, 2008 12:53:44 PM]

Share this post


Link to post
Share on other sites
I'm almost afraid to ask, but why do you want to do this?

Share this post


Link to post
Share on other sites
Declare an unsigned int array[4] = {-1,-1,-1,-1}; or similar and load it with movaps.

Share this post


Link to post
Share on other sites
Quote:
Original post by outRider
Declare an unsigned int array[4] = {-1,-1,-1,-1}; or similar and load it with movaps.


That's what I've been doing -- this is faster as it doesn't have to touch memory.

Share this post


Link to post
Share on other sites
My question is more why are you obsessed with getting this operation down to a single instruction?

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
I want to generate a 128-bit vector of all 1 bits.

__m128 a = _mm_setzero_ps();
return _mm_cmpeq_ps( a, a );

Any ideas? (SSE2)
What do you need the _mm_setzero_ps() for?

I mean, you compare a xmm register with itself, this is always true, regardless of the contents, isn't it?
This would not be the case if we were talking about normal FPU math, which might compare a 80/96 bit register with a 32 bit memory location (yielding the well-known floating point comparison dread).
However, the floats inside your xmm register are all exactly 32 bits, so there should be no issue?

Share this post


Link to post
Share on other sites
Tried that, and it seems to work ok (which of course proves nothing...).

#include <xmmintrin.h>
#include <stdio.h>
int main()
{
int f[4];

__m128 a;
__m128 b;
__m128 c;
a = _mm_cmpeq_ps( a, a );
b = _mm_cmpeq_ps( b, b );
c = _mm_cmpeq_ps( c, c );

_mm_storeu_ps((float*) f, a);
printf("%x %x %x %x\n", f[0], f[1], f[2], f[3]);
_mm_storeu_ps((float*) f, b);
printf("%x %x %x %x\n", f[0], f[1], f[2], f[3]);
_mm_storeu_ps((float*) f, c);
printf("%x %x %x %x\n", f[0], f[1], f[2], f[3]);

return 0;
}
This shows ffffffff ffffffff ffffffff ffffffff three times, as expected.

Share this post


Link to post
Share on other sites
I think what he means is that it still generates a load off the stack for a when it doesn't have to. The _mm_setzero_ps() should reduce to xorps xmm#,xmm# instead of a load.

Share this post


Link to post
Share on other sites
Quote:
Original post by SiCrane
My question is more why are you obsessed with getting this operation down to a single instruction?


One instruction is better than two. :) I'm not obsessed. :) At least not ATM.

Quote:
Original post by samoth
I mean, you compare a xmm register with itself, this is always true, regardless of the contents, isn't it?


Yeah -- my problem is that I can't get it to compile. Read the third sentence of my first post.

Quote:
Original post by outRider
I think what he means is that it still generates a load off the stack for a when it doesn't have to.


Exactly... so what I currently have is the solution proposed at the top of my first post. Just produces an 'xor' and a 'cmpeq', which is better than a 'movaps' and a 'cmpeq'. However, a simple 'cmpeq' would be best.

Share this post


Link to post
Share on other sites
Quote:
Original post by samoth
However, the floats inside your xmm register are all exactly 32 bits, so there should be no issue?


Except for one cute little case:
NaN != NaN

[grin]

Quote:
One instruction is better than two. :) I'm not obsessed. :) At least not ATM.

If you can't come up with a rational reason why it matters, I'd say yes, you are obsessed.
Does the additional instruction actually matter? If it's for performance reasons, I assume you have profiled it.

Share this post


Link to post
Share on other sites
Spoonbender: If performance didn't matter, why would he be using SSE? ;)

NerdInHisShoe: there is an instruction for and-not, but not or-not. As such the dependency chain is not shorter than XOR/CMP.

Unfortunately I have no better solution to offer; this seems to be a regrettable limitation in the instrinsics (an _mm_setones_ps would be needed).

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Except for one cute little case:
NaN != NaN

[grin]
Dang, you got me there! [grin]
But hey, you could use PCMPEQ[B|W|D] instead, these work bitwise and don't care about NaNs.

Share this post


Link to post
Share on other sites
Quote:
But hey, you could use PCMPEQ[B|W|D] instead, these work bitwise and don't care about NaNs.

That would work but incurs a "reformatting delay". XMM registers include a few hidden status bits, which need to be regenerated when casting from int (__m128i) to float/double (__m128). For float data, the _ps instructions should therefore be used whenever possible; additional bonus: their encodings are shorter than those of SSE2 integer instructions.

Share this post


Link to post
Share on other sites
Quote:
But hey, you could use PCMPEQ[B|W|D] instead, these work bitwise and don't care about NaNs.


Yeah I thought of that soon after, that comparing int's would be best, even though it's SSE2+ only. SSE2 is the baseline in the SSE series anyways IMHO.

Quote:
Original post by Jan Wassenberg
Quote:
But hey, you could use PCMPEQ[B|W|D] instead, these work bitwise and don't care about NaNs.

That would work but incurs a "reformatting delay". XMM registers include a few hidden status bits, which need to be regenerated when casting from int (__m128i) to float/double (__m128). For float data, the _ps instructions should therefore be used whenever possible; additional bonus: their encodings are shorter than those of SSE2 integer instructions.


I did not know about either of those two facts!

I reinterpret_cast from __m128 <--> __m128i all the time, and the disassembly looks perfect, so I assumed I was golden. Can you give more details on where this "reformatting delay" shows up in HERE or elsewhere? I would be very interested.

And, does the shorter (SSE1?) encodings translate to less latency?

Quote:
Original post by Spoonbender
If you can't come up with a rational reason why it matters, I'd say yes, you are obsessed.
Does the additional instruction actually matter? If it's for performance reasons, I assume you have profiled it.


Fine-grained optimizations such as these won't show up in a profile. Don't fall into the "premature optimization is the root of all evil"-preaching crowd.

[Edited by - discman1028 on March 18, 2008 5:23:48 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
I reinterpret_cast from __m128 <--> __m128i all the time, and the disassembly looks perfect, so I assumed I was golden. Can you give more details on where this "reformatting delay" shows up in HERE or elsewhere?


Found it in Appendix E of http://cr.yp.to/bib/2004/-amd-25112.pdf. But still not a ton of detail. I know going between the FP stack, gen purpose regs, and XMM regs is not so good. But within the XMM regs... I thought the __m128 types and __m128i types were just for type safety only.

I know it's not always bad at least. For example:

I went from _mm_shuffle_ps() for duplicating the float-components of a __m128, to a _mm_shuffle_epi32() operation (SSE2) on a __m128i. I had to reinterpret_cast from and to __m128 before and after. 'pshufd' (the single instruction emitted by _mm_shuffle_epi32()) requires less registers as operands, and as such, the compiler proved to have better register allocation, and profiles show that using that integer shuffle for all my single-vector float shuffles was about 10-20% faster.

[Edited by - discman1028 on March 18, 2008 5:55:23 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
One instruction is better than two. :) I'm not obsessed. :) At least not ATM.


Not always - two fast instructions are better than one slow instruction. CMPPS isn't particularly fast; the logical ops are quicker.

Share this post


Link to post
Share on other sites
Quote:
Original post by Jerax
Quote:
Original post by discman1028
One instruction is better than two. :) I'm not obsessed. :) At least not ATM.


Not always - two fast instructions are better than one slow instruction. CMPPS isn't particularly fast; the logical ops are quicker.


(Of course, I meant one instruction of the two. e.g. xor then cmpps is slower than cmpps alone.)


Anyhow: Any more info about that "reformatting delay", Jan?

[Edited by - discman1028 on March 20, 2008 3:36:55 PM]

Share this post


Link to post
Share on other sites
Quote:
Anyhow: Any more info about that "reformatting delay", Jan?

Sorry about the "delay" (heh), was without internet access for a week whilst helping someone move house.

Quote:
I reinterpret_cast from __m128 <--> __m128i all the time, and the disassembly looks perfect, so I assumed I was golden. Can you give more details on where this "reformatting delay" shows up in HERE or elsewhere? I would be very interested.

Yes, this is a microarchitectural issue and thus does not show up on the architectural level (i.e. the instructions).
The previously mentioned AMD manual is unfortunately rather thin on details, but you can read more about it in Agner Fog's microarchitecture and "Optimizing Assembly" manuals. (They do contradict themselves slightly with respect to which processors exhibit such behavior.)

Quote:
And, does the shorter (SSE1?) encodings translate to less latency?

No, but this is relevant for the fetch/decode stage. Absent other bottlenecks, we can now rely on execution units to handle 3..4 instructions per cycle. However, instruction fetch bandwidth has stagnated at 16 bytes/cycle. The more complex IA-32 and AMD64 instructions can easily reach 8 bytes, so this window limit can definitely be encountered. To achieve peak throughput, it is again relevant nowadays to ensure encodings are small.

Quote:
I went from _mm_shuffle_ps() for duplicating the float-components of a __m128, to a _mm_shuffle_epi32() operation (SSE2) on a __m128i. I had to reinterpret_cast from and to __m128 before and after. 'pshufd' (the single instruction emitted by _mm_shuffle_epi32()) requires less registers as operands, and as such, the compiler proved to have better register allocation, and profiles show that using that integer shuffle for all my single-vector float shuffles was about 10-20% faster.

Yes, it is a bit strange that SHUFPS and PSHUFD are so different in their operation. However, if using the 'wrong' instruction brings more gains than the reformatting delay, then go for it :)

Share this post


Link to post
Share on other sites
Ahhh, agner's stuff rules. ;)

Quote:

There is a penalty for using the wrong type of instructions on AMD processors in some
cases. The reason is that the processor stores extra information about floating point
numbers in XMM registers in order to remember if the number is zero, normal or denormal.
This information is lost if an instruction intended for integer vectors is used for moving the
floating point data. The processor needs one or two clock cycles extra for re-generating the
lost information. This is called a reformatting delay. The reformatting delay occurs whenever
the output of an integer XMM instruction is used as input for a floating point XMM
instruction, except when the floating point XMM instruction does nothing else than writing
the value to memory. Interestingly, there is no reformatting delay when using singleprecision
XMM instructions for double-precision data or vice versa.
The reformatting delay occurs only in AMD processors. I have observed no reformatting
delays on any Intel processor.

Using an instruction of a wrong type can be advantageous in cases where there is no
reformatting delay and in cases where the gain by using a particular instruction is more than
the reformatting delay
. Some cases are described below.

Using the shortest instruction

The instructions for packed single precision floating point numbers, with names ending in
PS, are one byte shorter than equivalent instructions for double precision or integers. For
example, you may use MOVAPS instead of MOVAPD or MOVDQA for moving data to or from
memory or between registers. A reformatting delay occurs in AMD processors when using
MOVAPS for moving the result of an integer instruction to another register, but not when
moving data to or from memory.

Using the most efficient instruction

There are several different ways of reading an XMM register from unaligned memory. The
typed instructions are MOVDQU, MOVUPD, and MOVUPS. These are all quite inefficient. LDDQU is
faster, but requires the SSE3 instruction set. On many processors, the most efficient way of
reading an XMM register from unaligned memory is to read 64 bits at a time using MOVQ and
MOVHPS. Likewise, the fastest way of writing to unaligned memory may be to use MOVLPS
and MOVHPS.
An efficient way of setting a vector register to zero is PXOR XMM0,XMM0. The P4 and P4E
processors recognize this instruction as being independent of the previous value of XMM0,
while it does not recognize this for XORPS or XORPD. The PXOR instruction is therefore
preferred for setting a register to zero.
The integer versions of the Boolean vector instructions (PAND, PANDN, POR, PXOR) can use
the FADD or FMUL unit in an AMD64 processor, while the floating point versions can use
only the FMUL unit.

Using an instruction that is not available for other types of data

There are many situations where it is advantageous to use an instruction intended for a
different type of data simply because an equivalent instruction doesn’t exist for the type of
data you have.
The instructions for single precision float vectors are available in the SSE instruction set,
while the equivalent instructions for double precision and integers require the SSE2
instruction set. Using MOVAPS instead of MOVAPD or MOVDQA for moving data makes the code
compatible with processors that have SSE but not SSE2.
There are many useful instructions for data shuffling that are available for only one type of
data. These instructions can easily be used for other types of data than they are intended
for. The reformatting delay, if any, is often less than the cost of alternative solutions. The
data shuffling instructions are listed in the next paragraph.
...

13.3 Shuffling data

Vectorized code sometimes needs a lot of instructions for swapping and copying vector
elements and putting data into the right positions in the vectors. The need for these extra
instructions reduces the advantage of using vector operations. It can often be an advantage
to use a shuffling instruction that is intended for a different type of data than you have, as
explained in the previous paragraph. Some instructions that are useful for data shuffling are
listed below.
...
Using the PSHUFD instruction can often save a move instruction. Can be used for higher
element sizes as well.




And there it all is in bold, including the shuffle suggestion. I should save myself some time and read these cover-to-cover. :)

Thanks Jan.

Share this post


Link to post
Share on other sites
As an aside: It would be awesome if everybody who is interested in halfway-decent SSE compiler optimization go here and leave a comment, or at least rate the topic "important": linky

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement