C dot product vs SSE dot product

Started by
12 comments, last by Zoner 12 years, 3 months ago
I recently updated my vector library to use sse and I am wondering why my dot product implementation is much slower using SSE.

I tried the usual dot4 code...


v1.x * v2.x + v1.y * v2.y + v1.z * v2.z + v1.w * v2.w


...versus the sse version...


__m128 t = _mm_mul_ps(v1, v2);
_mm_add_ps(
_mm_add_ps(
_mm_shuffle_ps(t, t, _MM_SHUFFLE(0, 0, 0, 0)),
_mm_shuffle_ps(t, t, _MM_SHUFFLE(1, 1, 1, 1))),
_mm_add_ps(
_mm_shuffle_ps(t, t, _MM_SHUFFLE(2, 2, 2, 2)),
_mm_shuffle_ps(t, t, _MM_SHUFFLE(3, 3, 3, 3))));


And I ran both in a 10000 iteration loop and I found that the sse version was 26 times slower than the original (0.000001 secs vs 0.000026 secs).

I compared the assembly output of both...


fld dword ptr [edi+20h]
fmul dword ptr [edi+10h]
fld dword ptr [edi+24h]
fmul dword ptr [edi+14h]
faddp st(1),st
fld dword ptr [edi+28h]
fmul dword ptr [edi+18h]
faddp st(1),st
fld dword ptr [edi+2Ch]
fmul dword ptr [edi+1Ch]
faddp st(1),st


...versus...


movaps xmm0,xmmword ptr [edi+20h]
mulps xmm0,xmmword ptr [edi+10h]
movaps xmm1,xmm0
shufps xmm1,xmm0,0FFh
movaps xmm2,xmm0
shufps xmm2,xmm0,0AAh
addps xmm1,xmm2
movaps xmm2,xmm0
shufps xmm2,xmm0,55h
shufps xmm0,xmm0,0
addps xmm2,xmm0
addps xmm1,xmm2


...and I dont see why SSE is not faster.

All I can think of is the compiler is better at rescheduling floating point operations. At this point I am wondering if I should avoid SSE for dot products.
Advertisement


All I can think of is the compiler is better at rescheduling floating point operations

It is. And it gets even more complicated. See here.
Your heavy use of shufps is probably slowing you down a lot, especially if you have an Intel CPU older than Penryn. (Penryn introduced the "Super Shuffle Engine" which made shuffles faster). I think you can replace the shuffles with shifts and get better results. Try the pslldq instruction. You can shift the register left by 4 bytes and add it to its non-shifted self and then shift the result left by 8 bytes and add it to its non shifted version. This way you get the sum of the 4 elements on the leftmost 32 bits.
If I understand what they say in the link provided by Antheus, they show how to dot 2 arbitrary sized vectors. I only care about 4 components vectors, so I don't really need the SSE add instructions they use. I should just do a SSE multiplication of both vectors and sum the resulting vector with normal floating point arithmetic (they use LEA to add them in the link).

Edit :
I just tried that and it failed miserably. :)

@D_Tr : what function would you use to "shift"?
I added a link to my previous post to a page that describes the instruction and the corresponding intrinsic. The intrinsic according to the linked-to page is this: __m128i _mm_slli_si128 ( __m128i a, int imm) The immediate value represents the number of bytes you want to shift the register by. I think intrinsics are the same accross all major compilers.

EDIT: You can also have a look at the movelhps instruction. It can save you an move instruction since it takes 2 register operands unlike psllqd and you do not need to copy the register before performing the instruction. As I have thought it in assembly it should look like this:


movaps xmm0, [X]
mulps xmm0, [Y]
movlhps xmm1, xmm0
addps xmm0, xmm1
movaps xmm1, xmm0
pslldq xmm1, 4
addps xmm0, xmm1


Still, it is 7 instructions long...
SSE does have a dot product instruction unfortunately it was only added with SSE4.1 so lots of commonly used CPUs don't have it (the Steam hardware survey says that about 50% of users have it at the moment). This means you'd have to either build two exes, or do some sort of run time decision making to use it.

If you need to do lots of dot products without that instruction, then the best option is to keep the data in SoA form. That is something like:

struct AoSVector4_1000
{
float x[1000];
float y[1000];
float z[1000];
float w[1000];
};


You can then compute four or more dot products simultaneously, and it's significantly quicker than doing one at a time from AoS data.
I thought about it and I reduced the number of function required to compute a dot 4, its not as good as a structure of array but its better than the first implementation I proposed.

(I really don't like working with structure of arrays.... When technical constraints kills application design...)


__m128 m = _mm_mul_ps(v1, v2);
__m128 t = _mm_add_ps(m, _mm_shuffle_ps(m, m, _MM_SHUFFLE(2, 3, 0, 1)));
__m128 result = _mm_add_ps(t, _mm_shuffle_ps(t, t, _MM_SHUFFLE(1, 0, 3, 2)));


You get a the resulting dot product in 5 intrinsics.


(I really don't like working with structure of arrays.... When technical constraints kills application design...)


Nope - your application is defying the purpose of SIMD. Single-instruction-multiple-data.

SIMD, GPUs, even MPI. What you really want is as much data as possible on which to apply the operation.

The link above references optimization manuals which list performance penalties of individual operations. Through per-element processing, these cannot be avoided. So the solution is to structure data in CPU-friendly way.

To optimize dot products, perform 1000 of them at the same time before moving on. Ever since 486 (cache hierarchy + pipeline), individual instructions aren't relevant anymore. Optimizing dot product, no matter the technique won't yield results beyond memory and instruction latency, which is where regular code is as good as it gets. To benefit, change application to use more data.

The code you have now 'dot(a,b)' is bottlenecked, regardless of how it's implemented internally, even if using SSE4 dot intrinsic.
What I mean is I do not want to change the interface the program is currently using, I want to do some transparent optimization.

I am sure I am not alone in this situation. I can't go and convert our mesh format from aos to soa without someone noticing it... but I can optimize my vector library and no one will notice... they will only see that the performance improved a bit.
It gets even more comppli

[quote name='Shnoutz' timestamp='1325624725' post='4899385']
All I can think of is the compiler is better at rescheduling floating point operations

It is. And it gets even more complicated. See here.
[/quote]

And it gets even more complicated. See here.

This topic is closed to new replies.

Advertisement