# C dot product vs SSE dot product

This topic is 2514 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I recently updated my vector library to use sse and I am wondering why my dot product implementation is much slower using SSE.

I tried the usual dot4 code...

 v1.x * v2.x + v1.y * v2.y + v1.z * v2.z + v1.w * v2.w 

...versus the sse version...

 __m128 t = _mm_mul_ps(v1, v2); _mm_add_ps( _mm_add_ps( _mm_shuffle_ps(t, t, _MM_SHUFFLE(0, 0, 0, 0)), _mm_shuffle_ps(t, t, _MM_SHUFFLE(1, 1, 1, 1))), _mm_add_ps( _mm_shuffle_ps(t, t, _MM_SHUFFLE(2, 2, 2, 2)), _mm_shuffle_ps(t, t, _MM_SHUFFLE(3, 3, 3, 3)))); 

And I ran both in a 10000 iteration loop and I found that the sse version was 26 times slower than the original (0.000001 secs vs 0.000026 secs).

I compared the assembly output of both...

 fld dword ptr [edi+20h] fmul dword ptr [edi+10h] fld dword ptr [edi+24h] fmul dword ptr [edi+14h] faddp st(1),st fld dword ptr [edi+28h] fmul dword ptr [edi+18h] faddp st(1),st fld dword ptr [edi+2Ch] fmul dword ptr [edi+1Ch] faddp st(1),st 

...versus...

 movaps xmm0,xmmword ptr [edi+20h] mulps xmm0,xmmword ptr [edi+10h] movaps xmm1,xmm0 shufps xmm1,xmm0,0FFh movaps xmm2,xmm0 shufps xmm2,xmm0,0AAh addps xmm1,xmm2 movaps xmm2,xmm0 shufps xmm2,xmm0,55h shufps xmm0,xmm0,0 addps xmm2,xmm0 addps xmm1,xmm2 

...and I dont see why SSE is not faster.

All I can think of is the compiler is better at rescheduling floating point operations. At this point I am wondering if I should avoid SSE for dot products.

##### Share on other sites

All I can think of is the compiler is better at rescheduling floating point operations

It is. And it gets even more complicated. See here.

##### Share on other sites
Your heavy use of shufps is probably slowing you down a lot, especially if you have an Intel CPU older than Penryn. (Penryn introduced the "Super Shuffle Engine" which made shuffles faster). I think you can replace the shuffles with shifts and get better results. Try the pslldq instruction. You can shift the register left by 4 bytes and add it to its non-shifted self and then shift the result left by 8 bytes and add it to its non shifted version. This way you get the sum of the 4 elements on the leftmost 32 bits.

##### Share on other sites
If I understand what they say in the link provided by Antheus, they show how to dot 2 arbitrary sized vectors. I only care about 4 components vectors, so I don't really need the SSE add instructions they use. I should just do a SSE multiplication of both vectors and sum the resulting vector with normal floating point arithmetic (they use LEA to add them in the link).

Edit :
I just tried that and it failed miserably.

@D_Tr : what function would you use to "shift"?

##### Share on other sites
I added a link to my previous post to a page that describes the instruction and the corresponding intrinsic. The intrinsic according to the linked-to page is this: __m128i _mm_slli_si128 ( __m128i a, int imm) The immediate value represents the number of bytes you want to shift the register by. I think intrinsics are the same accross all major compilers.

EDIT: You can also have a look at the movelhps instruction. It can save you an move instruction since it takes 2 register operands unlike psllqd and you do not need to copy the register before performing the instruction. As I have thought it in assembly it should look like this:

 movaps xmm0, [X] mulps xmm0, [Y] movlhps xmm1, xmm0 addps xmm0, xmm1 movaps xmm1, xmm0 pslldq xmm1, 4 addps xmm0, xmm1 

Still, it is 7 instructions long...

##### Share on other sites
SSE does have a dot product instruction unfortunately it was only added with SSE4.1 so lots of commonly used CPUs don't have it (the Steam hardware survey says that about 50% of users have it at the moment). This means you'd have to either build two exes, or do some sort of run time decision making to use it.

If you need to do lots of dot products without that instruction, then the best option is to keep the data in SoA form. That is something like:

struct AoSVector4_1000 { float x[1000]; float y[1000]; float z[1000]; float w[1000]; };

You can then compute four or more dot products simultaneously, and it's significantly quicker than doing one at a time from AoS data.

##### Share on other sites
I thought about it and I reduced the number of function required to compute a dot 4, its not as good as a structure of array but its better than the first implementation I proposed.

(I really don't like working with structure of arrays.... When technical constraints kills application design...)

 __m128 m = _mm_mul_ps(v1, v2); __m128 t = _mm_add_ps(m, _mm_shuffle_ps(m, m, _MM_SHUFFLE(2, 3, 0, 1))); __m128 result = _mm_add_ps(t, _mm_shuffle_ps(t, t, _MM_SHUFFLE(1, 0, 3, 2))); 

You get a the resulting dot product in 5 intrinsics.

##### Share on other sites

(I really don't like working with structure of arrays.... When technical constraints kills application design...)

Nope - your application is defying the purpose of SIMD. Single-instruction-multiple-data.

SIMD, GPUs, even MPI. What you really want is as much data as possible on which to apply the operation.

The link above references optimization manuals which list performance penalties of individual operations. Through per-element processing, these cannot be avoided. So the solution is to structure data in CPU-friendly way.

To optimize dot products, perform 1000 of them at the same time before moving on. Ever since 486 (cache hierarchy + pipeline), individual instructions aren't relevant anymore. Optimizing dot product, no matter the technique won't yield results beyond memory and instruction latency, which is where regular code is as good as it gets. To benefit, change application to use more data.

The code you have now 'dot(a,b)' is bottlenecked, regardless of how it's implemented internally, even if using SSE4 dot intrinsic.

##### Share on other sites
What I mean is I do not want to change the interface the program is currently using, I want to do some transparent optimization.

I am sure I am not alone in this situation. I can't go and convert our mesh format from aos to soa without someone noticing it... but I can optimize my vector library and no one will notice... they will only see that the performance improved a bit.

##### Share on other sites
It gets even more comppli

[quote name='Shnoutz' timestamp='1325624725' post='4899385']
All I can think of is the compiler is better at rescheduling floating point operations

It is. And it gets even more complicated. See here.
[/quote]

And it gets even more complicated. See here.

1. 1
2. 2
3. 3
Rutin
22
4. 4
5. 5
khawk
14

• 9
• 11
• 11
• 23
• 12
• ### Forum Statistics

• Total Topics
633654
• Total Posts
3013169
×