Back to General and Gameplay Programming

C dot product vs SSE dot product

General and Gameplay Programming Programming

Started by Shnoutz January 03, 2012 09:05 PM

12 comments, last by Zoner 12 years, 3 months ago

Shnoutz

356

Author

January 03, 2012 09:05 PM

I recently updated my vector library to use sse and I am wondering why my dot product implementation is much slower using SSE.

I tried the usual dot4 code...



v1.x * v2.x + v1.y * v2.y + v1.z * v2.z + v1.w * v2.w

...versus the sse version...



__m128 t = _mm_mul_ps(v1, v2);

_mm_add_ps(

	_mm_add_ps(

		_mm_shuffle_ps(t, t, _MM_SHUFFLE(0, 0, 0, 0)),

		_mm_shuffle_ps(t, t, _MM_SHUFFLE(1, 1, 1, 1))),

	_mm_add_ps(

		_mm_shuffle_ps(t, t, _MM_SHUFFLE(2, 2, 2, 2)),

		_mm_shuffle_ps(t, t, _MM_SHUFFLE(3, 3, 3, 3))));

And I ran both in a 10000 iteration loop and I found that the sse version was 26 times slower than the original (0.000001 secs vs 0.000026 secs).

I compared the assembly output of both...



fld    dword ptr [edi+20h]

fmul   dword ptr [edi+10h]

fld    dword ptr [edi+24h]

fmul   dword ptr [edi+14h]

faddp  st(1),st

fld    dword ptr [edi+28h]

fmul   dword ptr [edi+18h]

faddp  st(1),st

fld    dword ptr [edi+2Ch]

fmul   dword ptr [edi+1Ch]

faddp  st(1),st

...versus...



movaps  xmm0,xmmword ptr [edi+20h]

mulps   xmm0,xmmword ptr [edi+10h]

movaps  xmm1,xmm0

shufps  xmm1,xmm0,0FFh

movaps  xmm2,xmm0

shufps  xmm2,xmm0,0AAh

addps   xmm1,xmm2

movaps  xmm2,xmm0

shufps  xmm2,xmm0,55h

shufps  xmm0,xmm0,0

addps   xmm2,xmm0

addps   xmm1,xmm2

...and I dont see why SSE is not faster.

All I can think of is the compiler is better at rescheduling floating point operations. At this point I am wondering if I should avoid SSE for dot products.

Antheus

2,410

January 03, 2012 09:26 PM

All I can think of is the compiler is better at rescheduling floating point operations

It is. And it gets even more complicated. See here.

asdfg__12

362

January 03, 2012 09:52 PM

Your heavy use of shufps is probably slowing you down a lot, especially if you have an Intel CPU older than Penryn. (Penryn introduced the "Super Shuffle Engine" which made shuffles faster). I think you can replace the shuffles with shifts and get better results. Try the pslldq instruction. You can shift the register left by 4 bytes and add it to its non-shifted self and then shift the result left by 8 bytes and add it to its non shifted version. This way you get the sum of the 4 elements on the leftmost 32 bits.

Shnoutz

356

Author

January 03, 2012 10:03 PM

If I understand what they say in the link provided by Antheus, they show how to dot 2 arbitrary sized vectors. I only care about 4 components vectors, so I don't really need the SSE add instructions they use. I should just do a SSE multiplication of both vectors and sum the resulting vector with normal floating point arithmetic (they use LEA to add them in the link).

Edit :
I just tried that and it failed miserably.

@D_Tr : what function would you use to "shift"?

asdfg__12

362

January 03, 2012 10:28 PM

I added a link to my previous post to a page that describes the instruction and the corresponding intrinsic. The intrinsic according to the linked-to page is this: __m128i _mm_slli_si128 ( __m128i a, int imm) The immediate value represents the number of bytes you want to shift the register by. I think intrinsics are the same accross all major compilers.

EDIT: You can also have a look at the movelhps instruction. It can save you an move instruction since it takes 2 register operands unlike psllqd and you do not need to copy the register before performing the instruction. As I have thought it in assembly it should look like this:



movaps xmm0, [X]

mulps xmm0, [Y]

movlhps xmm1, xmm0

addps xmm0, xmm1

movaps xmm1, xmm0

pslldq xmm1, 4

addps xmm0, xmm1

Still, it is 7 instructions long...

Adam_42

3,664

January 04, 2012 09:08 PM

SSE does have a dot product instruction unfortunately it was only added with SSE4.1 so lots of commonly used CPUs don't have it (the Steam hardware survey says that about 50% of users have it at the moment). This means you'd have to either build two exes, or do some sort of run time decision making to use it.

If you need to do lots of dot products without that instruction, then the best option is to keep the data in SoA form. That is something like:

struct AoSVector4_1000

{

  float x[1000];

  float y[1000];

  float z[1000];

  float w[1000];

};

You can then compute four or more dot products simultaneously, and it's significantly quicker than doing one at a time from AoS data.

Shnoutz

356

Author

January 05, 2012 03:59 PM

I thought about it and I reduced the number of function required to compute a dot 4, its not as good as a structure of array but its better than the first implementation I proposed.

(I really don't like working with structure of arrays.... When technical constraints kills application design...)



__m128 m = _mm_mul_ps(v1, v2);

__m128 t = _mm_add_ps(m, _mm_shuffle_ps(m, m, _MM_SHUFFLE(2, 3, 0, 1)));

__m128 result = _mm_add_ps(t, _mm_shuffle_ps(t, t, _MM_SHUFFLE(1, 0, 3, 2)));

You get a the resulting dot product in 5 intrinsics.

Antheus

2,410

January 05, 2012 04:09 PM

(I really don't like working with structure of arrays.... When technical constraints kills application design...)

Nope - your application is defying the purpose of SIMD. Single-instruction-multiple-data.

SIMD, GPUs, even MPI. What you really want is as much data as possible on which to apply the operation.

The link above references optimization manuals which list performance penalties of individual operations. Through per-element processing, these cannot be avoided. So the solution is to structure data in CPU-friendly way.

To optimize dot products, perform 1000 of them at the same time before moving on. Ever since 486 (cache hierarchy + pipeline), individual instructions aren't relevant anymore. Optimizing dot product, no matter the technique won't yield results beyond memory and instruction latency, which is where regular code is as good as it gets. To benefit, change application to use more data.

The code you have now 'dot(a,b)' is bottlenecked, regardless of how it's implemented internally, even if using SSE4 dot intrinsic.

Shnoutz

356

Author

January 05, 2012 04:17 PM

What I mean is I do not want to change the interface the program is currently using, I want to do some transparent optimization.

I am sure I am not alone in this situation. I can't go and convert our mesh format from aos to soa without someone noticing it... but I can optimize my vector library and no one will notice... they will only see that the performance improved a bit.

Rattenhirn

3,115

January 05, 2012 04:26 PM

It gets even more comppli

[quote name='Shnoutz' timestamp='1325624725' post='4899385']
All I can think of is the compiler is better at rescheduling floating point operations

It is. And it gets even more complicated. See here.
[/quote]

And it gets even more complicated. See here.

C dot product vs SSE dot product

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

C dot product vs SSE dot product

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines