• Advertisement

Archived

This topic is now archived and is closed to further replies.

my SSE code is slower than normal! why?

This topic is 5150 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello! I would like to ask somebody to tell me, why is my SSE code for vector and matrix operations slower. I must do something wrong, but dont know what. My vector addition for example: (im using intrinsics, xmmintirn.h for vc++6 sp5) data type: typedef union { __m128 data; float elements[4]; } vector4d; with SSE (using intrinsics): inline vector4d add(vector4d a, vector4d b) { vector4d c; c.data = _mm_add_ps(a.data, b.data); return c; }; without: inline vector4d add_sisd(vector4d a, vector4d b) { return set(a.elements[0]+b.elements[0],a.elements[1]+b.elements[1],a.elements[2]+b.elements[2],0); }; (set() simply returns a vector4d with the parameter values) "Knowledge is no more expensive than ignorance, and at least as satisfying." -Barrin

Share this post


Link to post
Share on other sites
Advertisement
Guest Anonymous Poster
hmm, one thing that is really important is memory alignment; I think that SSE stuff should be aligned to 8 byte boundaries (can anyone confirm this), anything that isn''t 8 byte aligned can incur a high penelty - this could easily happen if you''re using stuff allocated on the stack (ie local variables) or dynamically alloc''ed mem if you didn''t use the special instrinsic aligned memory alloc calls for VC (can''t remember the exact fn names - it''s in the processor pack help)

Have you tried pre-caching the data using the precache instructions ? that can often provide a significant boost (prefetchnta in asm works best for me)

The other thing is; is this an appropriate application; generally speaking you get the best results if you need to process large chunks of data in one go (rather than just dipping in and out of SSE for say just one op)

What happens if you get rid of the temp variable in the SSE version ? ie:

inline vector4d add(vector4d a, vector4d b)
{
return _mm_add_ps(a.data, b.data);
}

maybe you can do something like this with some kind of cast ??

right, that''s all i can think of at the moment; good luck

Share this post


Link to post
Share on other sites
you are most likely just trying to do ops on unaligned data. if the memory isnt alligned, it will make those ops very very inneficient. You might also note that normally these kinds of ops are used for large number of data, not just one... I assume you knew that and was just testing with a single data value ;P

Hope I''ve helped.

[shameless plug]I did write a tutorial over SSE2 which you find helpful My amazing tutorial[/shameless plug]

it is over sse2, but should be pretty relavent to sse as well.

Dwiel

Share this post


Link to post
Share on other sites
Yes i actually tried it on single vectors. the variables should be aligned to 16byte boundaries, but i dont really know what does that mean. somebody could explain it to me?
the other thing is prefetching. how does that work (how to process huge arrays efficietly)? i did not find any docs, exactly specifying the sse operations...

"Knowledge is no more expensive than ignorance, and at least as satisfying." -Barrin

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
okay:

Alignment to a 16byte boundary means that the address of the data (ie it''s first byte) should be at an address that is exactly divisible by 16 with no remainder: ie 0x00543210, 0x00437780 would be okay, 0x00683837 & 0x00278274 would not be okay (basically in hex the last digit in a hex number needs to be zero for the data to be 16 byte aligned). I believe there are some macros defined in VC that ensure 16 byte alignement of specific variables - it''s in the help files.

prefetch: basically the principle is that you tell the processor in advance that it will need this data so that it can cache it ready for when it is needed: for example in a loop you might tell the cpu to prefetch data for the next itteration whilst you do the current itteration (if that makes it any clearer...)

Share this post


Link to post
Share on other sites
thanx! i got this pefetch thing, but need the specifications of the SSE functions, that do this job (what do they exactly do and how)
any specs/tutes on this?

Share this post


Link to post
Share on other sites
"thanx! i got this pefetch thing, but need the specifications of the SSE functions, that do this job (what do they exactly do and how)
any specs/tutes on this? "


intel.com

I think that pdf would be under 5 MB.
Actually 3.5, xxx and 5.5 MB
Happy DL.

Share this post


Link to post
Share on other sites

  • Advertisement