Jump to content
  • Advertisement


This topic is now archived and is closed to further replies.


sse-accelerated vector arithmetics question

This topic is 5545 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

hi. i have coded a vector library some time ago... now i started to *optimize* that library using SSE inline assembly... and i''m stuck with even the easiest of all operations: the addition. to be able to use "movaps" i have to allign the vector''s coordinates to a 2 byte boundary, right? (i done that). then i have to load up the four coordinates of the first vector into a sse-register (xmm0) and those of the other vector into another one (xmm1) using movaps. then i should be able to add them using addps, and transfer them back to the returning vector''s coordinates with movaps again, right? this is the code:
#ifdef SSE
inline vector operator+(vector v1, const vector &v2)
#ifdef _UNIX_
"mov %0, %%eax\n"
"mov %1, %%edx\n"

"movaps (%%eax), %%xmm0\n"
"movaps (%%edx), %%xmm1\n"
"addps   %%xmm1, %%xmm0\n"
"movaps  %%xmm0, (%%eax)\n"
:"m"(&v1), "m"(&v2)
:"eax", "edx", "xmm0", "xmm1"
#else // i didn''t test this part
__asm {
movaps xmm0, v1
movaps xmm0, v2
addps xmm0, xmm1
movaps v1, xmm0
  return v1;
inline vector operator+(vector v1, const vector& v2)
   v1.x+=v2.x; v1.y+=v2.y; v1.z+=v2.z;
   return v1;
when i compile it with intelC it just segfault''s out, and if i compile with gcc it returns _random_ results somebody sees whats wrong?

Share this post

Link to post
Share on other sites
Is it really worth doing this? A decent compiler will probably produce the best optimisations for you anyway.

When I''m in command, every mission''s a suicide mission!

Share this post

Link to post
Share on other sites
yes i guess it should be, since a vector addition is performed thousands of times per frame...

my algorithm uses:
2*mov ( 1 clock cycle in this case )
3*movaps ( 3 clock cycles )
1*addps ( 1 to 3 clock cycles)
-> 12 to 14 clock cycles

gcc''s (-march=athlon-xp -O2) algorithm uses:
6*fld ( 1 to 2 cycles )
3*fadd (1 cycle )
3*fstp ( 1 to 2 cycles )
-> 21 to 30 clock cycles

intelC''s (-march=pentiumiii -O2) algorithm uses:
-> 14 to 23 clock cycles

Share this post

Link to post
Share on other sites

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!