• ### Popular Now

• 11
• 9
• 10
• 9
• 10

#### Archived

This topic is now archived and is closed to further replies.

# sse-accelerated vector arithmetics question

This topic is 5361 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

hi. i have coded a vector library some time ago... now i started to *optimize* that library using SSE inline assembly... and i''m stuck with even the easiest of all operations: the addition. to be able to use "movaps" i have to allign the vector''s coordinates to a 2 byte boundary, right? (i done that). then i have to load up the four coordinates of the first vector into a sse-register (xmm0) and those of the other vector into another one (xmm1) using movaps. then i should be able to add them using addps, and transfer them back to the returning vector''s coordinates with movaps again, right? this is the code:
#ifdef SSE
inline vector operator+(vector v1, const vector &v2)
{
#ifdef _UNIX_
asm(
"mov %0, %%eax\n"
"mov %1, %%edx\n"

"movaps (%%eax), %%xmm0\n"
"movaps (%%edx), %%xmm1\n"
"movaps  %%xmm0, (%%eax)\n"
:
:"m"(&v1), "m"(&v2)
:"eax", "edx", "xmm0", "xmm1"
);
#else // i didn''t test this part
__asm {
movaps xmm0, v1
movaps xmm0, v2
movaps v1, xmm0
}
#endif
return v1;
}
#else
inline vector operator+(vector v1, const vector& v2)
{
v1.x+=v2.x; v1.y+=v2.y; v1.z+=v2.z;
return v1;
}
#endif

when i compile it with intelC it just segfault''s out, and if i compile with gcc it returns _random_ results somebody sees whats wrong?

##### Share on other sites
Is it really worth doing this? A decent compiler will probably produce the best optimisations for you anyway.

---
When I''m in command, every mission''s a suicide mission!

##### Share on other sites
yes i guess it should be, since a vector addition is performed thousands of times per frame...

my algorithm uses:
2*mov ( 1 clock cycle in this case )
3*movaps ( 3 clock cycles )
1*addps ( 1 to 3 clock cycles)
-> 12 to 14 clock cycles

gcc''s (-march=athlon-xp -O2) algorithm uses:
9*mov
6*fld ( 1 to 2 cycles )
3*fstp ( 1 to 2 cycles )
-> 21 to 30 clock cycles

intelC''s (-march=pentiumiii -O2) algorithm uses:
2*mov
6*fld