# SIMD Vector Instruction Problems

This topic is 2683 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi,
I am currently trying to convert some of my math library to take advantage of x86 SSE instructions. I am having problems when building with -O3 optimization in gcc, however. The results from the vector addition do not match the expected value. Upon further testing, incorrect results were also obtained when using -O1 optimizations.

main.cpp:
 #include <Aleph/vec4f.h> #include <iostream> using namespace aleph; using namespace std; int main( int _argc, char **_argv ) { vec4f a( 1.0f, 2.0f, 3.0f, 1.0f ); vec4f b( 4.0f, 2.0f, 1.0f, 0.0f ); vec4f c = a + b; a = c; cout << "c = " << c << endl; return 0; } 

The vec4f class overloads the addition operator (note that this is inlined):
 class vec4f { public: //... vec4f operator+( const vec4f &_b ) const { vec4f ret; add4f( m_fvData, _b.m_fvData, ret.m_fvData ); return ret; }; //... private: float m_fvData[4]; }; 

add4f is defined as follows:
 #define FORCEINLINE __attribute__( ( always_inline ) ) namespace aleph { FORCEINLINE void add4f( const float *_a, const float *_b, float *_r ); } #ifdef ALEPH_SIMD void aleph::add4f( const float *_a, const float *_b, float *_r ) { __asm__ volatile( "movups (%1), %%xmm0\n" "movups (%2), %%xmm1\n" "addps %%xmm1, %%xmm0\n" "movups %%xmm0, %0\n" : "=m" ( *_r ) : "r" ( _a ), "r" ( _b ) ); } #else #warning simulating SIMD instructions void aleph::add4f( const float *_a, const float *_b, float *_r ) { _r[0] = _a[0] + _b[0]; _r[1] = _a[1] + _b[1]; _r[2] = _a[2] + _b[2]; _r[3] = _a[3] + _b[3]; } #endif // ALEPH_SIMD

When SIMD operations are simulated, the addition works correctly. When the application is built in debug mode, the addition also works correctly. However, when built using optimizations, only the first element of the vector is successfully added, while the other elements yield incorrect results that are consistent across runs.

It seems to me that gcc is modifying the order of the instructions in some way for optimization purposes, but I cannot figure out how. Does anyone with experience using inline asm know of any way to prevent this? This is my first venture in using SIMD operations, so I feel like there is something fundamental I am missing (my knowledge of gcc's inline asm is also relatively shaky).

##### Share on other sites
If you compile with -O3 -g you can use objdump on the executable or the object file to view the produced assembly:

objdump -S -d *executable*

You can compare that to your inlined assembly code if you suspect the compiler messed up.

If the compiler really messed up, you might try to use the intrinsics for SIMD instructions instead of inlined assembly. They are even compatible to Microsoft's compiler.

##### Share on other sites

It seems to me that gcc is modifying the order of the instructions in some way for optimization purposes, but I cannot figure out how. Does anyone with experience using inline asm know of any way to prevent this? This is my first venture in using SIMD operations, so I feel like there is something fundamental I am missing (my knowledge of gcc's inline asm is also relatively shaky).

I'd say you are missing the intrinisics like __m128, and _mm_add_ps. It is a lot easier than using the inline assembly.

##### Share on other sites
Hi, thanks for your replies. I managed to solve the problem. It turns out that in certain situations the compiler optimized away the initialization of the vectors. This occurred because I did not indicate in the inline assembler that the code clobbered the locations in memory. This was solved by modifying the inline asm as follows:

 void aleph::add4f( const float *_a, const float *_b, float *_r ) { __asm__ volatile( "movups (%1), %%xmm0\n" "movups (%2), %%xmm1\n" "addps %%xmm1, %%xmm0\n" "movups %%xmm0, %0\n" : "=m" ( *_r ) : "r" ( _a ), "r" ( _b ) : "memory" ); } 

##### Share on other sites
GCC has vector support that can generate SIMD code automatically, without requiring intrinsics or inline asm. Docs here:
http://gcc.gnu.org/o...Extensions.html

Out of interest, I'm pretty sure most compilers already generate SIMD code for general floating point operations these days, rather than generating code to use the FPU. Of course that's still operating on single values, so it's not exactly optimal :-)

Andy Firth at Bungie did a nice writeup of his approach to cross platform, SIMD-supporting math libs here.
And I rather like the vectorial library (still a work in progress, but certainly usable), which applies a lot of those ideas, uses the gcc vector extensions and supports intel, arm and powerpc.

The golden rule: make sure you look at the code the compiler generates to be sure it's doing what you think it is!

Hope some of this is useful!

1. 1
2. 2
Rutin
18
3. 3
4. 4
5. 5

• 14
• 12
• 9
• 12
• 37
• ### Forum Statistics

• Total Topics
631428
• Total Posts
3000028
×