Sign in to follow this  

SIMD Vector Instruction Problems

This topic is 2489 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi,
I am currently trying to convert some of my math library to take advantage of x86 SSE instructions. I am having problems when building with -O3 optimization in gcc, however. The results from the vector addition do not match the expected value. Upon further testing, incorrect results were also obtained when using -O1 optimizations.

main.cpp:
[code]
#include <Aleph/vec4f.h>
#include <iostream>

using namespace aleph;
using namespace std;

int
main( int _argc, char **_argv )
{
vec4f a( 1.0f, 2.0f, 3.0f, 1.0f );
vec4f b( 4.0f, 2.0f, 1.0f, 0.0f );

vec4f c = a + b;
a = c;

cout << "c = " << c << endl;

return 0;
}
[/code]

The vec4f class overloads the addition operator (note that this is inlined):
[code]
class vec4f
{
public:
//...
vec4f operator+( const vec4f &_b ) const
{
vec4f ret;
add4f( m_fvData, _b.m_fvData, ret.m_fvData );
return ret;
};
//...
private:

float m_fvData[4];
};
[/code]

add4f is defined as follows:
[code]
#define FORCEINLINE __attribute__( ( always_inline ) )

namespace aleph
{
FORCEINLINE void add4f( const float *_a, const float *_b, float *_r );
}

#ifdef ALEPH_SIMD

void
aleph::add4f( const float *_a, const float *_b, float *_r )
{
__asm__ volatile( "movups (%1), %%xmm0\n"
"movups (%2), %%xmm1\n"
"addps %%xmm1, %%xmm0\n"
"movups %%xmm0, %0\n"
: "=m" ( *_r )
: "r" ( _a ), "r" ( _b ) );
}

#else

#warning simulating SIMD instructions
void
aleph::add4f( const float *_a, const float *_b, float *_r )
{
_r[0] = _a[0] + _b[0];
_r[1] = _a[1] + _b[1];
_r[2] = _a[2] + _b[2];
_r[3] = _a[3] + _b[3];
}

#endif // ALEPH_SIMD[/code]

When SIMD operations are simulated, the addition works correctly. When the application is built in debug mode, the addition also works correctly. However, when built using optimizations, only the first element of the vector is successfully added, while the other elements yield incorrect results that are consistent across runs.

It seems to me that gcc is modifying the order of the instructions in some way for optimization purposes, but I cannot figure out how. Does anyone with experience using inline asm know of any way to prevent this? This is my first venture in using SIMD operations, so I feel like there is something fundamental I am missing (my knowledge of gcc's inline asm is also relatively shaky).

Share this post


Link to post
Share on other sites
If you compile with -O3 -g you can use objdump on the executable or the object file to view the produced assembly:

objdump -S -d *executable*

You can compare that to your inlined assembly code if you suspect the compiler messed up.

If the compiler really messed up, you might try to use the intrinsics for SIMD instructions instead of inlined assembly. They are even compatible to Microsoft's compiler.

Share this post


Link to post
Share on other sites
[quote name='othello' timestamp='1298056545' post='4776036']
It seems to me that gcc is modifying the order of the instructions in some way for optimization purposes, but I cannot figure out how. Does anyone with experience using inline asm know of any way to prevent this? This is my first venture in using SIMD operations, so I feel like there is something fundamental I am missing (my knowledge of gcc's inline asm is also relatively shaky).
[/quote]
I'd say you are missing the intrinisics like __m128, and _mm_add_ps. It is a lot easier than using the inline assembly.

Share this post


Link to post
Share on other sites
Hi, thanks for your replies. I managed to solve the problem. It turns out that in certain situations the compiler optimized away the initialization of the vectors. This occurred because I did not indicate in the inline assembler that the code clobbered the locations in memory. This was solved by modifying the inline asm as follows:

[code]
void
aleph::add4f( const float *_a, const float *_b, float *_r )
{
__asm__ volatile( "movups (%1), %%xmm0\n"
"movups (%2), %%xmm1\n"
"addps %%xmm1, %%xmm0\n"
"movups %%xmm0, %0\n"
: "=m" ( *_r )
: "r" ( _a ), "r" ( _b )
: "memory" );
}
[/code]

Share this post


Link to post
Share on other sites
GCC has vector support that can generate SIMD code automatically, without requiring intrinsics or inline asm. Docs here:
[url="http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html"]http://gcc.gnu.org/o...Extensions.html[/url]

Out of interest, I'm pretty sure most compilers already generate SIMD code for general floating point operations these days, rather than generating code to use the FPU. Of course that's still operating on single values, so it's not exactly optimal :-)

Andy Firth at Bungie did a nice writeup of his approach to cross platform, SIMD-supporting math libs [url="http://andyfirth.blogspot.com/2010/07/becoming-console-programmer-math.html"]here[/url].
And I rather like the [url="http://github.com/scoopr/vectorial"]vectorial[/url] library (still a work in progress, but certainly usable), which applies a lot of those ideas, uses the gcc vector extensions and supports intel, arm and powerpc.

The golden rule: make sure you look at the code the compiler generates to be sure it's doing what you think it is!

Hope some of this is useful!

Share this post


Link to post
Share on other sites

This topic is 2489 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this