mmx matrix multiply

Started by
11 comments, last by carllloyd 16 years, 5 months ago
I have a matrix, which is 4x4 floats and a vector3 which is 3 floats. Now, Up till now, i've being using plain old c++ to to my transform. Now I have to do a fair amount of matrix multiplies on the CPU, its taking most of my cpu time. Vector3 result = someMatrix.Transform(someVector3);

inline Vector3 Matrix4::Transform(const Vector3& vector) const
{
	Vector3 ret;
	ret.x = (m[0] * vector.x) + (m[4] * vector.y) + (m[8] * vector.z) + m[12];
	ret.y = (m[1] * vector.x) + (m[5] * vector.y) + (m[9] * vector.z) + m[13];
	ret.z = (m[2] * vector.x) + (m[6] * vector.y) + (m[10] * vector.z) + m[14];
	return ret;
}
so i tried converting it to MMX like so:

inline Vector3 Matrix4::Transform(const Vector3& vector) const
{
	__m128 x = _mm_set_ps1(vector.x);
	__m128 y = _mm_set_ps1(vector.y);
	__m128 z = _mm_set_ps1(vector.z);

	__align16 Matrix4 alignMat(m);

	__m128* matX = (__m128*)&alignMat.m[0];
	__m128* matY = (__m128*)&alignMat.m[4];
	__m128* matZ = (__m128*)&alignMat.m[8];
	__m128* matTrans = (__m128*)&alignMat.m[12];

	__align16 Vector4 ret;
	__m128* r = (__m128*)&ret
	*r = _mm_mul_ps(*matX, x);
	*r = _mm_add_ps(*r, _mm_mul_ps(*matY, y));
	*r = _mm_add_ps(*r, _mm_mul_ps(*matZ, z));
	*r = _mm_add_ps(*r, *matTrans);

	return ret;
}
now it turns out the old code is faster, and i think its due to a number of factors: 1. returning a class, so the copy constructor is called (both have to do this anyway), I could speed this up by passing in a Vector3 to write to, but the mmx code would still need to copy to the vector3 after completing as it handles 4 floats at a time. 2. my matrix and vectors aren't 16 byte aligned, so in the multiply i have to make temporary vars that are aligned so i can do the mmx code. i guess i could remove my vector2 and 3 classes and just use vector4, but the idea was to not have to upload redundant data to the GPU from my maths library
Advertisement
I am no expert in x86 assembly, but if I'm not mistaken I think MMX only can handle integers.

You should check SSE / 3DNow instead
I think the OP means SSE. You're right on both your suspicions, but the one that is mainly responsible is more or less (2). People often think that SSE can be dropped into the core of an algorithm (like a matrix multiply routine or similar) and provide a big speed-up. This is almost never the case, as to get any decent performance improvement you need to a) make your input and output data formats SSE-friendly and b) batch your work together to process a large chunk of data at once.

In your case, this would mean making the input and output memory locations 16-byte aligned, using Vector4 for the vectors, and possibly rewriting your inner loop to batch all your vectors to be transformed (along with the matrices) together and do them all in one go. You should find you get a significant performance improvement if you do this.
To save the copy constructor call declare the function as:

// Could use & instead of * if you prefer for the resultinline void Matrix4::Transform(const Vector3& vector, Vector3 *pDest) const{	Vector3 ret;	ret.x = (m[0] * vector.x) + (m[4] * vector.y) + (m[8] * vector.z) + m[12];	ret.y = (m[1] * vector.x) + (m[5] * vector.y) + (m[9] * vector.z) + m[13];	ret.z = (m[2] * vector.x) + (m[6] * vector.y) + (m[10] * vector.z) + m[14];	*pDest = ret;}


Also make sure that function is in the .h file and not the .cpp so it'll get inlined in other files.

In addition make sure your default constructor for a Vector3 does nothing and is also declared in the .h file. You can make other constructors that initialize the members if you want to (e.g. one that takes 3 float parameters).

If you want it to go faster than that then you'll want to make a function that takes an array of vectors as the input, and returns an array as the result which will go noticeably quicker than doing one at a time, especially with SSE. To save on writing it yourself you may find D3DXVec3TransformArray() useful ;)
Quote:Original post by Adam_42
To save the copy constructor call declare the function as:

*** Source Snippet Removed ***

Also make sure that function is in the .h file and not the .cpp so it'll get inlined in other files.


EVIL EVIL EVIL.

If you want to avoid copy constructors then you should do it the "correct" way, without changing the public interface, which generally means using expression templates.

Quote:In addition make sure your default constructor for a Vector3 does nothing and is also declared in the .h file. You can make other constructors that initialize the members if you want to (e.g. one that takes 3 float parameters).


Also evil, if you use expression templates then the compiler will be able to optimize away the initialization (actually if done properly you don't even have to rely on the compilers optimizer, you can use the type system to ensure that it doesn't get initialized unless it's used before being assigned)
so what im thinking of doing is this:

make my matrix and vector4 classes 16byte aligned to be sse friendly.

ie.
class Vector4
{
__align16 v[4];
}

Thing is, will this create problems where the first vector in an array may include some padding

eg:
stl::vector<Vector4>

will there be some form of padding between some where here? or will the compiler automatically do something like this:
__align16 stl::vector<Vector4>

remove my vector3 and vector2 classes, as these cant be accelerated via sse easily

I can then change my transform to:
inline void Matrix4::Transform(const Vector4& vector, Vector3& result) const

that way i can avoid both my problems.
Sound like a good solution?

BTW how does expression templates avoid using the copy constructor?
Quote:Original post by Julian90
EVIL EVIL EVIL.

If you want to avoid copy constructors then you should do it the "correct" way, without changing the public interface, which generally means using expression templates.

Quote:In addition make sure your default constructor for a Vector3 does nothing and is also declared in the .h file. You can make other constructors that initialize the members if you want to (e.g. one that takes 3 float parameters).


Also evil, if you use expression templates then the compiler will be able to optimize away the initialization (actually if done properly you don't even have to rely on the compilers optimizer, you can use the type system to ensure that it doesn't get initialized unless it's used before being assigned)


I think this is slightly heavy-handed. Expression templates are indeed a useful technique for minimising the creation of temporaries. However, they are also complex and have disadvantages of their own. It is perfectly reasonable to have a Transform function with signature

void Transform(Vector3& v, const Matrix4& m)

which avoids the copy constructor at no cost in readability. Adam_42's recommendation that the default constructor not initialise the vector is standard practice when creating a non-ET based math library which the OP clearly has.

Given the original question asked about optimising code to use SSE, effectively telling the OP to "write a new math library" doesn't really deal with the issue at hand.
I have a somewhat related question to the OP. I read an article which somewhat vaguely hinted, but never explicitly stated, that you need the intel c++ compiler for SSE instructions to compile properly (he said the intel compiler also automatically aligns all vectors and matrices for the user btw).
1. Did he mean that only the intel compiler can produce fast SSE code? Or are there other compilers which are known to produce just as fast code? For instance, will visual studio express do it? What about the normal visual studio? And what about GCC?
2. I've heard AMD has similar instructions for speeding up matrix and vector operations. Does AMD support SSE, or do they have a second instruction set? If yes, doesn't that mean that programmers must ship their programs with 2 different builds, one adapted to AMD, and one adapted to Intel if you want fast operations on both CPU types?
Quote:Original post by d00fus
I think this is slightly heavy-handed. Expression templates are indeed a useful technique for minimising the creation of temporaries. However, they are also complex and have disadvantages of their own.

<snip>

Given the original question asked about optimising code to use SSE, effectively telling the OP to "write a new math library" doesn't really deal with the issue at hand.


Ok sorry agreed, I was two "heavy handed", I'm just tired and have a real dislike for the "return value as reference" type things, how about I try and be a bit more helpful.

Quote:so what im thinking of doing is this:
make my matrix and vector4 classes 16byte aligned to be sse friendly.
<snip>


This will work but, as you've noticed, has some disadvantages with having to use Vector4's where you really want Vector3's and hence wasting some computation's, there are alternatives I'll outline below (these have there own advantages, one of which is that they require larger scale changes and so just using Vector4 may still be preferable)

Quote:Thing is, will this create problems where the first vector in an array may include some padding

eg:
stl::vector<Vector4>

will there be some form of padding between some where here? or will the compiler automatically do something like this:
__align16 stl::vector<Vector4>


The variables only get aligned when they are allocated on the stack, you need to use __aligned_malloc to allocate data structures which need to be aligned dynamically which generally means writing a custom allocator for the SC++L containers.

Quote:remove my vector3 and vector2 classes, as these cant be accelerated via sse easily


They can be accelerated just not as easily, once again see below.

Quote:I can then change my transform to:
inline void Matrix4::Transform(const Vector4& vector, Vector3& result) const

that way i can avoid both my problems.
Sound like a good solution?


Sounds good.

Quote:BTW how does expression templates avoid using the copy constructor?


Basically you return a temporary object which represents the computation to be done and then when the temporary object is assigned to the place you want to do the result you can perform the computation in place without creating the temporaries, it makes more sense with an example:

struct Vector3{    float x, y, z;    // Standard operator[]    float operator[](std::size_t);    // Copy Constructor and operator= to create a Vector3 from a stored    // computation (expression template)    template<typename T>    Vector3(T rhs) : x(rhs[0]), y(rhs[1]), z(rhs[2]) { }    template<typename T>    Vector3& operator=(T rhs) { x = rhs[0]; y = rhs[1]; z = rhs[2]; }};// The temporary that represents additiontemplate<typename Left, typename Right>struct Add{    Add(const Left& lhs, const Right& rhs) : lhs(lhs), rhs(rhs) { }    // Calculate the i'th result    float operator[](std::size_t i) { return lhs + rhs; }private:    Left lhs;    Right rhs;};template<typename Left, typename Right>Add<Left, Right> operator+(const Left& left, const Right& right){    return Add<Left, Right>(left, right);}// Exampleint main(){    // Now assuming you have a decent optimizing compiler this:    Vector3 a, b, c;    Vector3 d = a + b + c;    // Which would normaly compile as    Vector3 temp1 = a + b;    Vector3 d = temp1 + c;    // Will become    d.x = a.x + b.x + c.x;    d.y = a.y + b.y + c.y;    d.z = a.z + b.z + c.z;}


But as was mentioned above this is probably overkill for your situation.

Now for the alternative way to vectorize things that I mentioned earlier. The basic idea is to instead of working on one vector at a time you work on four, so for example, to compute a dot on four vectors instead of computing four separate dot products and trying to use SSE on them you would use SSE to compute four dot products at the same time, e.g.

// Represents four Vector3'sstruct Vector4x3{    __m128 x;    __m128 y;    __m128 z;}// Computes four dot products on Vector3's__m128 dot(const Vector4x3 lhs, const Vector4x3 rhs){    Vector4x3 temp;    temp.x = _mm_mul_ps(lhs.x, rhs.x);    temp.y = _mm_mul_ps(lhs.y, rhs.y);    temp.z = _mm_mul_ps(lhs.z, rhs.z);    return _mm_add_ps(temp.z, _mm_add_ps(temp.x, temp.y));}
Quote:Original post by all_names_taken
I have a somewhat related question to the OP. I read an article which somewhat vaguely hinted, but never explicitly stated, that you need the intel c++ compiler for SSE instructions to compile properly (he said the intel compiler also automatically aligns all vectors and matrices for the user btw).
1. Did he mean that only the intel compiler can produce fast SSE code? Or are there other compilers which are known to produce just as fast code? For instance, will visual studio express do it? What about the normal visual studio? And what about GCC?
2. I've heard AMD has similar instructions for speeding up matrix and vector operations. Does AMD support SSE, or do they have a second instruction set? If yes, doesn't that mean that programmers must ship their programs with 2 different builds, one adapted to AMD, and one adapted to Intel if you want fast operations on both CPU types?


Typically to use SSE you will either write the assembly by hand or use so-called intrinsics, which are slightly higher level but map more or less directly to assembly instructions. To write the assembly by hand all you need is the ability to write inline assembly, which any decent compiler should provide. AFAIK VS, GCC and Intel all support intrinsics as well.

AMD supports SSE also; the level of support will depend on the chip. AMD also originally introduced 3DNow! (and a bunch of enhancements), which was intended to be a similar SIMD instruction set. You may need to write different code for Intel and AMD but typically not because of needing different instructions, but rather to handle differences in memory access latencies and other platform specific characteristics.

This topic is closed to new replies.

Advertisement