Quote:Original post by d00fus
I think this is slightly heavy-handed. Expression templates are indeed a useful technique for minimising the creation of temporaries. However, they are also complex and have disadvantages of their own.
<snip>
Given the original question asked about optimising code to use SSE, effectively telling the OP to "write a new math library" doesn't really deal with the issue at hand.
Ok sorry agreed, I was two "heavy handed", I'm just tired and have a real dislike for the "return value as reference" type things, how about I try and be a bit more helpful.
Quote:so what im thinking of doing is this:
make my matrix and vector4 classes 16byte aligned to be sse friendly.
<snip>
This will work but, as you've noticed, has some disadvantages with having to use Vector4's where you really want Vector3's and hence wasting some computation's, there are alternatives I'll outline below (these have there own advantages, one of which is that they require larger scale changes and so just using Vector4 may still be preferable)
Quote:Thing is, will this create problems where the first vector in an array may include some padding
eg:
stl::vector<Vector4>
will there be some form of padding between some where here? or will the compiler automatically do something like this:
__align16 stl::vector<Vector4>
The variables only get aligned when they are allocated on the stack, you need to use __aligned_malloc to allocate data structures which need to be aligned dynamically which generally means writing a custom allocator for the SC++L containers.
Quote:remove my vector3 and vector2 classes, as these cant be accelerated via sse easily
They can be accelerated just not as easily, once again see below.
Quote:I can then change my transform to:
inline void Matrix4::Transform(const Vector4& vector, Vector3& result) const
that way i can avoid both my problems.
Sound like a good solution?
Sounds good.
Quote:BTW how does expression templates avoid using the copy constructor?
Basically you return a temporary object which represents the computation to be done and then when the temporary object is assigned to the place you want to do the result you can perform the computation in place without creating the temporaries, it makes more sense with an example:
struct Vector3{ float x, y, z; // Standard operator[] float operator[](std::size_t); // Copy Constructor and operator= to create a Vector3 from a stored // computation (expression template) template<typename T> Vector3(T rhs) : x(rhs[0]), y(rhs[1]), z(rhs[2]) { } template<typename T> Vector3& operator=(T rhs) { x = rhs[0]; y = rhs[1]; z = rhs[2]; }};// The temporary that represents additiontemplate<typename Left, typename Right>struct Add{ Add(const Left& lhs, const Right& rhs) : lhs(lhs), rhs(rhs) { } // Calculate the i'th result float operator[](std::size_t i) { return lhs + rhs; }private: Left lhs; Right rhs;};template<typename Left, typename Right>Add<Left, Right> operator+(const Left& left, const Right& right){ return Add<Left, Right>(left, right);}// Exampleint main(){ // Now assuming you have a decent optimizing compiler this: Vector3 a, b, c; Vector3 d = a + b + c; // Which would normaly compile as Vector3 temp1 = a + b; Vector3 d = temp1 + c; // Will become d.x = a.x + b.x + c.x; d.y = a.y + b.y + c.y; d.z = a.z + b.z + c.z;}
But as was mentioned above this is probably overkill for your situation.
Now for the alternative way to vectorize things that I mentioned earlier. The basic idea is to instead of working on one vector at a time you work on four, so for example, to compute a dot on four vectors instead of computing four separate dot products and trying to use SSE on them you would use SSE to compute four dot products at the same time, e.g.
// Represents four Vector3'sstruct Vector4x3{ __m128 x; __m128 y; __m128 z;}// Computes four dot products on Vector3's__m128 dot(const Vector4x3 lhs, const Vector4x3 rhs){ Vector4x3 temp; temp.x = _mm_mul_ps(lhs.x, rhs.x); temp.y = _mm_mul_ps(lhs.y, rhs.y); temp.z = _mm_mul_ps(lhs.z, rhs.z); return _mm_add_ps(temp.z, _mm_add_ps(temp.x, temp.y));}