Sign in to follow this  
sgtcodeboy

Vector Length using SIMD

Recommended Posts

I'm working on learning SIMD to calculate 3d math, vector & matrix etc.. I figured out how to calculate the length of a vector, but my implementation seems a bit verbose.. I'm wondering if there's a better way. I know you're not supposed to access the members of the __m128 directly. The assembly shows a lot of movaps.


#define SHUFFLE_PARAMS(x,y,z,w) ((x) | ((y) << 2) | ((z) << 4) | ((w) << 6))

#define _mm_replicate_x_ps(v) _mm_shuffle_ps((v), (v), SHUFFLE_PARAMS(0,0,0,0))
#define _mm_replicate_y_ps(v) _mm_shuffle_ps((v), (v), SHUFFLE_PARAMS(1,1,1,1))
#define _mm_replicate_z_ps(v) _mm_shuffle_ps((v), (v), SHUFFLE_PARAMS(2,2,2,2))
#define _mm_replicate_w_ps(v) _mm_shuffle_ps((v), (v), SHUFFLE_PARAMS(3,3,3,3))

__m128 vector_length(__m128 vec)
{
__m128 squared = _mm_mul_ps(vec,vec);

__m128 x = _mm_replicate_x_ps(squared);
__m128 y = _mm_replicate_y_ps(squared);
__m128 z = _mm_replicate_z_ps(squared);

__m128 added = _mm_add_ps(_mm_add_ps(x,y),z);

__m128 sqrt = _mm_sqrt_ps(added);
return sqrt;
}

Share this post


Link to post
Share on other sites
Honestly, this is a really bad use of SIMD code.

If you wish to do this, them make it a batch operation for a large number of vectors whose length you compute simultaneously. Otherwise you're just wasting your time writing inefficient SIMD code.

Share this post


Link to post
Share on other sites
Thank you for your reply, I'm just learning so all feedback is good. I definitely understand about the batch. I'll definitely keep that in mind. Assuming inside the batch, is there a better/different way that the mechanics of calculating a vector's length should be done with SIMD?

Share this post


Link to post
Share on other sites

struct Vec3
{
__m128 x;
__m128 y;
__m128 z;
};

__m128 vector_length(const Vec3& a)
{
const __m128 x2 = _mm_mul_ps(a.x, a.x);
const __m128 y2 = _mm_mul_ps(a.y, a.y);
const __m128 z2 = _mm_mul_ps(a.z, a.z);
return _mm_sqrt_ps( _mm_add_ps(z2,_mm_add_ps(x2, y2)) );
}


Share this post


Link to post
Share on other sites
Quote:
Original post by RobTheBloke
*** Source Snippet Removed ***


What? A 48-byte 3-component vector? I don't think that's what Washu had in mind. Batching means 'use structure-of-arrays and compute lengths in parallel,' not 'increase storage overhead by 400%.' As was said, this is actually not a problem SSE is well-suited to solve. You could try using the horizontal adds introduced in SSE3 but be advised this will still likely result in marginal improvement.

Share this post


Link to post
Share on other sites
Newbie clarification here, when we're talking about a structure of arrays, and calculating in parallel we're talking about something like


struct vector_lengths
{
__m128 Vec[NN];
__m128 Length[NN];
};

for( int i = 0; i < NN; i++)
{
// SIMD code to calculate vector length here..
}




in order to avoid the cost of the function calls, etc..

as opposed to something like CUDA to actually calculate all of the lengths in parallel on 32 - 100s of threads.

Share this post


Link to post
Share on other sites
Quote:
Original post by InvalidPointer
Quote:
Original post by RobTheBloke
*** Source Snippet Removed ***


What? A 48-byte 3-component vector? I don't think that's what Washu had in mind. Batching means 'use structure-of-arrays and compute lengths in parallel,' not 'increase storage overhead by 400%.' As was said, this is actually not a problem SSE is well-suited to solve. You could try using the horizontal adds introduced in SSE3 but be advised this will still likely result in marginal improvement.


I think you've misunderstood the code. That is actually a good use of SSE and will calculate the length of four vectors in parallel.

Vector1{ x[0], y[0], z[0] }
Vector2{ x[1], y[1], z[1] }
Vector3{ x[2], y[2], z[2] }
Vector4{ x[3], y[3], z[3] }

Output{ Vector1_length, Vector2_length, Vector3_length, Vector4_length }

Share this post


Link to post
Share on other sites
In fairness, Rob's sample could have named things a bit more clearly -- Vec3 could have been named something like Vectors4SOA, and it's members x, y and z could have been named xs, ys zs. The paramter to the function 'a' could have been better named 'vectors' or even 'soa' and then x2, y2 and z2 could have been xs2, ys2 and xs2.

Check out the difference:


struct Vectors4SOA
{
__m128 xs;
__m128 ys;
__m128 zs;
};

__m128 vector_length(const Vectors4SOA& soa)
{
const __m128 xs2 = _mm_mul_ps(soa.xs, soa.xs);
const __m128 ys2 = _mm_mul_ps(soa.ys, soa.ys);
const __m128 zs2 = _mm_mul_ps(soa.zs, soa.zs);
return _mm_sqrt_ps( _mm_add_ps(zs2,_mm_add_ps(xs2, ys2)) );
}



Not to knock Rob's code or anything -- it certainly does what's intended. But I was confused myself at first glance over the code until I had actually taken the time to read it and follow what was going on in my head. A few extra characters here and there direct early assumptions in the correct direction so that one doesn't have to mentally simulate the code to follow it.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this