If you're alright with restricting yourself to SSE3+, you can use the _mm_hadd_ps intrinsic to save several steps in your code.

I find that SIMD is not very useful for your standard coordinate vectors (i.e. position), but are far more useful when you are dealing with many independent parallel operations (which vector magnitude is not, hence the horizontal add) and can manage to store your data aligned and in a SIMD-friendly format. For instance, you could write a much more efficient function to compute the magnitude of 4 different vectors at once, returning a 128 bit value containing the result of all 4, no shuffling required.

For this reason, my math library contains a specialized SIMDScalar<Type,N> template class that represents an N-wide SIMD value of the given type (so that you can support doubles, ints, or whatever else). I then have a SIMDVector3<Type,N> class which implements an N-wide standard 3D coordinate vector using a SIMDScalar for each of the x, y, and z components. You can then use a SIMDVector3 to perform parallel operations on many 3D vectors at once, stored in structure-of-arrays format. The advantage of doing it this way is that you avoid all of the packing/unpacking that would be necessary (and slow) if you were using an __m128 to make a 4-component coordinate vector class as you are trying to do.