**0**

# fast matrix4x4 and vec4 multiply. SSE?

###
#2
Crossbones+ - Reputation: **7082**

Posted 09 June 2014 - 09:11 AM

It depends on how many such operations you perform. I managed to achieve 11x speedup in one particular case which had matrix multiplication + other code.

__m128 vec_x = _mm_permute_ps(vector4, 0x00); __m128 vec_y = _mm_permute_ps(vector4, 0x55); __m128 vec_z = _mm_permute_ps(vector4, 0xAA); __m128 vec_w = _mm_permute_ps(vector4, 0xFF); // assume mat4_1, mat4_2, mat4_3, mat4_4 are matrix's components (I think rows) __m128 res0 = _mm_mul_ps(vec_x, mat4_1); __m128 res1 = _mm_fmadd_ps(vec_y, mat4_2, res0); __m128 res2 = _mm_fmadd_ps(vec_z, mat4_3, res1); __m128 res3 = _mm_fmadd_ps(vec_w, mat4_4, res2); // return res3; because it's transformed vector4. // for vector3 your mat4_4 and vec_w are just zero, so remove them altogether

###
#3
Members - Reputation: **955**

Posted 09 June 2014 - 11:23 AM

in first place, you should work with interleaved arrays, as for the matrix, and for the vector. And cache the parameters.

void transform4x3mat(float* mat, float* vec,float* res)

{

// cache vec values in case the res points to same vector

float x= *vec;

float y=*(vec+1);

float z=*(vec+2);

float w=1.0;// see , most likely always 1, so you do not need to multiply 4 column by w at all

*(res)=x*(*(mat))+y*(*(mat+1))+z*(*(mat+2))+(*(mat+3));

*(res+1)=x*(*(mat+4))+y*(*(mat+5))+z*(*(mat+6))+(*(mat+7));

*(res+2)=x*(*(mat+8))+y*(*(mat+9))+z*(*(mat+10))+(*(mat+11));

*(res+3)=x*(*(mat+12))+y*(*(mat+13))+z*(*(mat+14))+(*(mat+15)); // in case of projection matrix (included), compute this 4th compenent of result, else set straight 1

}

this function does colum matrix transformation, assuming row layout of matrix in memory.

**Edited by JohnnyCode, 09 June 2014 - 11:24 AM.**