# fast matrix4x4 and vec4 multiply. SSE?

## Recommended Posts

hi guys, I'm looking for fast matrix4x4 per Vec4/vec3 multiply. It might be an SSE code or something like this. Would this bring a big speedup to my C++ code? Can anyone help me? Thanks in advance.

##### Share on other sites
Zaoshi Kaba    8434

It depends on how many such operations you perform. I managed to achieve 11x speedup in one particular case which had matrix multiplication + other code.

	__m128 vec_x = _mm_permute_ps(vector4, 0x00);
__m128 vec_y = _mm_permute_ps(vector4, 0x55);
__m128 vec_z = _mm_permute_ps(vector4, 0xAA);
__m128 vec_w = _mm_permute_ps(vector4, 0xFF);

// assume mat4_1, mat4_2, mat4_3, mat4_4 are matrix's components (I think rows)
__m128 res0 = _mm_mul_ps(vec_x, mat4_1);
__m128 res1 = _mm_fmadd_ps(vec_y, mat4_2, res0);
__m128 res2 = _mm_fmadd_ps(vec_z, mat4_3, res1);
__m128 res3 = _mm_fmadd_ps(vec_w, mat4_4, res2);
// return res3; because it's transformed vector4.

// for vector3 your mat4_4 and vec_w are just zero, so remove them altogether


##### Share on other sites
JohnnyCode    1046

in first place, you should work with interleaved arrays, as for the matrix, and for the vector. And cache the parameters.

void transform4x3mat(float* mat, float* vec,float* res)

{

// cache vec values in case the res points to same vector

float x= *vec;

float y=*(vec+1);

float z=*(vec+2);

float w=1.0;// see , most likely always 1, so you do not need to multiply 4 column by w at all

*(res)=x*(*(mat))+y*(*(mat+1))+z*(*(mat+2))+(*(mat+3));

*(res+1)=x*(*(mat+4))+y*(*(mat+5))+z*(*(mat+6))+(*(mat+7));

*(res+2)=x*(*(mat+8))+y*(*(mat+9))+z*(*(mat+10))+(*(mat+11));

*(res+3)=x*(*(mat+12))+y*(*(mat+13))+z*(*(mat+14))+(*(mat+15)); // in case of projection matrix (included), compute this 4th compenent of result, else set straight 1

}

this function does colum matrix transformation, assuming row layout of matrix in memory.

Edited by JohnnyCode

##### Share on other sites

JohnnyCode's got the right idea. Brandon Jones' glMatrix unrolls all the code for things like that because WebGL needs all the speed it can get.