for each traingle (lets call him abc - it has vertices abc) I

need to cross and normalize to get the normal ,

Presumably (but I'm not an SSE expert, so someone may contradict me): The best performance for such a problem comes with a memory layout where each SSE register holds the same components of 4 vertices. I.e.

uint count = ( vertices.length + 3 ) / 4; __m128 verticesX[count]; __m128 verticesY[count]; __m128 verticesZ[count];

Fill the arrays with the data of the vertices a, b, c of the first 4-tuple triangles, then of the second 4-tuple of triangles, and so on. In memory you then have something like:

verticesX[0] : tri[0].vertex_a.x, tri[1].vertex_a.x, tri[2].vertex_a.x, tri[3].vertex_a.x verticesX[1] : tri[0].vertex_b.x, tri[1].vertex_b.x, tri[2].vertex_b.x, tri[3].vertex_b.x verticesX[2] : tri[0].vertex_c.x, tri[1].vertex_c.x, tri[2].vertex_c.x, tri[3].vertex_c.x verticesX[3] : tri[4].vertex_a.x, tri[5].vertex_a.x, tri[6].vertex_a.x, tri[7].vertex_a.x verticesX[4] : tri[4].vertex_b.x, tri[5].vertex_b.x, tri[6].vertex_b.x, tri[7].vertex_b.x verticesX[5] : tri[4].vertex_c.x, tri[5].vertex_c.x, tri[6].vertex_c.x, tri[7].vertex_c.x ...

verticesZ: analogously, but with the .z component

dx01 = verticesX[i+0] - verticesX[i+1]; dy01 = verticesY[i+0] - verticesY[i+1]; dz01 = verticesZ[i+0] - verticesZ[i+1]; dx02 = verticesX[i+0] - verticesX[i+2]; dy02 = verticesY[i+0] - verticesY[i+2]; dz02 = verticesZ[i+0] - verticesZ[i+2]; nx = dy01 * dz02 - dz01 * dy02; ny = dz01 * dx02 - dx01 * dz02; nz = dx01 * dy02 - dy01 * dx02; len = sqrt(nx * nx + ny * ny + nz * nz); nx /= len; ny /= len; nz /= len;

should result in the normals of 4 triangles per run.

then i need to multiply it by model_pos matrix

Doing the same trickery with the model matrix requires each of its components to be replicated 4 times, so that each register holds 4 times the same value. It is not clear to me what "model_pos" means, but if it is the transform that relates the model to the world, all you need is the 3x3 sub-matrix that stores the rotational part since the vectors you are about to transform are direction vectors.