for each traingle (lets call him abc - it has vertices abc) I
need to cross and normalize to get the normal ,
Presumably (but I'm not an SSE expert, so someone may contradict me): The best performance for such a problem comes with a memory layout where each SSE register holds the same components of 4 vertices. I.e.
uint count = ( vertices.length + 3 ) / 4; __m128 verticesX[count]; __m128 verticesY[count]; __m128 verticesZ[count];
Fill the arrays with the data of the vertices a, b, c of the first 4-tuple triangles, then of the second 4-tuple of triangles, and so on. In memory you then have something like:
verticesX : tri.vertex_a.x, tri.vertex_a.x, tri.vertex_a.x, tri.vertex_a.x verticesX : tri.vertex_b.x, tri.vertex_b.x, tri.vertex_b.x, tri.vertex_b.x verticesX : tri.vertex_c.x, tri.vertex_c.x, tri.vertex_c.x, tri.vertex_c.x verticesX : tri.vertex_a.x, tri.vertex_a.x, tri.vertex_a.x, tri.vertex_a.x verticesX : tri.vertex_b.x, tri.vertex_b.x, tri.vertex_b.x, tri.vertex_b.x verticesX : tri.vertex_c.x, tri.vertex_c.x, tri.vertex_c.x, tri.vertex_c.x ...
verticesZ: analogously, but with the .z component
dx01 = verticesX[i+0] - verticesX[i+1]; dy01 = verticesY[i+0] - verticesY[i+1]; dz01 = verticesZ[i+0] - verticesZ[i+1]; dx02 = verticesX[i+0] - verticesX[i+2]; dy02 = verticesY[i+0] - verticesY[i+2]; dz02 = verticesZ[i+0] - verticesZ[i+2]; nx = dy01 * dz02 - dz01 * dy02; ny = dz01 * dx02 - dx01 * dz02; nz = dx01 * dy02 - dy01 * dx02; len = sqrt(nx * nx + ny * ny + nz * nz); nx /= len; ny /= len; nz /= len;
should result in the normals of 4 triangles per run.
then i need to multiply it by model_pos matrix
Doing the same trickery with the model matrix requires each of its components to be replicated 4 times, so that each register holds 4 times the same value. It is not clear to me what "model_pos" means, but if it is the transform that relates the model to the world, all you need is the 3x3 sub-matrix that stores the rotational part since the vectors you are about to transform are direction vectors.