I got an array with input geometry, this is 100k entries
array of triangles
I got an array with input geometry, this is 100k entries
array of triangles
You currently have an "array of structures" or AoS for short, i.e. a vertex (the structure) sequenced into an array. For SSE it is often better to have a "structure of arrays" or SoA for short. This means to split the vertex into parts, and each part gets its own array. These arrays can usually be organized better w.r.t. to SSE.
Which semantics have the 9 floats of your vertex, and what operation should be done on them?
You currently have an "array of structures" or AoS for short, i.e. a vertex (the structure) sequenced into an array. For SSE it is often better to have a "structure of arrays" or SoA for short. This means to split the vertex into parts, and each part gets its own array. These arrays can usually be organized better w.r.t. to SSE.
Which semantics have the 9 floats of your vertex, and what operation should be done on them?
for each traingle (lets call him abc - it has vertices abc) I
need to cross and normalize to get the normal ,
n = (b-a) x (c-a)
n = normalize (n)
then i need to multiply it by model_pos matrix
n = modelpos * n
then i need to calc some light factors
Color *= (n * LightDir) * LightColor
though here with color it is a bit more complicated as this base color to multiply is in packed ARGB format not float3 as the whole rest
more info is in the anothers thread near here "how to optymize this with sse or sse intrinsics" - im a bit confused with this still as sse is a bit new
to me
Make an array of __m128 and make your triangles and normals to be indices into that array.
To fill the elements of the __m128 make a union with an array of floats or define four elements a, b, c d or x,y,z,w. However you want.
For the transformation you go blindly through the array.
The compiler will detect the best way to use sse instructions.
for each traingle (lets call him abc - it has vertices abc) I
need to cross and normalize to get the normal ,
Presumably (but I'm not an SSE expert, so someone may contradict me): The best performance for such a problem comes with a memory layout where each SSE register holds the same components of 4 vertices. I.e.
uint count = ( vertices.length + 3 ) / 4;
__m128 verticesX[count];
__m128 verticesY[count];
__m128 verticesZ[count];
Fill the arrays with the data of the vertices a, b, c of the first 4-tuple triangles, then of the second 4-tuple of triangles, and so on. In memory you then have something like:
verticesX[0] : tri[0].vertex_a.x, tri[1].vertex_a.x, tri[2].vertex_a.x, tri[3].vertex_a.x
verticesX[1] : tri[0].vertex_b.x, tri[1].vertex_b.x, tri[2].vertex_b.x, tri[3].vertex_b.x
verticesX[2] : tri[0].vertex_c.x, tri[1].vertex_c.x, tri[2].vertex_c.x, tri[3].vertex_c.x
verticesX[3] : tri[4].vertex_a.x, tri[5].vertex_a.x, tri[6].vertex_a.x, tri[7].vertex_a.x
verticesX[4] : tri[4].vertex_b.x, tri[5].vertex_b.x, tri[6].vertex_b.x, tri[7].vertex_b.x
verticesX[5] : tri[4].vertex_c.x, tri[5].vertex_c.x, tri[6].vertex_c.x, tri[7].vertex_c.x
...
verticesZ: analogously, but with the .z component
dx01 = verticesX[i+0] - verticesX[i+1];
dy01 = verticesY[i+0] - verticesY[i+1];
dz01 = verticesZ[i+0] - verticesZ[i+1];
dx02 = verticesX[i+0] - verticesX[i+2];
dy02 = verticesY[i+0] - verticesY[i+2];
dz02 = verticesZ[i+0] - verticesZ[i+2];
nx = dy01 * dz02 - dz01 * dy02;
ny = dz01 * dx02 - dx01 * dz02;
nz = dx01 * dy02 - dy01 * dx02;
len = sqrt(nx * nx + ny * ny + nz * nz);
nx /= len;
ny /= len;
nz /= len;
should result in the normals of 4 triangles per run.
then i need to multiply it by model_pos matrix
Doing the same trickery with the model matrix requires each of its components to be replicated 4 times, so that each register holds 4 times the same value. It is not clear to me what "model_pos" means, but if it is the transform that relates the model to the world, all you need is the 3x3 sub-matrix that stores the rotational part since the vectors you are about to transform are direction vectors.
the question was more for alignment , say
void f()
{
float3 x;
}
what 'keyword' tu use to align it to 16 bytes in GCC?
as to sse i will be trying to learn both vertical (this is 4 pack) way and horizontal way (harder but sometimes easier to integrate such procedure), tnx
If you use __m128 type variables they are automaticly 16Byte aligned if you do not force a special alignment.
Currently compilers are really bad at autovectorization.
haegarr already pointed a simple solution, however SoA is not really comfortable, especially if you want to modify a vertex buffer(the most common case) where the data is usually interlaced (AoS).
If you can convert interlaced data to SoA use _mm_stream_ps intrinsic.
If we are talking about the interlaced data eg. (Triangle triangles[trianglesMAX], where Triangle has 3 vector that contain 3 points) and it is not possible(worth it) to convert to SoA:
alternatively i could copy each record to local variable buffer and thenwork on this - not sure if this wouldnt be better
Well this is a good solution. Because the data is not aligned you should use movups(_mm_loadu_ps for example) instructions to load your data into the xmm registers. While wrting SSE code keep in mind that __m128 is not just a type.
Pay attention when writing the cross product operation it is really tricky.
Here is my implementation. You should try to find a better solution
EDIT:
what 'keyword' tu use to align it to 16 bytes in GCC?
__attribute__(align(16)) GCC
__declspec(align(16)) MSVC
alignas(16) C++ 11 not implemented on all compilers yet.
cross (a.m_M128, b.m_M128 )
__m128 T = _mm_shuffle_ps(a.m_M128, a.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)
__m128 V = _mm_shuffle_ps(b.m_M128, b.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)
//i(ay*bz - by*az) + j(bx*az - ax*bz) + k(ax*by - bx*ay)
T = _mm_mul_ps(T, b.m_M128);//bx * ay, by * az, bz * ax
V = _mm_mul_ps(V, a.m_M128);//ax * by, ay * bz, az * bx
V = _mm_sub_ps(V, T);
//V holds the result
V = _mm_shuffle_ps(V, V, SGE_SIMD_SHUFFLE(1, 2, 0, 3));
Helpful link:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/.
If you prefer to write autovectorization friendly code, please tell me I will do my best to help you.
I use a gcc-4.8.x compiler on my linux machine and it did a good job even as I did not use __m128 data. Using a struct with 4 floats activated the SSE instruction set and made the software pretty fast. Multiplying the list of vectors by a matrix even has been converted to SSE instructions. It maybe getting better but only within single clockcycles range.
I use a gcc-4.8.x compiler on my linux machine and it did a good job even as I did not use __m128 data. Using a struct with 4 floats activated the SSE instruction set and made the software pretty fast. Multiplying the list of vectors by a matrix even has been converted to SSE instructions. It maybe getting better but only within single clockcycles range.
Just because there are SSE instructions that doesn't mean that the compiler did a great job.
For example if you want to sum 2 vec4 vectors and if you're not writing autovect friendly code lets say there will be 4 addss (+ tons of movss) instead of a single addps.
EDIT:
what 'keyword' tu use to align it to 16 bytes in GCC?
__attribute__((align(16))) GCC
__declspec(align(16)) MSVC
alignas(16) C++11 not implemented on all compilers yet.
Those will also modify the sizeof(TYPE).