sse-alignment troubles

Started by
32 comments, last by jbadams 9 years, 9 months ago

I got an array with input geometry, this is 100k entries

array of triangles

Triangle triangles[trianglesMAX];
each traingle is 9 floats (it is 36 bytes each)
unfortunately this is not multiply of 16 bytes, and unfortunatelly
to pad this up i need tu add whole 12 empty bytes
so i wonder how i could change this, should i chnge the triangle
for SseTriangle (or what better name for this) which is build from 12
floats (48 bytes) ? - it would grow input ram size from 3.6 MB to 4.8 MB
this could slow the ram thing
alternatively i could copy each record to local variable buffer and then
work on this - not sure if this wouldnt be better
(at all i got yet a little confusion here)
other question is how 'keyword' in the GCC to use to align global and local
storage variables for the usage of sse, i mean im mainly using the c types
like float3 float9 float12 - how to align that?
sse has some type called __m128 probably this is automaticaly aligned
but it would be unconvinient to me to use this in all the places - in some
sse agnostic places i would be liking using float3 then recast it to _m128
so i would prefer to use them both
Advertisement

You currently have an "array of structures" or AoS for short, i.e. a vertex (the structure) sequenced into an array. For SSE it is often better to have a "structure of arrays" or SoA for short. This means to split the vertex into parts, and each part gets its own array. These arrays can usually be organized better w.r.t. to SSE.

Which semantics have the 9 floats of your vertex, and what operation should be done on them?

You currently have an "array of structures" or AoS for short, i.e. a vertex (the structure) sequenced into an array. For SSE it is often better to have a "structure of arrays" or SoA for short. This means to split the vertex into parts, and each part gets its own array. These arrays can usually be organized better w.r.t. to SSE.

Which semantics have the 9 floats of your vertex, and what operation should be done on them?

for each traingle (lets call him abc - it has vertices abc) I

need to cross and normalize to get the normal ,

n = (b-a) x (c-a)

n = normalize (n)

then i need to multiply it by model_pos matrix

n = modelpos * n

then i need to calc some light factors

Color *= (n * LightDir) * LightColor

though here with color it is a bit more complicated as this base color to multiply is in packed ARGB format not float3 as the whole rest

more info is in the anothers thread near here "how to optymize this with sse or sse intrinsics" - im a bit confused with this still as sse is a bit new

to me

Make an array of __m128 and make your triangles and normals to be indices into that array.

To fill the elements of the __m128 make a union with an array of floats or define four elements a, b, c d or x,y,z,w. However you want.

For the transformation you go blindly through the array.

The compiler will detect the best way to use sse instructions.


for each traingle (lets call him abc - it has vertices abc) I
need to cross and normalize to get the normal ,

Presumably (but I'm not an SSE expert, so someone may contradict me): The best performance for such a problem comes with a memory layout where each SSE register holds the same components of 4 vertices. I.e.


uint count = ( vertices.length + 3 ) / 4;
__m128 verticesX[count];
__m128 verticesY[count];
__m128 verticesZ[count];

Fill the arrays with the data of the vertices a, b, c of the first 4-tuple triangles, then of the second 4-tuple of triangles, and so on. In memory you then have something like:


verticesX[0] : tri[0].vertex_a.x, tri[1].vertex_a.x, tri[2].vertex_a.x, tri[3].vertex_a.x 
verticesX[1] : tri[0].vertex_b.x, tri[1].vertex_b.x, tri[2].vertex_b.x, tri[3].vertex_b.x
verticesX[2] : tri[0].vertex_c.x, tri[1].vertex_c.x, tri[2].vertex_c.x, tri[3].vertex_c.x
verticesX[3] : tri[4].vertex_a.x, tri[5].vertex_a.x, tri[6].vertex_a.x, tri[7].vertex_a.x 
verticesX[4] : tri[4].vertex_b.x, tri[5].vertex_b.x, tri[6].vertex_b.x, tri[7].vertex_b.x
verticesX[5] : tri[4].vertex_c.x, tri[5].vertex_c.x, tri[6].vertex_c.x, tri[7].vertex_c.x 
...
verticesX: analogously, but with the .y component

verticesZ: analogously, but with the .z component

Then computations along the scheme

dx01 = verticesX[i+0] - verticesX[i+1];
dy01 = verticesY[i+0] - verticesY[i+1];
dz01 = verticesZ[i+0] - verticesZ[i+1];
dx02 = verticesX[i+0] - verticesX[i+2];
dy02 = verticesY[i+0] - verticesY[i+2];
dz02 = verticesZ[i+0] - verticesZ[i+2];

nx = dy01 * dz02 - dz01 * dy02;
ny = dz01 * dx02 - dx01 * dz02;
nz = dx01 * dy02 - dy01 * dx02;

len = sqrt(nx * nx + ny * ny + nz * nz);

nx /= len;
ny /= len;
nz /= len;

should result in the normals of 4 triangles per run.


then i need to multiply it by model_pos matrix

Doing the same trickery with the model matrix requires each of its components to be replicated 4 times, so that each register holds 4 times the same value. It is not clear to me what "model_pos" means, but if it is the transform that relates the model to the world, all you need is the 3x3 sub-matrix that stores the rotational part since the vectors you are about to transform are direction vectors.

the question was more for alignment , say

void f()

{

float3 x;

}

what 'keyword' tu use to align it to 16 bytes in GCC?

as to sse i will be trying to learn both vertical (this is 4 pack) way and horizontal way (harder but sometimes easier to integrate such procedure), tnx

If you use __m128 type variables they are automaticly 16Byte aligned if you do not force a special alignment.

Currently compilers are really bad at autovectorization.

haegarr already pointed a simple solution, however SoA is not really comfortable, especially if you want to modify a vertex buffer(the most common case) where the data is usually interlaced (AoS).

If you can convert interlaced data to SoA use _mm_stream_ps intrinsic.

If we are talking about the interlaced data eg. (Triangle triangles[trianglesMAX], where Triangle has 3 vector that contain 3 points) and it is not possible(worth it) to convert to SoA:


alternatively i could copy each record to local variable buffer and thenwork on this - not sure if this wouldnt be better

Well this is a good solution. Because the data is not aligned you should use movups(_mm_loadu_ps for example) instructions to load your data into the xmm registers. While wrting SSE code keep in mind that __m128 is not just a type.

Pay attention when writing the cross product operation it is really tricky.

Here is my implementation. You should try to find a better solution

EDIT:


what 'keyword' tu use to align it to 16 bytes in GCC?

__attribute__(align(16)) GCC

__declspec(align(16)) MSVC

alignas(16) C++ 11 not implemented on all compilers yet.

cross (a.m_M128, b.m_M128 )


__m128 T = _mm_shuffle_ps(a.m_M128, a.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)
__m128 V = _mm_shuffle_ps(b.m_M128, b.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)


//i(ay*bz - by*az)  + j(bx*az - ax*bz)  + k(ax*by - bx*ay)
T = _mm_mul_ps(T, b.m_M128);//bx * ay, by * az, bz * ax
V = _mm_mul_ps(V, a.m_M128);//ax * by, ay * bz, az * bx
V = _mm_sub_ps(V, T);

//V holds the result
V = _mm_shuffle_ps(V, V, SGE_SIMD_SHUFFLE(1, 2, 0, 3));

Helpful link:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/.

If you prefer to write autovectorization friendly code, please tell me I will do my best to help you.

I use a gcc-4.8.x compiler on my linux machine and it did a good job even as I did not use __m128 data. Using a struct with 4 floats activated the SSE instruction set and made the software pretty fast. Multiplying the list of vectors by a matrix even has been converted to SSE instructions. It maybe getting better but only within single clockcycles range.

I use a gcc-4.8.x compiler on my linux machine and it did a good job even as I did not use __m128 data. Using a struct with 4 floats activated the SSE instruction set and made the software pretty fast. Multiplying the list of vectors by a matrix even has been converted to SSE instructions. It maybe getting better but only within single clockcycles range.

Just because there are SSE instructions that doesn't mean that the compiler did a great job.

For example if you want to sum 2 vec4 vectors and if you're not writing autovect friendly code lets say there will be 4 addss (+ tons of movss) instead of a single addps.

EDIT:


what 'keyword' tu use to align it to 16 bytes in GCC?

__attribute__((align(16))) GCC

__declspec(align(16)) MSVC

alignas(16) C++11 not implemented on all compilers yet.

Those will also modify the sizeof(TYPE).

This topic is closed to new replies.

Advertisement