# sse-alignment troubles

## Recommended Posts

I got an array with input geometry, this is 100k entries

array of triangles

Triangle triangles[trianglesMAX];

each traingle is 9 floats (it is 36 bytes each)

unfortunately this is not multiply of 16 bytes, and unfortunatelly
to pad this up i need tu add whole 12 empty bytes

so i wonder how i could change this, should i chnge the triangle
for SseTriangle (or what better name for this) which is build from 12
floats (48 bytes) ? - it would grow input ram size from 3.6 MB to 4.8 MB
this could slow the ram thing

alternatively i could copy each record to local variable buffer and then
work on this - not sure if this wouldnt be better

(at all i got yet a little confusion here)

other question is how 'keyword' in the GCC to use to align global and local
storage variables for the usage of sse, i mean im mainly using the c types
like float3 float9 float12 - how to align that?

sse has some type called __m128 probably this is automaticaly aligned
but it would be unconvinient to me to use this in all the places - in some
sse agnostic places i would be liking using float3 then recast it to _m128
so i would prefer to use them both

##### Share on other sites

You currently have an "array of structures" or AoS for short, i.e. a vertex (the structure) sequenced into an array. For SSE it is often better to have a "structure of arrays" or SoA for short. This means to split the vertex into parts, and each part gets its own array. These arrays can usually be organized better w.r.t. to SSE.

Which semantics have the 9 floats of your vertex, and what operation should be done on them?

##### Share on other sites

You currently have an "array of structures" or AoS for short, i.e. a vertex (the structure) sequenced into an array. For SSE it is often better to have a "structure of arrays" or SoA for short. This means to split the vertex into parts, and each part gets its own array. These arrays can usually be organized better w.r.t. to SSE.

Which semantics have the 9 floats of your vertex, and what operation should be done on them?

for each traingle (lets call him abc - it has vertices abc) I

need to cross  and normalize to get the normal ,

n = (b-a) x (c-a)

n = normalize (n)

then i need to multiply it by model_pos matrix

n = modelpos * n

then i need to calc some light factors

Color *=  (n * LightDir) * LightColor

though here with color it is a bit more complicated as this base color to multiply is in packed ARGB format not float3 as the whole rest

more info is in the anothers thread near here "how to optymize this with sse or sse intrinsics" - im a bit confused with this still as sse is a bit new

to me

##### Share on other sites

Make an array of __m128 and make your triangles and normals to be indices into that array.

To fill the elements of the __m128 make a union with an array of floats or define four elements a, b, c d or x,y,z,w. However you want.

For the transformation you go blindly through the array.

The compiler will detect the best way to use sse instructions.

##### Share on other sites

for each traingle (lets call him abc - it has vertices abc) I
need to cross  and normalize to get the normal ,

Presumably (but I'm not an SSE expert, so someone may contradict me): The best performance for such a problem comes with a memory layout where each SSE register holds the same components of 4 vertices. I.e.

uint count = ( vertices.length + 3 ) / 4;
__m128 verticesX[count];
__m128 verticesY[count];
__m128 verticesZ[count];

Fill the arrays with the data of the vertices a, b, c of the first 4-tuple triangles, then of the second 4-tuple of triangles, and so on. In memory you then have something like:

verticesX[0] : tri[0].vertex_a.x, tri[1].vertex_a.x, tri[2].vertex_a.x, tri[3].vertex_a.x
verticesX[1] : tri[0].vertex_b.x, tri[1].vertex_b.x, tri[2].vertex_b.x, tri[3].vertex_b.x
verticesX[2] : tri[0].vertex_c.x, tri[1].vertex_c.x, tri[2].vertex_c.x, tri[3].vertex_c.x
verticesX[3] : tri[4].vertex_a.x, tri[5].vertex_a.x, tri[6].vertex_a.x, tri[7].vertex_a.x
verticesX[4] : tri[4].vertex_b.x, tri[5].vertex_b.x, tri[6].vertex_b.x, tri[7].vertex_b.x
verticesX[5] : tri[4].vertex_c.x, tri[5].vertex_c.x, tri[6].vertex_c.x, tri[7].vertex_c.x
...
verticesX: analogously, but with the .y component

verticesZ: analogously, but with the .z component

Then computations along the scheme
dx01 = verticesX[i+0] - verticesX[i+1];
dy01 = verticesY[i+0] - verticesY[i+1];
dz01 = verticesZ[i+0] - verticesZ[i+1];
dx02 = verticesX[i+0] - verticesX[i+2];
dy02 = verticesY[i+0] - verticesY[i+2];
dz02 = verticesZ[i+0] - verticesZ[i+2];

nx = dy01 * dz02 - dz01 * dy02;
ny = dz01 * dx02 - dx01 * dz02;
nz = dx01 * dy02 - dy01 * dx02;

len = sqrt(nx * nx + ny * ny + nz * nz);

nx /= len;
ny /= len;
nz /= len;

should result in the normals of 4 triangles per run.

then i need to multiply it by model_pos matrix

Doing the same trickery with the model matrix requires each of its components to be replicated 4 times, so that each register holds 4 times the same value. It is not clear to me what "model_pos" means, but if it is the transform that relates the model to the world, all you need is the 3x3 sub-matrix that stores the rotational part since the vectors you are about to transform are direction vectors.

##### Share on other sites

the question was more for alignment , say

void f()

{

float3 x;

}

what 'keyword' tu use to align it to 16 bytes in GCC?

as to sse i will be trying to learn both vertical (this is 4 pack) way and horizontal way (harder but sometimes easier to integrate such procedure), tnx

##### Share on other sites

If you use __m128 type variables they are automaticly 16Byte aligned if you do not force a special alignment.

##### Share on other sites

Currently compilers are really bad at autovectorization.

haegarr already pointed a simple solution, however SoA is not really comfortable, especially if you want to modify a vertex buffer(the most common case) where the data is usually interlaced (AoS).

If you can convert interlaced data to SoA use _mm_stream_ps intrinsic.

If we are talking about the interlaced data eg. (Triangle triangles[trianglesMAX], where Triangle has 3 vector that contain 3 points)  and it is not possible(worth it) to convert to SoA:

alternatively i could copy each record to local variable buffer and thenwork on this - not sure if this wouldnt be better

Well this is a good solution. Because the data is not aligned you should use movups(_mm_loadu_ps for example) instructions to load your data into the xmm registers. While wrting SSE code keep in mind that __m128 is not just a type.

Pay attention when writing the cross product operation it is really tricky.

Here is my implementation. You should try to find a better solution

EDIT:

what 'keyword' tu use to align it to 16 bytes in GCC?

__attribute__(align(16)) GCC

__declspec(align(16)) MSVC

alignas(16) C++ 11 not implemented on all compilers yet.

cross (a.m_M128, b.m_M128 )

__m128 T = _mm_shuffle_ps(a.m_M128, a.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)
__m128 V = _mm_shuffle_ps(b.m_M128, b.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)

//i(ay*bz - by*az)  + j(bx*az - ax*bz)  + k(ax*by - bx*ay)
T = _mm_mul_ps(T, b.m_M128);//bx * ay, by * az, bz * ax
V = _mm_mul_ps(V, a.m_M128);//ax * by, ay * bz, az * bx
V = _mm_sub_ps(V, T);

//V holds the result
V = _mm_shuffle_ps(V, V, SGE_SIMD_SHUFFLE(1, 2, 0, 3));


If you prefer to write autovectorization friendly code, please tell me I will do my best to help you.

Edited by imoogiBG

##### Share on other sites

I use a gcc-4.8.x compiler on my linux machine and it did  a good job even as I did not use __m128 data. Using a struct with 4 floats activated the SSE instruction set and made the software pretty fast. Multiplying the list of vectors by a matrix even has been converted to SSE instructions. It maybe getting better but only within single clockcycles range.

##### Share on other sites

I use a gcc-4.8.x compiler on my linux machine and it did  a good job even as I did not use __m128 data. Using a struct with 4 floats activated the SSE instruction set and made the software pretty fast. Multiplying the list of vectors by a matrix even has been converted to SSE instructions. It maybe getting better but only within single clockcycles range.

Just because there are SSE instructions that doesn't mean that the compiler did a great job.

For example if you want to sum 2 vec4 vectors and if you're not writing autovect friendly code lets say there will be 4 addss (+ tons of movss) instead of a single addps.

EDIT:

what 'keyword' tu use to align it to 16 bytes in GCC?

__attribute__((align(16))) GCC

__declspec(align(16)) MSVC

alignas(16) C++11 not implemented on all compilers yet.

Those will also modify the sizeof(TYPE).

Edited by imoogiBG

##### Share on other sites

You are right about the instructions used. But the time needed to transform about 300k vectors is nearly the same. This is what I ment before. The difference between hand optimized code and the output of the compiler may differ only in single clock cycles.

Sure it can be of some interest. But in most cases it makes more sense to do a highlevel optimization.

##### Share on other sites

You are right about the instructions used. But the time needed to transform about 300k vectors is nearly the same.....But in most cases it makes more sense to do a highlevel optimization.

Well this depends on the situation(prefetch friendlyness, instruction cache, code complexity). Usually I convet to sse code, that is executed more than one and it is crtical.

Of course I hate to write SSE directly, it breaks everything, the code is ugly....blah

But there are a lot of cases where intrinsic code is much faster. Hopefully In the future that code will be scraped.

Edited by imoogiBG

##### Share on other sites

Currently compilers are really bad at autovectorization.

haegarr already pointed a simple solution, however SoA is not really comfortable, especially if you want to modify a vertex buffer(the most common case) where the data is usually interlaced (AoS).

If you can convert interlaced data to SoA use _mm_stream_ps intrinsic.

If we are talking about the interlaced data eg. (Triangle triangles[trianglesMAX], where Triangle has 3 vector that contain 3 points)  and it is not possible(worth it) to convert to SoA:

alternatively i could copy each record to local variable buffer and thenwork on this - not sure if this wouldnt be better

Well this is a good solution. Because the data is not aligned you should use movups(_mm_loadu_ps for example) instructions to load your data into the xmm registers. While wrting SSE code keep in mind that __m128 is not just a type, the compiler will try to put every __m128 into the registers.

Pay attention when writing the cross product operation it is really tricky.

Here is my implementation. You should try to find a better solution

EDIT:

what 'keyword' tu use to align it to 16 bytes in GCC?

__attribute__(align(16)) GCC

__declspec(align(16)) MSVC

alignas(16) C++ 11 not implemented on all compilers yet.

cross (a.m_M128, b.m_M128 )

__m128 T = _mm_shuffle_ps(a.m_M128, a.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)
__m128 V = _mm_shuffle_ps(b.m_M128, b.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)

//i(ay*bz - by*az)  + j(bx*az - ax*bz)  + k(ax*by - bx*ay)
T = _mm_mul_ps(T, b.m_M128);//bx * ay, by * az, bz * ax
V = _mm_mul_ps(V, a.m_M128);//ax * by, ay * bz, az * bx
V = _mm_sub_ps(V, T);

//V holds the result
V = _mm_shuffle_ps(V, V, SGE_SIMD_SHUFFLE(1, 2, 0, 3));


If you prefer to write autovectorization friendly code, please tell me I will do my best to help you.

alright,

as i understand there are two types of sse vectorization "vertical" when you are processing nearly the same way as scalar code but with 4-packs - and "horizontal"

- i would like to train a bit both of them, i got an idea to do some very basic tests today, i may give a results here if someone is interested

could you maybe say if i can typedef this "__attribute__(align(16)) GCC"

to get an always aligned version of "float3" struct i use for 3d?

i would like to typedef float3a (and float4a) type to be always aligned to 16 (but with no to much waste of spacei mean

float3a tab[1000]; should be aligned and have 16k bytes

##### Share on other sites

typedef __declspec(align(16)) vec3 vec3a; is ignored and vec3a is NOT aligned at 16. This is on MSVC and I think that gcc will do the same.

struct __attribute__((align(16))) avec3 : public vec3
{};

Should do the trick but pay attention that sizeof(vec3) != sizeof(avec3) != 12.
sizeof(avec3) should be 16.
Also keep in mind that new operator DOES NOT return aligned addresses, I you want to use new, overload it and use something like aligned_malloc.

EDIT:

I assume in this post that we are talking about float vectors.

as i understand there are two types of sse vectorization "vertical" when you are processing nearly the same way as scalar code but with 4-packs - and "horizontal"

No one calls it that way. Your algorithm depends on the data layout. So there are 2 major data layouts Structure of Arrays and Array of Strcutures (called interlaced).

Edited by imoogiBG

##### Share on other sites

typedef __declspec(align(16)) vec3 vec3a; is ignored and vec3a is NOT aligned at 16. This is on MSVC and I think that gcc will do the same.

struct __attribute__((align(16))) avec3 : public vec3
{};

Should do the trick but pay attention that sizeof(vec3) != sizeof(avec3) != 12.
sizeof(avec3) should be 16.
Also keep in mind that new operator DOES NOT return aligned addresses, I you want to use new, overload it and use something like aligned_malloc.

EDIT:

I assume in this post that we are talking about float vectors.

as i understand there are two types of sse vectorization "vertical" when you are processing nearly the same way as scalar code but with 4-packs - and "horizontal"

No one calls it that way. Your algorithm depends on the data layout. So there are 2 major data layouts Structure of Arrays and Array of Strcutures (called interlaced).

great,

this naming my be not official but i think this is worth to use as it describes two main styles of writing i think

i did some tests

struct  __attribute__((align(16))) float3A { float x,y,z;};
typedef __attribute__((align(16))) float3 float3At;

float3 f1;
float3A f2;
float3At f3;

void tests()
{

exit(0);

}

and opposite as you said the two first were aligned to 4 only the last one (this with typedef is aligned 16), very fine - this works *

if you would like to talk its possible that i would be having some more questions (dont wannt to be sse master but would like to train basics of it to be able to use it quick sometimes )

gives 12 12 12 not exactly you say, maybe compilers differ sadly im using  "# GNU C++ (tdm-1) version 4.7.1 (mingw32)" (c++ compilation though on c only code) - much tnx for pointing this out

* edit

damn it seemed that it was aligned only by chance so any one of this is not working

edit

this is working

typedef float3 float3At __attribute__ ((aligned (16)));

-> 16

but there is another problem

float3At tab[1000];

tests.c:118:18: error: alignment of array elements is greater than element size
Edited by fir

##### Share on other sites

You are right about the instructions used. But the time needed to transform about 300k vectors is nearly the same. This is what I ment before. The difference between hand optimized code and the output of the compiler may differ only in single clock cycles.

Sure it can be of some interest. But in most cases it makes more sense to do a highlevel optimization.

If so how will you explain such results:

for(int i=0; i<1000*1000; i++)
{
tab3[i] =  tab1[i] * tab2[i];    //multiply 1 mln floats
}

for(int i=0; i<1000*1000; i+=4)
{
__m128 a = _mm_load_ps(  (float*)&tab1[i] );
__m128 b = _mm_load_ps(  (float*)&tab2[i] );
__m128 c = _mm_mul_ps(  a,b );
_mm_store_ps(&tab3[i], c);

}

consecutive runs

floats

4.58 ms 4.58 ms 4.73 ms

m128

2.91 ms 2.95 ms 2.92 ms

?

its even more speedup that i have expected - i am expecting
more 30% of speedup than 3x becouse as far as i know, most codes are
ram bound not calculation bound so sse would not help much in this cases (as im not sure how it can halp with improving ram throughtput
but probably and sadly its abilities here counts zero )
But for me 30 % is sometimes worth it - Also in some othercases of
heavy calculation with a less ram flow it can bring more like 3 times
speedup

so speaking about no difference you are maybe spreading some misinformation (as some antyoptimization propaganda does - from my
experiences my efforts to optymize codes with "nanooptymizations" over algorythmic improvements in most cases was giving 2X 3X speedup
9relative to base case with -O3, i mean when starting with srightforward (but not laggy) c code compiled with -O3 searching for various "picooptymizations" often gave 2x sometimes 3x speedup)

##### Share on other sites

so speaking about no difference you are maybe spreading some misinformation (as some antyoptimization propaganda does - from my
experiences my efforts to optymize codes with "nanooptymizations" over algorythmic improvements in most cases was giving 2X 3X speedup
9relative to base case with -O3, i mean when starting with srightforward (but not laggy) c code compiled with -O3 searching for various "picooptymizations" often gave 2x sometimes 3x speedup)

Your first implementation is not a vector operation. It is a simple multiplication of values. And you do some highlevel optimizations, as I suggested, because you re-organize the data to take some benefits of vector operations. This has nothing todo with misinformation. You are doing some different things.

But think of it like you always do. You are right and I am some misinformation spreading noob. You are the king of codeing.

typedef float tElement;

typedef union tagVector {
__m128 xmm;
struct {
tElement x,y,z,w;
} e;
}tVector;

typedef struct tagMatrix {
tVector v[4];
} tMatrix;

#if 1

tiew[s].e.x = m.v[0].e.x*tiev[s].e.x + m.v[1].e.x*tiev[s].e.y + m.v[2].e.x*tiev[s].e.z +m.v[3].e.x*tiev[s].e.w;
tiew[s].e.y=m.v[0].e.y*tiev[s].e.x + m.v[1].e.y*tiev[s].e.y + m.v[2].e.y*tiev[s].e.z +m.v[3].e.y*tiev[s].e.w;
tiew[s].e.z=m.v[0].e.z*tiev[s].e.x + m.v[1].e.z*tiev[s].e.y + m.v[2].e.z*tiev[s].e.z +m.v[3].e.z*tiev[s].e.w;
tiew[s].e.w=m.v[0].e.w*tiev[s].e.x + m.v[1].e.w*tiev[s].e.y + m.v[2].e.w*tiev[s].e.z +m.v[3].e.w*tiev[s].e.w;
#else
a=_mm_mul_ps(tiev[s].xmm, m.v[0].xmm);
b=_mm_mul_ps(tiev[s].xmm, m.v[1].xmm);
c=_mm_mul_ps(tiev[s].xmm, m.v[2].xmm);
d=_mm_mul_ps(tiev[s].xmm, m.v[3].xmm);
#endif

The two make not many difference in speed. The first generates lots of skalar operations the second use the packed operations as expected. And the first is minimal faster, for what hell ever.

By giving the compiler a hint what datatype you use, __m128, it is able to produce highly optimized code that maybe only be topped by timeconsuming developing nano optimized code. The difference is..... the first, faster version is portable to any architecture.

##### Share on other sites

so speaking about no difference you are maybe spreading some misinformation (as some antyoptimization propaganda does - from my
experiences my efforts to optymize codes with "nanooptymizations" over algorythmic improvements in most cases was giving 2X 3X speedup
9relative to base case with -O3, i mean when starting with srightforward (but not laggy) c code compiled with -O3 searching for various "picooptymizations" often gave 2x sometimes 3x speedup)

Your first implementation is not a vector operation. It is a simple multiplication of values. And you do some highlevel optimizations, as I suggested, because you re-organize the data to take some benefits of vector operations. This has nothing todo with misinformation. You are doing some different things.

But think of it like you always do. You are right and I am some misinformation spreading noob. You are the king of codeing.

what?

this is the revriting the loop to sse so this is the thing talken here

(though i must say now that the benchmark is very context sensitive - i moved the loops to some outer loop and i cannot reproduce the results they seem now about equal in speed) (need some more testing and some more arithmetic cases to show the speedup as i said most code cases are ram bound i think)

besides who said i am the king of coding? why you inventing that? im moderately experienced but i dislike beliving in statements who are not good substantiation - and im also indifferent to propaganda

Edited by fir

##### Share on other sites

You are right. I am the noob. You did a great job in rewriting a loop.

##### Share on other sites

You are right. I am the noob.

Youre wrong - i was not saying you are a noob - I dont know what youare talkin about at all

I was refering to this

"I use a gcc-4.8.x compiler on my linux machine and it did  a good job even as I did not use __m128 data. Using a struct with 4 floats activated the SSE instruction set and made the software pretty fast. Multiplying the list of vectors by a matrix even has been converted to SSE instructions. It maybe getting better but only within single clockcycles range."

if you are saying that compiler is doing close thing to hand sse optymizations i think youre wrong and saying so you propaply are spreading misleading information (though still i cannot prove it but i see no reasons to belive this ) - some mine previous experiences with the things some call "nanooptymizations" showed that this general statemants (that it bring no optimization) were  not true in my cases

Edited by fir

##### Share on other sites

Yes this is true.

##### Share on other sites

Guys take it easy.

Could you provide the both multiplication test cases code?  I think that something is not clear.

##### Share on other sites

Guys take it easy.

Could you provide the both multiplication test cases code?  I think that something is not clear.

I think my first results was wrong it probably was dependant on warming of the cache  - now i see no difference in such simple multiplication

will do some more tests later

(as ti test cases i use  QueryPerformanceCounter it returns some ticks that you could translate to nanoseconds etc - then i just put test code beetween two QueryPerformanceCounter calls  - it is also good to put it into some loop etc, sometimes i was using rtdsc call too but rarely)

Edited by fir

##### Share on other sites
By giving the compiler a hint what datatype you use, __m128, it is able to produce highly optimized code

Well this is kind of true. The optimization comes from the default alignment, not because of the __m128 itself. Another great hit to the autovectorizator is by using directly arrays.

Im going to show a bit silly example but it show what compilers are doing and what they don't. I will be using latest MS compiler, GCC would to similar things, clang probably will give best results(I have no idea actually never worked with that compiler, only heard that autovect is really good on clang).

the code:

template<class T>
void init(T& a)
{
}

template<class T>
void use(const T& a)
{
fwrite(&a , sizeof(T), 1, (FILE*)53);
}

union Vec
{
struct {float x,y,z,w;};
float v[4];
};

init and use are functions that will disable the compile time computations.

Compiling

Vec a,b, r;

init(a);
init(b);

r.x = a.x + b.x;
r.y = a.y + b.y;
r.z = a.z + b.z;
r.w = a.w + b.w; use(r);

will produce:

.....bla bla.....
movss       dword ptr [ebp-34h],xmm0
movss       xmm0,dword ptr [ebp-20h]
movss       dword ptr [ebp-30h],xmm0
movss       xmm0,dword ptr [ebp-1Ch]
movss       dword ptr [ebp-2Ch],xmm0
movss       xmm0,dword ptr [ebp-18h]
movss       dword ptr [ebp-28h],xmm0  .....bla

Compiling

for(int t = 0; t < 4; ++t)
{
r.v[t] = a.v[t] + b.v[t];
}

Will produce:

bla.....blaaa movups      xmm1,xmmword ptr [esp+34h]
push        35h
movups      xmm0,xmmword ptr [esp+28h]
push        1
lea         eax,[esp+4Ch]
push        10h
push        eax
movups      xmmword ptr [esp+54h],xmm1
blaaaa

The second version is really nice and absolutely *faster*.

If we add __m128 (as tribad says) or add align 16 we will get a bit better results.

Yes this is silly you should profile real applications! I've learned that the bad way

EDIT:

There are missing instrcutions in the 1st case(1 more addss) but you've got the point.

Adding alignment to the first(xyzw) case will cure the instruction bloat but :

I'm strongly against adding a (__m128)/(adding align) members to a general purpose strcture for 2 reasons:

1) the new operator and the alignment

2) portability

You should use some ugly named strictures that scream SIMD_SEE_STUFF_VECTOR_NO_GENERAL_PURPOUSE_TYPE_DUDE something like this.

EDIT2:

@fir I don't know what you're trying to achieve? Is it an urgent thing for work or you're just learning?

If you're learning I suggest you to look at already written math libraries DirectXMath glm Bullet Vectormath and see how they are used in big projects, how they are using things. Then pick a small demo from DIrectX SDK for example (the triangle picking one) rewrite it using SSE and autovect code, profile the results, this is the best way to learn how to SSE.

Edited by imoogiBG

##### Share on other sites
This topic is now closed to further replies.

• ## Partner Spotlight

• ### Forum Statistics

• Total Topics
627680
• Total Posts
2978609

• 13
• 12
• 10
• 12
• 22