• Create Account

Banner advertising on our site currently available from just \$5!

# sse-alignment troubles

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

33 replies to this topic

### #21Tribad  Members   -  Reputation: 955

Like
2Likes
Like

Posted 29 June 2014 - 12:56 PM

Yes this is true.

### #22imoogiBG  Members   -  Reputation: 1465

Like
2Likes
Like

Posted 29 June 2014 - 01:07 PM

Guys take it easy.

Could you provide the both multiplication test cases code?  I think that something is not clear.

### #23 fir   Members   -  Reputation: -460

Like
-2Likes
Like

Posted 29 June 2014 - 01:12 PM

Guys take it easy.

Could you provide the both multiplication test cases code?  I think that something is not clear.

I think my first results was wrong it probably was dependant on warming of the cache  - now i see no difference in such simple multiplication

will do some more tests later

(as ti test cases i use  QueryPerformanceCounter it returns some ticks that you could translate to nanoseconds etc - then i just put test code beetween two QueryPerformanceCounter calls  - it is also good to put it into some loop etc, sometimes i was using rtdsc call too but rarely)

Edited by fir, 29 June 2014 - 01:21 PM.

### #24phantom  Moderators   -  Reputation: 8496

Like
7Likes
Like

Posted 29 June 2014 - 01:40 PM

Your "test" is a simple forward array walk over 3.81Meg of data; this is the thing a CPU's pre-fetcher eats for lunch and 3.81Meg is probably going to sit in the L2 cache of your CPU quite nicely too so a second run right after the fact is going to remove the memory fetch difference.

Things like this are why rubbish micro-benchmarks are no good.
IF you used a real profiler, instead of something lashed together with a poor timing system, then you would have seen that 'oh, memory and timings are different.

You can NOT reason about a program in the way you want without either bags more of experience (which you don't have) or the correct tools (which you refuse to use for god knows what reason).

You then compound this by basically accusing everyone of lying with calls of 'propaganda' about well established facts which have been leant by years of research by others and then passed down to those who have come afterwards.

Your whole line of thought and reasoning, in this thread and the 3 others you have active (which is frankly overkill; you are lucky I wasn't about this weekend otherwise I might have closed them to keep the discussions focused as it's basically all the same subject!), is deeply deeply flawed.

### #25imoogiBG  Members   -  Reputation: 1465

Like
4Likes
Like

Posted 29 June 2014 - 01:42 PM

By giving the compiler a hint what datatype you use, __m128, it is able to produce highly optimized code

Well this is kind of true. The optimization comes from the default alignment, not because of the __m128 itself. Another great hit to the autovectorizator is by using directly arrays.

Im going to show a bit silly example but it show what compilers are doing and what they don't. I will be using latest MS compiler, GCC would to similar things, clang probably will give best results(I have no idea actually never worked with that compiler, only heard that autovect is really good on clang).

the code:

template<class T>
void init(T& a)
{
}

template<class T>
void use(const T& a)
{
fwrite(&a , sizeof(T), 1, (FILE*)53);
}

union Vec
{
struct {float x,y,z,w;};
float v[4];
};

init and use are functions that will disable the compile time computations.

Compiling

Vec a,b, r;

init(a);
init(b);

r.x = a.x + b.x;
r.y = a.y + b.y;
r.z = a.z + b.z;
r.w = a.w + b.w; use(r);

will produce:

.....bla bla.....
movss       dword ptr [ebp-34h],xmm0
movss       xmm0,dword ptr [ebp-20h]
movss       dword ptr [ebp-30h],xmm0
movss       xmm0,dword ptr [ebp-1Ch]
movss       dword ptr [ebp-2Ch],xmm0
movss       xmm0,dword ptr [ebp-18h]
movss       dword ptr [ebp-28h],xmm0  .....bla

Compiling

for(int t = 0; t < 4; ++t)
{
r.v[t] = a.v[t] + b.v[t];
}

Will produce:

bla.....blaaa movups      xmm1,xmmword ptr [esp+34h]
push        35h
movups      xmm0,xmmword ptr [esp+28h]
push        1
lea         eax,[esp+4Ch]
push        10h
push        eax
movups      xmmword ptr [esp+54h],xmm1
blaaaa

The second version is really nice and absolutely *faster*.

If we add __m128 (as tribad says) or add align 16 we will get a bit better results.

Yes this is silly you should profile real applications! I've learned that the bad way

EDIT:

There are missing instrcutions in the 1st case(1 more addss) but you've got the point.

Adding alignment to the first(xyzw) case will cure the instruction bloat but :

I'm strongly against adding a (__m128)/(adding align) members to a general purpose strcture for 2 reasons:

1) the new operator and the alignment

2) portability

You should use some ugly named strictures that scream SIMD_SEE_STUFF_VECTOR_NO_GENERAL_PURPOUSE_TYPE_DUDE something like this.

EDIT2:

@fir I don't know what you're trying to achieve? Is it an urgent thing for work or you're just learning?

If you're learning I suggest you to look at already written math libraries DirectXMath glm Bullet Vectormath and see how they are used in big projects, how they are using things. Then pick a small demo from DIrectX SDK for example (the triangle picking one) rewrite it using SSE and autovect code, profile the results, this is the best way to learn how to SSE.

Edited by imoogiBG, 29 June 2014 - 05:46 PM.

### #26 fir   Members   -  Reputation: -460

Like
-2Likes
Like

Posted 29 June 2014 - 02:17 PM

In general it is sad that sse will not improve memory thruoughtput even a bit - (as i know for something like 3 years and this is greatest disapointment here )- so it shows it probably is only able to improve some register arthimetic stuff of which I have no intuition if it will help in my rasterizer case- had no intuition here, will need to do some test and i will know,

Im not eexpecting to much as i said but some 'training' is not bad ;\

### #27 fir   Members   -  Reputation: -460

Like
-3Likes
Like

Posted 29 June 2014 - 02:31 PM

@fir I don't know what you're trying to achieve? Is it an urgent thing for work or you're just learning?

If you're learning I suggest you to look at already written math libraries DirectXMath glm Bullet Vectormath and see how they are used in big projects, how they are using things. Then pick a small demo from DIrectX SDK for example (the triangle picking one) rewrite it using SSE and autovect code, profile the results, this is the best way to learn how to SSE.

I want to do some 'training' - can spend a week or so on this - no urgent thing but i want to be able to write simple intrinsics stuff

good way i think is to rewrite a couple of some simple snipets from scalar into 4-packs and compare the times - though as i said i got now the feelling that worth revritting is mostly (or only) the stuff that is doing

a bigger amount of registry arithmetic - the hotspots that are realy memory flow with moderate amount of arithmetic will probably not gain nearly nothng

### #28imoogiBG  Members   -  Reputation: 1465

Like
3Likes
Like

Posted 29 June 2014 - 03:20 PM

This is my last post on the thread simply because someone is downvoating firs posts for no reason.

Don't get me wrong I'm not defending/offending him, I just try to exchange useful information with all of you.

I think that if you put a downvote on someones post, you MUST tell him why he is wrong, at the end of the date we are programmers we use facts.

If someone is asking the same question again and again, tell him that you've already told that and explain him how the current issue is related to the old one. If someone refuses to listen put one donevote with a comment and abandon the thread.

Just.... we are losing the point of the forums, it is supposed to be a place where we become better programmers, not a place for flaming.

Yes fir is a bit offensive in his posts, but maybe this is the way he speaks/thinks in English, I don't know, but it is obvious that (at least in this topic) the offensiveness isn't intended.

Lets just use nice words and say meaningful things.

Edited by imoogiBG, 29 June 2014 - 03:21 PM.

### #29 fir   Members   -  Reputation: -460

Like
-2Likes
Like

Posted 29 June 2014 - 03:25 PM

This is my last post on the thread simply because someone is downvoating firs posts for no reason.

Don't get me wrong I'm not defending/offending him, I just try to exchange useful information with all of you.

I think that if you put a downvote on someones post, you MUST tell him why he is wrong, at the end of the date we are programmers we use facts.

If someone is asking the same question again and again, tell him that you've already told that and explain him how the current issue is related to the old one. If someone refuses to listen put one donevote with a comment and abandon the thread.

Just.... we are losing the point of the forums, it is supposed to be a place where we become better programmers, not a place for flaming.

Yes fir is a bit offensive in his posts, but maybe this is the way he speaks/thinks in English, I don't know, but it is obvious that (at least in this topic) the offensiveness isn't intended.

Lets just use nice words and say meaningful things.

dont worry, i very like the this dovnvotes, ( Im more worry when im upvoted happily im on the way of the biggest score here, real nice)

( help me to get -20 000 thts something im really looking for, do not upvote me its banal )

Edited by fir, 29 June 2014 - 03:39 PM.

### #30Madhed  Crossbones+   -  Reputation: 3453

Like
3Likes
Like

Posted 29 June 2014 - 05:09 PM

http://www.codersnotes.com/sleepy

### #31BitMaster  Crossbones+   -  Reputation: 5151

Like
5Likes
Like

Posted 30 June 2014 - 01:17 AM

Don't get me wrong I'm not defending/offending him, I just try to exchange useful information with all of you.
I think that if you put a downvote on someones post, you MUST tell him why he is wrong, at the end of the date we are programmers we use facts.

In an ideal world, yes. In practice, you need to check fir's history. The surprising thing is not he is getting downvoted without further comment, the surprising thing is there are people left who are actually willing to engage with him in something resembling a constructive way. Whatever else he is, he is very good at burning bridges for absolutely no reason at all.

### #32 fir   Members   -  Reputation: -460

Like
-3Likes
Like

Posted 30 June 2014 - 02:15 AM

Don't get me wrong I'm not defending/offending him, I just try to exchange useful information with all of you.
I think that if you put a downvote on someones post, you MUST tell him why he is wrong, at the end of the date we are programmers we use facts.

In an ideal world, yes. In practice, you need to check fir's history. The surprising thing is not he is getting downvoted without further comment, the surprising thing is there are people left who are actually willing to engage with him in something resembling a constructive way. Whatever else he is, he is very good at burning bridges for absolutely no reason at all.

this topic is about sse - i think you should answer to this not increase tons of blabla already present here,

### #33Ohforf sake  Members   -  Reputation: 1949

Like
12Likes
Like

Posted 30 June 2014 - 03:48 PM

Sadly the other thread with the weird stack alignment bug got closed, so I can't post there anymore. But it seems like fir in his foresight created enough of them for everyone, so I'm just gonna post here.

For those of you, other then fir, who might have (or at some point in the future will) stumble across the same problem (aligned loads from the stack, generated by the compiler result in unaligned addresses) do not despair. There is another tool, next to the debugger, that fir also doesn't need, but that is very handy in this case, and which helped me a lot when I had to solve the very same problem (and I'm only sharing this, because it was a real WTF? moment for me). This tool is called google.

Just google for "GCC windows stack alignment" and pick (for example) the 3rd link that comes up, and you will get a nice explanation of the problem, alongside the solution.

Or, you can post a disassembly dump in the nearest internet community and wait for someone to figure it out, while you chill and play tetris *ducks and runs for the door*.

### #34jbadams  Senior Staff   -  Reputation: 21919

Like
2Likes
Like

Posted 01 July 2014 - 06:42 AM

This is obviously no longer on topic and is unlikely to get back to it -- topic closed.