Archived

This topic is now archived and is closed to further replies.

Kefdman

is it me or my compiler???

Recommended Posts

Kefdman    122
Hi! I am obtaining strange results from my program... Im using this function to multiply two 4x4 matrices together, and it is running really slow. I made another version of it and it is really fast. I really cant figure out why one is faster than the other. Is it me or my compiler that is causing that???? Here''s the slow one, it takes 29 seconds to multiply 1000000 matrices: inline static void m_mult(float * output, const float * matrice1, const float * matrice2) { // ligne 1 output[0] = matrice1[0] * matrice2[0] + matrice1[1] * matrice2[4] + matrice1[2] * matrice2[8] + matrice1[3] * matrice2[12]; output[1] = matrice1[0] * matrice2[1] + matrice1[1] * matrice2[5] + matrice1[2] * matrice2[9] + matrice1[3] * matrice2[13]; output[2] = matrice1[0] * matrice2[2] + matrice1[1] * matrice2[6] + matrice1[2] * matrice2[10] + matrice1[3] * matrice2[14]; output[3] = matrice1[0] * matrice2[3] + matrice1[1] * matrice2[7] + matrice1[2] * matrice2[11] + matrice1[3] * matrice2[15]; // ligne 2 output[4] = matrice1[4] * matrice2[0] + matrice1[5] * matrice2[4] + matrice1[6] * matrice2[8] + matrice1[7] * matrice2[12]; output[5] = matrice1[4] * matrice2[1] + matrice1[5] * matrice2[5] + matrice1[6] * matrice2[9] + matrice1[7] * matrice2[13]; output[6] = matrice1[4] * matrice2[2] + matrice1[5] * matrice2[6] + matrice1[6] * matrice2[10] + matrice1[7] * matrice2[14]; output[7] = matrice1[4] * matrice2[3] + matrice1[5] * matrice2[7] + matrice1[6] * matrice2[11] + matrice1[7] * matrice2[15]; // ligne 3 output[8] = matrice1[8] * matrice2[0] + matrice1[9] * matrice2[4] + matrice1[10] * matrice2[8] + matrice1[11] * matrice2[12]; output[9] = matrice1[8] * matrice2[1] + matrice1[9] * matrice2[5] + matrice1[10] * matrice2[9] + matrice1[11] * matrice2[13]; output[10] = matrice1[8] * matrice2[2] + matrice1[9] * matrice2[6] + matrice1[10] * matrice2[10] + matrice1[11] * matrice2[14]; output[11] = matrice1[8] * matrice2[3] + matrice1[9] * matrice2[7] + matrice1[10] * matrice2[11] + matrice1[11] * matrice2[15]; // ligne 4 output[12] = matrice1[12] * matrice2[0] + matrice1[13] * matrice2[4] + matrice1[14] * matrice2[8] + matrice1[15] * matrice2[12]; output[13] = matrice1[12] * matrice2[1] + matrice1[13] * matrice2[5] + matrice1[14] * matrice2[9] + matrice1[15] * matrice2[13]; output[14] = matrice1[12] * matrice2[2] + matrice1[13] * matrice2[6] + matrice1[14] * matrice2[10] + matrice1[15] * matrice2[14]; output[15] = matrice1[12] * matrice2[3] + matrice1[13] * matrice2[7] + matrice1[14] * matrice2[11] + matrice1[15] * matrice2[15]; } And here is the faster one, it takes 3 seconds to do the same thing!!! inline static MATRIX m_mult(const MATRIX matrice1, const MATRIX matrice2) { MATRIX output; // ligne 1 output.element[0] = matrice1.element[0] * matrice2.element[0] + matrice1.element[1] * matrice2.element[4] + matrice1.element[2] * matrice2.element[8] + matrice1.element[3] * matrice2.element[12]; output.element[1] = matrice1.element[0] * matrice2.element[1] + matrice1.element[1] * matrice2.element[5] + matrice1.element[2] * matrice2.element[9] + matrice1.element[3] * matrice2.element[13]; output.element[2] = matrice1.element[0] * matrice2.element[2] + matrice1.element[1] * matrice2.element[6] + matrice1.element[2] * matrice2.element[10] + matrice1.element[3] * matrice2.element[14]; output.element[3] = matrice1.element[0] * matrice2.element[3] + matrice1.element[1] * matrice2.element[7] + matrice1.element[2] * matrice2.element[11] + matrice1.element[3] * matrice2.element[15]; // ligne 2 output.element[4] = matrice1.element[4] * matrice2.element[0] + matrice1.element[5] * matrice2.element[4] + matrice1.element[6] * matrice2.element[8] + matrice1.element[7] * matrice2.element[12]; output.element[5] = matrice1.element[4] * matrice2.element[1] + matrice1.element[5] * matrice2.element[5] + matrice1.element[6] * matrice2.element[9] + matrice1.element[7] * matrice2.element[13]; output.element[6] = matrice1.element[4] * matrice2.element[2] + matrice1.element[5] * matrice2.element[6] + matrice1.element[6] * matrice2.element[10] + matrice1.element[7] * matrice2.element[14]; output.element[7] = matrice1.element[4] * matrice2.element[3] + matrice1.element[5] * matrice2.element[7] + matrice1.element[6] * matrice2.element[11] + matrice1.element[7] * matrice2.element[15]; // ligne 3 output.element[8] = matrice1.element[8] * matrice2.element[0] + matrice1.element[9] * matrice2.element[4] + matrice1.element[10] * matrice2.element[8] + matrice1.element[11] * matrice2.element[12]; output.element[9] = matrice1.element[8] * matrice2.element[1] + matrice1.element[9] * matrice2.element[5] + matrice1.element[10] * matrice2.element[9] + matrice1.element[11] * matrice2.element[13]; output.element[10] = matrice1.element[8] * matrice2.element[2] + matrice1.element[9] * matrice2.element[6] + matrice1.element[10] * matrice2.element[10] + matrice1.element[11] * matrice2.element[14]; output.element[11] = matrice1.element[8] * matrice2.element[3] + matrice1.element[9] * matrice2.element[7] + matrice1.element[10] * matrice2.element[11] + matrice1.element[11] * matrice2.element[15]; // ligne 4 output.element[12] = matrice1.element[12] * matrice2.element[0] + matrice1.element[13] * matrice2.element[4] + matrice1.element[14] * matrice2.element[8] + matrice1.element[15] * matrice2.element[12]; output.element[13] = matrice1.element[12] * matrice2.element[1] + matrice1.element[13] * matrice2.element[5] + matrice1.element[14] * matrice2.element[9] + matrice1.element[15] * matrice2.element[13]; output.element[14] = matrice1.element[12] * matrice2.element[2] + matrice1.element[13] * matrice2.element[6] + matrice1.element[14] * matrice2.element[10] + matrice1.element[15] * matrice2.element[14]; output.element[15] = matrice1.element[12] * matrice2.element[3] + matrice1.element[13] * matrice2.element[7] + matrice1.element[14] * matrice2.element[11] + matrice1.element[15] * matrice2.element[15]; return output; } BTW: im using gcc 2.95.4 20011006, on a celeron laptop, on linux 2.4 Ah! i noticed another thing: when i enable -O3 -march=i686 -s -ffast-math it becomes about 10% slower...

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
It''s hard to say why the two are different without knowing how MATRIX is defined.

I notice you''re using float values in the first slower one. I seem to recall that doubles are actually more efficient on modern processors. Anyone out there know if this is right?

Share this post


Link to post
Share on other sites
Mark_C    122
I didnt put a timer in there but with the "slow one" I can do a million in about the same amount of time 2-3 seconds. Using VC++ 6.0 in Windows I highly doubt VC++ is working better than GCC or Windows is working better than Linux.

Share this post


Link to post
Share on other sites
Kefdman    122
So i am gonna to use the faster one, but i still cant understand why it is faster... before writing them, i tought that the one with pointers to floats would be much faster than the one that returns 16 floats, moving them around uselessly in memory. Anyways, my code in general is so messed up that one more or one less thing like that wont matter much.
thank for answering.

Share this post


Link to post
Share on other sites
Kefdman    122
quote:
Original post by a person
its called turning on the optimization flags. possible you forgot to do that, since gcc dont optimize by default -O2 should help.


i tried it with -O3 -march=i686 -s -ffast-math
and it slowed it down by about 10%

Share this post


Link to post
Share on other sites
NickB    146
Try making your compiler "assume no aliasing" for the function (could try for the whole file, but that may be dangerous) - see explaination later in this post. In MSVC you can do this by inserting this pragma before the function:
  
#pragma optimise("a",on)

inline static void m_mult(float * output, const float * matrice1, const float * matrice2)
{
// ligne 1

output[0] = matrice1[0] * matrice2[0] + matrice1[1] * matrice2[4] + matrice1[2] * matrice2[8] + matrice1[3] * matrice2[12];
output[1] = matrice1[0] * matrice2[1] + matrice1[1] * matrice2[5] + matrice1[2] * matrice2[9] + matrice1[3] * matrice2[13];
output[2] = matrice1[0] * matrice2[2] + matrice1[1] * matrice2[6] + matrice1[2] * matrice2[10] + matrice1[3] * matrice2[14];
output[3] = matrice1[0] * matrice2[3] + matrice1[1] * matrice2[7] + matrice1[2] * matrice2[11] + matrice1[3] * matrice2[15];
// ligne 2

output[4] = matrice1[4] * matrice2[0] + matrice1[5] * matrice2[4] + matrice1[6] * matrice2[8] + matrice1[7] * matrice2[12];
output[5] = matrice1[4] * matrice2[1] + matrice1[5] * matrice2[5] + matrice1[6] * matrice2[9] + matrice1[7] * matrice2[13];
output[6] = matrice1[4] * matrice2[2] + matrice1[5] * matrice2[6] + matrice1[6] * matrice2[10] + matrice1[7] * matrice2[14];
output[7] = matrice1[4] * matrice2[3] + matrice1[5] * matrice2[7] + matrice1[6] * matrice2[11] + matrice1[7] * matrice2[15];
// ligne 3

output[8] = matrice1[8] * matrice2[0] + matrice1[9] * matrice2[4] + matrice1[10] * matrice2[8] + matrice1[11] * matrice2[12];
output[9] = matrice1[8] * matrice2[1] + matrice1[9] * matrice2[5] + matrice1[10] * matrice2[9] + matrice1[11] * matrice2[13];
output[10] = matrice1[8] * matrice2[2] + matrice1[9] * matrice2[6] + matrice1[10] * matrice2[10] + matrice1[11] * matrice2[14];
output[11] = matrice1[8] * matrice2[3] + matrice1[9] * matrice2[7] + matrice1[10] * matrice2[11] + matrice1[11] * matrice2[15];
// ligne 4

output[12] = matrice1[12] * matrice2[0] + matrice1[13] * matrice2[4] + matrice1[14] * matrice2[8] + matrice1[15] * matrice2[12];
output[13] = matrice1[12] * matrice2[1] + matrice1[13] * matrice2[5] + matrice1[14] * matrice2[9] + matrice1[15] * matrice2[13];
output[14] = matrice1[12] * matrice2[2] + matrice1[13] * matrice2[6] + matrice1[14] * matrice2[10] + matrice1[15] * matrice2[14];
output[15] = matrice1[12] * matrice2[3] + matrice1[13] * matrice2[7] + matrice1[14] * matrice2[11] + matrice1[15] * matrice2[15];
}


EXPLAINATION:

I haven''t had time to check this out myself, but the issue is probably due to the compiler assuming that aliasing may be occuring. Basically the problem as far as the compiler is concerned is that the matrix you are writing to (output) _MIGHT_ be the same one as one of your input matricies (it can''t know until run time) - ie the output pointer _might_ be an alias (ie different name, same thing) as the input pointers.

Therefore to play safe and ensure that it''s always using current values for the matricies the compiler reloads any values it may have cached in the registers from the input matrices - which may ammount (relatively) to a lot of extra overhead Telling the compiler to "assume no aliasing" means that it will assume that all the pointers to values (that matter - ie ones that are written) are unique, so that it will not need to take the precaution of reloading the values whenever it does a write. This does of course burden you with ensuring that you don''t try anything strange, like passing it the same matrix as the output and one of the input matrixes.

The reason that the second function didn''t suffer the same problem is that the matrix you were writing to is a local matrix, which the compiler could automatically guarantee that it was not being written ''behind its back'', meaning it didn''t need to have the overhead of reloading the values.

Well done for reading this far I''m sorry if it''s a bit long and boring - but there you go, I hope it''s solved the problem & explained why it''s occuring.

NickB

PS. I''m sorry, but I don''t know what the optimisation pragma''s are for GCC - especially no aliasing.

Share this post


Link to post
Share on other sites
Kefdman    122
Hey thanks a lot NickB, im now looking how to disable aliasing in gcc, that was probably what was causing my problem. because i had expected the one with pointers to be much faster than the one that passed it all by copy!!
GCC probably decided to use aliasing because my program does a lot of crazy stuff with pointers and it keeps segfaulting all the time.

Edited by - kefdman on December 13, 2001 9:43:15 PM

Share this post


Link to post
Share on other sites
NickB    146
It might be worth getting to the bottom of the segmentation faults though...that could be causing really nasty slow downs...make sure you've got enough stack space - I think you can alter that under GCC in the binarys for some OS's - it's unusual that you are running out of space, but this could happen if you call alot of functions at once in a chain (ie a calls b which calls c etc.), if you're using recusive functions or if you have an abnormally large number of local variables (or a combination of these - long call chains with recursive functions all containing lots of local variables)

PS: Null & Void - aliasing seems unlikely to be the major problem I agree - but then I assumed everything was running normally

Edited by - NickB on December 13, 2001 10:34:08 PM

Share this post


Link to post
Share on other sites
Kefdman    122
NickB: lol no, segfaults don''t slow down my program, they only crash it!
I dont have a lot of recursive functions, i doubt the stack is the problem. These segfaults are caused by my program trying to access memory it didn''t malloc first!
I didn''t want to bother you with that, I only meant that I write really bad programs.

Share this post


Link to post
Share on other sites