Optimization Puzzle

Started by
56 comments, last by MaulingMonkey 18 years, 7 months ago
Hiya Guys. I've been told over and over that the Win32 API function 'sqrt' is slow, Slow, SLOW... So of course I opted to utilize SSE2 technology in order to work out the Norm of a vector, packed as 2 double precision floats. No problem there... But incase I AM doing something wrong, here is my inlined ASM code, packaged as a function called 'FastNorm'. inline double FastNorm (const __m128d vector) { double norm; __asm { movapd xmm0, vector mulpd xmm0, xmm0 movapd xmm1, xmm0 shufpd xmm0, xmm0, _MM_SHUFFLE2 (0, 1) addpd xmm0, xmm1 sqrtpd xmm0, xmm0 movlpd norm, xmm0 } return norm; } Now get this, when I compare the above code, to this one which INCLUDES the dreaded 'sqrt' function, as follows; double Norm (const CSSE2Vector &vector) { return double (sqrt ((vector.x * vector.x) + (vector.y * vector.y))); } where CSSE2Vector is a small class which simply encapsulates the __m128d data type, the 'Norm' function actually runs much FASTER than 'FastNorm' in Release builds using MS Visual C++ 6.0. In fact, 'Norm' runs as fast as a null function (a function with no code and no arguments). In Debug builds 'FastNorm' runs faster than 'Norm', so no problem there. I'm using the high performance counter to time BOTH functions in exact conditions. But I have a feeling the damn complier is running some optimization routines that somehow beat SSE2 code. Or its doing some damn loop unrolling (I use about a billion iterations in the loop that tests the code), and has thus taken out the invariant test code outta the main test loop, thus skewing my code times. Something is wrong here, 'cos I be dammed if 'sqrt' can beat SSE2. Any of you boffins know how to tame my complier, so that it DOESN'T run stealth optimizations in my code??? Otherwise, I gotta stick to using the 'sqrt' version to norm all my vectors in a Release build... Suspecting that its faster than SSE2, only 'cos I couldn't time the damn thing properly :-( Plz Help!!!
Advertisement
Try looking at the assembly being generated?

Also sqrt isn't a win32 api function; it is a standard library function.
Thx Sherlock! (j/k)

Ah... Win32 API / ANSI function calls... All the same to me really when you've programmed for a while...

Guess its a bad, lazy way of saying Non-DirectX function calls...

Cos they're all slow to me.... Well mostly...

Didn't REALLY wanna look at asm code mate... Inline asm is about all I can handle for now, due to its convenience...

Have you seen the asm code the complier produces???

If I was an asm nut, then sure I would have done that LONG ago...

No, MS Visual C++ is over complicated... And I'm sure theres a way to tame the damn complier...

At least I'll gain some control if I wanna do precise code timings...

Looking at convoluted complier generated asm is the LAST resort.

But its an idea that I might have to consider :-(
Quote:Original post by TerrorFLOP
Something is wrong here, 'cos I be dammed if 'sqrt' can beat SSE2. Any of you boffins know how to tame my complier, so that it DOESN'T run stealth optimizations in my code???


So... you want the compiler to do less work so you can do more? Please, explain to me how that's a good idea.

If you don't want it to optimize, turn off optimizations. Another option would maybe be to tell the compiler you aren't using the standard library and then link the standard library by hand.

Quote:
Otherwise, I gotta stick to using the 'sqrt' version to norm all my vectors in a Release build... Suspecting that its faster than SSE2, only 'cos I couldn't time the damn thing properly :-(


It's possible it is using the SSE2 code. Face it, compilers, by and large, are better than humans at optimizing code.

And, for what it's worth, if someone doesn't know the difference between Win32 functions and ANSI functions, especially if they call themselves TerrorFLOP (the pun strikes terror in my heart), I doubt they're better at low level optimizations than a compiler (almost certainly coded by people who know the difference between Win32 and ANSI).
Easy there matey.

I did say I was lazy when it comes to the classification of differnt API's.

My problem, not yours... Not that it IS a problem...

Listen, I understand compliers are the BEST optimizers... That was NOT my point.

My point is that its difficult to time your code when the complier alters the code that you are trying to time.

If you have any ideas on how to temporarily disable that feature, then I'm interested...

If not... Go bug some other poor soul with incomplete and vague answers.

Plz answer the specific question:

Can you disable the compliers' feature to alter ones code...

A simple yes or no would suffice.

I'm experiementing with various code here and I wanna know which one works best.

Is that re-phrased question straight forward enough?
Quote:And, for what it's worth, if someone doesn't know the difference between Win32 functions and ANSI functions, especially if they call themselves TerrorFLOP (the pun strikes terror in my heart), I doubt they're better at low level optimizations than a compiler (almost certainly coded by people who know the difference between Win32 and ANSI).


Even today's compilers are limited in some optimizations that they can do and still there's a need to develop better algorithms for compilers, especially for stuff like avoiding "optimization blockers", but if I'm not mistaken there's at least ONE problem related to that that's NP-Complete, so in that case the best optimization is (if someone created an algorithm for it) an aproximation, if required...
True optimization is, first and foremost, improving the algorithms. Then you can think if assembly is still worth it...

The general problem is that most people that try to do stuff in assembly don't know that much assembly and use quite often stuff that needs lots of CPI in few lines, or lots of lines when something can be written in a more terse way with simpler instructions... It's assembly. If you really wanted "ease of reading" +skill for your code you wouldn't be programming in assembly.

My advice is... If you really want to optimize try learning more assembly or try thinking in a different way to do that.

P.S. - I heard that Quake3 source code had an implementation of sqrt in assembly... Maybe you could look at it if you want ideas...
Let me clarify my problem clearly...

As it seems peeps, who shall forever remain anonymous (and thank God eh?) haven't really grasped my problem.

Compliers use what is called 'Loop Invariant Computations'.

What that means in layman terms, is that ANY code that you write within ANY loop that does NOT depend upon the looping index, is classified as an invariant.

Most optimizing compliers are constructed to detect such code in loops, and thus MOVE the invariant code OUT of the loop in order to increase speed:

Here is an example of my dilemma, the complier has re-written my test code;


<CODE TO LOG PERFORMANCE COUNTER GOES HERE>
for (register int i = 1000000000; i >= 0; --i)
{
<TEST CODE THAT DOES NOT DEPEND ON i>
}
<CODE TO LOG PERFORMANCE COUNTER GOES HERE SO A TIME DIFFERENTIAL CAN BE OBTAINED>


to the following:


<CODE TO LOG PERFORMANCE COUNTER GOES HERE>
<TEST CODE THAT DOES NOT DEPEND ON i IS NOW MOVED HERE BY THE COMPLIER!!!>
for (resgister int i = 1000000000; i >= 0; --i)
{
}
<CODE TO LOG PERFORMANCE COUNTER GOES HERE SO A TIME DIFFERENTIAL CAN BE OBTAINED>


Do you see the problem? I wish to test the code, but when I build a Release version of the code, with all optimizations active, all I do is test a null loop.

Hence why when I test THIS function;


inline double Tools::Norm (const CSSE2Vector &A)
{
return double (sqrt ((A.x * A.x) + (A.y * A.y)));
}


I get a report back saying it’s just as FAST as a NULL function, which is obvious bull.

Now, can anyone tell me if there is a way to tell the complier NOT to perform a Loop Invariant Optimization.

Much Appreciated.
Although it is probably naive, the way I test code like this is to pass a random number to the function I want to test. I think this works when you want to compare two different functions because the generation of the random number is a common factor of the process. I guess you could also time the generation of the random number if you wanted a better estimate of the time that each function takes to perform its task.


-Josh

--www.physicaluncertainty.com
--linkedin
--irc.freenode.net#gdnet

Thx k0n.

A helpful reply at last LOL...

The thing is, is that I agree with you that asm is tricky stuff.

Hence my reluctance to look at complier generated asm code...

Perhaps one day I'll jump straight into asm, but with today's high specs API's I guess I try to avoid such an endeavour.

However, the small piece of simple SSE2 asm that I wrote is hardly complex.

And the only reason why I've attempted asm is because of 'sqrt's apparent slowness.

Many books I’ve bought on C/C++ optimizations all hail the complier, but they also try to encourage us to use highly optimised SIMD instructions for critical code, and I suppose I agree.

I will look into those areas you mentioned... Thanks very much for your input... But c'mon, a handful of simple SIMD instructions gets beat by 'sqrt' in Release versions only???

(In Debug, my code wins big time!)

No... My complier is being too clever... Sure it’s helpful... But I want it to behave JUST a little so I can accurately test my code...

IF that is NOT possible (MS VC++ is complex and I've spent years on the damn thing... Well more on programming... But I guess I still have a lot to learn on the IDE)... Then sure I'll quit trying to time code...

Or perhaps invest in an expensive profiler...

But I ain't giving up just quite yet LOL!!!

Thanks Anyway :-)
Maybe the volatile keyword could help?

This topic is closed to new replies.

Advertisement