Optimization Puzzle

Started by
56 comments, last by MaulingMonkey 18 years, 7 months ago
Thx jjd...

You know, I actually thought of that...

I didn't run the tests though 'cos I thought it sounded a bit crazy (LOL.... No offence mate!)

Mmmm... Perhaps now that you have mentioned I'll give that idea a go...

However, I've a funny feeling that the ever so helpful complier will say;

'Great!!! Another code invariant that does not depend on the looping index'

and move the code according...

Sure, helpful in MOST situations, but not in code timing...

Last resort I guess is to somehow incorporate the looping index in the test code...

I might not get an accurate test... But at least it won't move the code so that it CAN be tested...

Gonna try some ideas now...

Thx!
Advertisement
TerrorFlop:
One thing you could try is to take the assembly the compiler generates (to call normal sqrt) and put that in an __asm block so that the compiler will treat it as the same "black box" and not optimize it. You could also try __declspec(noinline) in case the compiler is just unrolling and inlining the whole deal...

However, the BEST (and I highly recommend this if you're into this sort of thing) is to go to AMD's website and download CodeAnalyst. It allows you to 'record' portions of your code, and then it simulates everything that is happening on the processor during that time and shows you a graph of where all the cycles are going and which stages of the pipeline are stalling for each instruction. It takes a lot of practice to really understand everything that's happening, but the "Green is Good, Red is Bad" thing will be a good start :)

Anyway, regarding the "Compiler is the God of Optimization"... That would be great, wouldn't it? It makes sense, right? Unfortunately, the compiler is *terrible* at optimizing, and especially intrinsics. I mean jokingly bad. The only thing the compiler actually does is inline and unroll loops, so as to distort the sort of tests TerrorFlop is trying to run. The real genius is the processor, especially those AMDs. The re-arrangement of crappy code that that thing can do on the fly, it's indescribable what sort of genius went into producing that on silicon at 3 GHz.

Anyway back to TF: The test you're running... it's not a very good way to test SSE. The one thing the processor can't do is magically eliminate latency stalls, like using xmm0 as a destination and source to almost every instruction... it'll fill those stalls with out-of-order instructions, but if all you're doing is a tight loop that keeps stalling, there's not much it can hope to do. Do you need to use doubles? This can be written much faster with floats.

[edit] Another thing. I'm not sure about doubles, but for floats there are 2 instructions: sqrtps and rsqrtps. One is the full IEEE float sqrt, and the second is a 12-bit accurate inverse square root that runs about 7 times faster (so to eliminate the divide that is necessary to do a normalize). So, it's likely that the "meat" of your function, the square root, is occupying 80% of your cycles in both functions. See, this is the sort of thing CodeAnalyst is for.[/edit]
Just include the loop index in the arguments to the function. No more invariant.

Thx Ajas.

You've certainly given me food for thought there and I'll check out CodeAnalyst.

Think I'll just give up anyway... (the other coding ideas I wrote still come up with ‘sqrt’ as fast as a bleeding NULL statement!!!)

You would have thought that accurately timing code would be pretty easy, but the complier can and does get in the way.

However testing 'sqrt's actual asm code directly sounds like a damn good idea...

The SSE2 code I wrote... And lets face it... It ain't complex (but yeah, I know a little bit about Intel’s CPU instruction re-ordering... Damn complex subject!!!)... IS fast...

I was simply bemused that 'sqrt' is 'faster' despite it having such a bad rep in most optimization books I've read.

And the fact that it’s as fast (apparently) as a NULL function sure as hell doesn't make sense.

But thx man... Gonna go rip out 'sqrt's innards and learn from its dark secrets LOL!
Anon said:

Just include the loop index in the arguments to the function. No more invariant.



Thx for the tip.... But even that does not appear to work. The complier still performs its magic... Somehow... :-(

(Which is damn useful 99.999% of the time... But not for accurately testing code)

__________________________________________________________
Optimizing Code Really F's U Up!!! DON'T DO IT!!! (LOL)
It could be that it's detecting that you're not storing the result anywhere and it knows the sqrt function doesn't do anything so it's just optimizing out the whole thing? Maybe try storing the results in a preallocated array. Then the time it takes for the memory writes becomes part of the test...but you can view it as added realism. Who knows, it could be that under realistic usage one method is faster than the other sometimes and sometimes the other is faster, due to inlining + out of order execution + memory usage patterns.
An extract from MSDN Online:

_______________________________________________________________________________
Often, some code inside a loop, which is repeated many times, does not change values during the loop's execution. These loop invariants can be calculated once before the loop runs. Removing or hoisting loop invariants generally improves speed without affecting size very much. For example, invariant hoisting can remove the arithmetic from the inside of this loop:

for (i=0;i<1000;i++) {
A = a + b; }

The product is

t = a+b;
for (i=0;i<1000;i++) {
A = t; }

This saves 999 addition operations at the cost of one temporary variable, which is likely to go into a register.
________________________________________________________________________________


So even trying to force code to be NON-invariant is pointless... The complier does what the hell it wants...

Most times thats helpful.... Most times...

Programmer 0 - Complier 1 :-(
Quote:Original post by TerrorFLOP
Thx Sherlock! (j/k)

Ah... Win32 API / ANSI function calls... All the same to me really when you've programmed for a while...

for a while but not for long enough.

It is standard library call, heck, not even a call (i can bet it is inlined). In release mode with optimizations on is going to have equal or greater speed than whatever-you-can-ever-write.

anyway, if you need this sqrt to compute distance, chaces are you can do your work with squared radius and do not need sqrt at all.

p.s. benchmarks is worth precisely nothing when compiler is "more smart" than writer of benchmark, and it is certanly the case. Compiler will not compile things that is not necessary, that is. You need to do something useful with results(like, add them and print the sum), make all inputs different, etc. It's not just invariant, if result is going nowhere, no computation is done. Better yet benchmark your actual code. Chances are, compiler will optimize a lot of unnecessary code out there too.

Anyway, just forget about assembly-level optimizations. If you can not write benchmark, you surely can not make optimizations better than compiler can, as compiler cleanly "knows" more optimizations than you.
1. Putting hand-coded asm into your code makes it hard for the compiler to optimise the code around it, so unless it's a substantial chunk of asm any speed improvement may well be lost.

2. You might be better off using intrinsics in the source code (not asm) - e.g. here.

3. Your compiler probably supports generating SSE instructions anyway (even if it doesn't manage the vector forms) - why not just enable this flag in the compiler? Of course, you need to know that the target machine will support them.

4. If you really care about speed, use floats not doubles, which may require reworking your algorithms to handle the loss in precision.

5. doing sqrt(x*x + y*y) can generate precision problems when x and y are large - that's the point of the hypot function. It may be that the increased accuracy with hypot means you can use floats, not doubles.

I don't know whether the above suggestions will help - my own experience has been that trying to speed things up at this low level (SSE intrinsics, prefetching etc) invariably slows things down!
Dmytry...

I find your lack of IQ... Disturbing to say the least.

No wonder there has been posts from the moderators about peeps like you.

If I wish to time my code... Then that is entirely my business.

I've timed code before... And I shall do so... AGAIN.

There’s even code in the online MSDN (and several other books) which shows you how to implement code timing logic and I've implemented some of that code into my own timing routines. If you have a problem with other coders trying to time and experiment with their code... Then you have a serious problem dude.

With regards to the 'radius squared' issue... You have NO idea why I would choose to norm vectors do you?

Go check out Newton's Law Of Gravitation... The VECTOR version (since force IS a vector)... And you'll realize that the famous law utilizes norm cubes.

Anyway, WHY I'm I trying to explain myself to you.

If you've nothing constructive to say man... THEN DON'T SAY!!!

I'm sure the rest of the civilized world will get by just fine without your 'insights' (if indeed you can call it that).

This topic is closed to new replies.

Advertisement