Back to General and Gameplay Programming

How to profile an inner inner loop

General and Gameplay Programming Programming

Started by marumaru January 26, 2009 03:54 AM

11 comments, last by swiftcoder 15 years, 2 months ago

marumaru

122

Author

January 26, 2009 03:54 AM

Hi guys, I'm trying to figure out why a certain piece of code is slower than it should be. The time-consuming bit is a loop that gets executed a few hundred billion times per run. Said loop contains a fairly complex control flow, and a number of potentially time consuming pieces. The problem is that I can't for the life of me figure out how long each of those pieces take relative to one another. I've tried running it through tools like oprofile and callgrind, but the most I can reliably get out of them is that indeed, the function that runs this loop takes a long time. The statement-level data is sketchy at best, especially since gcc rearranges everything during optimization. The results are inconsistent with results I observe by commenting out sections of the code. Profiling unoptimized code might give more reliable output, but I don't think it would tell me anything useful. I even tried using the RDTSC instruction to measure clock cycles, but the instruction itself is too slow relative to the length of the code I'm trying to measure, since each iteration takes very little time. Depending on where I put it, RDTSC alone sometimes doubles the runtime of the program, so obviously I can't trust any results from that. Any bright ideas, or am I going to have to bone up on my assembly and try to read gcc's optimized output?

Morrandir

304

January 26, 2009 04:30 AM

Hi!

Have you tried optimizing with a more high level tool, like AMD CodeAnalyst? Don't know if it would help in your situation or not, but it's a wonderful (and free) tool that is definitely worth a try. It has helped me find fairly low-level bottlenecks in the past (like massive cache misses).

Regarding assembly knowledge, with a good profiler you can usually get by just looking up the meaning of the instructions that take a lot of cycles. At least that's what I usually do and it works so far :)

Zahlman

1,682

January 26, 2009 10:59 AM

Have you considered the possibility that the real problem is that the code runs too many times? Maybe you can change the algorithm in a way that avoids having such a highly-repeated inner loop.

outRider

852

January 26, 2009 11:10 AM

Did you try OProfile's annotated disassembly output, or just annotated source? Generally I look at the output of opannotate -a -s <binary>. Maybe if you post that output for the function in question someone might shed some light on your problem.

marumaru

122

Author

January 26, 2009 11:31 AM

Quote:Original post by Morrandir
Hi!

Have you tried optimizing with a more high level tool, like AMD CodeAnalyst? Don't know if it would help in your situation or not, but it's a wonderful (and free) tool that is definitely worth a try. It has helped me find fairly low-level bottlenecks in the past (like massive cache misses).

Regarding assembly knowledge, with a good profiler you can usually get by just looking up the meaning of the instructions that take a lot of cycles. At least that's what I usually do and it works so far :)

Thanks, I'll give it a try.

Some further investigation-by-commenting-out seems to indicate that the big slowdown is actually due to a bit of code that just reads from a 2D array of booleans once per iteration, so I imagine it must indeed be due to a very expensive cache miss. Guess I'll need to re-structure the data to be more cache friendly.

Quote:
Have you considered the possibility that the real problem is that the code runs too many times? Maybe you can change the algorithm in a way that avoids having such a highly-repeated inner loop.

lol

Unfortunately my problem is very fundamentally quadratic in a value that can get pretty large. The very part of the algorithm that's running slower than expected is the part that's supposed to be reducing the number of times the loop runs. [grin]

I admit "hundred billion times" is a bit of an exaggeration though; in the majority of cases the loop drops out very early and only really does a lot of work in a few tens of millions of the iterations.

Quote:
Did you try OProfile's annotated disassembly output, or just annotated source? Generally I look at the output of opannotate -a -s <binary>. Maybe if you post that output for the function in question someone might shed some light on your problem.

I've looked at the annotated disassembly, though I admit not very carefully. I had some difficulty figuring out which bits of assembly corresponded to what, since as I mentioned gcc does a hell of a job mangling my code. This is probably the route I'll have to go, though.

iMalc

2,466

January 26, 2009 12:22 PM

Quote:Original post by Zahlman
Have you considered the possibility that the real problem is that the code runs too many times? Maybe you can change the algorithm in a way that avoids having such a highly-repeated inner loop.

This requires some serious consideration. It sounds like the code must be fairly well optimised at the low level already, since adding an rdtsc slowed it down a lot. Usually the only thing you can do in such a case is to reduce how often the thing is called.
If you're absolutely sure that you must perform more low-level optimisation then you'll need to show code. And that's not something we want you to know the answer immediately to. You need to seriously reconsider every possible way to call it less.

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

marumaru

122

Author

January 26, 2009 12:35 PM

Quote:Original post by iMalc
Quote:Original post by Zahlman
Have you considered the possibility that the real problem is that the code runs too many times? Maybe you can change the algorithm in a way that avoids having such a highly-repeated inner loop.
This requires some serious consideration. It sounds like the code must be fairly well optimised at the low level already, since adding an rdtsc slowed it down a lot. Usually the only thing you can do in such a case is to reduce how often the thing is called.
If you're absolutely sure that you must perform more low-level optimisation then you'll need to show code. And that's not something we want you to know the answer immediately to. You need to seriously reconsider every possible way to call it less.

Don't get me wrong; I understand that algorithmic optimizations are generally more effective than code-level optimizations. The code I'm trying to profile *is* an algorithmic optimization, that successfully reduces the number of iterations by orders of magnitude, yet against all odds fails to speed up the runtime.

stonemetal

288

January 26, 2009 12:51 PM

Quote:I've tried running it through tools like oprofile and callgrind, but the most I can reliably get out of them is that indeed, the function that runs this loop takes a long time.

Have you tried pulling chunks of the function you are worried about out in to separate functions so that callgrind can inspect the pieces?

Adam_42

3,664

January 26, 2009 03:27 PM

Quote:Some further investigation-by-commenting-out seems to indicate that the big slowdown is actually due to a bit of code that just reads from a 2D array of booleans once per iteration, so I imagine it must indeed be due to a very expensive cache miss. Guess I'll need to re-structure the data to be more cache friendly.

You might find http://www.agner.org/optimize/ useful. The matrix transposition example (section 8.10 of manual 1) shows options for cutting down on cache issues when accessing a big 2D array.

Also don't forget that you can always store booleans as a single bit each to fit more of them in the cache.

marumaru

122

Author

January 26, 2009 04:56 PM

Quote:Original post by stonemetal
Quote:I've tried running it through tools like oprofile and callgrind, but the most I can reliably get out of them is that indeed, the function that runs this loop takes a long time.

Have you tried pulling chunks of the function you are worried about out in to separate functions so that callgrind can inspect the pieces?

I suspect that doing that would significantly change the performance characteristics of the code.

Quote:
You might find http://www.agner.org/optimize/ useful. The matrix transposition example (section 8.10 of manual 1) shows options for cutting down on cache issues when accessing a big 2D array.

Also don't forget that you can always store booleans as a single bit each to fit more of them in the cache.

Cool, that looks like a really useful resource. Thanks!

How to profile an inner inner loop

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

How to profile an inner inner loop

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines