• Advertisement
Sign in to follow this  

How to profile an inner inner loop

This topic is 3314 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi guys, I'm trying to figure out why a certain piece of code is slower than it should be. The time-consuming bit is a loop that gets executed a few hundred billion times per run. Said loop contains a fairly complex control flow, and a number of potentially time consuming pieces. The problem is that I can't for the life of me figure out how long each of those pieces take relative to one another. I've tried running it through tools like oprofile and callgrind, but the most I can reliably get out of them is that indeed, the function that runs this loop takes a long time. The statement-level data is sketchy at best, especially since gcc rearranges everything during optimization. The results are inconsistent with results I observe by commenting out sections of the code. Profiling unoptimized code might give more reliable output, but I don't think it would tell me anything useful. I even tried using the RDTSC instruction to measure clock cycles, but the instruction itself is too slow relative to the length of the code I'm trying to measure, since each iteration takes very little time. Depending on where I put it, RDTSC alone sometimes doubles the runtime of the program, so obviously I can't trust any results from that. Any bright ideas, or am I going to have to bone up on my assembly and try to read gcc's optimized output?

Share this post


Link to post
Share on other sites
Advertisement
Hi!

Have you tried optimizing with a more high level tool, like AMD CodeAnalyst? Don't know if it would help in your situation or not, but it's a wonderful (and free) tool that is definitely worth a try. It has helped me find fairly low-level bottlenecks in the past (like massive cache misses).

Regarding assembly knowledge, with a good profiler you can usually get by just looking up the meaning of the instructions that take a lot of cycles. At least that's what I usually do and it works so far :)

Share this post


Link to post
Share on other sites
Have you considered the possibility that the real problem is that the code runs too many times? Maybe you can change the algorithm in a way that avoids having such a highly-repeated inner loop.

Share this post


Link to post
Share on other sites
Did you try OProfile's annotated disassembly output, or just annotated source? Generally I look at the output of opannotate -a -s <binary>. Maybe if you post that output for the function in question someone might shed some light on your problem.

Share this post


Link to post
Share on other sites
Quote:
Original post by Morrandir
Hi!

Have you tried optimizing with a more high level tool, like AMD CodeAnalyst? Don't know if it would help in your situation or not, but it's a wonderful (and free) tool that is definitely worth a try. It has helped me find fairly low-level bottlenecks in the past (like massive cache misses).

Regarding assembly knowledge, with a good profiler you can usually get by just looking up the meaning of the instructions that take a lot of cycles. At least that's what I usually do and it works so far :)

Thanks, I'll give it a try.

Some further investigation-by-commenting-out seems to indicate that the big slowdown is actually due to a bit of code that just reads from a 2D array of booleans once per iteration, so I imagine it must indeed be due to a very expensive cache miss. Guess I'll need to re-structure the data to be more cache friendly.
Quote:

Have you considered the possibility that the real problem is that the code runs too many times? Maybe you can change the algorithm in a way that avoids having such a highly-repeated inner loop.

lol

Unfortunately my problem is very fundamentally quadratic in a value that can get pretty large. The very part of the algorithm that's running slower than expected is the part that's supposed to be reducing the number of times the loop runs. [grin]

I admit "hundred billion times" is a bit of an exaggeration though; in the majority of cases the loop drops out very early and only really does a lot of work in a few tens of millions of the iterations.
Quote:

Did you try OProfile's annotated disassembly output, or just annotated source? Generally I look at the output of opannotate -a -s <binary>. Maybe if you post that output for the function in question someone might shed some light on your problem.

I've looked at the annotated disassembly, though I admit not very carefully. I had some difficulty figuring out which bits of assembly corresponded to what, since as I mentioned gcc does a hell of a job mangling my code. This is probably the route I'll have to go, though.

Share this post


Link to post
Share on other sites
Quote:
Original post by Zahlman
Have you considered the possibility that the real problem is that the code runs too many times? Maybe you can change the algorithm in a way that avoids having such a highly-repeated inner loop.
This requires some serious consideration. It sounds like the code must be fairly well optimised at the low level already, since adding an rdtsc slowed it down a lot. Usually the only thing you can do in such a case is to reduce how often the thing is called.
If you're absolutely sure that you must perform more low-level optimisation then you'll need to show code. And that's not something we want you to know the answer immediately to. You need to seriously reconsider every possible way to call it less.

Share this post


Link to post
Share on other sites
Quote:
Original post by iMalc
Quote:
Original post by Zahlman
Have you considered the possibility that the real problem is that the code runs too many times? Maybe you can change the algorithm in a way that avoids having such a highly-repeated inner loop.
This requires some serious consideration. It sounds like the code must be fairly well optimised at the low level already, since adding an rdtsc slowed it down a lot. Usually the only thing you can do in such a case is to reduce how often the thing is called.
If you're absolutely sure that you must perform more low-level optimisation then you'll need to show code. And that's not something we want you to know the answer immediately to. You need to seriously reconsider every possible way to call it less.

Don't get me wrong; I understand that algorithmic optimizations are generally more effective than code-level optimizations. The code I'm trying to profile *is* an algorithmic optimization, that successfully reduces the number of iterations by orders of magnitude, yet against all odds fails to speed up the runtime.

Share this post


Link to post
Share on other sites
Quote:
I've tried running it through tools like oprofile and callgrind, but the most I can reliably get out of them is that indeed, the function that runs this loop takes a long time.


Have you tried pulling chunks of the function you are worried about out in to separate functions so that callgrind can inspect the pieces?

Share this post


Link to post
Share on other sites
Quote:
Some further investigation-by-commenting-out seems to indicate that the big slowdown is actually due to a bit of code that just reads from a 2D array of booleans once per iteration, so I imagine it must indeed be due to a very expensive cache miss. Guess I'll need to re-structure the data to be more cache friendly.


You might find http://www.agner.org/optimize/ useful. The matrix transposition example (section 8.10 of manual 1) shows options for cutting down on cache issues when accessing a big 2D array.

Also don't forget that you can always store booleans as a single bit each to fit more of them in the cache.

Share this post


Link to post
Share on other sites
Quote:
Original post by stonemetal
Quote:
I've tried running it through tools like oprofile and callgrind, but the most I can reliably get out of them is that indeed, the function that runs this loop takes a long time.


Have you tried pulling chunks of the function you are worried about out in to separate functions so that callgrind can inspect the pieces?

I suspect that doing that would significantly change the performance characteristics of the code.
Quote:

You might find http://www.agner.org/optimize/ useful. The matrix transposition example (section 8.10 of manual 1) shows options for cutting down on cache issues when accessing a big 2D array.

Also don't forget that you can always store booleans as a single bit each to fit more of them in the cache.

Cool, that looks like a really useful resource. Thanks!

Share this post


Link to post
Share on other sites
Quote:
Original post by shavi
Quote:
Original post by stonemetal
Quote:
I've tried running it through tools like oprofile and callgrind, but the most I can reliably get out of them is that indeed, the function that runs this loop takes a long time.
Have you tried pulling chunks of the function you are worried about out in to separate functions so that callgrind can inspect the pieces?
I suspect that doing that would significantly change the performance characteristics of the code.
Quite possibly much less so than you might think. Aggressive inlining should put most of the chunks back in, unless there is benefit to leaving them out. It also tends to be a fairly minor code modification, and will allow you to figure out exactly which chunk is chewing up most of the cycles.

Share this post


Link to post
Share on other sites
Quote:
Original post by swiftcoder
Quote:
Original post by shavi
Quote:
Original post by stonemetal
Quote:
I've tried running it through tools like oprofile and callgrind, but the most I can reliably get out of them is that indeed, the function that runs this loop takes a long time.
Have you tried pulling chunks of the function you are worried about out in to separate functions so that callgrind can inspect the pieces?
I suspect that doing that would significantly change the performance characteristics of the code.
Quite possibly much less so than you might think. Aggressive inlining should put most of the chunks back in, unless there is benefit to leaving them out. It also tends to be a fairly minor code modification, and will allow you to figure out exactly which chunk is chewing up most of the cycles.

Hm, interesting. Ok, I'll give that a try then.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement