• Advertisement
Sign in to follow this  

Best way to measure performance

This topic is 3344 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I would like to test a few things performance-wise, just to be sure if some of my thoughts of getting higher performance (micro-optimization and more) are actually working. I know performance is really an 'in-theory' thing, and depends on a lot of variables. So my question is, what would be the best / most stable OS / compiler / way to measure time / number of repetitions etc. etc. etc.. Also, I noticed the first few times I run something (on XP), it's always slower then later on. So I guess I should leave out the few highest and lowest? So, what would be the best way to perform these measurements? Thanks! PS: that would be in C and/or C++, where I'd like to compile to ASM as well. I currently use GCC and WinXP, though I have an Ubuntu box downstairs I could use as well.

Share this post


Link to post
Share on other sites
Advertisement
An often recommended technique is profiling. Profiling can tell you how much of your time is being spent in what function, how many times that function is called, how much time it takes, etc. It's a great tool for the kind of thing you'r talking about.

You'll want to profile with realistic conditions: Compile in release mode, run it for a good amount of time, etc. Note that profiling does add a little overhead though. Probably negligible. Note that this is for whole-program testing. To see how one line of code performs is a lot harder, especially if it turns out your compiler optimized it.

Note that you'll probably not get a LOT of data on any micro-optimizations that you mentioned. A lot of micro optimizations cut off so little time, the profiler won't find it. If you're good with assembly, I think you can get the time from the cpu yourself and that'll be a lot more accurate.

But, you mentioned "what would be the best / most stable OS / compiler".

You'll want to use your target OS. One reason is if you're testing on Linux but release on Windows, your Linux test results are invalid, especially if you depend on any OS-specific functionality.

I don't know there's one BEST compiler. Each one has advantages and disadvantages you'd just have to see what works. If you're using gcc, you can probably just use gcc.

I read in a few other topics about things loading faster the second time because the OS caches the file. I don't know whether it'd run better the second time, though.

If you're curious about optimization and performance, this is a pretty good guide. The Optimizing software in C++ one contains a section called "Finding the biggest time consumers" with a subsection about profilers. It also contains some good info about optimization and WHEN to optimize.

EDIT:
One last thing. "So, what would be the best way to perform these measurements?"
Whether or not my advice is any good kind of depends on what you're making, I think. I mean, if I wanted to reduce lag in my game, I might use a profiler, but if I wanted to stress test my application and how fast it could run through a MB of data, benchmarking might be more appropriate. So, what ARE you doing?

Share this post


Link to post
Share on other sites
Thanks for the reply.

I want to test a few micro-optimizations ++i and i++ (I just want to test it myself) and other similar things, but also small libraries I wrote, how quick will it run, what SPF (seconds per frame) will it give me, testing how much overhead std::string::substr() gives me...I want to be sure about things, performance-wise.

I'm not too familiar with ASM, but I can read it a bit. Exporting the ASM sources sounds like a solid idea...using Release mode and the optimizations I'd use when I'd distribute it...

For stress-testing the profiler would be good to have.

Thanks.

Share this post


Link to post
Share on other sites
Quote:
Original post by Decrius
I want to test a few micro-optimizations ++i and i++

The best test is to look at the generated assembly.

Furthermore, i++ can NEVER be faster than ++i. I ALWAYS write ++i.

Share this post


Link to post
Share on other sites
Quote:
Original post by DevFred
Furthermore, i++ can NEVER be faster than ++i. I ALWAYS write ++i.


Yes, true. I want to see the difference though, how much it will differ etc.

Share this post


Link to post
Share on other sites
Quote:
Original post by DevFred
Furthermore, i++ can NEVER be faster than ++i. I ALWAYS write ++i.
That isn't entirely true - if i is an integral type, there is really no reason why the compiler cannot compile both expressions to the identical assembly. For aggregate types you are of course correct.

Share this post


Link to post
Share on other sites
Quote:
Original post by Decrius

So, what would be the best way to perform these measurements?


There is no best way.

Micro-optimizations depend completely on the context. While ++i may take less than one cycle, the cost of loading the value from memory on first access may be hundreds of cycles.

In algorithms, running out of registers affects performance. Pipelining is an issue as well, so is cache behavior.

Quote:
most stable OS / compiler
This question translates to: "I'm trying to improve my SUV. Which race track and which Formula 1 team should I use to measure performance".

You use your target environment. It doesn't matter how things work elsewhere.

Quote:
way to measure time
One that is accurate enough. If your process takes 17 hours, stopwatch will do. If it takes 15 cycles, RTDSC is likely optimal. It depends.

Quote:
number of repetitions
Anywhere between 0 and infinity. For some optimizations, you may not need to run any tests at all. For others, you need to run just enough to receive enough samples to get statistically accurate measurement.

For arbitrary tests, I used a loop:
int n = 1;
while (true) {
time t1 = now();
for (int i = 0; i < n; i++) run_single_test();
time t2 = now();
time delta = t2 - t1;

if (delta_time < desired_time) {
n = n * 2;
} else {
break;
}
}

for (int i = 0; i < m; i++) {
time t1 = now();
for (int i = 0; i < n; i++) {
run_single_test();
}
time t2 = now();
record_single_measurement(t2-t1);
}


The above is convenient for automated profiling. For each test, you specify how long it should run. The profiler then calibrates the number of repetitions (n) and number of measurements (m).

The nice thing about this approach is that you can measure a single instruction (which will get executed bazillion times) or an expensive algorithm (gets tested 12 times). Further more, it's possible to provide statistical analysis of results to determine if results are consistent. This information can be used to re-run the simulation with longer desired_time to compensate for inaccurate timers.

I generally use this type of profiling as part of unit tests. It helps as metric to see if a change to code broke affected performance. But unlike other unit tests, the results are logged, they don't cause an error if algorithm is suddenly slower (due to different absolute values on different machines).

Share this post


Link to post
Share on other sites
Quote:
Original post by swiftcoder
Quote:
Original post by DevFred
Furthermore, i++ can NEVER be faster than ++i. I ALWAYS write ++i.
That isn't entirely true - if i is an integral type, there is really no reason why the compiler cannot compile both expressions to the identical assembly. For aggregate types you are of course correct.

Although you're right the compiler can realize of that, actually he's technically correct.

He said !(speed(i++) < speed(++i)) which is the same as speed(i++) >= speed(++i)

Which means that i++ can be SLOWER OR EQUAL than ++i. That's satisfying what you're saying and what he said.

It's impressive how a couple of words can express that much of content, isn't it?

Cheers
Dark Sylinc

Share this post


Link to post
Share on other sites
Quote:
Original post by Matias Goldberg
Quote:
Original post by swiftcoder
Quote:
Original post by DevFred
Furthermore, i++ can NEVER be faster than ++i. I ALWAYS write ++i.
That isn't entirely true - if i is an integral type, there is really no reason why the compiler cannot compile both expressions to the identical assembly. For aggregate types you are of course correct.

Although you're right the compiler can realize of that, actually he's technically correct.

He said !(speed(i++) < speed(++i)) which is the same as speed(i++) >= speed(++i)

Which means that i++ can be SLOWER OR EQUAL than ++i. That's satisfying what you're saying and what he said.

It's impressive how a couple of words can express that much of content, isn't it?
LOL - I must have been skimming rather more than usual. I primarily replied because I thought DevFred was stating that one should always use preincrement, but I see on re-reading that he only stated that he himself uses preincrement. My apologies [smile]

Share this post


Link to post
Share on other sites
Performance is relative.

Most chunks of code have no significant performance differences. Some chunks of code are hit really hard. A profiler helps there. Measure before, measure after, take the version that works best.

Another nice tool is map files. They can tell you a lot about your code without resorting to look at the raw assembly. Reading map files to identify performance is an art.

For very fine tuning you can look at the generated optimized code. This is often a last resort, only after you have identified the bottleneck, identified and corrected all the high level problems (like algorithm choice), identified and corrected issues in surrounding code, and still know that the very small gains will be worth the huge time investment.

Share this post


Link to post
Share on other sites
Yeah, speaking of small gains, most optimizations like using ++i instead of i++ will gain less than one-thousandth a second. Actually, much less. This is only a must if you're looking at the performance critical loop that iterates so often that you notice.

However, it seems the OP is more interested in best practices. Writing overall better code I think is worth it, so long as you don't constantly refactor.

Quote:
Original post by Decrius
I'm not too familiar with ASM, but I can read it a bit. Exporting the ASM sources sounds like a solid idea...using Release mode and the optimizations I'd use when I'd distribute it...


Two tips:
First, do not allow your compiler to inline functions when you look at asm. This makes it so you'll know where to look. While this isn't exactly a realistic condition, it just lets you see what the code looks like. Alternatively, with gcc, I can do inline assembly like so: asm("# comment"); and when I compile with -S, I can see exactly where the function is implemented. (There are ways to have g++ print the source and assembly together, but I find the assembly gets mangled so I rely on my own methods.)

Second, know that compilers don't do the obvious thing. For example, try dividing by seven. So, what you look at might not be clear at all. But, if you post here, I think we can help make sense of assembly.

On the increment thing: I think some compilers can tell how to increment i even if it's not an integer or a pointer. And i++ is sometimes actually faster according to the optimization guide I linked. It says that the cpu needs the address of the variable it's going to load so many cycles ahead of time, so i++ will be faster when loading from memory because the address can already be calculated.
    x = array[i++]
is faster than
    x = array[++i]


I think it's best to use ++i most of the time just so you KNOW it's optimized, but it sounds like speed(i++) CAN be greater than speed(++i).

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
I think it's best to use ++i most of the time just so you KNOW it's optimized, but it sounds like speed(i++) CAN be greater than speed(++i).
That brings up the major argument against hand-optimising this sort of thing - it is darn complicated, and results in little gain. The optimisations which can give incredible performance increases involve careful cache, memory and I/O management - you might save 10-50 cycles with clever assembly tricks, but a single cache miss costs you hundreds of cycles, and a page fault or I/O operation cost millions.

Share this post


Link to post
Share on other sites
Thanks guys, great replies. I guess in the land of optimization, ASM is king :). Might be handy to learn ASM any time soon :)

And yes, I agree. Micro-optimizations is really for the hobbyist, it's really insignificant in regular programs, and can be misleading. It's difficult to say if things are faster or slower micro-wise, if you don't test it.

I want to test some algorithms as well though, so this'll come in handy :)

I currently use the Code Profiler that ships with CodeBlocks (which is the gcc one?). It looks a bit limited information-wise, any suggestions for a decent code profiler?

Share this post


Link to post
Share on other sites
Quote:
Original post by Decrius
I currently use the Code Profiler that ships with CodeBlocks (which is the gcc one?). It looks a bit limited information-wise, any suggestions for a decent code profiler?
Would that be gperf? Like all things gnu, it is quite powerful, but it takes a lot of getting used to. Give the documentation a thorough browser, or get yourself a graphical profiler.

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
    x = array[i++]
is faster than
    x = array[++i]


Be careful here, those two expressions have different semantics.

Share this post


Link to post
Share on other sites
Doesn't the C++ standard allow i++ to optimise out the returned value if it is not used, even for overloaded operators? For example, this would allow i++ in a for loop to not generate an unused temporary return if i is an iterator. I'm sure that all sensible compilers do this, in which case i++ or ++i would do exactly the same thing in a loop situation. However, I guess ++i is certain to be optimal - it's a minor point I think though.

Also, to emphasise, always profile and optimise the most time consuming functions. Anything else is simply a waste of time. Why get a 0.001% improvement when you could get a 30% improvement for the same effort?

Share this post


Link to post
Share on other sites
Quote:
Original post by AshleysBrain
Doesn't the C++ standard allow i++ to optimise out the returned value if it is not used, even for overloaded operators? For example, this would allow i++ in a for loop to not generate an unused temporary return if i is an iterator. I'm sure that all sensible compilers do this, in which case i++ or ++i would do exactly the same thing in a loop situation.
What if ++i and i++ are overloaded to do entirely different things? Despite the expectation that they represent pre and post-increment, the programmer is allowed to overload them to do whatever he wants: ++i could create a database, and i++ reformat your hard drive, for all the compiler knows.

Share this post


Link to post
Share on other sites
Quote:
Original post by swiftcoder
What if ++i and i++ are overloaded to do entirely different things? Despite the expectation that they represent pre and post-increment, the programmer is allowed to overload them to do whatever he wants: ++i could create a database, and i++ reformat your hard drive, for all the compiler knows.


You'd be crazy if you made those operators do different things! Surely an absolute recipe for disaster. Overloaded operators should do the obvious thing - ie. both increment overloads should increment the object in some way. If it's not obvious, a named function really should be used.

Anyway I can't remember where I read C++ allows removal of unused return temp values for i++ - does anyone know if that's allowed by the standard?

Share this post


Link to post
Share on other sites
Quote:
Original post by AshleysBrain
Quote:
Original post by swiftcoder
What if ++i and i++ are overloaded to do entirely different things? Despite the expectation that they represent pre and post-increment, the programmer is allowed to overload them to do whatever he wants: ++i could create a database, and i++ reformat your hard drive, for all the compiler knows.
You'd be crazy if you made those operators do different things! Surely an absolute recipe for disaster. Overloaded operators should do the obvious thing - ie. both increment overloads should increment the object in some way. If it's not obvious, a named function really should be used.
OK, then by your logic, << and >> should only be used for bitshifts, right? Unfortunately, std::stream uses it for input/output. And what about boost::spirit, which overloads pretty much every operator in fairly unusual ways...

Share this post


Link to post
Share on other sites
Quote:
Original post by DevFred
Quote:
Original post by Splinter of Chaos
    x = array[i++]
is faster than
    x = array[++i]


Be careful here, those two expressions have different semantics.


QFE

Share this post


Link to post
Share on other sites
Quote:
Original post by AshleysBrain
For example, this would allow i++ in a for loop to not generate an unused temporary return if i is an iterator.

Maybe, but here is the important question: WHY write for loops with i++ in the first place? What is the benefit of i++ over ++i? I find the semantics of ++i a lot simpler to understand and explain to others.

Share this post


Link to post
Share on other sites
Quote:
Original post by AshleysBrain
Quote:
Original post by swiftcoder
What if ++i and i++ are overloaded to do entirely different things? Despite the expectation that they represent pre and post-increment, the programmer is allowed to overload them to do whatever he wants: ++i could create a database, and i++ reformat your hard drive, for all the compiler knows.


You'd be crazy if you made those operators do different things!


Good point. A better example would be an iterator class that does a lot of extra leg-work behind the scenes. Even if the compiler still could make the same optimization, it might not take that chance. Because, for all it knows, ++i could create a database, and i++ reformat your hard drive.

Share this post


Link to post
Share on other sites
Getting into ASM at the moment, and decompiled a few C sources. How would I find out how many cycles it takes for one command? And is this CPU specific? Or is there another way to measure the cycles of a program? And yes, it's mostly curiosity :-)

PS: I'm using the latest NASM assembler on WinXP

Share this post


Link to post
Share on other sites
Quote:
Original post by Decrius
Getting into ASM at the moment, and decompiled a few C sources. How would I find out how many cycles it takes for one command? And is this CPU specific? Or is there another way to measure the cycles of a program? And yes, it's mostly curiosity :-)
You need to download the manuals for your particular model of CPU (it will be different for each). However, as has been said before, cycle counts are entirely useless as a measure of application performance on a modern CPU.

Share this post


Link to post
Share on other sites
Thanks guys :)

Also, about the post and pre increment debate, this is what http://www.agner.org/optimize/optimizing_cpp.pdf (page 29) says:
Quote:
Increment and decrement operators

The pre-increment operator ++i and the post-increment operator i++ are as fast as additions. When used simply to increment a variable, it makes no difference whether you use pre-increment or post-increment. The effect is simply identical. For example, for (i=0; i<n; i++) is the same as for (i=0; i<n; ++i). But when the result of the expression is used, then there may be a difference in efficiency. For example, x = array[i++] is more efficient than x = array[++i] because in the latter case, the calculation of the address of the array element has to wait for the new value of i which will delay the availability of x for approximately two clock cycles. Obviously, the initial value of i must be adjusted if you change pre-increment to post-increment.

There are also situations where pre-increment is more efficient than post-increment. For example, in the case a = ++b; the compiler will recognize that the values of a and b are the same after this statement so that it can use the same register for both, while the expression a = b++; will make the values of a and b different so that they cannot use the same register.

Everything that is said here about increment operators also applies to decrement operators.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement