cpu-cycles cost of different operations (c++)?

Started by
6 comments, last by DeadXorAlive 16 years, 6 months ago
Topic. Like is there any good calculations on this, pref some average of different CPUs so you can get a overview. compare stuff like these lines and more: p=a+4; p=a*b*c; p=pow(a,b); forLoops fileAccess functionCalls etc you get the idea. I know these is not a absolute answer to this but maybe some estimates? Thanks Erik
Advertisement
The cpu cycles cost? Compile with assembly listing output, count the instructions (including loops), that's your baseline cpu cycles cost. For instance, on the x86, I'd estimate:

p=a+4; // between 0 and 3p=a*b*c; // between 0 and 7p=pow(a,b); // between 0 and 4, plus the contents of the pow function,             // possibly less if inlined.forLoops // Depends on the condition and step, I'd say between 0 and 5 for         // your typical integer loop.fileAccess // Anywhere between 0 and infinite cycles, depending on the            // filesystem.functionCalls // Depends on the function call, argument count and type,               // etc. So, between 0 and infinite.


However, the cpu-cycle cost of an expression, alone, is an extremely imprecise metric: while you can say in the general case that a 10-clock-cycle operation is faster than a 100,000-clock-cycle operation, it might happen that a 100-clock-cycle operation is faster than a 10-clock-cycle one (because of pipeline stall, hardware accesses, latency, throughput or cache miss considerations).

A typical example:
int data[512][512];int x = 0;for (int i = 0; i < 512; ++i)  for (int j = 0; j < 512; ++j)    x += data[j];for (int i = 0; i < 512; ++i)  for (int j = 0; j < 512; ++j)    x += data[j];


The two looping blocks will generate nearly the same machine code, yet one will run much faster than the other because it's cache-friendly.
Quote:Like is there any good calculations on this, pref some average of different CPUs so you can get a overview.


It cannot be said for the examples you've given.

CPU cycles can only be evaluated at assembly level.

In just about any above-assembly language, the compilation will transform that code into something, possibly remove it, pre-calculate it, unroll the loop or do something weird.

When it comes to differences between processors it gets even more annoying, due to memory cache sizes, pipeline behaviour and possibly vectorization.

First 3 examples also depend heavily on the type of variables used.

Modern CPUs can behave better than RISC, which means are able to execute more than one instruction per cycle, due to pipelining. But that cannot be evaluated on per-instruction basis.

This is why: profile, profile, profile. Although things are documented, such performance aspects could be said to be almost non-deterministic.

In addition, the operations listed can often be replaced with different ones. pow(a,b) can be much simpler if a and b are ints, and a is (power of) 2.

Quote:fileAccess


This is like measuring time of a hiking trip with a nanosecond precision watch. File access is measured in miliseconds, or microseconds for cached files, but all of this depends on huge layers of OS.
yes i know but is there no site or something that tries to cover this? Dont tell me again that it "depends" (i stated in original post that i was aware) I mean to get a rough estimate?

Maybe CPU-cycles was misguiding. Im not iterested in exactly how many cycles it takes but how much it effects performance and fps. What im looking for is some survey where they have tried different code in different machines and tries to give them values so we can get an estimate on what cost what. Is there any? I searched a bit...

E
Well, what you're asking about sounds exactly like what this talk answered...

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

If you start looking at every sum you make, each multiply you make, you'll end optimizing code instead of writting it. Games are complex. Better look at the algorithm, not at the code. Let people that dedicates it's programming life to optimizing code to take care of that. Look at the Intel, AMD libraries which are optimized. Or try SIMDx86 or my own library, x86mph

Anyway here it goes
p=a+4; //About 2-3 cycles
p=a*b*c; //About 2-4 cycles
p=pow(a,b); //2 pushes + 1 call + what's inside pow. Avoid using it if you just need P=a*a*a
forLoops //3 Cycles in the best scenario per iteration. + what's inside the loop
fileAccess //Impossible to determinate (but slow)
functionCalls //1-2 cycles per call, + 2-4 cycles per parameter, + 1 cycle for returning from the call + 0-1 cycle for returning a specific value (i.e. an integer) + 0-14 cycles if some or all of the register vaules must be saved (usually between 2-8 cycles) + What's inside the call


Hope this helps
Dark Sylinc

Also note that some instructions may take less/more in the future. This is all too relative. And as already said, Cache is very important, instruction pairing, stalls, etc.
Avoid reading and immediately writing to the same memory point, avoid excesive branching (i.e. more than 3), attempt to read the same amount of memory in the same time (i.e. don't access one word, then a dword, then a word, then a dword then a byte. Group them all dword,dword, do something, word,word, do something byte)
Read the Intel's IA 32 manual for better comprehension.
ok thanks. got some clearer picture of it now.
E
Agner Fog's optimization's manuals provide a lot of information. The C++ guide is a worthwile read, it makes you realize how many things are involved. There is also a manual that lists performance statistics for instructions on different CPU's.

This topic is closed to new replies.

Advertisement