Calculating CPU usage

Started by
12 comments, last by donguow 12 years ago
Hi guys,

I'm looking for ways (or tools) on Windows to calculate CPU usage of a program, specificly, I aim to measure how many instructions per second (IPS) a program take for its execution. If anyone knows, plz help.

Thanks,
Advertisement
Question: why exactly do you want to know this? If you're trying to profile code, I'd use a profiler. If you've got some special needs/interests that actually require the IPS though, it might be good to share your end goal and special needs/interests so that the right tool can be suggested. Also, I'm curios why the IPS is needed, as I've never had a need to calculate it.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
Hi Corntalks,

My case is that I have several graphics applications and would like to make a comparison among them in terms of CPU usage. To the best of my knowledge, IPS could be a good metric. As you can see, it's, of course, not compulsory to use IPS. There maybe some better parameters to be concerned with. Anyway, I am open to your suggestions/recommendations :)
I don't get it. Maybe my world view of how programs are executed is outdated on modern machines. Process is allowed to be executed by scheduler, process runs at CPU clock speed. They don't run faster or slower because "they use a lot of CPU". IPS also seems like a pointless concept. A div takes more cycles than an add. Does that mean a program using lots of divs executing for the same amount of time is using up less CPU, because you get much lower IPS?

You could use a clock that allows measuring actual process or even thread time (not sure what the equivalent is for Windows). A normal timer would not just tell you "this code was giving up its time slice more often or spent more time in blocking operations" it will also tell you "there might have been more higher priority processes active at the time or the scheduler simply had a bad day".

In the same way, stopping how long your code is executing only works for small parts of code and with multiple test runs (or by using a clock that only counts actual time spent in this thread). By just stopping the total time, you would also measure all the time spent in totally different processes whenever the scheduler interrupts you. As an alternative, you could set your thread to realtime priority, but you better make very sure you have a way of ending it and don't get stuck in some loop (or it's reset time).
f@dzhttp://festini.device-zero.de
I second Trienco +1

I aim to measure how many instructions per second (IPS) a program take for its execution. If anyone knows, plz help.[/quote]
This metric is meaningless. First, an instruction would be considered at the assembly level, i.e. individual MOV's or ADD's, not your standard "i += 3;" kind of high-level code. Secondly, instructions don't all take one cycle to execute. Third, with multiprocessor machines your program may run on multiple CPU's at the same time. And in fact on any scheduled system your program may be interrupted for several hundred cycles to give other programs time to execute.

So as you can see, instructions per second have no meaning. A better metric is clocks per second, but that only has meaning if you are only looking at a section of code which only runs on a single CPU (you can usually ignore the effects of the scheduler if the code section is small enough as it will be amortized over repeated averages).

And instructions per second is a bad metric for CPU usage disregarding all the above anyway. Can I measure my CPU's performance by having it do a bunch of NOP's and see how many it can do per second? No, it will just go into sleep mode since none of its units are actually being used. The same is true for many other instructions.

As an alternative, you could set your thread to realtime priority, but you better make very sure you have a way of ending it and don't get stuck in some loop (or it's reset time).[/quote]
This doesn't quite solve the problem either, on Windows anyway. Many system processes also have realtime priority (despite what MSDN says) and the scheduler will still interrupt you since your program will have the same priority. And I think the newer Windows versions implement fair scheduling which will actually distribute time fairly by giving a "score" to each process based on their priority, with the score never being zero, so all threads (even idle ones) always get to run at some point, even if the realtime thread may be running say 100x more often.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

The tool "perfmon" that comes with Windows has all of what you need.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
CodeAnalyst and Vtune are great to figure out what time your app is spending on the CPU, CodeAnalyst is free but really only usefull on AMD chips, VTune has a trial periode I think and only usefull on Intel chips. Next to that you will want to have NVIDIA Parallel [color=#000000]Nsight and AMD GPU PerfStudio 2 to figure out how much GPU time your shader is using, these again only work on NVidia and AMD hardware.

The only real way of figuring out how much time an app is spending and where it is being spent is by using a profiler, sadly most CPU profilers don't give you a per frame breakdown, this is where the console profilers far out class the traditional ones btw, as often in a game you would like a per frame breakdown. It might be that the GPU profilers in conjunction with the CPU profilers give you this but I do not now this.

[edit]
Did a little more digging and AMD GPU Perfstudio has a per frame breakdown under this section http://developer.amd...meProfiler.aspx. And I am guessing NVidia will offer something similar in NSight, http://www.nvidia.com/object/parallel-nsight.html go to the section on graphics and it will tell you it got frame timings and profiling.
[/edit]

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

The whole IPS thing gets even worse when you realize the superscalar nature of current CPUs. CPUs will stuff up to 4 instructions per cycle into their pipelines. So cycles/istruction in current hardware is often in the range 0.5-0.3 and in theory can go down to 0.25. So the "div takes more cycles" of olden times is more like "div blocks more ressources" nowadays since instructions actually take less than one cycle so to speak (that is in the throughput sense, not latency).

And then SIMD makes this eintierly go out of the window... SSE/AVX instructions tend to influence the CPI rage negatively but are themselves doing multiple instructions so to speak... So having a AVX heavy loop that pushes mul/add each cycle will result in tools like intels vTune tell you your CPI rate is low (around 2) while your program is pushing 16 single precision flops per cycle and core...

I'm also gonna say, either try to use a profiler cleverly. Or stick a bunch of timers in your code and average over a big enough sample set to get good statistics.

The whole IPS thing gets even worse when you realize the superscalar nature of current CPUs. CPUs will stuff up to 4 instructions per cycle into their pipelines. So instead of cycles/istruction tools such as Intel vTune give you "retired instructions per cycle" which in theory can go up to 4 on current hardware and often reaches values in the range 2-3. So the "div takes more cycles" of olden times is more like "div blocks more ressources" nowadays since instructions actually take less than one cycle so to speak (that is in the throughput sense, not latency).

Its worse than that...

A simple div instruction against an address can explode to upwards of thousands of cycles wasted while waiting on a main memory fetch (if it wasn't in the cache, depends on memory latency and other things). Even if it is in the cache you still spend time waiting on the cache (even L1/DCache has a latency greater than a single cycle usually).

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.


Even if it is in the cache you still spend time waiting on the cache (even L1/DCache has a latency greater than a single cycle usually).


That is usually the point where you start considering the moon phase as a possible influence... :D I had this funky case where i got higher performance accessing half the data from memory (L1) instead of registers since i would get ressource blocks on AVX registers because the loop would reuse them too quickly (every 8 cycles) while apparently sandy bridge CPUs manage to stream data from L1 at pretty much register speed.

This topic is closed to new replies.

Advertisement