How to measure the time it takes to evalueate an expression

Started by
7 comments, last by frob 11 years, 6 months ago
say I have a boolean expression

a<b

How can I measure the time it takes for my computer to evaluate a single expression?
An invisible text.
Advertisement
uhm...

I guess you could make a timestamp before and after the expression and look at the difference.

It will barely differ though, unless you have a really complex expression.
Would the result be inaccurate because it is such a small difference?
And how can I obtain the true time it takes to evaluate just the expression, not other things such as assigning the value to a variable

Normally when I just code [font=courier new,courier,monospace]1==2;[/font] compilers see the code as not effective and discards it.
An invisible text.

Would the result be inaccurate because it is such a small difference?


There are high precision timers (on windows they are called "Multimedia Timers") that can give you i believe at the microsecond level (might even be nanosecond though at that level it probably would be inaccurate). Why do you need such a precise timing for only a single operation? If you are trying to figure out the clock speed of the user's processor there are better ways.
I started programming in the 80s. Back then it would take a microprocessor a well-defined number of cycles to execute each instruction, and this number of cycles was documented and not that hard to measure. You could look at a loop, add the time it would take to run each instruction, add them up and you would get precisely how long it would take to run through the loop once.

Since then microprocessors have been getting increasingly sophisticated, and the question you are asking is no longer meaningful.

To give you a flavor of the things that make an answer impossible, consider that the processor can evaluate multiple instructions in parallel. Because of pipelining, it might take the processor 4 cycles to execute the instruction, but it can execute one such instruction per cycle. There are multiple instruction decoders, and precisely what instructions are around the one you are trying to evaluate might change the availability of certain decoders, therefore changing the time it takes to execute the instruction. This is particularly nasty because the fact that you are measuring time before and after the instruction might change what you are trying to measure.

You also have to know that the time it takes to evaluate the expression depends dramatically on whether the operands are in registers, L1, L2, L3 cache, main memory or a memory page that has been swapped to disk. The performance would be different if you use that expression as an integer (0 or 1) or if you use it in an if-statement with a complex body, or if you use it in an if-statement whose body is a simple assignment (because there are special conditional move instructions). In the case of the if-statement with a complex body, the performance would also be very different whether the CPU can guess (through heuristics) the value of the expression or not. Also your thread might be interrupted right around the evaluation of the expression, either because of preemptive multithreading or because of a hardware interruption, and the execution might be restarted milliseconds later. All these things make the results of your timing impossible to reproduce.

To make things worse, if the evaluation of your expression doesn't change the observable behavior of your program, it is legitimate for the compiler to optimize it away and do nothing.

With all those warnings in mind, a precise method to measure time on Windows is described here.
I'm doing an experiment.


[quote name='lride' timestamp='1349398907' post='4986974']
Would the result be inaccurate because it is such a small difference?


There are high precision timers (on windows they are called "Multimedia Timers") that can give you i believe at the microsecond level (might even be nanosecond though at that level it probably would be inaccurate). Why do you need such a precise timing for only a single operation? If you are trying to figure out the clock speed of the user's processor there are better ways.
[/quote]

Are you saying it would still be inaccurate at nanosecond level?
An invisible text.
Thank you. I understand.


and the question you are asking is no longer meaningful.

An invisible text.

Are you saying it would still be inaccurate at nanosecond level?

Note that on the nanosecond level, even calling your timing functions will show noise. Just calling QueryPerformanceCounter, storing it in a variable/register, and then calling it again will take some time. So you have to remember that the time you get back includes your operation and your call to QueryPerformanceCounter.

To help minimize this inaccuracy, you'll want to repeat your operation until it takes a significant amount of time compared to the noise introduced by your timing calls. But like alvaro said, modern CPUs are incredibly complicated, and the timings you get back in your testing setup may not match the average results in a real application.

To be more accurate, you'd read the generated assembly, look up your specs on your CPU and calculate by hand how much time an average expression to evaluate. But even then there are all sorts of things that the CPU can do that will make your calculated time just a guideline and not a fixed absolute.

So I guess it all depends on how accurate you want to be.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Are you saying it would still be inaccurate at nanosecond level?

Yes.


Again, consider the modern CPU architecture.

First, let's consider the number of stages inside a CPU: The early x86-based processors only had a single stage. You could look up in the processor manual how many cycles each instruction could take. Then around the 486 timeline, a pipeline was introduced to the x86 series so some operations could be completed in parallel with other operations. Simple integer comparisons are much faster than multiplication or division, so there was no point in delaying one when it didn't impact the other. The Pentium III had 10 stages in its pipeline. The Willamette core had 20. Prescott had 31 stages. Etc.

Let's look at just a few of those stages.

First the instruction must be decoded. The CPU's decoder can convert multiple instructions into micro-operations, exactly how many varies depending on the CPU in use. Some instructions are big, and the CPU can decode only a single instruction per clock cycle. Other instructions, such as the simple integer logical operation "a<b" you presented, are small; Aligned correctly your CPU may decode one, two, three, four, or maybe eight instructions at once; this will vary based on the CPU core and what happens to be next to it in the compiled code.

Addresses are calculated, registers are processed, data must be marked as dirty, etc. All of this takes a different number of CPU cycles depending on the processor microarchitecture and the code that happens to be around it.

Eventually all those micro-operations make their way to the out-of-order core. The OOO core can reorder operations and will perform the operation as soon as possible. Again, some operations are fast --- many operations take a fraction of a CPU cycle; perhaps it will perform three or four comparison micro-ops in a single cycle, where a more complex division micro-op may take five or ten cycles. Each CPU Core actually has a half dozen or more (depending on the microarchitecture) processors inside it. For example it may have 3 integer add/subtract/compare processors each capable of 4 micro-ops per cycle, 4 general purpose integer processors, and 3 FPU/SIMD processors, all of them are pulling from the same OOO core. A microarchitecture may be able to perform 10 or 20 or 30 or even 50+ micro-operations per clock cycle, all depending on the CPU version, what is adjacent to the operation in question, and whatever else is sitting inside the OOO core.

Then those micro-ops must be put back in order and retired. Again this duration will take varying amount of time based on the microarchitecture.


If your single comparison operation happened to be next to a bunch of other lightweight instructions, it may pass through the pipeline in 10 or 20 CPU cycles. So assuming a 3.4 GHz processor, that might be around 5ns. Or if it happens to be next to several instruction branches and floating point math and other heavy instructions, it may pass through the pipeline in 100 CPU cycles, that may be around 30ns. Note that your individual comparison operation didn't change, only the adjacent code, but it dramatically changed the time it took to leave the CPU.



Considering all of that, the time required for a single integer compare operation will come out as just random noise. The actual time it takes is so vanishingly small that measurement is useless.

This topic is closed to new replies.

Advertisement